Re: no default hash partition

2019-08-06 Thread Amit Langote
Horiguchi-san,

On Wed, Aug 7, 2019 at 1:59 PM Kyotaro Horiguchi
 wrote:
> At Tue, 6 Aug 2019 23:26:19 -0400, Robert Haas  wrote:
> > On Tue, Aug 6, 2019 at 6:58 PM Tom Lane  wrote:
> > I think, as Amit says, that having an automatic partition creation
> > feature for hash partitions (and maybe other kinds, but certainly for
> > hash) would be a useful thing to add to the system. I also think that
> > it might be useful to add some commands to automate partition
> > splitting (and maybe combining) although I think there's some design
> > work to be done there to figure out exactly what we should build.  I
> > don't think it's ever useful to have a hash-partitioned table with an
> > incomplete set of partitions long term, but it makes things simpler to
> > allow that temporarily, for example during dump restoration.
> > Therefore, I see no reason why we would want to go to the trouble of
> > allowing hash-partitioned tables to have default partitions; it would
> > just encourage people to do things that don't really make any sense.
>
> +1.
>
> By the way, couldn't we offer a means to check for gaps in a hash
> partition? For example, the output of current \d+ 
> contains the Partitoins section that shows a list of
> partitions. I think that we can show all gaps there.
>
> =# \d+ p
>Partitioned table "public.p"
> ...
> Partition key: HASH (a)
> Partitions: c1 FOR VALUES WITH (modulus 4, remainder 0),
> c3 FOR VALUES WITH (modulus 4, remainder 3),
> GAP (modulus 4, remainder 1),
> GAP (modulus 4, remainder 2)
>
> Or
>
> Partitions: c1 FOR VALUES WITH (modulus 4, remainder 0),
> c3 FOR VALUES WITH (modulus 4, remainder 3),
> Gaps: (modulus 4, remainder 1), (modulus 4, remainder 2)

I imagine showing this output would require some non-trivial code on
the client side (?) to figure out the gaps.  If our intention in the
long run is to make sure that such gaps only ever appear temporarily,
that is, when running a command to increase the number of hash
partitions (as detailed in Robert's email), then a user would never
see those gaps.  So, maybe writing such code wouldn't be worthwhile in
the long run?

Thanks,
Amit




is necessary to recheck cached data in fn_extra?

2019-08-06 Thread Pavel Stehule
Hi

I should to use a cache accessed via fn_extra. There will be stored data
about function parameters (types). If I understand correctly, these data
should be stable in query, and then recheck is not necessary. Is it true?

Regards

Pavel


Re: stress test for parallel workers

2019-08-06 Thread Thomas Munro
On Wed, Aug 7, 2019 at 5:07 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Another question is whether the build farm should be setting the Linux
> > oom score adjust thing.
>
> AFAIK you can't do that without being root.

Rats, yeah you need CAP_SYS_RESOURCE or root to lower it.

-- 
Thomas Munro
https://enterprisedb.com




Re: stress test for parallel workers

2019-08-06 Thread Tom Lane
Thomas Munro  writes:
> Another question is whether the build farm should be setting the Linux
> oom score adjust thing.

AFAIK you can't do that without being root.

regards, tom lane




Re: stress test for parallel workers

2019-08-06 Thread Thomas Munro
On Wed, Aug 7, 2019 at 4:29 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > I wondered if the build farm should try to report OOM kill -9 or other
> > signal activity affecting the postmaster.
>
> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
> that would fork the postmaster, wait for the postmaster to exit, and then
> report the exit status.  Where to report it *to* seems like the hard part,
> but maybe an answer that worked for the buildfarm would be enough for now.

Oh, right, you don't even need subreaper tricks (I was imagining we
had a double fork somewhere we don't).

Another question is whether the build farm should be setting the Linux
oom score adjust thing.

-- 
Thomas Munro
https://enterprisedb.com




Re: no default hash partition

2019-08-06 Thread Kyotaro Horiguchi
At Tue, 6 Aug 2019 23:26:19 -0400, Robert Haas  wrote in 

> On Tue, Aug 6, 2019 at 6:58 PM Tom Lane  wrote:
> I think, as Amit says, that having an automatic partition creation
> feature for hash partitions (and maybe other kinds, but certainly for
> hash) would be a useful thing to add to the system. I also think that
> it might be useful to add some commands to automate partition
> splitting (and maybe combining) although I think there's some design
> work to be done there to figure out exactly what we should build.  I
> don't think it's ever useful to have a hash-partitioned table with an
> incomplete set of partitions long term, but it makes things simpler to
> allow that temporarily, for example during dump restoration.
> Therefore, I see no reason why we would want to go to the trouble of
> allowing hash-partitioned tables to have default partitions; it would
> just encourage people to do things that don't really make any sense.

+1.

By the way, couldn't we offer a means to check for gaps in a hash
partition? For example, the output of current \d+ 
contains the Partitoins section that shows a list of
partitions. I think that we can show all gaps there.

=# \d+ p
   Partitioned table "public.p"
...
Partition key: HASH (a)
Partitions: c1 FOR VALUES WITH (modulus 4, remainder 0),
c3 FOR VALUES WITH (modulus 4, remainder 3),
GAP (modulus 4, remainder 1),
GAP (modulus 4, remainder 2)

Or

Partitions: c1 FOR VALUES WITH (modulus 4, remainder 0),
c3 FOR VALUES WITH (modulus 4, remainder 3),
Gaps: (modulus 4, remainder 1), (modulus 4, remainder 2)


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: stress test for parallel workers

2019-08-06 Thread Tom Lane
Thomas Munro  writes:
> I wondered if the build farm should try to report OOM kill -9 or other
> signal activity affecting the postmaster.

Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit status.  Where to report it *to* seems like the hard part,
but maybe an answer that worked for the buildfarm would be enough for now.

regards, tom lane




Re: Built-in connection pooler

2019-08-06 Thread Li Japin
Hi, Konstantin

I test the patch-16 on postgresql master branch, and I find the 
temporary table
cannot removed when we re-connect to it. Here is my test:

japin@ww-it:~/WwIT/postgresql/Debug/connpool$ initdb
The files belonging to this database system will be owned by user "japin".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /home/japin/WwIT/postgresql/Debug/connpool/DATA ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Asia/Shanghai
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

     pg_ctl -D /home/japin/WwIT/postgresql/Debug/connpool/DATA -l 
logfile start

japin@ww-it:~/WwIT/postgresql/Debug/connpool$ pg_ctl -l /tmp/log start
waiting for server to start done
server started
japin@ww-it:~/WwIT/postgresql/Debug/connpool$ psql postgres
psql (13devel)
Type "help" for help.

postgres=# ALTER SYSTEM SET connection_proxies TO 1;
ALTER SYSTEM
postgres=# ALTER SYSTEM SET session_pool_size TO 1;
ALTER SYSTEM
postgres=# \q
japin@ww-it:~/WwIT/postgresql/Debug/connpool$ pg_ctl -l /tmp/log restart
waiting for server to shut down done
server stopped
waiting for server to start done
server started
japin@ww-it:~/WwIT/postgresql/Debug/connpool$ psql -p 6543 postgres
psql (13devel)
Type "help" for help.

postgres=# CREATE TEMP TABLE test(id int, info text);
CREATE TABLE
postgres=# INSERT INTO test SELECT id, md5(id::text) FROM 
generate_series(1, 10) id;
INSERT 0 10
postgres=# select * from pg_pooler_state();
  pid  | n_clients | n_ssl_clients | n_pools | n_backends | 
n_dedicated_backends | n_idle_backends | n_idle_clients | tx_bytes | 
rx_bytes | n_transactions
--+---+---+-++--+-++--+--+
  3885 | 1 | 0 |   1 |  1 
|    0 |   0 |  0 | 1154 | 
2880 |  6
(1 row)

postgres=# \q
japin@ww-it:~/WwIT/postgresql/Debug/connpool$ psql -p 6543 postgres
psql (13devel)
Type "help" for help.

postgres=# \d
     List of relations
   Schema   | Name | Type  | Owner
---+--+---+---
  pg_temp_3 | test | table | japin
(1 row)

postgres=# select * from pg_pooler_state();
  pid  | n_clients | n_ssl_clients | n_pools | n_backends | 
n_dedicated_backends | n_idle_backends | n_idle_clients | tx_bytes | 
rx_bytes | n_transactions
--+---+---+-++--+-++--+--+
  3885 | 1 | 0 |   1 |  1 
|    0 |   0 |  0 | 2088 | 
3621 |  8
(1 row)

postgres=# select * from test ;
  id |   info
+--
   1 | c4ca4238a0b923820dcc509a6f75849b
   2 | c81e728d9d4c2f636f067f89cc14862c
   3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
   4 | a87ff679a2f3e71d9181a67b7542122c
   5 | e4da3b7fbbce2345d7772b0674a318d5
   6 | 1679091c5a880faf6fb5e6087eb1b2dc
   7 | 8f14e45fceea167a5a36dedd4bea2543
   8 | c9f0f895fb98ab9159f51fd0297e236d
   9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
  10 | d3d9446802a44259755d38e6d163e820
(10 rows)

I inspect the code, and find the following code in DefineRelation function:

if (stmt->relation->relpersistence != RELPERSISTENCE_TEMP
     && stmt->oncommit != ONCOMMIT_DROP)
     MyProc->is_tainted = true;

For temporary table, MyProc->is_tainted might be true, I changed it as 
following:

if (stmt->relation->relpersistence == RELPERSISTENCE_TEMP
     || stmt->oncommit == ONCOMMIT_DROP)
     MyProc->is_tainted = true;

For temporary table, it works. I not sure the changes is right.

On 8/2/19 7:05 PM, Konstantin Knizhnik wrote:
>
>
> On 02.08.2019 12:57, DEV_OPS wrote:
>> Hello Konstantin
>>
>>
>> would you please re-base this patch? I'm going to test it, and back port
>> into PG10 stable and PG9 stable
>>
>>
>> thank you very much
>>
>>
>
> Thank you.
> Rebased patch is attached.
>
>



Re: no default hash partition

2019-08-06 Thread David Fetter
On Tue, Aug 06, 2019 at 06:58:44PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > On 2019-Aug-06, Tom Lane wrote:
> >> Seems like "it's likely to cause trouble for users" is just going to
> >> beg the question "why?".  Can we explain the hazard succinctly?
> >> Or point to a comment somewhere else that explains it?
> 
> > Right ... the "trouble" is just that if the user later wants to add the
> > missing partitions, they'll need to acquire some strong lock (IIRC it's AEL)
> > in the partitioned table, so it effectively means an outage.  With
> > list/range partitioning, there's the slight advantage that you don't
> > have to guess all your partitions in advance, or cover data values that
> > are required for a very small number of rows.  In hash partitioning you
> > can't really predict which values are those going to be, and the set of
> > missing partitions is perfectly known.
> 
> Hmm.  So given the point about it being hard to predict which hash
> partitions would receive what values ... under what circumstances
> would it be sensible to not create a full set of partitions?  Should
> we just enforce that there is a full set, somehow?

+1 for requiring that hash partitions not have gaps, ideally by making
one call create all the partitions.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: no default hash partition

2019-08-06 Thread Robert Haas
On Tue, Aug 6, 2019 at 6:58 PM Tom Lane  wrote:
> Hmm.  So given the point about it being hard to predict which hash
> partitions would receive what values ... under what circumstances
> would it be sensible to not create a full set of partitions?  Should
> we just enforce that there is a full set, somehow?

I think it would only be sensible as a temporary state.  The system
allows more than one modulus so that you can do partition split
incrementally.  For example if you have 8 partitions all with modulus
8 and with remainders 0..7, you could:

- detach the partition with (modulus 8, remainder 0)
- attach two new partitions with (modulus 16, remainder 0) and
(modulus 16, remainder 8)
- move the data from the old partition to the new ones

Then you'd have 9 partitions, and you'd only have taken the amount of
downtime needed to repartition 1/8th of your data.  You could then
repeat this process one partition at a time during additional
maintenance windows, and end up with 16 partitions in the end.
Without the ability to have more than one modulus, or if you had
chosen not to double the modulus but to change it to some other value
like 13, you would've needed to repartition all the data at once,
which would have required one much longer outage.  You can argue about
whether the ability to do this kind of thing is useful, but it seemed
to me that it was.

I think, as Amit says, that having an automatic partition creation
feature for hash partitions (and maybe other kinds, but certainly for
hash) would be a useful thing to add to the system. I also think that
it might be useful to add some commands to automate partition
splitting (and maybe combining) although I think there's some design
work to be done there to figure out exactly what we should build.  I
don't think it's ever useful to have a hash-partitioned table with an
incomplete set of partitions long term, but it makes things simpler to
allow that temporarily, for example during dump restoration.
Therefore, I see no reason why we would want to go to the trouble of
allowing hash-partitioned tables to have default partitions; it would
just encourage people to do things that don't really make any sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Built-in connection pooler

2019-08-06 Thread Ryan Lambert
Hi Konstantin,

I did some testing with the latest patch [1] on a small local VM with 1 CPU
and 2GB RAM with the intention of exploring pg_pooler_state().

Configuration:

idle_pool_worker_timeout = 0 (default)
connection_proxies = 2
max_sessions = 10 (default)
max_connections = 1000

Initialized pgbench w/ scale 10 for the small server.

Running pgbench w/out connection pooler with 300 connections:

pgbench -p 5432 -c 300 -j 1 -T 60 -P 15 -S bench_test
starting vacuum...end.
progress: 15.0 s, 1343.3 tps, lat 123.097 ms stddev 380.780
progress: 30.0 s, 1086.7 tps, lat 155.586 ms stddev 376.963
progress: 45.1 s, 1103.8 tps, lat 156.644 ms stddev 347.058
progress: 60.6 s, 652.6 tps, lat 271.060 ms stddev 575.295
transaction type: 
scaling factor: 10
query mode: simple
number of clients: 300
number of threads: 1
duration: 60 s
number of transactions actually processed: 63387
latency average = 171.079 ms
latency stddev = 439.735 ms
tps = 1000.918781 (including connections establishing)
tps = 1000.993926 (excluding connections establishing)


It crashes when I attempt to run with the connection pooler, 300
connections:

pgbench -p 6543 -c 300 -j 1 -T 60 -P 15 -S bench_test
starting vacuum...end.
connection to database "bench_test" failed:
server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

In the logs I get:

WARNING:  PROXY: Failed to add new client - too much sessions: 18 clients,
1 backends. Try to increase 'max_sessions' configuration parameter.

The logs report 1 backend even though max_sessions is the default of 10.
Why is there only 1 backend reported?  Is that error message getting the
right value?

Minor grammar fix, the logs on this warning should say "too many sessions"
instead of "too much sessions."

Reducing pgbench to only 30 connections keeps it from completely crashing
but it still does not run successfully.

pgbench -p 6543 -c 30 -j 1 -T 60 -P 15 -S bench_test
starting vacuum...end.
client 9 aborted in command 1 (SQL) of script 0; perhaps the backend died
while processing
client 11 aborted in command 1 (SQL) of script 0; perhaps the backend died
while processing
client 13 aborted in command 1 (SQL) of script 0; perhaps the backend died
while processing
...
...
progress: 15.0 s, 5734.5 tps, lat 1.191 ms stddev 10.041
progress: 30.0 s, 7789.6 tps, lat 0.830 ms stddev 6.251
progress: 45.0 s, 8211.3 tps, lat 0.810 ms stddev 5.970
progress: 60.0 s, 8466.5 tps, lat 0.789 ms stddev 6.151
transaction type: 
scaling factor: 10
query mode: simple
number of clients: 30
number of threads: 1
duration: 60 s
number of transactions actually processed: 453042
latency average = 0.884 ms
latency stddev = 7.182 ms
tps = 7549.373416 (including connections establishing)
tps = 7549.402629 (excluding connections establishing)
Run was aborted; the above results are incomplete.

Logs for that run show (truncated):


2019-08-07 00:19:37.707 UTC [22152] WARNING:  PROXY: Failed to add new
client - too much sessions: 18 clients, 1 backends. Try to increase
'max_sessions' configuration parameter.
2019-08-07 00:31:10.467 UTC [22151] WARNING:  PROXY: Failed to add new
client - too much sessions: 15 clients, 4 backends. Try to increase
'max_sessions' configuration parameter.
2019-08-07 00:31:10.468 UTC [22152] WARNING:  PROXY: Failed to add new
client - too much sessions: 15 clients, 4 backends. Try to increase
'max_sessions' configuration parameter.
...
...


Here it is reporting fewer clients with more backends. Still, only 4
backends reported with 15 clients doesn't seem right.  Looking at the
results from pg_pooler_state() at the same time (below) showed 5 and 7
backends for the two different proxies, so why are the logs only reporting
4 backends when pg_pooler_state() reports 12 total?

Why is n_idle_clients negative?  In this case it showed -21 and -17.  Each
proxy reported 7 clients, with max_sessions = 10, having those
n_idle_client results doesn't make sense to me.


postgres=# SELECT * FROM pg_pooler_state();
 pid  | n_clients | n_ssl_clients | n_pools | n_backends |
n_dedicated_backends | n_idle_backends | n_idle_clients | tx_bytes |
rx_bytes | n_transactions

---+---+---+-++--+-++--+--+---
-
25737 | 7 | 0 |   1 |  5 |
   0 |   0 |-21 |  4099541 |  3896792 |
 61959
25738 | 7 | 0 |   1 |  7 |
   0 |   2 |-17 |  4530587 |  4307474 |
 68490
(2 rows)


I get errors running pgbench down to only 20 connections with this
configuration. I tried adjusting connection_proxies = 1 and it handles even
fewer connections.  Setting connection_proxies = 4 allows it to handle 20
connections without error, but by 40 connections it starts having issues.

While I don't have expectations of this working great 

Re: FETCH FIRST clause PERCENT option

2019-08-06 Thread Kyotaro Horiguchi
Hello.

At Thu, 1 Aug 2019 15:54:11 +0300, Surafel Temesgen  
wrote in 
> > Other than that, we can rip the clause if it is 100%
> >
> >
> > You mean if PERCENT=100 it should short circuit and run the query
> > normally?  I like that.
> >
> 
> The patch did not did it automatically. Its query writer obligation to do
> that currently

I have some comments.

This patch uses distinct parameters for exact number and
percentage. On the othe hand planner has a notion of
tuple_fraction covering the both. The same handling is also used
for tuple number estimation. Using the same scheme will make
things far simpler. See the comment of grouping_planner().

In executor part, addition to LimiteState.position, this patch
uses LimiteState.backwardPosition to count how many tuples we're
back from the end of the current tuplestore. But things will get
simpler by just asking the tuplestore for the number of holding
tuples.


+slot = node->subSlot;

Why is this needed? The variable is properly set before use and
the assignment is bogus in the first place.

The new code block in LIMIT_RESCAN in ExecLimit is useless since
it is exatctly the same with existing code block. Why didn't you
use the existing if block?

>if (node->limitOption == PERCENTAGE)
+{
+node->perExactCount = ceil(node->percent * node->position 
/ 100.0);
+
+

node->position is the number of tuples returned to upper node so
far (not the number of tuples this node has read from the lower
node so far).  I don't understand what the expression means.


+if (node->perExactCount == node->perExactCount + 1)
+node->perExactCount++;

What? The condition never be true. As the result, the following
if block below won't run.

>/*
+ * Return the tuple up to the number of exact count in OFFSET
+ * clause without percentage value consideration.
+ */
+if (node->perExactCount > 0)
+{
+




+/*
+ * We may needed this tuple in backward scan so put it into
+ * tuplestore.
+ */
+if (node->limitOption == PERCENTAGE)
+{
+tuplestore_puttupleslot(node->tupleStore, slot);
+tuplestore_advance(node->tupleStore, true);
+}

"needed"->"need" ? The comment says that this is needed for
backward scan, but it is actually required even in forward
scan. More to the point, tuplestore_advance lacks comment.

Anyway, the code in LIMIT_RESCAN is broken in some ways. For
example, if it is used as the inner scan of a NestLoop, the
tuplestore accumulates tuples by every scan. You will see that
the tuplestore stores up to 1000 tuples (10 times of the inner)
by the following query.

create table t0 as (select a from generate_series(0, 99) a);
create table t1 as (select a from generate_series(0, 9) a);
analyze;
set enable_hashjoin to false;
set enable_mergejoin to false;
set enable_material to false;
explain analyze select t.*, t1.* from (select a from t0 fetch first 10 percent 
rows only) as t join t1 on (t.a = t1.a);
   QUERY PLAN 
---
 Nested Loop  (cost=0.20..7.35 rows=10 width=8) (actual ...)
   ->  Seq Scan on t1  (cost=0.00..1.10 rows=10 width=4) (actual ...)
   ->  Limit  (cost=0.20..0.40 rows=10 width=4) (actual ...)
 ->  Seq Scan on t0  (cost=0.00..2.00 rows=100 width=4) (actual ...)

That's it for now.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: partition routing layering in nodeModifyTable.c

2019-08-06 Thread Etsuro Fujita
Hi,

On Wed, Aug 7, 2019 at 11:47 AM Amit Langote  wrote:
> On Wed, Aug 7, 2019 at 11:30 AM Etsuro Fujita  wrote:
> > On Wed, Aug 7, 2019 at 10:24 AM Amit Langote  
> > wrote:
> > > * Regarding setting ForeignScan.resultRelIndex even for non-direct
> > > modifications, maybe that's not a good idea anymore.  A foreign table
> > > result relation might be involved in a local join, which prevents it
> > > from being directly-modifiable and also hides the ForeignScan node
> > > from being easily modifiable in PlanForeignModify.  Maybe, we should
> > > just interpret resultRelIndex as being set only when
> > > direct-modification is feasible.
> >
> > Yeah, I think so; when using PlanForeignModify because for example,
> > the foreign table result relation is involved in a local join, as you
> > mentioned, ForeignScan.operation would be left unchanged (ie,
> > CMD_SELECT), so to me it's more understandable to not set
> > ForeignScan.resultRelIndex.
>
> OK.
>
> > > Should we rename the field
> > > accordingly to be self-documenting?
> >
> > IMO I like the name resultRelIndex, but do you have any better idea?
>
> On second thought, I'm fine with sticking to resultRelIndex.  Trying
> to make it self documenting might make the name very long.

OK

> Here are the updated patches.

IIUC, I think we reached a consensus at least on the 0001 patch.
Andres, would you mind if I commit that patch?

Best regards,
Etsuro Fujita




Re: remove "msg" parameter from convert_tuples_by_name

2019-08-06 Thread Amit Langote
On Wed, Aug 7, 2019 at 7:47 AM Alvaro Herrera  wrote:
>
> Hello, here's a pretty trivial cleanup.
>
> Currently, you have to pass the errmsg text to convert_tuples_by_name
> and convert_tuples_by_position that's going to be raised if the tuple
> descriptors don't match.  In the latter's case that makes sense, as each
> case is pretty specific and tailored messages can be offered, so this is
> useful.
>
> However, in the case of convert_tuples_by_name, it seems we don't have
> enough control over what is being called, so there's no way to
> produce tailored messages -- all the callers are using the same generic
> wording: "could not convert row type".
>
> This code was introduced by dcb2bda9b704; I think back then we were
> thinking that it would be possible to give different error messages for
> different cases (as convert_tuples_by_position was already doing then),
> however it seems clear now that that'll never happen.
>
> I propose we get rid of it by having convert_tuples_by_name supply the
> error message by itself, as in the attached patch.

+1.  I always wondered when writing partitioning patches why I have to
pass the same string.

If we're reducing the message string to occur only once in the source
code, can we maybe write it to be more informative?  I wonder if users
aren't normally supposed to see this message?

Thanks,
Amit




Re: partition routing layering in nodeModifyTable.c

2019-08-06 Thread Amit Langote
Fujita-san,

Thanks for the quick follow up.

On Wed, Aug 7, 2019 at 11:30 AM Etsuro Fujita  wrote:
> On Wed, Aug 7, 2019 at 10:24 AM Amit Langote  wrote:
> > * Regarding setting ForeignScan.resultRelIndex even for non-direct
> > modifications, maybe that's not a good idea anymore.  A foreign table
> > result relation might be involved in a local join, which prevents it
> > from being directly-modifiable and also hides the ForeignScan node
> > from being easily modifiable in PlanForeignModify.  Maybe, we should
> > just interpret resultRelIndex as being set only when
> > direct-modification is feasible.
>
> Yeah, I think so; when using PlanForeignModify because for example,
> the foreign table result relation is involved in a local join, as you
> mentioned, ForeignScan.operation would be left unchanged (ie,
> CMD_SELECT), so to me it's more understandable to not set
> ForeignScan.resultRelIndex.

OK.

> > Should we rename the field
> > accordingly to be self-documenting?
>
> IMO I like the name resultRelIndex, but do you have any better idea?

On second thought, I'm fine with sticking to resultRelIndex.  Trying
to make it self documenting might make the name very long.

Here are the updated patches.

Thanks,
Amit


v5-0001-Revise-BeginDirectModify-API-to-pass-ResultRelInf.patch
Description: Binary data


v5-0003-Rearrange-partition-update-row-movement-code-a-bi.patch
Description: Binary data


v5-0004-Refactor-transition-tuple-capture-code-a-bit.patch
Description: Binary data


v5-0002-Remove-es_result_relation_info.patch
Description: Binary data


Re: no default hash partition

2019-08-06 Thread Amit Langote
Hi,

On Wed, Aug 7, 2019 at 8:02 AM Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Alvaro Herrera  writes:
> > > On 2019-Aug-06, Tom Lane wrote:
> > >> Seems like "it's likely to cause trouble for users" is just going to
> > >> beg the question "why?".  Can we explain the hazard succinctly?
> > >> Or point to a comment somewhere else that explains it?
> >
> > > Right ... the "trouble" is just that if the user later wants to add the
> > > missing partitions, they'll need to acquire some strong lock (IIRC it's 
> > > AEL)
> > > in the partitioned table, so it effectively means an outage.  With
> > > list/range partitioning, there's the slight advantage that you don't
> > > have to guess all your partitions in advance, or cover data values that
> > > are required for a very small number of rows.  In hash partitioning you
> > > can't really predict which values are those going to be, and the set of
> > > missing partitions is perfectly known.
> >
> > Hmm.  So given the point about it being hard to predict which hash
> > partitions would receive what values ... under what circumstances
> > would it be sensible to not create a full set of partitions?  Should
> > we just enforce that there is a full set, somehow?
>
> I imagine there's good reasons this wasn't just done (for this or
> various other things), but couldn't we enforce it by just creating them
> all..?  Sure would simplify a lot of things for users.  Similairly for
> list partitions, I would think.  Again, I feel like there's probably a
> reason why it doesn't just work(tm) like that, but it sure would be
> nice.

Maybe the reason that we don't create all partitions automatically is
that hash-partitioning developers thought that such a feature could be
built later [1].  Maybe you know, but I think it's just that we
implemented the syntax needed to get things like pg_dump/upgrade
working sanely, that is, a command to define each partition
separately, and... stopped there.  There're no other intrinsic reasons
that I know of for this implementation order.  pg_partman helps with
the automation, with features that users want in most or all cases --
define all needed partitions for a given modulus, define time series
partitions for a given window, etc.  Maybe not everyone likes to rely
on an external tool, so the core at some point will have features to
perform some if not all of the tasks that pg_partman does, with the
added benefit that the new feature might allow the core to optimize
partitioning better.

Btw, there was even a discussion started recently to discuss the
user-level feature:

Subject: Creating partitions automatically at least on HASH?
https://www.postgresql.org/message-id/alpine.DEB.2.21.1907150711080.22273%40lancre

Thanks,
Amit

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobGH4zK27y42gGbtvfWFPnATHcocMZ%3DHkJF51KLkKY_xw%40mail.gmail.com




Re: partition routing layering in nodeModifyTable.c

2019-08-06 Thread Etsuro Fujita
Amit-san,

On Wed, Aug 7, 2019 at 10:24 AM Amit Langote  wrote:
> On Tue, Aug 6, 2019 at 9:56 PM Etsuro Fujita  wrote:
> >  What
> > I'm thinking for the setrefs.c change is to modify ForeignScan (ie,
> > set_foreignscan_references) rather than ModifyTable, like the
> > attached.
>
> Thanks for the patch.  I have couple of comments:
>
> * I'm afraid that we've implicitly created an ordering constraint on
> some code in set_plan_refs().  That is, a ModifyTable's plans now must
> always be processed before adding its result relations to the global
> list, which for good measure, should be written down somewhere; I
> propose this comment in the ModifyTable's case block in set_plan_refs:
>
> @@ -877,6 +877,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int 
> rtoffset)
>  rc->rti += rtoffset;
>  rc->prti += rtoffset;
>  }
> +/*
> + * Caution: Do not change the relative ordering of this loop
> + * and the statement below that adds the result relations to
> + * root->glob->resultRelations, because we need to use the
> + * current value of list_length(root->glob->resultRelations)
> + * in some plans.
> + */
>  foreach(l, splan->plans)
>  {
>  lfirst(l) = set_plan_refs(root,

+1

> * Regarding setting ForeignScan.resultRelIndex even for non-direct
> modifications, maybe that's not a good idea anymore.  A foreign table
> result relation might be involved in a local join, which prevents it
> from being directly-modifiable and also hides the ForeignScan node
> from being easily modifiable in PlanForeignModify.  Maybe, we should
> just interpret resultRelIndex as being set only when
> direct-modification is feasible.

Yeah, I think so; when using PlanForeignModify because for example,
the foreign table result relation is involved in a local join, as you
mentioned, ForeignScan.operation would be left unchanged (ie,
CMD_SELECT), so to me it's more understandable to not set
ForeignScan.resultRelIndex.

> Should we rename the field
> accordingly to be self-documenting?

IMO I like the name resultRelIndex, but do you have any better idea?

> > @@ -895,6 +898,12 @@ BeginDirectModify(ForeignScanState *node,
> >   for ExplainDirectModify and 
> > EndDirectModif\
> > y.
> >  
> >
> > +
> > + Also note that it's a good idea to store the rinfo
> > + in the fdw_state for
> > + IterateDirectModify to use.
> > +
> >
> > Actually, if the FDW only supports direct modifications for queries
> > without RETURNING, it wouldn't need the rinfo in IterateDirectModify,
> > so I think we would probably need to update this as such.  Having said
> > that, it seems too detailed to me to describe such a thing in the FDW
> > documentation.  To avoid making the documentation verbose, it would be
> > better to not add such kind of thing at all?
>
> Hmm OK.  Perhaps, others who want to implement the direct modification
> API can work that out by looking at postgres_fdw implementation.

Yeah, I think so.

Best regards,
Etsuro Fujita




Re: no default hash partition

2019-08-06 Thread Amit Langote
Hi Alvaro,

On Wed, Aug 7, 2019 at 7:27 AM Alvaro Herrera  wrote:
>
> Given the discussion starting at
> https://postgr.es/m/cafjfprdbiqjzm8sg9+s0x8re-afhds6mflgguw0wvunlgrv...@mail.gmail.com
> we don't have default-partition support with the hash partitioning
> scheme.  That seems a reasonable outcome, but I think we should have a
> comment about it (I had to search the reason for this restriction in the
> hash-partitioning patch set).

That hash-partitioned tables can't have default partition is mentioned
in the CREATE TABLE page:

"If DEFAULT is specified, the table will be created as a default
partition of the parent table. The parent can either be a list or
range partitioned table. A partition key value not fitting into any
other partition of the given parent will be routed to the default
partition. There can be only one default partition for a given parent
table."

>  How about the attached?  Does anyone see
> a reason to make this more verbose, and if so to what?

If the outcome of this discussion is that we expand our internal
documentation of why there's no default hash partition, then should we
also expand the user documentation somehow?

Thanks,
Amit




Re: More issues with pg_verify_checksums and checksum verification in base backups

2019-08-06 Thread Michael Paquier
On Tue, Aug 06, 2019 at 04:20:43PM -0400, Stephen Frost wrote:
> On Tue, Aug 6, 2019 at 15:45 Magnus Hagander  wrote:
>> When agreement cannot be found, perhaps a parameter is in order?
>>
>> That is, have the tool complain about such files by default but with a
>> HINT that it may or may not be a problem, and a switch that makes it stop
>> complaining?
> 
> WFM.

Fine by me.  I'd also rather not change the behavior that we have now
without the switch.
--
Michael


signature.asc
Description: PGP signature


Re: Assertion for logically decoding multi inserts into the catalog

2019-08-06 Thread Michael Paquier
On Tue, Aug 06, 2019 at 03:08:48PM +0200, Daniel Gustafsson wrote:
> Thanks, this is a much better approach and it passes tests for me.  +1 on this
> version (regardless of outcome of the other patch as this is separate).

I had an extra lookup at this stuff this morning, and applied the
patch.  Please note that I have kept the assertion on tupledata which
cannot be NULL and added a comment about that because it is not
possible to finish yet in a state where we do not have tuple data in
this context, but it actually could be the case if we begin to use
multi-inserts with system catalogs, so the assertion is here to make
future patch authors careful about that.  We could in this case bypass
DecodeMultiInsert() if tupledata is NULL and assert that
XLH_INSERT_CONTAINS_NEW_TUPLE should not be set, or we could just
bypass the logic if XLH_INSERT_CONTAINS_NEW_TUPLE is not set at all.
Let's sort that out in your other patch.
--
Michael


signature.asc
Description: PGP signature


Re: partition routing layering in nodeModifyTable.c

2019-08-06 Thread Amit Langote
Fujita-san,

Thanks a lot the review.

On Tue, Aug 6, 2019 at 9:56 PM Etsuro Fujita  wrote:
> On Mon, Aug 5, 2019 at 6:16 PM Amit Langote  wrote:
> > I first thought to set it
> > only if direct modification is being used, but maybe it'd be simpler
> > to set it even if direct modification is not used.  To set it, the
> > patch teaches set_plan_refs() to initialize resultRelIndex of
> > ForeignScan plans that appear under ModifyTable.  Fujita-san said he
> > plans to revise the planning of direct-modification style queries to
> > not require a ModifyTable node anymore, but maybe he'll just need to
> > add similar code elsewhere but not outside setrefs.c.
>
> Yeah, but I'm not sure this is a good idea:
>
> @ -877,12 +878,6 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
> rc->rti += rtoffset;
> rc->prti += rtoffset;
> }
> -   foreach(l, splan->plans)
> -   {
> -   lfirst(l) = set_plan_refs(root,
> - (Plan *) lfirst(l),
> - rtoffset);
> -   }
>
> /*
>  * Append this ModifyTable node's final result relation RT
> @@ -908,6 +903,27 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int 
> rtoffset)
> lappend_int(root->glob->rootResultRelations,
> splan->rootRelation);
> }
> +
> +   resultRelIndex = splan->resultRelIndex;
> +   foreach(l, splan->plans)
> +   {
> +   lfirst(l) = set_plan_refs(root,
> + (Plan *) lfirst(l),
> + rtoffset);
> +
> +   /*
> +* For foreign table result relations, save their index
> +* in the global list of result relations into the
> +* corresponding ForeignScan nodes.
> +*/
> +   if (IsA(lfirst(l), ForeignScan))
> +   {
> +   ForeignScan *fscan = (ForeignScan *) lfirst(l);
> +
> +   fscan->resultRelIndex = resultRelIndex;
> +   }
> +   resultRelIndex++;
> +   }
> }
>
> because I still feel the same way as mentioned above by Andres.

Reading Andres' emails again, I now understand why we shouldn't set
ForeignScan's resultRelIndex the way my patches did.

>  What
> I'm thinking for the setrefs.c change is to modify ForeignScan (ie,
> set_foreignscan_references) rather than ModifyTable, like the
> attached.

Thanks for the patch.  I have couple of comments:

* I'm afraid that we've implicitly created an ordering constraint on
some code in set_plan_refs().  That is, a ModifyTable's plans now must
always be processed before adding its result relations to the global
list, which for good measure, should be written down somewhere; I
propose this comment in the ModifyTable's case block in set_plan_refs:

@@ -877,6 +877,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 rc->rti += rtoffset;
 rc->prti += rtoffset;
 }
+/*
+ * Caution: Do not change the relative ordering of this loop
+ * and the statement below that adds the result relations to
+ * root->glob->resultRelations, because we need to use the
+ * current value of list_length(root->glob->resultRelations)
+ * in some plans.
+ */
 foreach(l, splan->plans)
 {
 lfirst(l) = set_plan_refs(root,

* Regarding setting ForeignScan.resultRelIndex even for non-direct
modifications, maybe that's not a good idea anymore.  A foreign table
result relation might be involved in a local join, which prevents it
from being directly-modifiable and also hides the ForeignScan node
from being easily modifiable in PlanForeignModify.  Maybe, we should
just interpret resultRelIndex as being set only when
direct-modification is feasible.  Should we rename the field
accordingly to be self-documenting?

Please let me know your thoughts, so that I can modify the patch.

>  Maybe I'm missing something, but for direct modification
> without ModifyTable, I think we would probably only have to modify
> that function further so that it not only adjusts resultRelIndex but
> does some extra work such as appending the result relation RT index to
> root->glob->resultRelations as done for ModifyTable.

Yeah, that seems reasonable.

> > > Then we could just have BeginForeignModify, BeginDirectModify,
> > > BeginForeignScan all be called from ExecInitForeignScan().
>
> Sorry, previously, I mistakenly agreed with that.  As I said before, I
> think I was too tired.
>
> > I too think that 

Re: Refactoring code stripping trailing \n and \r from strings

2019-08-06 Thread Michael Paquier
On Tue, Aug 06, 2019 at 03:10:33PM -0400, Bruce Momjian wrote:
> On Thu, Aug  1, 2019 at 12:18:20PM +0900, Michael Paquier wrote:
>> b654714 has reworked the way we handle removal of CLRF for several
>> code paths, and has repeated the same code patterns to do that in 8
>> different places.  Could it make sense to refactor things as per the
>> attached with a new routine in common/string.c?
> 
> Yes, I think this is a good idea.

Thanks for the review, Bruce!  Tom, do you have any objections?
--
Michael


signature.asc
Description: PGP signature


Re: stress test for parallel workers

2019-08-06 Thread Thomas Munro
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane  wrote:
> > Do you have an example to hand?  Is this
> > failure always happening on Linux?
>
> I dug around a bit further, and while my recollection of a lot of
> "postmaster exited during a parallel transaction" failures is accurate,
> there is a very strong correlation I'd not noticed: it's just a few
> buildfarm critters that are producing those.  To wit, I find that
> string in these recent failures (checked all runs in the past 3 months):
>
>   sysname  |branch |  snapshot
> ---+---+-
>  lorikeet  | HEAD  | 2019-06-16 20:28:25
>  lorikeet  | HEAD  | 2019-07-07 14:58:38
>  lorikeet  | HEAD  | 2019-07-02 10:38:08
>  lorikeet  | HEAD  | 2019-06-14 14:58:24
>  lorikeet  | HEAD  | 2019-07-04 20:28:44
>  lorikeet  | HEAD  | 2019-04-30 11:00:49
>  lorikeet  | HEAD  | 2019-06-19 20:29:27
>  lorikeet  | HEAD  | 2019-05-21 08:28:26
>  lorikeet  | REL_11_STABLE | 2019-07-11 08:29:08
>  lorikeet  | REL_11_STABLE | 2019-07-09 08:28:41
>  lorikeet  | REL_12_STABLE | 2019-07-16 08:28:37
>  lorikeet  | REL_12_STABLE | 2019-07-02 21:46:47
>  lorikeet  | REL9_6_STABLE | 2019-07-02 20:28:14
>  vulpes| HEAD  | 2019-06-14 09:18:18
>  vulpes| HEAD  | 2019-06-27 09:17:19
>  vulpes| HEAD  | 2019-07-21 09:01:45
>  vulpes| HEAD  | 2019-06-12 09:11:02
>  vulpes| HEAD  | 2019-07-05 08:43:29
>  vulpes| HEAD  | 2019-07-15 08:43:28
>  vulpes| HEAD  | 2019-07-19 09:28:12
>  wobbegong | HEAD  | 2019-06-09 20:43:22
>  wobbegong | HEAD  | 2019-07-02 21:17:41
>  wobbegong | HEAD  | 2019-06-04 21:06:07
>  wobbegong | HEAD  | 2019-07-14 20:43:54
>  wobbegong | HEAD  | 2019-06-19 21:05:04
>  wobbegong | HEAD  | 2019-07-08 20:55:18
>  wobbegong | HEAD  | 2019-06-28 21:18:46
>  wobbegong | HEAD  | 2019-06-02 20:43:20
>  wobbegong | HEAD  | 2019-07-04 21:01:37
>  wobbegong | HEAD  | 2019-06-14 21:20:59
>  wobbegong | HEAD  | 2019-06-23 21:36:51
>  wobbegong | HEAD  | 2019-07-18 21:31:36
> (32 rows)
>
> We already knew that lorikeet has its own peculiar stability
> problems, and these other two critters run different compilers
> on the same Fedora 27 ppc64le platform.
>
> So I think I've got to take back the assertion that we've got
> some lurking generic problem.  This pattern looks way more
> like a platform-specific issue.  Overaggressive OOM killer
> would fit the facts on vulpes/wobbegong, perhaps, though
> it's odd that it only happens on HEAD runs.

chipmunk also:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2019-08-06%2014:16:16

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.

On some systems (depending on sysctl kernel.dmesg_restrict on Linux,
security.bsd.unprivileged_read_msgbuf on FreeBSD etc) you can run
dmesg as a non-root user, and there the OOM killer's footprints or
signaled exit statuses for processes under init might normally be found,
but that seems a bit invasive for the host system (I guess you'd
filter it carefully).  Unfortunately it isn't enabled on many common
systems anyway.

Maybe there is a systemd-specific way to get the info we need without
being root?

Another idea: start the postmaster under a subreaper (Linux 3.4+
prctl(PR_SET_CHILD_SUBREAPER), FreeBSD 10.2+
procctl(PROC_REAP_ACQUIRE)) that exists just to report on its
children's exit status, so the build farm could see "pid XXX was
killed by signal 9" message if it is nuked by the OOM killer.  Perhaps
there is a common subreaper wrapper out there that would wait, print
messages like that, rince and repeat until it has no children and then
exit, or perhaps pg_ctl or even a perl script could do somethign like
that if requested.  Another thought, not explored, is the brand new
Linux pidfd stuff that can be used to wait and get exit status for a
non-child process (or the older BSD equivalent), but the paint isn't
even dry on that stuff anwyay.

--
Thomas Munro
https://enterprisedb.com




Re: no default hash partition

2019-08-06 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Hmm.  So given the point about it being hard to predict which hash
> >> partitions would receive what values ... under what circumstances
> >> would it be sensible to not create a full set of partitions?  Should
> >> we just enforce that there is a full set, somehow?
> 
> > I imagine there's good reasons this wasn't just done (for this or
> > various other things), but couldn't we enforce it by just creating them
> > all..?  Sure would simplify a lot of things for users.  Similairly for
> > list partitions, I would think.
> 
> Well, with lists Alvaro's point holds: you might know a priori that
> some of the values are infrequent and don't deserve their own partition.
> The thing about hash is that the entries should (in theory) get spread
> out to all partitions pretty evenly, so it's hard to see why a user
> would want to treat any partition differently from any other.

Yeah, that's a fair argument, but giving the user a way to say that
would address it.  As in, "create me a list-partitioned table for these
values, plus a default."  Anyhow, I'm sure that I'm taking this beyond
what we need to do right now, just sharing where I think it'd be good
for things to go.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: no default hash partition

2019-08-06 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Hmm.  So given the point about it being hard to predict which hash
>> partitions would receive what values ... under what circumstances
>> would it be sensible to not create a full set of partitions?  Should
>> we just enforce that there is a full set, somehow?

> I imagine there's good reasons this wasn't just done (for this or
> various other things), but couldn't we enforce it by just creating them
> all..?  Sure would simplify a lot of things for users.  Similairly for
> list partitions, I would think.

Well, with lists Alvaro's point holds: you might know a priori that
some of the values are infrequent and don't deserve their own partition.
The thing about hash is that the entries should (in theory) get spread
out to all partitions pretty evenly, so it's hard to see why a user
would want to treat any partition differently from any other.

regards, tom lane




Re: partition routing layering in nodeModifyTable.c

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-05 18:16:10 +0900, Amit Langote wrote:
> The patch adds a resultRelIndex field to ForeignScan node, which is
> set to >= 0 value for non-SELECT queries.  I first thought to set it
> only if direct modification is being used, but maybe it'd be simpler
> to set it even if direct modification is not used.

Yea, I think we should just always set it.


> To set it, the
> patch teaches set_plan_refs() to initialize resultRelIndex of
> ForeignScan plans that appear under ModifyTable.  Fujita-san said he
> plans to revise the planning of direct-modification style queries to
> not require a ModifyTable node anymore, but maybe he'll just need to
> add similar code elsewhere but not outside setrefs.c.

I think I prefer the approach in Fujita-san's email. While not extremely
pretty either, it would allow for having nodes between the foreign scan
and the modify node.


> > Then we could just have BeginForeignModify, BeginDirectModify,
> > BeginForeignScan all be called from ExecInitForeignScan().
> 
> I too think that it would've been great if we could call both
> BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but
> the former's API seems to be designed to be called from
> ExecInitModifyTable from the get-go.  Maybe we should leave that
> as-is?

Yea, we should leave it where it is.  I think the API here is fairly
ugly, but it's probably not worth changing. And if we were to change it,
it'd need a lot bigger hammer.

Greetings,

Andres Freund




Re: dropdb --force

2019-08-06 Thread Ryan Lambert
I set the status to Waiting on Author since Tom's concerns [1] have not been 
addressed.

[1] https://www.postgresql.org/message-id/15707.1564024305%40sss.pgh.pa.us

Thanks,
Ryan

Re: no default hash partition

2019-08-06 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Alvaro Herrera  writes:
> > On 2019-Aug-06, Tom Lane wrote:
> >> Seems like "it's likely to cause trouble for users" is just going to
> >> beg the question "why?".  Can we explain the hazard succinctly?
> >> Or point to a comment somewhere else that explains it?
> 
> > Right ... the "trouble" is just that if the user later wants to add the
> > missing partitions, they'll need to acquire some strong lock (IIRC it's AEL)
> > in the partitioned table, so it effectively means an outage.  With
> > list/range partitioning, there's the slight advantage that you don't
> > have to guess all your partitions in advance, or cover data values that
> > are required for a very small number of rows.  In hash partitioning you
> > can't really predict which values are those going to be, and the set of
> > missing partitions is perfectly known.
> 
> Hmm.  So given the point about it being hard to predict which hash
> partitions would receive what values ... under what circumstances
> would it be sensible to not create a full set of partitions?  Should
> we just enforce that there is a full set, somehow?

I imagine there's good reasons this wasn't just done (for this or
various other things), but couldn't we enforce it by just creating them
all..?  Sure would simplify a lot of things for users.  Similairly for
list partitions, I would think.  Again, I feel like there's probably a
reason why it doesn't just work(tm) like that, but it sure would be
nice.

Of course, there's the other side of things where it'd sure be nice to
automatically have partitions created for time-based partitions when
appropriate (yes, basically doing what pg_partman already does, but in
core somehow..), but for hash partitions we don't need to deal with
that.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: no default hash partition

2019-08-06 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Aug-06, Tom Lane wrote:
>> Seems like "it's likely to cause trouble for users" is just going to
>> beg the question "why?".  Can we explain the hazard succinctly?
>> Or point to a comment somewhere else that explains it?

> Right ... the "trouble" is just that if the user later wants to add the
> missing partitions, they'll need to acquire some strong lock (IIRC it's AEL)
> in the partitioned table, so it effectively means an outage.  With
> list/range partitioning, there's the slight advantage that you don't
> have to guess all your partitions in advance, or cover data values that
> are required for a very small number of rows.  In hash partitioning you
> can't really predict which values are those going to be, and the set of
> missing partitions is perfectly known.

Hmm.  So given the point about it being hard to predict which hash
partitions would receive what values ... under what circumstances
would it be sensible to not create a full set of partitions?  Should
we just enforce that there is a full set, somehow?

regards, tom lane




Re: no default hash partition

2019-08-06 Thread Alvaro Herrera
On 2019-Aug-06, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Given the discussion starting at
> > https://postgr.es/m/cafjfprdbiqjzm8sg9+s0x8re-afhds6mflgguw0wvunlgrv...@mail.gmail.com
> > we don't have default-partition support with the hash partitioning
> > scheme.  That seems a reasonable outcome, but I think we should have a
> > comment about it (I had to search the reason for this restriction in the
> > hash-partitioning patch set).  How about the attached?  Does anyone see
> > a reason to make this more verbose, and if so to what?
> 
> Seems like "it's likely to cause trouble for users" is just going to
> beg the question "why?".  Can we explain the hazard succinctly?
> Or point to a comment somewhere else that explains it?

Right ... the "trouble" is just that if the user later wants to add the
missing partitions, they'll need to acquire some strong lock (IIRC it's AEL)
in the partitioned table, so it effectively means an outage.  With
list/range partitioning, there's the slight advantage that you don't
have to guess all your partitions in advance, or cover data values that
are required for a very small number of rows.  In hash partitioning you
can't really predict which values are those going to be, and the set of
missing partitions is perfectly known.

Not enlightened enough ATM for a succint enough explanation, but I'll
take suggestions.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: FETCH FIRST clause PERCENT option

2019-08-06 Thread Ryan Lambert
Surafel,

The patch did not did it automatically. Its query writer obligation to do
> that currently


Ok.

Your latest patch [1] passes make installcheck-world, I didn't test the
actual functionality this round.

   plan = (Plan *) make_limit(plan,
> subparse->limitOffset,
> -   subparse->limitCount);
> +   subparse->limitCount,
> +   subparse->limitOption);
>


I assume the limit percentage number goes into subparse->limitCount?  If
so, I don't see that documented.  Or does this truly only store the count?

The remainder of the code seems to make sense.  I attached a patch with a
few minor changes in the comments.  I need to go back to my notes and see
if I covered all the testing I had thought of, I should get to that later
this week.

[1]
https://www.postgresql.org/message-id/attachment/103028/percent-incremental-v6.patch

*Ryan Lambert*


>


percent-incremental-v6-comment-cleanup.patch
Description: Binary data


remove "msg" parameter from convert_tuples_by_name

2019-08-06 Thread Alvaro Herrera
Hello, here's a pretty trivial cleanup.

Currently, you have to pass the errmsg text to convert_tuples_by_name
and convert_tuples_by_position that's going to be raised if the tuple
descriptors don't match.  In the latter's case that makes sense, as each
case is pretty specific and tailored messages can be offered, so this is
useful.

However, in the case of convert_tuples_by_name, it seems we don't have
enough control over what is being called, so there's no way to
produce tailored messages -- all the callers are using the same generic
wording: "could not convert row type".

This code was introduced by dcb2bda9b704; I think back then we were
thinking that it would be possible to give different error messages for
different cases (as convert_tuples_by_position was already doing then),
however it seems clear now that that'll never happen.

I propose we get rid of it by having convert_tuples_by_name supply the
error message by itself, as in the attached patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From eb4da7ddd5c596ec8585761f02e55d2f03262f62 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 26 Jun 2019 16:36:50 -0400
Subject: [PATCH] Remove 'msg' param from convert_tuples_by_name and friends

---
 src/backend/access/common/tupconvert.c | 17 +++--
 src/backend/catalog/partition.c|  3 +--
 src/backend/commands/analyze.c |  3 +--
 src/backend/commands/indexcmds.c   |  3 +--
 src/backend/commands/tablecmds.c   | 24 
 src/backend/executor/execExprInterp.c  |  4 +---
 src/backend/executor/execMain.c| 12 
 src/backend/executor/execPartition.c   | 18 ++
 src/backend/executor/nodeModifyTable.c |  3 +--
 src/include/access/tupconvert.h|  9 +++--
 10 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/src/backend/access/common/tupconvert.c b/src/backend/access/common/tupconvert.c
index 8cda16431c..0ec9cd5870 100644
--- a/src/backend/access/common/tupconvert.c
+++ b/src/backend/access/common/tupconvert.c
@@ -203,15 +203,14 @@ convert_tuples_by_position(TupleDesc indesc,
  */
 TupleConversionMap *
 convert_tuples_by_name(TupleDesc indesc,
-	   TupleDesc outdesc,
-	   const char *msg)
+	   TupleDesc outdesc)
 {
 	TupleConversionMap *map;
 	AttrNumber *attrMap;
 	int			n = outdesc->natts;
 
 	/* Verify compatibility and prepare attribute-number map */
-	attrMap = convert_tuples_by_name_map_if_req(indesc, outdesc, msg);
+	attrMap = convert_tuples_by_name_map_if_req(indesc, outdesc);
 
 	if (attrMap == NULL)
 	{
@@ -244,8 +243,7 @@ convert_tuples_by_name(TupleDesc indesc,
  */
 AttrNumber *
 convert_tuples_by_name_map(TupleDesc indesc,
-		   TupleDesc outdesc,
-		   const char *msg)
+		   TupleDesc outdesc)
 {
 	AttrNumber *attrMap;
 	int			outnatts;
@@ -299,7 +297,7 @@ convert_tuples_by_name_map(TupleDesc indesc,
 if (atttypid != inatt->atttypid || atttypmod != inatt->atttypmod)
 	ereport(ERROR,
 			(errcode(ERRCODE_DATATYPE_MISMATCH),
-			 errmsg_internal("%s", _(msg)),
+			 errmsg("could not convert row type"),
 			 errdetail("Attribute \"%s\" of type %s does not match corresponding attribute of type %s.",
 	   attname,
 	   format_type_be(outdesc->tdtypeid),
@@ -311,7 +309,7 @@ convert_tuples_by_name_map(TupleDesc indesc,
 		if (attrMap[i] == 0)
 			ereport(ERROR,
 	(errcode(ERRCODE_DATATYPE_MISMATCH),
-	 errmsg_internal("%s", _(msg)),
+	 errmsg("could not convert row type"),
 	 errdetail("Attribute \"%s\" of type %s does not exist in type %s.",
 			   attname,
 			   format_type_be(outdesc->tdtypeid),
@@ -327,8 +325,7 @@ convert_tuples_by_name_map(TupleDesc indesc,
  */
 AttrNumber *
 convert_tuples_by_name_map_if_req(TupleDesc indesc,
-  TupleDesc outdesc,
-  const char *msg)
+  TupleDesc outdesc)
 {
 	AttrNumber *attrMap;
 	int			n = outdesc->natts;
@@ -336,7 +333,7 @@ convert_tuples_by_name_map_if_req(TupleDesc indesc,
 	bool		same;
 
 	/* Verify compatibility and prepare attribute-number map */
-	attrMap = convert_tuples_by_name_map(indesc, outdesc, msg);
+	attrMap = convert_tuples_by_name_map(indesc, outdesc);
 
 	/*
 	 * Check to see if the map is one-to-one, in which case we need not do a
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 07b8223348..53af14e939 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -210,8 +210,7 @@ map_partition_varattnos(List *expr, int fromrel_varno,
 		AttrNumber *part_attnos;
 
 		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
- RelationGetDescr(from_rel),
- gettext_noop("could not convert row type"));
+ RelationGetDescr(from_rel));
 		expr = (List *) map_variable_attnos((Node *) expr,
 			fromrel_varno, 0,
 			

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-06 Thread Bruce Momjian
On Tue, Aug  6, 2019 at 06:13:30PM -0400, Jonathan Katz wrote:
> Hi,
> 
> On 8/6/19 3:01 PM, Bruce Momjian wrote:
> > On Tue, Aug  6, 2019 at 01:55:38PM -0400, Bruce Momjian wrote:
> >> CTR mode creates a bit stream for the first 16 bytes with nonce of
> >> (segment_number, counter = 0), and the next 16 bytes with 
> >> (segment_number, counter = 1), etc.  We only XOR using the parts of the
> >> bit stream we want to use.  We don't care what the WAL content is --- we
> >> just XOR it with the stream with the matching counter for that part of
> >> the WAL.
> > 
> > The diagram which is part of this section might be helpful:
> > 
> > 
> > https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_(CTR)
> > 
> > https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#/media/File:CTR_encryption_2.svg
> 
> This is going to be a slightly long (understatement) email that I
> thought would be easier to try to communicate all in one place vs.
> replying to individual parts on this long thread. My main goal was to
> present some things I had researched on TDE, some of which had been
> mentioned on thread, and compile it in one place (it's also why I was
> slow to respond on some other things on the thread -- sorry!)

This basically tries to re-litigate many discussions we have already
had, and I don't see much value in replying point by point.  It
relitigates:

*  table/tablespace-level encryption keys (single WAL file and unlocked
keys for recovery)

*  CTR mode

*  Authentication of data (we decided we would not do this for v1 of
this feature)

* Use of something like "ssl_passphrase"

If you want to relitigate something, you will need to state that, and
reference the previous arguments in explaining your disagreement.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: no default hash partition

2019-08-06 Thread Tom Lane
Alvaro Herrera  writes:
> Given the discussion starting at
> https://postgr.es/m/cafjfprdbiqjzm8sg9+s0x8re-afhds6mflgguw0wvunlgrv...@mail.gmail.com
> we don't have default-partition support with the hash partitioning
> scheme.  That seems a reasonable outcome, but I think we should have a
> comment about it (I had to search the reason for this restriction in the
> hash-partitioning patch set).  How about the attached?  Does anyone see
> a reason to make this more verbose, and if so to what?

Seems like "it's likely to cause trouble for users" is just going to
beg the question "why?".  Can we explain the hazard succinctly?
Or point to a comment somewhere else that explains it?

regards, tom lane




no default hash partition

2019-08-06 Thread Alvaro Herrera
Given the discussion starting at
https://postgr.es/m/cafjfprdbiqjzm8sg9+s0x8re-afhds6mflgguw0wvunlgrv...@mail.gmail.com
we don't have default-partition support with the hash partitioning
scheme.  That seems a reasonable outcome, but I think we should have a
comment about it (I had to search the reason for this restriction in the
hash-partitioning patch set).  How about the attached?  Does anyone see
a reason to make this more verbose, and if so to what?

... unless somebody wants to argue that we should have the feature; if
so please share your patch.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From f1f5ceb124b8662f63bebc6808bcf0472b579eaa Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 6 Aug 2019 12:25:56 -0400
Subject: [PATCH] Add comment on no default partition with hash partitioning

---
 src/backend/parser/parse_utilcmd.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index a9b2f8bacd..83c8048387 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3727,6 +3727,10 @@ transformPartitionBound(ParseState *pstate, Relation parent,
 
 	if (spec->is_default)
 	{
+		/*
+		 * Hash partitioning does not support a default partition; there's no
+		 * use case for it and it's likely to cause trouble for users.
+		 */
 		if (strategy == PARTITION_STRATEGY_HASH)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
-- 
2.17.1



Re: Problem with default partition pruning

2019-08-06 Thread Alvaro Herrera
On 2019-Aug-06, Alvaro Herrera wrote:

> Well, if this is really all that duplicative, one thing we could do is
> run this check in get_partprune_steps_internal only if
> constraint_exclusion is a value other than on; we should achieve the
> same effect with no repetition.  Patch for that is attached.  However,
> if I run the server with constraint_exclusion=on, the regression test
> fail with the attached diff.  I didn't look at what test is failing, but
> it seems to me that it's not really duplicative in all cases, only some.
> Therefore we can't do it.

Right ... One of the failing cases is one that was benefitting from
constraint_exclusion in the location where we were doing it previously.

I think trying to fix this problem by selectively moving where to apply
constraint exclusion would be very bug-prone, and hard to detect whether
we're missing one spot or doing it multiple times in some other cases.
So I think we shouldn't try.  If this is a real problem, then we should
add a flag to the reloptinfo and set it when we've done pruning, then
do nothing if we already did it.  I'm not sure that this is correct, and
I'm even less sure that it is worth the trouble.

In short, I propose to get this done as the patch I posted in
https://postgr.es/m/20190806133053.GA23706@alvherre.pgsql

Cheers

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-06 Thread Jonathan S. Katz
Hi,

On 8/6/19 3:01 PM, Bruce Momjian wrote:
> On Tue, Aug  6, 2019 at 01:55:38PM -0400, Bruce Momjian wrote:
>> CTR mode creates a bit stream for the first 16 bytes with nonce of
>> (segment_number, counter = 0), and the next 16 bytes with 
>> (segment_number, counter = 1), etc.  We only XOR using the parts of the
>> bit stream we want to use.  We don't care what the WAL content is --- we
>> just XOR it with the stream with the matching counter for that part of
>> the WAL.
> 
> The diagram which is part of this section might be helpful:
> 
>   
> https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_(CTR)
>   
> https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#/media/File:CTR_encryption_2.svg

This is going to be a slightly long (understatement) email that I
thought would be easier to try to communicate all in one place vs.
replying to individual parts on this long thread. My main goal was to
present some things I had researched on TDE, some of which had been
mentioned on thread, and compile it in one place (it's also why I was
slow to respond on some other things on the thread -- sorry!)

While compiling this note, one idea that came to mind is that given the
complexity of this topic and getting the key pieces (no pun intended)
correct, it may be worthwhile if the people interested in working on TDE
make some time where we can all either get together and/or set up a
group call where we hash out the architecture we want to build to ensure
we have a fundamentally sound implementation (even if it takes a few
versions to implement it fully).

Since my last note, I really dove in to understand what other RDBMS
system are doing and what we can learn from them as well as what would
make sense in PostgreSQL. In particular, one goal is to be able to build
a TDE system that satisfies keeping data both confidential and
integrable while at rest while minimizing the amount of overhead we
introduce into the system.

Additionally, I would strongly suggest, if not require, that what we
build follows guidelines such as those outlined by NIST, as failure to
do so could end up that some teams would be unable to utilize our TDE
solution. And of course, mitigate the risk that we introduce security
vulnerabilities :)

(I've also continued to build out my terrible prototype to experiment
with some of the methods suggested. It's written in Python and leverages
the "cryptography"[0] library [which it itself has some good
recommendations on how to use its various parts) and still not worth
sharing yet (though happy to share if asked off-list -- you will be
underwhelmed].)

Below I outline some of my findings from looking at other systems,
looking at our own code, and make recommendations to the best of my
abilities on this matter. I broke it up into these 3 sections, which are
interspersed with research and recommendations:

1. Encryption Key Management
2. Encryption/Decryption of Pages + WAL Records
3. Automated Key Rotation

Of course, they are tightly intertwined, but thought it would be easier
to look at it in this way. It does stop short of certain implementation
details.

Anyway, without further ado:

#1 Encryption Key Management


While I thought what we had proposed on list (KEK, MDEK, TDEK, WDEK)
system made a lot of sense, I still decided to look at what other
systems did. In particular, I looked at SQL Server and the "redwood
city" database.

It turns out the SQL Server has a somewhat similar architecture to what
we propose[1]:

https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/transparent-data-encryption?view=sql-server-2017

- A Service Master Key (equivalent to the key-encrypting key KEK) is
created when SQL Server is setup (equivalent to initdb)
- SQL Server uses Windows facilities to protect the master cluster key.
We are trying to achieve this with our KEK, which are saying we would
store in the GUC `tde_passphrase` (or equivalent, I forget the name we
proposed, if we proposed one. Sorry.). If we wanted to protect it at the
OS level, Peter Eisentraut shows a way to do this equivalent with the
"ssl_passphrase" GUC and systemd[2]
- SQL Server lets you create a "master key" for all the clusters, explicitly
- Here is where the differences start: SQL Server then lets you create a
certificate that can be protected by the master key.
- For the specific databases in the SQL Server cluster, you can then
create a "database encryption key" (they call it "DEK") that is used to
encrypt the data, and ask that said key is protected by a certificate.

(For SQL Server, it appears that TDE is all or nothing within a
database, but *not* within the entire cluster.)

SQL Server does support a multi-key approach which allows for, amongst
other things, key rotation. For what we are looking to do, it may be too
much, but it seems similar to the ideas proposed.

The redwood city database also has something similar to what we are
proposing[3]:


Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-06 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Ian Barwick  writes:
> >>> I dislike the special-casing of ALTER SYSTEM here, where we're basically
> >>> saying that only ALTER SYSTEM is allowed to do this cleanup and that if
> >>> such cleanup is wanted then ALTER SYSTEM must be run.
> 
> > This is just saying what ALTER SYSTEM will do, which IMHO we should describe
> > somewhere. Initially when I stated working with pg.auto.conf I had
> > my application append a comment line to show where the entries came from,
> > but not having any idea how pg.auto.conf was modified at that point, I was
> > wondering why the comment subsequently disappeared. Perusing the source 
> > code has
> > explained that for me, but would be mighty useful to document that.
> 
> I feel fairly resistant to making the config.sgml explanation much longer
> than what I wrote.  That chapter is material that every Postgres DBA has
> to absorb, so we should *not* be burdening it with stuff that few people
> need to know.

Sure, I agree with that.

> Perhaps we could put some of these details into the Notes section of the
> ALTER SYSTEM ref page.  But I wonder how much of this is needed at all.

I'd be alright with that too, but I'd be just as fine with even a README
or something that we feel other hackers and external tool developers
would be likely to find.  I agree that all of this isn't something that
your run-of-the-mill DBA needs to know, but they are things that I'm
sure external tool authors will care about (including myself, David S,
probably the other backup/restore tool maintainers, and at least the
author of pg_conftool, presumably).

Of course, for my 2c anyway, the "low level backup API" is in the same
realm as this stuff (though it's missing important things like "what
magic exit code do you return from archive command to make PG give up
instead of retry"...) and we've got a whole ton of text in our docs
about that.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-06 00:56:26 -0700, Andres Freund wrote:
> Out of energy.

Here's the last section of my low-leve review. Plan to write a higher
level summary afterwards, now that I have a better picture of the code.


> +static void
> +UndoDiscardOneLog(UndoLogSlot *slot, TransactionId xmin, bool *hibernate)

I think the naming here is pretty confusing.  We have UndoDiscard(),
UndoDiscardOneLog(), UndoLogDiscard(). I don't think anybody really can
be expected to understand what is supposed to be what from these names.


> + /* Loop until we run out of discardable transactions in the given log. 
> */
> + do
> + {

for(;;) or while (true)


> + TransactionId wait_xid = InvalidTransactionId;
> + bool pending_abort = false;
> + bool request_rollback = false;
> + UndoStatus status;
> + UndoRecordFetchContext  context;
> +
> + next_insert = UndoLogGetNextInsertPtr(logno);
> +
> + /* There must be some undo data for a transaction. */
> + Assert(next_insert != undo_recptr);
> +
> + /* Fetch the undo record for the given undo_recptr. */
> + BeginUndoFetch();
> + uur = UndoFetchRecord(, undo_recptr);
> + FinishUndoFetch();
> +
> + if (uur != NULL)
> + {
> + if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED)

FWIW, this is precisely my problem with exposing such small
informational functions, which actually have to perform some work. As is
there's several places looking up the underlying undo slot, within just
these lines of code.

We do it once in UndoLogGetNextInsertPtr(). Then again in
UndoFetchRecord(). And then again in UndoRecPtrGetCategory(). And then
later again multiple times when actually discarding.  That perhaps
doesn't matter from a performance POV, but for me that indicates that
the APIs aren't quite right.


> + {
> + /*
> +  * For the "shared" category, we only discard 
> when the
> +  * rm_undo_status callback tells us we can.
> +  */

Is there a description as to what the rm_status callback is intended to
do? It currently is mandatory, is that intended?  Why does this only
apply to shared records? And why just for SHARED, not for any of the others?



> + else
> + {
> + TransactionId xid = 
> XidFromFullTransactionId(uur->uur_fxid);
> +
> + /*
> +  * Otherwise we use the CLOG and xmin to decide 
> whether to
> +  * wait, discard or roll back.
> +  *
> +  * XXX: We've added the transaction-in-progress 
> check to
> +  * avoid xids of in-progress autovacuum as 
> those are not
> +  * computed for oldestxmin calculation.

Hm. xids of autovacuum?  The concern here is the xid that autovacuum
might acquire when locking a relation for truncating a table at the end,
with wal_level=replica?  Because otherwise it shouldn't have any xids?



> See
> +  * DiscardWorkerMain.

Hm. This actually reminds me of a complaint I have about this. ISTM that
the logic for discarding itself should be separate from the discard
worker. I'd just add that, and a UDF to invoke it, in a separate commit.



> + /*
> +  * Add the aborted transaction to the rollback request 
> queues.
> +  *
> +  * We can ignore the abort for transactions whose 
> corresponding
> +  * database doesn't exist.
> +  */
> + if (request_rollback && 
> dbid_exists(uur->uur_txn->urec_dbid))
> + {
> + (void) RegisterUndoRequest(InvalidUndoRecPtr,
> + 
>undo_recptr,
> + 
>uur->uur_txn->urec_dbid,
> + 
>uur->uur_fxid);
> +
> + pending_abort = true;
> + }

As I, I think, said before: This imo should not be necessary.


> +
> + /*
> +  * We can discard upto this point when one of following 
> conditions is

*up to


> +  * met: (a) we need to wait for a transaction first. (b) there 
> is no
> +  * more log to process. (c) the transaction undo in current log 
> is
> +  * finished. (d) there is a pending abort.
> +  */

This comment is hard to understand. Perhaps you're missing some words?
Because it's e.g. 

Re: Cleanup of intro.sgml

2019-08-06 Thread Joshua D. Drake

Rev 2 attached.


Added:

SQL/JSON

SQL/XML

Fixed spelling mistakes

Fixed a missing closing tag.




--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
Postgres centered full stack support, consulting and development.
Advocate: @amplifypostgres || Get help: https://commandprompt.com/
* Unless otherwise stated, opinions are my own.   *

diff --git a/doc/src/sgml/intro.sgml b/doc/src/sgml/intro.sgml
index 3038826311..2a05985bb3 100644
--- a/doc/src/sgml/intro.sgml
+++ b/doc/src/sgml/intro.sgml
@@ -4,78 +4,77 @@
  Preface
 
  
-  This book is the official documentation of
+  This is the official documentation of
   PostgreSQL.  It has been written by the
-  PostgreSQL developers and other
-  volunteers in parallel to the development of the
-  PostgreSQL software.  It describes all
-  the functionality that the current version of
-  PostgreSQL officially supports.
+  PostgreSQL community in parallel with the 
+   development of the PostgreSQL database.  
+   It describes the functionality of current version of
+  PostgreSQL.
  
-
+	
  
-  To make the large amount of information about
-  PostgreSQL manageable, this book has been
-  organized in several parts.  Each part is targeted at a different
-  class of users, or at users in different stages of their
+  This book has been broken up into categories to better manage
+  the large amount of information about
+  PostgreSQL.  Each part is targeted at 
+  a different category of user, or at users in different stages of their
   PostgreSQL experience:
 
   

 
   is an informal introduction for new users.
+  It includes information on topics such as installation, architecture
+  and how to create your first database.
 

 

 
   documents the SQL query
- language environment, including data types and functions, as well
- as user-level performance tuning.  Every
- PostgreSQL user should read this.
+ language, including data types and functions. It also includes 
+ user-level performance tuning hints. All
+ PostgreSQL users should read this.
 

 

 
   describes the installation and
- administration of the server.  Everyone who runs a
- PostgreSQL server, be it for private
- use or for others, should read this part.
+ administration of the PostgreSQL server.  
+ Everyone who administrates a
+ PostgreSQL server, should read this part.
 

 

 
-  describes the programming
- interfaces for PostgreSQL client
- programs.
+  describes the C based programming
+ interfaces for PostgreSQL. Topics such as
+ Python or Go are not discussed in this section. 
 

 
 

 
-  contains information for
- advanced users about the extensibility capabilities of the
- server.  Topics include user-defined data types and
- functions.
+  contains information on
+ the extensibility of PostgreSQL.
+ server.  Topics include user-defined data types, procedural languages,
+ and triggers.
 

 

 
-  contains reference information about
- SQL commands, client and server programs.  This part supports
- the other parts with structured information sorted by command or
- program.
+  contains reference information including
+ SQL commands, client applications and server applications.  
 

 

 
   contains assorted information that might be of
- use to PostgreSQL developers.
+ use to PostgreSQL internals developers.
 

   
@@ -120,6 +119,33 @@
 

 
+   PostgreSQL also supports a comprehensive array of Enterprise level
+   features:
+
+   
+
+ Unstructured data via JSON and SQL/JSON
+
+
+ SQL/XML
+
+
+ Binary and Logical Replication
+
+
+ Hot Backups
+
+
+ Partitioning
+
+
+ Materialized Views
+
+
+ Procedures
+
+   
+
Also, PostgreSQL can be extended by the
user in many ways, for example by adding new
 
@@ -146,8 +172,8 @@
   
 
   
-   And because of the liberal license,
-   PostgreSQL can be used, modified, and
+   The liberal license allows
+   PostgreSQL to be used, modified, and
distributed by anyone free of charge for any purpose, be it
private, commercial, or academic.
   


Re: More issues with pg_verify_checksums and checksum verification in base backups

2019-08-06 Thread Stephen Frost
Greetings,

On Tue, Aug 6, 2019 at 15:45 Magnus Hagander  wrote:

> On Tue, Aug 6, 2019 at 6:07 PM Stephen Frost  wrote:
>
>> Greetings,
>>
>> * Andres Freund (and...@anarazel.de) wrote:
>> > On 2019-08-06 10:58:15 -0400, Stephen Frost wrote:
>> > > * Michael Banck (michael.ba...@credativ.de) wrote:
>> > > > Independently of the whitelist/blacklist question, I believe
>> > > > pg_checksums should not error out as soon as it encounters a weird
>> looking
>> > > > file, but either (i) still checksum it or (ii) skip it? Or is that
>> to be
>> > > > considered a pilot error and it's fine for pg_checksums to fold?
>> > >
>> > > imv, random files that we don't know about are exactly 'pilot error'
>> to
>> > > be complained about..  This is exactly why the whitelist idea falls
>> > > over.
>> >
>> > I still think this whole assumption is bad, and that you're fixing
>> > non-problems, and creating serious usability issues with zero benefits.
>>
>> I doubt we're going to get to agreement on this, unfortunately.
>>
>
> When agreement cannot be found, perhaps a parameter is in order?
>
> That is, have the tool complain about such files by default but with a
> HINT that it may or may not be a problem, and a switch that makes it stop
> complaining?
>

WFM.

Thanks!

Stephen

>


Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-06 Thread Tom Lane
Ian Barwick  writes:
> On 8/6/19 11:16 AM, Stephen Frost wrote:
>>> Erm, those are duplicates though and we're saying that ALTER SYSTEM
>>> removes those...  Seems like we should be normalizing the file to be
>>> consistent in this regard too.

> True. (Switches brain on)... Ah yes, with the patch previously provided
> by Tom, it seems to be just a case of replacing "strcmp" with 
> "guc_name_compare"
> to match the existing string; the name will be rewritten with the value 
> provided
> to ALTER SYSTEM, which will be normalized to lower case anyway.

Good catch.

>>> I dislike the special-casing of ALTER SYSTEM here, where we're basically
>>> saying that only ALTER SYSTEM is allowed to do this cleanup and that if
>>> such cleanup is wanted then ALTER SYSTEM must be run.

> This is just saying what ALTER SYSTEM will do, which IMHO we should describe
> somewhere. Initially when I stated working with pg.auto.conf I had
> my application append a comment line to show where the entries came from,
> but not having any idea how pg.auto.conf was modified at that point, I was
> wondering why the comment subsequently disappeared. Perusing the source code 
> has
> explained that for me, but would be mighty useful to document that.

I feel fairly resistant to making the config.sgml explanation much longer
than what I wrote.  That chapter is material that every Postgres DBA has
to absorb, so we should *not* be burdening it with stuff that few people
need to know.

Perhaps we could put some of these details into the Notes section of the
ALTER SYSTEM ref page.  But I wonder how much of this is needed at all.

regards, tom lane




Small patch to fix build on Windows

2019-08-06 Thread Dmitry Igrishin
Hi,

The attached self-documented patch fixes build on Windows in case when
path to Python has embedded spaces.
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index d1d0aed07e..76834f5188 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -495,7 +495,7 @@ sub mkvcbuild
 		my $pythonprog = "import sys;print(sys.prefix);"
 		  . "print(str(sys.version_info[0])+str(sys.version_info[1]))";
 		my $prefixcmd =
-		  $solution->{options}->{python} . "\\python -c \"$pythonprog\"";
+		  "\"$solution->{options}->{python}\\python\" -c \"$pythonprog\"";
 		my $pyout = `$prefixcmd`;
 		die "Could not query for python version!\n" if $?;
 		my ($pyprefix, $pyver) = split(/\r?\n/, $pyout);
diff --git a/src/tools/msvc/Project.pm b/src/tools/msvc/Project.pm
index b5d1dc6e89..28893f072d 100644
--- a/src/tools/msvc/Project.pm
+++ b/src/tools/msvc/Project.pm
@@ -132,10 +132,9 @@ sub AddLibrary
 {
 	my ($self, $lib, $dbgsuffix) = @_;
 
-	if ($lib =~ m/\s/)
-	{
-		$lib = '' . $lib . "";
-	}
+	# Since VC automatically quotes paths specified as the data of
+	#  in VC project file, it's mistakably
+	# to quote them here. Thus, it's okay if $lib contains spaces.
 
 	push @{ $self->{libraries} }, $lib;
 	if ($dbgsuffix)


Re: More issues with pg_verify_checksums and checksum verification in base backups

2019-08-06 Thread Magnus Hagander
On Tue, Aug 6, 2019 at 6:07 PM Stephen Frost  wrote:

> Greetings,
>
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2019-08-06 10:58:15 -0400, Stephen Frost wrote:
> > > * Michael Banck (michael.ba...@credativ.de) wrote:
> > > > Independently of the whitelist/blacklist question, I believe
> > > > pg_checksums should not error out as soon as it encounters a weird
> looking
> > > > file, but either (i) still checksum it or (ii) skip it? Or is that
> to be
> > > > considered a pilot error and it's fine for pg_checksums to fold?
> > >
> > > imv, random files that we don't know about are exactly 'pilot error' to
> > > be complained about..  This is exactly why the whitelist idea falls
> > > over.
> >
> > I still think this whole assumption is bad, and that you're fixing
> > non-problems, and creating serious usability issues with zero benefits.
>
> I doubt we're going to get to agreement on this, unfortunately.
>

When agreement cannot be found, perhaps a parameter is in order?

That is, have the tool complain about such files by default but with a HINT
that it may or may not be a problem, and a switch that makes it stop
complaining?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Refactoring code stripping trailing \n and \r from strings

2019-08-06 Thread Bruce Momjian
On Thu, Aug  1, 2019 at 12:18:20PM +0900, Michael Paquier wrote:
> Hi Tom,
> 
> b654714 has reworked the way we handle removal of CLRF for several
> code paths, and has repeated the same code patterns to do that in 8
> different places.  Could it make sense to refactor things as per the
> attached with a new routine in common/string.c?

Yes, I think this is a good idea.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-06 Thread Bruce Momjian
On Tue, Aug  6, 2019 at 01:55:38PM -0400, Bruce Momjian wrote:
> CTR mode creates a bit stream for the first 16 bytes with nonce of
> (segment_number, counter = 0), and the next 16 bytes with 
> (segment_number, counter = 1), etc.  We only XOR using the parts of the
> bit stream we want to use.  We don't care what the WAL content is --- we
> just XOR it with the stream with the matching counter for that part of
> the WAL.

The diagram which is part of this section might be helpful:


https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_(CTR)

https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#/media/File:CTR_encryption_2.svg

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: intarray GiST index gets wrong answers for '{}' <@ anything

2019-08-06 Thread Tom Lane
Alexander Korotkov  writes:
> Users, who likes existing behavior of handling <@ operator in intarray
> opclasses, may be advised to rewrite their queries as following.

> "col <@ const" => "col <@ const AND col && const"

Oh, that's a good suggestion --- it will work, and work reasonably
well, with either unpatched or patched intarray code; and also with
some future version that doesn't consider <@ indexable at all.

regards, tom lane




Re: block-level incremental backup

2019-08-06 Thread Ibrar Ahmed
On Tue, Aug 6, 2019 at 11:31 PM Ibrar Ahmed  wrote:

>
> I have not looked at the patch in detail, but just some nits from my side.
>
> On Fri, Aug 2, 2019 at 6:13 PM vignesh C  wrote:
>
>> On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
>>  wrote:
>> >
>> > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
>> jeevan.cha...@enterprisedb.com> wrote:
>> >>
>> >> I am almost done writing the patch for pg_combinebackup and will post
>> soon.
>> >
>> >
>> > Attached patch which implements the pg_combinebackup utility used to
>> combine
>> > full basebackup with one or more incremental backups.
>> >
>> > I have tested it manually and it works for all best cases.
>> >
>> > Let me know if you have any inputs/suggestions/review comments?
>> >
>> Some comments:
>> 1) There will be some link files created for tablespace, we might
>> require some special handling for it
>>
>> 2)
>> + while (numretries <= maxretries)
>> + {
>> + rc = system(copycmd);
>> + if (rc == 0)
>> + return;
>>
>> Use API to copy the file instead of "system", better to use the secure
> copy.
>
Ah, it is a local copy, simple copy API is enough.

>
>
>> + pg_log_info("could not copy, retrying after %d seconds",
>> + sleeptime);
>> + pg_usleep(numretries++ * sleeptime * 100L);
>> + }
>> Retry functionality is hanlded only for copying of full files, should
>> we handle retry for copying of partial files
>>
>> The log and the sleep time does not match, you are multiplying sleeptime
> with numretries++ and logging only "sleeptime"
>
> Why we are retiring here, capture proper copy error and act accordingly.
> Blindly retiring does not make sense.
>
> 3)
>> + maxretries = atoi(optarg);
>> + if (maxretries < 0)
>> + {
>> + pg_log_error("invalid value for maxretries");
>> + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
>> + exit(1);
>> + }
>> + break;
>> + case 's':
>> + sleeptime = atoi(optarg);
>> + if (sleeptime <= 0 || sleeptime > 60)
>> + {
>> + pg_log_error("invalid value for sleeptime");
>> + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"),
>> progname);
>> + exit(1);
>> + }
>> + break;
>> we can have some range for maxretries similar to sleeptime
>>
>> 4)
>> + fp = fopen(filename, "r");
>> + if (fp == NULL)
>> + {
>> + pg_log_error("could not read file \"%s\": %m", filename);
>> + exit(1);
>> + }
>> +
>> + labelfile = malloc(statbuf.st_size + 1);
>> + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
>> + {
>> + pg_log_error("corrupted file \"%s\": %m", filename);
>> + free(labelfile);
>> + exit(1);
>> + }
>> Should we check for malloc failure
>>
>> Use pg_malloc instead of malloc
>
>
>> 5) Should we add display of progress as backup may take some time,
>> this can be added as enhancement. We can get other's opinion on this.
>>
>> Yes, we should, but this is not the right time to do that.
>
>
>> 6)
>> + if (nIncrDir == MAX_INCR_BK_COUNT)
>> + {
>> + pg_log_error("too many incremental backups to combine");
>> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> progname);
>> + exit(1);
>> + }
>> +
>> + IncrDirs[nIncrDir] = optarg;
>> + nIncrDir++;
>> + break;
>>
>> If the backup count increases providing the input may be difficult,
>> Shall user provide all the incremental backups from a parent folder
>> and can we handle the ordering of incremental backup internally
>>
>> Why we have that limit at first place?
>
>
>> 7)
>> + if (isPartialFile)
>> + {
>> + if (verbose)
>> + pg_log_info("combining partial file \"%s.partial\"", fn);
>> +
>> + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
>> + }
>> + else
>> + copy_whole_file(infn, outfn);
>>
>> Add verbose for copying whole file
>>
>> 8) We can also check if approximate space is available in disk before
>> starting combine backup, this can be added as enhancement. We can get
>> other's opinion on this.
>>
>> 9)
>> + printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
>> (maximum %d)\n"), MAX_INCR_BK_COUNT);
>> + printf(_("  -o, --output-dir=DIRECTORY  combine backup into
>> directory\n"));
>> + printf(_("\nGeneral options:\n"));
>> + printf(_("  -n, --no-clean  do not clean up after
>> errors\n"));
>>
>> Combine backup into directory can be combine backup directory
>>
>> 10)
>> +/* Max number of incremental backups to be combined. */
>> +#define MAX_INCR_BK_COUNT 10
>> +
>> +/* magic number in incremental backup's .partial file */
>>
>> MAX_INCR_BK_COUNT can be increased little, some applications use 1
>> full backup at the beginning of the month and use 30 incremental
>> backups rest of the days in the month
>>
>> Regards,
>> Vignesh
>> EnterpriseDB: http://www.enterprisedb.com
>>
>>
>>
>
> --
> Ibrar Ahmed
>


-- 
Ibrar Ahmed


Re: block-level incremental backup

2019-08-06 Thread Ibrar Ahmed
I have not looked at the patch in detail, but just some nits from my side.

On Fri, Aug 2, 2019 at 6:13 PM vignesh C  wrote:

> On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
>  wrote:
> >
> > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
> >>
> >> I am almost done writing the patch for pg_combinebackup and will post
> soon.
> >
> >
> > Attached patch which implements the pg_combinebackup utility used to
> combine
> > full basebackup with one or more incremental backups.
> >
> > I have tested it manually and it works for all best cases.
> >
> > Let me know if you have any inputs/suggestions/review comments?
> >
> Some comments:
> 1) There will be some link files created for tablespace, we might
> require some special handling for it
>
> 2)
> + while (numretries <= maxretries)
> + {
> + rc = system(copycmd);
> + if (rc == 0)
> + return;
>
> Use API to copy the file instead of "system", better to use the secure
copy.


> + pg_log_info("could not copy, retrying after %d seconds",
> + sleeptime);
> + pg_usleep(numretries++ * sleeptime * 100L);
> + }
> Retry functionality is hanlded only for copying of full files, should
> we handle retry for copying of partial files
>
> The log and the sleep time does not match, you are multiplying sleeptime
with numretries++ and logging only "sleeptime"

Why we are retiring here, capture proper copy error and act accordingly.
Blindly retiring does not make sense.

3)
> + maxretries = atoi(optarg);
> + if (maxretries < 0)
> + {
> + pg_log_error("invalid value for maxretries");
> + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
> + exit(1);
> + }
> + break;
> + case 's':
> + sleeptime = atoi(optarg);
> + if (sleeptime <= 0 || sleeptime > 60)
> + {
> + pg_log_error("invalid value for sleeptime");
> + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"),
> progname);
> + exit(1);
> + }
> + break;
> we can have some range for maxretries similar to sleeptime
>
> 4)
> + fp = fopen(filename, "r");
> + if (fp == NULL)
> + {
> + pg_log_error("could not read file \"%s\": %m", filename);
> + exit(1);
> + }
> +
> + labelfile = malloc(statbuf.st_size + 1);
> + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
> + {
> + pg_log_error("corrupted file \"%s\": %m", filename);
> + free(labelfile);
> + exit(1);
> + }
> Should we check for malloc failure
>
> Use pg_malloc instead of malloc


> 5) Should we add display of progress as backup may take some time,
> this can be added as enhancement. We can get other's opinion on this.
>
> Yes, we should, but this is not the right time to do that.


> 6)
> + if (nIncrDir == MAX_INCR_BK_COUNT)
> + {
> + pg_log_error("too many incremental backups to combine");
> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
> progname);
> + exit(1);
> + }
> +
> + IncrDirs[nIncrDir] = optarg;
> + nIncrDir++;
> + break;
>
> If the backup count increases providing the input may be difficult,
> Shall user provide all the incremental backups from a parent folder
> and can we handle the ordering of incremental backup internally
>
> Why we have that limit at first place?


> 7)
> + if (isPartialFile)
> + {
> + if (verbose)
> + pg_log_info("combining partial file \"%s.partial\"", fn);
> +
> + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
> + }
> + else
> + copy_whole_file(infn, outfn);
>
> Add verbose for copying whole file
>
> 8) We can also check if approximate space is available in disk before
> starting combine backup, this can be added as enhancement. We can get
> other's opinion on this.
>
> 9)
> + printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
> (maximum %d)\n"), MAX_INCR_BK_COUNT);
> + printf(_("  -o, --output-dir=DIRECTORY  combine backup into
> directory\n"));
> + printf(_("\nGeneral options:\n"));
> + printf(_("  -n, --no-clean  do not clean up after
> errors\n"));
>
> Combine backup into directory can be combine backup directory
>
> 10)
> +/* Max number of incremental backups to be combined. */
> +#define MAX_INCR_BK_COUNT 10
> +
> +/* magic number in incremental backup's .partial file */
>
> MAX_INCR_BK_COUNT can be increased little, some applications use 1
> full backup at the beginning of the month and use 30 incremental
> backups rest of the days in the month
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

-- 
Ibrar Ahmed


Re: How am I supposed to fix this?

2019-08-06 Thread Larry Rosenman

On 08/06/2019 1:16 pm, Peter Geoghegan wrote:

On Tue, Aug 6, 2019 at 11:11 AM Larry Rosenman  wrote:
As a followup, btcheck found another index that had issues, and a 
toast

table was missing a chunk.

I have ALL the data I used to create this table still around so I just
dropped it and am reloading the data.


It sounds like there is a generic storage issue at play here. Often
TOAST data is the apparent first thing that gets corrupted, because
that's only because the inconsistencies are relatively obvious.

I suggest that you rerun amcheck using the same query, though this
time specify "heapallindexed=true" to bt_check_index(). Increase
maintenance_work_mem if it's set to a low value first (ideally you can
crank it up to 600MB). This type of verification will take a lot
longer, but will find more subtle inconsistencies that could easily be
missed.

Please let us know how this goes. I am always keen to hear about how
much the tooling helps in the real world.


I've already dropped and re-created the table involved (a metric crapton
of DNS queries). I know why and how this happened as well.  I had to
fully restore my system, and bacula didn't catch all the data etc since 
it
was being modified, and I didn't do the smart thing then and restore 
from a pg_dump.



--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106




Re: intarray GiST index gets wrong answers for '{}' <@ anything

2019-08-06 Thread Alexander Korotkov
Hi!

On Tue, Aug 6, 2019 at 8:56 PM Tom Lane  wrote:
> The reason appears to be that the condition for descending through a
> non-leaf index key for the RTContainedBy case is incorrectly optimistic:
> it supposes that we only need to descend into subtrees whose union key
> overlaps the query array.  But this does not guarantee to find subtrees
> that contain empty-array entries.  Worse, such entries could be anywhere
> in the tree, and because of the way that the insertion penalty is
> calculated, they probably are.  (We will compute a zero penalty to add
> an empty array item to any subtree.)  The reason it sometimes works
> seems to be that GiST randomizes its insertion decisions when there are
> equal penalties (cf gistchoose()), and sometimes by luck it puts all
> of the empty-array entries into subtrees that the existing rule will
> search.

Right, existing logic could work correctly, when dataset contains no
empty arrays.  But it clearly doesn't handle empty arrays.

> So as far as I can see, we have little choice but to lobotomize the
> RTContainedBy case and force a whole-index search.  This applies to
> both the gist__int_ops and gist__intbig_ops opclasses.  This is
> pretty awful for any applications that are depending on such queries
> to be fast, but it's hard to argue with "it gets the wrong answer,
> and not even reproducibly so".

+1 for pushing this

> In the future we might think about removing <@ from these opclasses,
> or making a non-backward-compatible change to segregate empty arrays
> from everything else in the index.  But neither answer seems very
> back-patchable, and I'm not really sure I want to put so much work
> into a second-class-citizen contrib module anyway.

+1 for removing <@ from opclasses.  Trying to segregate empty arrays
looks like invention of new opclass rather than bugfix for current
one.  One, who is interested in this piece of work, can implement this
new opclass.

Users, who likes existing behavior of handling <@ operator in intarray
opclasses, may be advised to rewrite their queries as following.

"col <@ const" => "col <@ const AND col && const"

New queries would have opclass support and handle non-empty arrays in
the same way.  It will be slightly slower because of evaluation of two
operators instead of one.  But this doesn't seem critical.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: How am I supposed to fix this?

2019-08-06 Thread Peter Geoghegan
On Tue, Aug 6, 2019 at 11:11 AM Larry Rosenman  wrote:
> As a followup, btcheck found another index that had issues, and a toast
> table was missing a chunk.
>
> I have ALL the data I used to create this table still around so I just
> dropped it and am reloading the data.

It sounds like there is a generic storage issue at play here. Often
TOAST data is the apparent first thing that gets corrupted, because
that's only because the inconsistencies are relatively obvious.

I suggest that you rerun amcheck using the same query, though this
time specify "heapallindexed=true" to bt_check_index(). Increase
maintenance_work_mem if it's set to a low value first (ideally you can
crank it up to 600MB). This type of verification will take a lot
longer, but will find more subtle inconsistencies that could easily be
missed.

Please let us know how this goes. I am always keen to hear about how
much the tooling helps in the real world.

-- 
Peter Geoghegan




Re: How am I supposed to fix this?

2019-08-06 Thread Larry Rosenman

On 08/06/2019 12:45 pm, Larry Rosenman wrote:

On 08/06/2019 12:35 pm, Peter Geoghegan wrote:

On Tue, Aug 6, 2019 at 10:34 AM Larry Rosenman  wrote:

ERROR:  function bt_index_check(index => oid) does not exist
LINE 1: SELECT bt_index_check(index => c.oid),
^
HINT:  No function matches the given name and argument types. You 
might

need to add explicit type casts.


It's a contrib extension, so you have to "create extension amcheck" 
first.



the check is running (this is a HUGE table).

For the initial error, it would be nice if:
1) the pg_toast schema was mentioned
or
2) reindex searched pg_toast as well.

I did do the reindex pg_toast. index.


As a followup, btcheck found another index that had issues, and a toast 
table was missing a chunk.


I have ALL the data I used to create this table still around so I just 
dropped it and am reloading the data.


I still think that the error message should mention the fully qualified 
index name.



--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106




Re: SQL:2011 PERIODS vs Postgres Ranges?

2019-08-06 Thread Ibrar Ahmed
On Tue, Aug 6, 2019 at 8:28 PM Paul Jungwirth 
wrote:

> Hi Ibrar,
>
> On 8/6/19 3:26 AM, Ibrar Ahmed wrote:
> > - Why we are not allowing any other datatype other than ranges in the
> > primary key. Without that there is no purpose of a primary key.
>
> A temporal primary key always has at least one ordinary column (of any
> type), so it is just a traditional primary key *plus* a PERIOD and/or
> range column to indicate when the record was true.
>
> > - Thinking about some special token to differentiate between normal
> > primary key and temporal primary key
>
> There is already some extra syntax. For the time part of a PK, you say
> `WITHOUT OVERLAPS`, like this:
>
>  CONSTRAINT pk_on_t PRIMARY KEY (id, valid_at WITHOUT OVERLAPS)
>
> In this example `id` is an ordinary column, and `valid_at` is either a
> Postgres range or a SQL:2011 PERIOD. (The latter is not yet implemented
> in my patch but there are some placeholder comments.)
>
> Similarly a foreign key has one or more traditional columns *plus* a
> range/PERIOD. It needs to have a range/PERIOD on both sides. It too has
> some special syntax, but instead of `WITHOUT OVERLAPS` it is `PERIOD`.
> (Don't blame me, I didn't write the standard :-) So here is an example:
>
>  CONSTRAINT fk_t2_to_t FOREIGN KEY (id, PERIOD valid_at)
>REFERENCES t (id, PERIOD valid_at)
>
> You should be able to see my changes to gram.y to support this new syntax.
>
> I hope this clears up how it works! I'm happy to answer more questions
> if you have any. Also if you want to read more:
>
> - This paper by Kulkarni & Michels is a 10-page overview of SQL:2011:
>
>
> https://sigmodrecord.org/publications/sigmodRecord/1209/pdfs/07.industry.kulkarni.pdf
>
> - This is a talk I gave at PGCon 2019 going over the concepts, with a
> lot of pictures. You can find text, slides, and a link to the video here:
>
> https://github.com/pjungwir/postgres-temporal-talk
>
> - This link is ostensibly an annotated bibliography but really tells a
> story about how the research has developed:
>
>
> https://illuminatedcomputing.com/posts/2017/12/temporal-databases-bibliography/
>
> - There is also some discussion about PERIODs vs ranges upthread here,
> as well as here:
>
> https://www.postgresql-archive.org/Periods-td6022563.html
>
>
Thanks, Paul for the explanation.  I think its good start, now I am looking
at the
range_agg patch to integrate that with that and test that.


> Yours,
>
> --
> Paul  ~{:-)
> p...@illuminatedcomputing.com
>


-- 
Ibrar Ahmed


Re: [PATCH] Atomic pgrename on Windows

2019-08-06 Thread Alexander Korotkov
Hi!

I'd like to resume the discussion on this subject.  Sorry for so long delay.

On Sat, Jan 20, 2018 at 6:13 PM Magnus Hagander  wrote:
> On Tue, Nov 28, 2017 at 2:47 AM, Michael Paquier  
> wrote:
>>
>> On Mon, Nov 27, 2017 at 3:28 PM, Alexander Korotkov
>>  wrote:
>> > Attached patch atomic-pgrename-windows-1.patch fixes this problem.  It
>> > appears to be possible to atomically replace file on Windows – 
>> > ReplaceFile()
>> > does that.  ReplaceFiles() requires target file to exist, this is why we
>> > still need to call MoveFileEx() when it doesn't exist.
>>
>> Do you think that it could be safer to unlink the target file first
>> with pgunlink()? This way you make sure that the target file is
>> removed and not locked. This change makes me worrying about the
>> introduction of more race conditions.
>
> Unlinking it first seems dangerous, as pointed out by Andres.
>
> What about first trying ReplaceFile() and then if it fails with "target 
> doesn't exist", then call MoveFileEx().
>
> Or the other way around -- try MoveFileEx() first since that seems to work 
> most of the time today (if it mostly broke we'd be in trouble already), and 
> if it fails with a sharing violation, try ReplaceFile()? And perhaps end up 
> doing it something similar to what we do with shared memory which is just to 
> loop over it and try  each a couple of time, before giving up and failing?

Unfortunately, it seems that none of such strategies would fit all the cases.

Basically, we have two option for renaming a file.
 * MoveFileEx() – safest possible option, less likely loose files, but
takes a lock on target.
 * ReplaceFile() – "lockless" option, but may loose files on OS crash.

Also we have two different cases of files renaming:
1) Renaming durable files.  When target already exists, on OS crash we
expect finding either old or new file in target filename.  We don't
expect to find nothing in target filename.
2) Renaming temporary files.  In this case we don't much care on
loosing files on OS crash.  But locking appears to be an issue in some
cases.

So, in 1st case it doesn't seem like an option to try ReplaceFile()
either in first or second place.  In both ways it causes a risk to
loose a target file, which seems unacceptable.

In the 2nd case, trying MoveFileEx() then ReplaceFile() or vise versa
might work.  But error codes of these functions doesn't tell
explicitly whether target file exists.  So, I prefer checking it
explicitly with GetFileAttributes().

Attached patch implements idea described in [1].  It implements
separate pgrename_temp() function, which is better for concurrent
operation but less safe.

Links
1. 
https://www.postgresql.org/message-id/CAPpHfds9trA6ipezK3BsuuOSQwEmESiqj8pkOxACFJpoLpcoNw%40mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pgrename_temp-1.patch
Description: Binary data


intarray GiST index gets wrong answers for '{}' <@ anything

2019-08-06 Thread Tom Lane
While looking at the pending patch for faster GIN index searches
on no-key queries, I was motivated to improve contrib/intarray's
regression test to exercise the GIN_SEARCH_MODE_ALL case, because
it didn't.  And then I thought well, let's try to bring the code
coverage of _int_gin.c up to something respectable, which led me
to the regression test additions shown in the attached.  And I
was astonished to observe that the GiST index cases mostly got
the wrong answer for the <@ query.  Sometimes they got the right
answer, but mostly not.  After some digging I saw that the problem
was that there are a number of empty arrays ('{}') in the data,
and those should surely all match the WHERE a <@ '{73,23,20}'
condition, but the GiST opclasses were not reliably finding them.

The reason appears to be that the condition for descending through a
non-leaf index key for the RTContainedBy case is incorrectly optimistic:
it supposes that we only need to descend into subtrees whose union key
overlaps the query array.  But this does not guarantee to find subtrees
that contain empty-array entries.  Worse, such entries could be anywhere
in the tree, and because of the way that the insertion penalty is
calculated, they probably are.  (We will compute a zero penalty to add
an empty array item to any subtree.)  The reason it sometimes works
seems to be that GiST randomizes its insertion decisions when there are
equal penalties (cf gistchoose()), and sometimes by luck it puts all
of the empty-array entries into subtrees that the existing rule will
search.

So as far as I can see, we have little choice but to lobotomize the
RTContainedBy case and force a whole-index search.  This applies to
both the gist__int_ops and gist__intbig_ops opclasses.  This is
pretty awful for any applications that are depending on such queries
to be fast, but it's hard to argue with "it gets the wrong answer,
and not even reproducibly so".

In the future we might think about removing <@ from these opclasses,
or making a non-backward-compatible change to segregate empty arrays
from everything else in the index.  But neither answer seems very
back-patchable, and I'm not really sure I want to put so much work
into a second-class-citizen contrib module anyway.

Comments?

regards, tom lane

diff --git a/contrib/intarray/_int_gist.c b/contrib/intarray/_int_gist.c
index 13dd7ac..e5a8011 100644
--- a/contrib/intarray/_int_gist.c
+++ b/contrib/intarray/_int_gist.c
@@ -96,8 +96,13 @@ g_int_consistent(PG_FUNCTION_ARGS)
 retval = inner_int_contains(query,
 			(ArrayType *) DatumGetPointer(entry->key));
 			else
-retval = inner_int_overlap((ArrayType *) DatumGetPointer(entry->key),
-		   query);
+			{
+/*
+ * Unfortunately, because empty arrays could be anywhere in
+ * the index, we must search the whole tree.
+ */
+retval = true;
+			}
 			break;
 		default:
 			retval = false;
diff --git a/contrib/intarray/_intbig_gist.c b/contrib/intarray/_intbig_gist.c
index 8d3b3f5..2a20abe 100644
--- a/contrib/intarray/_intbig_gist.c
+++ b/contrib/intarray/_intbig_gist.c
@@ -567,7 +567,13 @@ g_intbig_consistent(PG_FUNCTION_ARGS)
 }
 			}
 			else
-retval = _intbig_overlap((GISTTYPE *) DatumGetPointer(entry->key), query);
+			{
+/*
+ * Unfortunately, because empty arrays could be anywhere in
+ * the index, we must search the whole tree.
+ */
+retval = true;
+			}
 			break;
 		default:
 			retval = false;
diff --git a/contrib/intarray/expected/_int.out b/contrib/intarray/expected/_int.out
index 105c951..c92a565 100644
--- a/contrib/intarray/expected/_int.out
+++ b/contrib/intarray/expected/_int.out
@@ -431,6 +431,18 @@ SELECT count(*) from test__int WHERE a @> '{20,23}';
 12
 (1 row)
 
+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+ count 
+---
+10
+(1 row)
+
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
+ count 
+---
+ 1
+(1 row)
+
 SELECT count(*) from test__int WHERE a @@ '50&68';
  count 
 ---
@@ -449,6 +461,19 @@ SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
 21
 (1 row)
 
+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+ count 
+---
+  6566
+(1 row)
+
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';
+ count 
+---
+  6343
+(1 row)
+
+SET enable_seqscan = off;  -- not all of these would use index by default
 CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
 SELECT count(*) from test__int WHERE a && '{23,50}';
  count 
@@ -480,6 +505,18 @@ SELECT count(*) from test__int WHERE a @> '{20,23}';
 12
 (1 row)
 
+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+ count 
+---
+10
+(1 row)
+
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
+ count 
+---
+ 1
+(1 row)
+
 SELECT count(*) from test__int WHERE a @@ '50&68';
  count 
 ---
@@ -498,6 +535,18 @@ SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
 21
 (1 row)
 
+SELECT 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-06 Thread Bruce Momjian
On Wed, Aug  7, 2019 at 12:31:58AM +0900, Masahiko Sawada wrote:
> Well, so you mean that for example we encrypt only 100 bytes WAL
> record when append 100 bytes WAL records?
> 
> For WAL encryption, if we encrypt the entire 8k WAL page and write the
> entire page, the encrypted-and-written page will contain 100 bytes WAL
> record data and (8192-100) bytes garbage (omitted WAL page header for
> simplify), although WAL data on WAL buffer is still not encrypted
> state. And then if we append 200 bytes again, the
> encrypted-and-written page will contain 300 bytes WAL record data and
> (8192-300)bytes garbage, data on WAL buffer is still not encrypted
> state though.
> 
> In this case I think the first 100 bytes of two 8k WAL pages are the
> same because we encrypted both from the beginning of the page with the
> counter = 0. But the next 200 bytes are different; it's (encrypted)
> garbage in the former case but it's (encrypted) WAL record data in the
> latter case. I think that's a problem.
> 
> On the other hand, if we encrypt 8k WAL page with the different
> counter of nonce after append 200 byes WAL record, the first 100 byte
> (and of course the entire 8k page also) will be different. However
> since it's the same thing doing as changing already-flushed WAL record
> on the disk it's bad.
> 
> Also, if we encrypt only append data instead of entire 8k page, we
> would need to have the information  in somewhere about how much byte
> the WAL page has valid values. Otherwise reading WAL would not work
> fine.

OK, onlist reply.  We are going to encrypt the _entire_ WAL stream as we
write it, which is possible with CTR.  If we write 200 bytes of WAL, we
encrypt/XOR 200 bytes of WAL.  If we write 10k of WAL, and 8k of that is
an 8k page, we encrypt the entire 10k of WAL --- we don't care if there
is an 8k page in there or not.

CTR mode creates a bit stream for the first 16 bytes with nonce of
(segment_number, counter = 0), and the next 16 bytes with 
(segment_number, counter = 1), etc.  We only XOR using the parts of the
bit stream we want to use.  We don't care what the WAL content is --- we
just XOR it with the stream with the matching counter for that part of
the WAL.

It is true we are encrypting the same 8k page in the heap/index page,
and in WAL, with different key/nonce combinations, which I think is
secure.  What is insecure is to encrypt two different pieces of data
with the same key/nonce combination.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Cleanup of intro.sgml

2019-08-06 Thread Joshua D. Drake

On 8/5/19 1:13 PM, Chapman Flack wrote:

On 8/5/19 3:20 PM, Joshua D. Drake wrote:

intro.sgml today. Patch attached.

Things I noticed quickly:

broken up in to categoriess/in to/into/


Got it, I can make that change.



Unstructured data via JSON(or XML ?)


On this one, there is a lot of argument about whether XML is structured 
or not. I do agree that adding XML support would be good though as many 
people think that JSON rules the world but the old money companies are 
still using XML.




s/Partioniing/Partitioning/


Thanks for the catch. I will make the change.

JD




Regards,
-Chap




--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
Postgres centered full stack support, consulting and development.
Advocate: @amplifypostgres || Get help: https://commandprompt.com/
* Unless otherwise stated, opinions are my own.   *





Re: How am I supposed to fix this?

2019-08-06 Thread Larry Rosenman

On 08/06/2019 12:35 pm, Peter Geoghegan wrote:

On Tue, Aug 6, 2019 at 10:34 AM Larry Rosenman  wrote:

ERROR:  function bt_index_check(index => oid) does not exist
LINE 1: SELECT bt_index_check(index => c.oid),
^
HINT:  No function matches the given name and argument types. You 
might

need to add explicit type casts.


It's a contrib extension, so you have to "create extension amcheck" 
first.



the check is running (this is a HUGE table).

For the initial error, it would be nice if:
1) the pg_toast schema was mentioned
or
2) reindex searched pg_toast as well.

I did do the reindex pg_toast. index.


--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106




Re: How am I supposed to fix this?

2019-08-06 Thread Peter Geoghegan
On Tue, Aug 6, 2019 at 10:34 AM Larry Rosenman  wrote:
> ERROR:  function bt_index_check(index => oid) does not exist
> LINE 1: SELECT bt_index_check(index => c.oid),
> ^
> HINT:  No function matches the given name and argument types. You might
> need to add explicit type casts.

It's a contrib extension, so you have to "create extension amcheck" first.
-- 
Peter Geoghegan




Re: How am I supposed to fix this?

2019-08-06 Thread Larry Rosenman

On 08/06/2019 12:30 pm, Peter Geoghegan wrote:

On Tue, Aug 6, 2019 at 10:19 AM Tomas Vondra
 wrote:

The question is how much other data corruption is there ...


Larry could try running amcheck on the other indexes. Just the basic
bt_check_index() checks should be enough to detect problems like this.
They can be run fairly non-disruptively. Something like this should do
it:

SELECT bt_index_check(index => c.oid),
   c.relname,
   c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

If this takes too long, you can always adjust the query to only verify
system indexes or TOAST indexes.

ler=# SELECT bt_index_check(index => c.oid),
ler-#c.relname,
ler-#c.relpages
ler-# FROM pg_index i
ler-# JOIN pg_opclass op ON i.indclass[0] = op.oid
ler-# JOIN pg_am am ON op.opcmethod = am.oid
ler-# JOIN pg_class c ON i.indexrelid = c.oid
ler-# JOIN pg_namespace n ON c.relnamespace = n.oid
ler-# WHERE am.amname = 'btree'
ler-# -- Don't check temp tables, which may be from another session:
ler-# AND c.relpersistence != 't'
ler-# -- Function may throw an error when this is omitted:
ler-# AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ler-# ORDER BY c.relpages DESC;
ERROR:  function bt_index_check(index => oid) does not exist
LINE 1: SELECT bt_index_check(index => c.oid),
   ^
HINT:  No function matches the given name and argument types. You might 
need to add explicit type casts.

ler=#
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106




Re: How am I supposed to fix this?

2019-08-06 Thread Peter Geoghegan
On Tue, Aug 6, 2019 at 10:19 AM Tomas Vondra
 wrote:
> The question is how much other data corruption is there ...

Larry could try running amcheck on the other indexes. Just the basic
bt_check_index() checks should be enough to detect problems like this.
They can be run fairly non-disruptively. Something like this should do
it:

SELECT bt_index_check(index => c.oid),
   c.relname,
   c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

If this takes too long, you can always adjust the query to only verify
system indexes or TOAST indexes.

-- 
Peter Geoghegan




Re: How am I supposed to fix this?

2019-08-06 Thread Alvaro Herrera
On 2019-Aug-06, Larry Rosenman wrote:

> ler=# reindex index pg_toast_17760_index;
> ERROR:  relation "pg_toast_17760_index" does not exist

Maybe try "reindex index pg_toast.pg_toast_17760_index"

> ler=# reindex (verbose) database ler;
[...]
> ERROR:  index "pg_toast_17760_index" contains unexpected zero page at block
> 23686
> HINT:  Please REINDEX it.

I suspect REINDEX is trying to access that index for some other reason
than reindexing it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: How am I supposed to fix this?

2019-08-06 Thread Tomas Vondra

On Tue, Aug 06, 2019 at 12:06:45PM -0500, Larry Rosenman wrote:

I'm getting the below, and am unaware of how to fix it

11.4 on FreeBSD 12.



ler=# reindex (verbose) table dns_query ;
INFO:  index "dns_query_pkey" was reindexed
DETAIL:  CPU: user: 114.29 s, system: 207.94 s, elapsed: 698.87 s
ERROR:  index "pg_toast_17760_index" contains unexpected zero page at 
block 23686

HINT:  Please REINDEX it.
CONTEXT:  parallel worker
ler=# reindex index pg_toast_17760_index;
ERROR:  relation "pg_toast_17760_index" does not exist


You probably need to explicitly say pg_toast.pg_toast_17760_index here,
because pg_toast schema is not part of the search_path by default.


ler=# reindex (verbose) database ler;
INFO:  index "pg_class_oid_index" was reindexed
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "pg_class_relname_nsp_index" was reindexed
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "pg_class_tblspc_relfilenode_index" was reindexed
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  table "pg_catalog.pg_class" was reindexed
load: 14.53  cmd: psql 2675 [select] 2765.27r 0.01u 0.01s 0% 8292k
INFO:  index "dns_query_pkey" was reindexed
DETAIL:  CPU: user: 112.91 s, system: 205.51 s, elapsed: 688.28 s
ERROR:  index "pg_toast_17760_index" contains unexpected zero page at 
block 23686

HINT:  Please REINDEX it.
ler=#



Assuming the toast index is corrupted, this is kinda expected (when trying
to reindex an index on the toasted data).

The question is how much other data corruption is there ...


regards
--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Problem with default partition pruning

2019-08-06 Thread Alvaro Herrera
Hello,

On 2019-Aug-06, Amit Langote wrote:

> On Mon, Aug 5, 2019 at 11:39 PM Alvaro Herrera  
> wrote:

> > I don't think that we care about what happens with constraint_exclusion
> > is on.  That's not the recommended value for that setting anyway.
> 
> One issue I expressed with unconditionally applying constraint
> exclusion in partprune.c the way the third patch proposes to do it is
> that it means we end up performing the same processing twice for a
> given relation in come cases.  Specifically, when constraint_exclusion
> is set to on, relation_excluded_by_constraints() that occurs when
> set_rel_sizes() is applied to that relation would perform the same
> proof.  But maybe we shouldn't worry about the repetition too much,
> because it's not likely to be very problematic in practice,
> considering that setting constraint_exclusion to on is not
> recommended.

Well, if this is really all that duplicative, one thing we could do is
run this check in get_partprune_steps_internal only if
constraint_exclusion is a value other than on; we should achieve the
same effect with no repetition.  Patch for that is attached.  However,
if I run the server with constraint_exclusion=on, the regression test
fail with the attached diff.  I didn't look at what test is failing, but
it seems to me that it's not really duplicative in all cases, only some.
Therefore we can't do it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 2ed1e44c18..2f84f30423 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -1023,8 +1023,12 @@ gen_partprune_steps_internal(GeneratePruningStepsContext *context,
 		 * If the clause contradicts the partition constraint, mark the clause
 		 * as contradictory and we're done.  This is particularly helpful to
 		 * prune the default partition.
+		 *
+		 * However, we skip this if constraint_exclusion is set to ON, because
+		 * this check will be done by set_rel_size later on anyway.
 		 */
-		if (context->rel->partition_qual)
+		if (context->rel->partition_qual &&
+			constraint_exclusion != CONSTRAINT_EXCLUSION_ON)
 		{
 			List	   *partconstr;
 
diff -U3 /pgsql/source/master/src/test/regress/expected/partition_prune.out 
/home/alvherre/Code/pgsql/build/master/src/test/regress/results/partition_prune.out
--- /pgsql/source/master/src/test/regress/expected/partition_prune.out  
2019-08-06 12:27:10.041137540 -0400
+++ 
/home/alvherre/Code/pgsql/build/master/src/test/regress/results/partition_prune.out
 2019-08-06 13:12:20.367678444 -0400
@@ -529,6 +529,12 @@
  Filter: ((a = 1) OR ((b)::text = 'ab'::text))
->  Seq Scan on rlp3abcd
  Filter: ((a = 1) OR ((b)::text = 'ab'::text))
+   ->  Seq Scan on rlp3efgh
+ Filter: ((a = 1) OR ((b)::text = 'ab'::text))
+   ->  Seq Scan on rlp3nullxy
+ Filter: ((a = 1) OR ((b)::text = 'ab'::text))
+   ->  Seq Scan on rlp3_default
+ Filter: ((a = 1) OR ((b)::text = 'ab'::text))
->  Seq Scan on rlp4_1
  Filter: ((a = 1) OR ((b)::text = 'ab'::text))
->  Seq Scan on rlp4_2
@@ -547,7 +553,7 @@
  Filter: ((a = 1) OR ((b)::text = 'ab'::text))
->  Seq Scan on rlp_default_default
  Filter: ((a = 1) OR ((b)::text = 'ab'::text))
-(25 rows)
+(31 rows)
 
 explain (costs off) select * from rlp where a > 20 and a < 27;
QUERY PLAN
@@ -599,9 +605,11 @@
  Append
->  Seq Scan on rlp4_1
  Filter: ((a = 20) OR (a = 40))
+   ->  Seq Scan on rlp4_default
+ Filter: ((a = 20) OR (a = 40))
->  Seq Scan on rlp5_default
  Filter: ((a = 20) OR (a = 40))
-(5 rows)
+(7 rows)
 
 explain (costs off) select * from rlp3 where a = 20;   /* empty */
 QUERY PLAN


How am I supposed to fix this?

2019-08-06 Thread Larry Rosenman

I'm getting the below, and am unaware of how to fix it

11.4 on FreeBSD 12.



ler=# reindex (verbose) table dns_query ;
INFO:  index "dns_query_pkey" was reindexed
DETAIL:  CPU: user: 114.29 s, system: 207.94 s, elapsed: 698.87 s
ERROR:  index "pg_toast_17760_index" contains unexpected zero page at 
block 23686

HINT:  Please REINDEX it.
CONTEXT:  parallel worker
ler=# reindex index pg_toast_17760_index;
ERROR:  relation "pg_toast_17760_index" does not exist
ler=# reindex (verbose) database ler;
INFO:  index "pg_class_oid_index" was reindexed
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "pg_class_relname_nsp_index" was reindexed
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "pg_class_tblspc_relfilenode_index" was reindexed
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  table "pg_catalog.pg_class" was reindexed
load: 14.53  cmd: psql 2675 [select] 2765.27r 0.01u 0.01s 0% 8292k
INFO:  index "dns_query_pkey" was reindexed
DETAIL:  CPU: user: 112.91 s, system: 205.51 s, elapsed: 688.28 s
ERROR:  index "pg_toast_17760_index" contains unexpected zero page at 
block 23686

HINT:  Please REINDEX it.
ler=#

ler=# select version();
   
version

-
 PostgreSQL 11.4 on amd64-portbld-freebsd12.0, compiled by FreeBSD clang 
version 8.0.0 (tags/RELEASE_800/final 356365) (based on LLVM 8.0.0), 
64-bit

(1 row)

ler=#

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-06 Thread Masahiko Sawada
Hi,

On Wed, Aug 7, 2019, 00:31 Masahiko Sawada  wrote:

> Hi Bruce,
> (off-list)
>
> I think I'm missing something about basic of encryption. Please let me
> question about it on off-list.
>

Sorry for the noise, it was not off-list. I made a mistake.


> On Tue, Aug 6, 2019 at 11:36 PM Bruce Momjian  wrote:
> >
> > On Tue, Aug  6, 2019 at 12:00:27PM +0900, Masahiko Sawada wrote:
> > > What I'm thinking about WAL encryption is that WAL records on WAL
> > > buffer is not encrypted. When writing to the disk we copy the contents
> > > of 8k WAL page to a temporary buffer and encrypt it, and then write
> > > it. And according to the current behavior, every time we write WAL we
> > > write WAL per 8k WAL pages rather than WAL records.
> > >
> > > The nonce for WAL encryption is {segment number, counter}. Suppose we
> > > write 100 bytes WAL at beginning of the first 8k WAL page in WAL
> > > segment 50. We encrypt the entire 8k WAL page with the nonce starting
> > > from {50, 0} and write to the disk. After that, suppose we append 200
> > > bytes WAL to the same WAL page. We again encrypt the entire 8k WAL
> > > page with the nonce staring from {50, 0} and write to the disk. The
> > > two 8k WAL pages we wrote to the disk are different but we encrypted
> > > them with the same nonce, which I think it's bad.
> >
> > OK, I think you are missing something.   Let me go over the details.
> > First, I think we are all agreed we are using CTR for heap/index pages,
> > and for WAL, because CTR allows byte granularity, it is faster, and
> > might be more secure.
> >
> > So, to write 8k heap/index pages, we use the agreed-on LSN/page-number
> > to encrypt each page.  In CTR mode, we do that by creating an 8k bit
> > stream, which is created in 16-byte chunks with AES by incrementing the
> > counter used for each 16-byte chunk.  Wee then XOR the bits with what we
> > want to encrypt, and skip the LSN and CRC parts of the page.
> >
> > For WAL, we effectively create a 16MB bitstream, though we can create it
> > in parts as needed.  (Creating it in parts is easier in CTR mode.)  The
> > nonce is the segment number, but each 16-byte chunk uses a different
> > counter.  Therefore, even if you are encrypting the same 8k page several
> > times in the WAL, the 8k page would be different because of the LSN (and
> > other changes), and the bitstream you encrypt/XOR it with would be
> > different because the counter would be different for that offset in the
> > WAL.
>
> Well, so you mean that for example we encrypt only 100 bytes WAL
> record when append 100 bytes WAL records?
>
> For WAL encryption, if we encrypt the entire 8k WAL page and write the
> entire page, the encrypted-and-written page will contain 100 bytes WAL
> record data and (8192-100) bytes garbage (omitted WAL page header for
> simplify), although WAL data on WAL buffer is still not encrypted
> state. And then if we append 200 bytes again, the
> encrypted-and-written page will contain 300 bytes WAL record data and
> (8192-300)bytes garbage, data on WAL buffer is still not encrypted
> state though.
>
> In this case I think the first 100 bytes of two 8k WAL pages are the
> same because we encrypted both from the beginning of the page with the
> counter = 0. But the next 200 bytes are different; it's (encrypted)
> garbage in the former case but it's (encrypted) WAL record data in the
> latter case. I think that's a problem.
>
> On the other hand, if we encrypt 8k WAL page with the different
> counter of nonce after append 200 byes WAL record, the first 100 byte
> (and of course the entire 8k page also) will be different. However
> since it's the same thing doing as changing already-flushed WAL record
> on the disk it's bad.
>
> Also, if we encrypt only append data instead of entire 8k page, we
> would need to have the information  in somewhere about how much byte
> the WAL page has valid values. Otherwise reading WAL would not work
> fine.
>
> Please advise me what I am missing.
>
> Regards,
>
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>

Regards,

>


Re: More issues with pg_verify_checksums and checksum verification in base backups

2019-08-06 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-08-06 10:58:15 -0400, Stephen Frost wrote:
> > * Michael Banck (michael.ba...@credativ.de) wrote:
> > > Independently of the whitelist/blacklist question, I believe
> > > pg_checksums should not error out as soon as it encounters a weird looking
> > > file, but either (i) still checksum it or (ii) skip it? Or is that to be
> > > considered a pilot error and it's fine for pg_checksums to fold?
> > 
> > imv, random files that we don't know about are exactly 'pilot error' to
> > be complained about..  This is exactly why the whitelist idea falls
> > over.
> 
> I still think this whole assumption is bad, and that you're fixing
> non-problems, and creating serious usability issues with zero benefits.

I doubt we're going to get to agreement on this, unfortunately.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-06 Thread Masahiko Sawada
Hi Bruce,
(off-list)

I think I'm missing something about basic of encryption. Please let me
question about it on off-list.

On Tue, Aug 6, 2019 at 11:36 PM Bruce Momjian  wrote:
>
> On Tue, Aug  6, 2019 at 12:00:27PM +0900, Masahiko Sawada wrote:
> > What I'm thinking about WAL encryption is that WAL records on WAL
> > buffer is not encrypted. When writing to the disk we copy the contents
> > of 8k WAL page to a temporary buffer and encrypt it, and then write
> > it. And according to the current behavior, every time we write WAL we
> > write WAL per 8k WAL pages rather than WAL records.
> >
> > The nonce for WAL encryption is {segment number, counter}. Suppose we
> > write 100 bytes WAL at beginning of the first 8k WAL page in WAL
> > segment 50. We encrypt the entire 8k WAL page with the nonce starting
> > from {50, 0} and write to the disk. After that, suppose we append 200
> > bytes WAL to the same WAL page. We again encrypt the entire 8k WAL
> > page with the nonce staring from {50, 0} and write to the disk. The
> > two 8k WAL pages we wrote to the disk are different but we encrypted
> > them with the same nonce, which I think it's bad.
>
> OK, I think you are missing something.   Let me go over the details.
> First, I think we are all agreed we are using CTR for heap/index pages,
> and for WAL, because CTR allows byte granularity, it is faster, and
> might be more secure.
>
> So, to write 8k heap/index pages, we use the agreed-on LSN/page-number
> to encrypt each page.  In CTR mode, we do that by creating an 8k bit
> stream, which is created in 16-byte chunks with AES by incrementing the
> counter used for each 16-byte chunk.  Wee then XOR the bits with what we
> want to encrypt, and skip the LSN and CRC parts of the page.
>
> For WAL, we effectively create a 16MB bitstream, though we can create it
> in parts as needed.  (Creating it in parts is easier in CTR mode.)  The
> nonce is the segment number, but each 16-byte chunk uses a different
> counter.  Therefore, even if you are encrypting the same 8k page several
> times in the WAL, the 8k page would be different because of the LSN (and
> other changes), and the bitstream you encrypt/XOR it with would be
> different because the counter would be different for that offset in the
> WAL.

Well, so you mean that for example we encrypt only 100 bytes WAL
record when append 100 bytes WAL records?

For WAL encryption, if we encrypt the entire 8k WAL page and write the
entire page, the encrypted-and-written page will contain 100 bytes WAL
record data and (8192-100) bytes garbage (omitted WAL page header for
simplify), although WAL data on WAL buffer is still not encrypted
state. And then if we append 200 bytes again, the
encrypted-and-written page will contain 300 bytes WAL record data and
(8192-300)bytes garbage, data on WAL buffer is still not encrypted
state though.

In this case I think the first 100 bytes of two 8k WAL pages are the
same because we encrypted both from the beginning of the page with the
counter = 0. But the next 200 bytes are different; it's (encrypted)
garbage in the former case but it's (encrypted) WAL record data in the
latter case. I think that's a problem.

On the other hand, if we encrypt 8k WAL page with the different
counter of nonce after append 200 byes WAL record, the first 100 byte
(and of course the entire 8k page also) will be different. However
since it's the same thing doing as changing already-flushed WAL record
on the disk it's bad.

Also, if we encrypt only append data instead of entire 8k page, we
would need to have the information  in somewhere about how much byte
the WAL page has valid values. Otherwise reading WAL would not work
fine.

Please advise me what I am missing.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: Use PageIndexTupleOverwrite() within nbtsort.c

2019-08-06 Thread Anastasia Lubennikova

16.07.2019 1:12, Peter Geoghegan wrote:

Attached patch slightly simplifies nbtsort.c by making it use
PageIndexTupleOverwrite() to overwrite the last right non-pivot tuple
with the new high key (pivot tuple). PageIndexTupleOverwrite() is
designed so that code like this doesn't need to delete and re-insert
to replace an existing tuple.

This slightly simplifies the code, and also makes it marginally
faster. I'll add this to the 2019-09 CF.


I'm okay with this patch.

Should we also update similar code in _bt_mark_page_halfdead()?
I attached a new version of the patch with this change.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de..0e6a27e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1653,8 +1653,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	opaque->btpo_flags |= BTP_HALF_DEAD;
 
-	PageIndexTupleDelete(page, P_HIKEY);
-	Assert(PageGetMaxOffsetNumber(page) == 0);
+	Assert(PageGetMaxOffsetNumber(page) == P_HIKEY);
 	MemSet(, 0, sizeof(IndexTupleData));
 	trunctuple.t_info = sizeof(IndexTupleData);
 	if (target != leafblkno)
@@ -1662,8 +1661,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	else
 		BTreeTupleSetTopParent(, InvalidBlockNumber);
 
-	if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
-	false, false) == InvalidOffsetNumber)
+	if (!PageIndexTupleOverwrite(page, P_HIKEY, (Item) ,
+ IndexTupleSize()))
 		elog(ERROR, "could not add dummy high key to half-dead page");
 
 	/* Must mark buffers dirty before XLogInsert */
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e..82e0881 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -943,7 +943,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		{
 			IndexTuple	lastleft;
 			IndexTuple	truncated;
-			Size		truncsz;
 
 			/*
 			 * Truncate away any unneeded attributes from high key on leaf
@@ -961,26 +960,18 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * choosing a split point here for a benefit that is bound to be
 			 * much smaller.
 			 *
-			 * Since the truncated tuple is often smaller than the original
-			 * tuple, it cannot just be copied in place (besides, we want to
-			 * actually save space on the leaf page).  We delete the original
-			 * high key, and add our own truncated high key at the same
-			 * offset.
-			 *
-			 * Note that the page layout won't be changed very much.  oitup is
-			 * already located at the physical beginning of tuple space, so we
-			 * only shift the line pointer array back and forth, and overwrite
-			 * the tuple space previously occupied by oitup.  This is fairly
-			 * cheap.
+			 * Overwrite the old item with new truncated high key directly.
+			 * oitup is already located at the physical beginning of tuple
+			 * space, so this should directly reuse the existing tuple space.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
 
 			truncated = _bt_truncate(wstate->index, lastleft, oitup,
 	 wstate->inskey);
-			truncsz = IndexTupleSize(truncated);
-			PageIndexTupleDelete(opage, P_HIKEY);
-			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
+			if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
+		 IndexTupleSize(truncated)))
+elog(ERROR, "failed to add hikey to the index page");
 			pfree(truncated);
 
 			/* oitup should continue to point to the page's high key */


Re: SQL:2011 PERIODS vs Postgres Ranges?

2019-08-06 Thread Paul Jungwirth

Hi Ibrar,

On 8/6/19 3:26 AM, Ibrar Ahmed wrote:

- Why we are not allowing any other datatype other than ranges in the
primary key. Without that there is no purpose of a primary key.


A temporal primary key always has at least one ordinary column (of any 
type), so it is just a traditional primary key *plus* a PERIOD and/or 
range column to indicate when the record was true.



- Thinking about some special token to differentiate between normal
primary key and temporal primary key


There is already some extra syntax. For the time part of a PK, you say 
`WITHOUT OVERLAPS`, like this:


CONSTRAINT pk_on_t PRIMARY KEY (id, valid_at WITHOUT OVERLAPS)

In this example `id` is an ordinary column, and `valid_at` is either a 
Postgres range or a SQL:2011 PERIOD. (The latter is not yet implemented 
in my patch but there are some placeholder comments.)


Similarly a foreign key has one or more traditional columns *plus* a 
range/PERIOD. It needs to have a range/PERIOD on both sides. It too has 
some special syntax, but instead of `WITHOUT OVERLAPS` it is `PERIOD`. 
(Don't blame me, I didn't write the standard :-) So here is an example:


CONSTRAINT fk_t2_to_t FOREIGN KEY (id, PERIOD valid_at)
  REFERENCES t (id, PERIOD valid_at)

You should be able to see my changes to gram.y to support this new syntax.

I hope this clears up how it works! I'm happy to answer more questions 
if you have any. Also if you want to read more:


- This paper by Kulkarni & Michels is a 10-page overview of SQL:2011:

https://sigmodrecord.org/publications/sigmodRecord/1209/pdfs/07.industry.kulkarni.pdf

- This is a talk I gave at PGCon 2019 going over the concepts, with a 
lot of pictures. You can find text, slides, and a link to the video here:


https://github.com/pjungwir/postgres-temporal-talk

- This link is ostensibly an annotated bibliography but really tells a 
story about how the research has developed:


https://illuminatedcomputing.com/posts/2017/12/temporal-databases-bibliography/

- There is also some discussion about PERIODs vs ranges upthread here, 
as well as here:


https://www.postgresql-archive.org/Periods-td6022563.html


Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: More issues with pg_verify_checksums and checksum verification in base backups

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-06 10:58:15 -0400, Stephen Frost wrote:
> * Michael Banck (michael.ba...@credativ.de) wrote:
> > Independently of the whitelist/blacklist question, I believe
> > pg_checksums should not error out as soon as it encounters a weird looking
> > file, but either (i) still checksum it or (ii) skip it? Or is that to be
> > considered a pilot error and it's fine for pg_checksums to fold?
> 
> imv, random files that we don't know about are exactly 'pilot error' to
> be complained about..  This is exactly why the whitelist idea falls
> over.

I still think this whole assumption is bad, and that you're fixing
non-problems, and creating serious usability issues with zero benefits.

Greetings,

Andres Freund




Re: More issues with pg_verify_checksums and checksum verification in base backups

2019-08-06 Thread Stephen Frost
Greetings,

* Michael Banck (michael.ba...@credativ.de) wrote:
> Independently of the whitelist/blacklist question, I believe
> pg_checksums should not error out as soon as it encounters a weird looking
> file, but either (i) still checksum it or (ii) skip it? Or is that to be
> considered a pilot error and it's fine for pg_checksums to fold?

imv, random files that we don't know about are exactly 'pilot error' to
be complained about..  This is exactly why the whitelist idea falls
over.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-06 Thread Bruce Momjian
On Tue, Aug  6, 2019 at 12:00:27PM +0900, Masahiko Sawada wrote:
> What I'm thinking about WAL encryption is that WAL records on WAL
> buffer is not encrypted. When writing to the disk we copy the contents
> of 8k WAL page to a temporary buffer and encrypt it, and then write
> it. And according to the current behavior, every time we write WAL we
> write WAL per 8k WAL pages rather than WAL records.
> 
> The nonce for WAL encryption is {segment number, counter}. Suppose we
> write 100 bytes WAL at beginning of the first 8k WAL page in WAL
> segment 50. We encrypt the entire 8k WAL page with the nonce starting
> from {50, 0} and write to the disk. After that, suppose we append 200
> bytes WAL to the same WAL page. We again encrypt the entire 8k WAL
> page with the nonce staring from {50, 0} and write to the disk. The
> two 8k WAL pages we wrote to the disk are different but we encrypted
> them with the same nonce, which I think it's bad.

OK, I think you are missing something.   Let me go over the details. 
First, I think we are all agreed we are using CTR for heap/index pages,
and for WAL, because CTR allows byte granularity, it is faster, and
might be more secure.

So, to write 8k heap/index pages, we use the agreed-on LSN/page-number
to encrypt each page.  In CTR mode, we do that by creating an 8k bit
stream, which is created in 16-byte chunks with AES by incrementing the
counter used for each 16-byte chunk.  Wee then XOR the bits with what we
want to encrypt, and skip the LSN and CRC parts of the page.

For WAL, we effectively create a 16MB bitstream, though we can create it
in parts as needed.  (Creating it in parts is easier in CTR mode.)  The
nonce is the segment number, but each 16-byte chunk uses a different
counter.  Therefore, even if you are encrypting the same 8k page several
times in the WAL, the 8k page would be different because of the LSN (and
other changes), and the bitstream you encrypt/XOR it with would be
different because the counter would be different for that offset in the
WAL.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Problem with default partition pruning

2019-08-06 Thread Alvaro Herrera
On 2019-Aug-05, Alvaro Herrera wrote:

> So we have three locations for that test; one is where it currently is,
> which handles a small subset of the cases.  The other is where Amit
> first proposed putting it, which handles some additional cases; and the
> third one is where your latest patch puts it, which seems to handle all
> cases.  Isn't that what Amit is saying?  If that's correct (and that's
> what I want to imply with the comment changes I proposed), then we
> should just accept that version of the patch.

... actually, there's a fourth possible location, which is outside the
per-partitioning-attribute loop.  Nothing in the moved block is to be
done per attribute, so it'd be wasted work AFAICS.  I propose the
attached.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 09c4ed83b6..2ed1e44c18 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -849,10 +849,11 @@ get_matching_partitions(PartitionPruneContext *context, List *pruning_steps)
  * the context's steps list.  Each step is assigned a step identifier, unique
  * even across recursive calls.
  *
- * If we find clauses that are mutually contradictory, or a pseudoconstant
- * clause that contains false, we set context->contradictory to true and
- * return NIL (that is, no pruning steps).  Caller should consider all
- * partitions as pruned in that case.
+ * If we find clauses that are mutually contradictory, or contradictory with
+ * the partitioning constraint, or a pseudoconstant clause that contains
+ * false, we set context->contradictory to true and return NIL (that is, no
+ * pruning steps).  Caller should consider all partitions as pruned in that
+ * case.
  */
 static List *
 gen_partprune_steps_internal(GeneratePruningStepsContext *context,
@@ -942,35 +943,15 @@ gen_partprune_steps_internal(GeneratePruningStepsContext *context,
 	}
 	else
 	{
+		PartitionPruneStep *orstep;
+
 		/*
 		 * The arg didn't contain a clause matching this
 		 * partition key.  We cannot prune using such an arg.
 		 * To indicate that to the pruning code, we must
 		 * construct a dummy PartitionPruneStepCombine whose
 		 * source_stepids is set to an empty List.
-		 *
-		 * However, if we can prove using constraint exclusion
-		 * that the clause refutes the table's partition
-		 * constraint (if it's sub-partitioned), we need not
-		 * bother with that.  That is, we effectively ignore
-		 * this OR arm.
 		 */
-		List	   *partconstr = context->rel->partition_qual;
-		PartitionPruneStep *orstep;
-
-		if (partconstr)
-		{
-			partconstr = (List *)
-expression_planner((Expr *) partconstr);
-			if (context->rel->relid != 1)
-ChangeVarNodes((Node *) partconstr, 1,
-			   context->rel->relid, 0);
-			if (predicate_refuted_by(partconstr,
-	 list_make1(arg),
-	 false))
-continue;
-		}
-
 		orstep = gen_prune_step_combine(context, NIL,
 		PARTPRUNE_COMBINE_UNION);
 		arg_stepids = lappend_int(arg_stepids, orstep->step_id);
@@ -1038,6 +1019,29 @@ gen_partprune_steps_internal(GeneratePruningStepsContext *context,
 			 */
 		}
 
+		/*
+		 * If the clause contradicts the partition constraint, mark the clause
+		 * as contradictory and we're done.  This is particularly helpful to
+		 * prune the default partition.
+		 */
+		if (context->rel->partition_qual)
+		{
+			List	   *partconstr;
+
+			partconstr = (List *)
+expression_planner((Expr *) context->rel->partition_qual);
+			if (context->rel->relid != 1)
+ChangeVarNodes((Node *) partconstr, 1,
+			   context->rel->relid, 0);
+			if (predicate_refuted_by(partconstr,
+	 list_make1(clause),
+	 false))
+			{
+context->contradictory = true;
+return NIL;
+			}
+		}
+
 		/*
 		 * See if we can match this clause to any of the partition keys.
 		 */
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 2d3229fd73..6ccc91d390 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -592,6 +592,24 @@ explain (costs off) select * from rlp where a < 1 or (a > 20 and a < 25);
  Filter: ((a < 1) OR ((a > 20) AND (a < 25)))
 (5 rows)
 
+-- where clause contradicts sub-partition's constraint
+explain (costs off) select * from rlp where a = 20 or a = 40;
+   QUERY PLAN   
+
+ Append
+   ->  Seq Scan on rlp4_1
+ Filter: ((a = 20) OR (a = 40))
+   ->  Seq Scan on rlp5_default
+ Filter: ((a = 20) OR (a = 40))
+(5 rows)
+
+explain (costs off) select * from rlp3 where a = 20;   /* empty */
+QUERY PLAN   

Re: Assertion for logically decoding multi inserts into the catalog

2019-08-06 Thread Daniel Gustafsson
> On 6 Aug 2019, at 05:36, Michael Paquier  wrote:
> 
> On Tue, Aug 06, 2019 at 12:52:09AM +0200, Daniel Gustafsson wrote:
>> Yeah, this is clearly fat-fingered, the intent is to only run the Assert in
>> case XLH_INSERT_CONTAINS_NEW_TUPLE is set in xlrec->flags, as it only applies
>> under that condition.  The attached is tested in both in the multi-insert 
>> patch
>> and on HEAD, but I wish I could figure out a better way to express this 
>> Assert.
> 
> -   Assert(data == tupledata + tuplelen);
> +   Assert(data == tupledata + tuplelen ||
> +  ~(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE));
> I find this way to formulate the assertion a bit confusing, as what
> you want is basically to make sure that XLH_INSERT_CONTAINS_NEW_TUPLE
> is not set in the context of catalogs.  So you could just use that
> instead:
> (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0
> 
> Anyway, if you make a parallel with heap_multi_insert() and the way
> each xl_multi_insert_tuple is built, I think that the error does not
> come from this assertion, but with the way the data length is computed
> in DecodeMultiInsert as a move to the next chunk of tuple data is only
> done if XLH_INSERT_CONTAINS_NEW_TUPLE is set.  So, in my opinion,
> something to fix here is to make sure that we compute the correct
> length even if XLH_INSERT_CONTAINS_NEW_TUPLE is *not* set, and then
> make sure at the end that the tuple length matches to the end.
> 
> This way, we also make sure that we never finish on a state where
> the block data associated to the multi-insert record is NULL but
> because of a mistake there is some tuple data detected, or that the
> tuple data set has a final length which matches the expected outcome.
> And actually, it seems to me that this happens in your original patch
> to open access to multi-insert for catalogs, because for some reason
> XLogRecGetBlockData() returns NULL with a non-zero tuplelen in
> DecodeMultiInsert().  I can see that with the TAP test
> 010_logical_decoding_timelines.pl
> 
> Attached is a patch for that.  Thoughts? 

Thanks, this is a much better approach and it passes tests for me.  +1 on this
version (regardless of outcome of the other patch as this is separate).

cheers ./daniel



Re: partition routing layering in nodeModifyTable.c

2019-08-06 Thread Etsuro Fujita
Amit-san,

On Mon, Aug 5, 2019 at 6:16 PM Amit Langote  wrote:
> On Sat, Aug 3, 2019 at 3:01 AM Andres Freund  wrote:
> Based on the discussion, I have updated the patch.
>
> > I'm a bit woried about the move of BeginDirectModify() into
> > nodeModifyTable.c - it just seems like an odd control flow to me. Not
> > allowing any intermittent nodes between ForeignScan and ModifyTable also
> > seems like an undesirable restriction for the future. I realize that we
> > already do that for BeginForeignModify() (just btw, that already accepts
> > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> > makes sense), but it still seems like the wrong direction to me.
> >
> > The need for that move, I assume, comes from needing knowing the correct
> > ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> > at plan time (in setrefs.c), somewhat similar to how we determine
> > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?
>
> The patch adds a resultRelIndex field to ForeignScan node, which is
> set to >= 0 value for non-SELECT queries.

Thanks for the updated patch!

> I first thought to set it
> only if direct modification is being used, but maybe it'd be simpler
> to set it even if direct modification is not used.  To set it, the
> patch teaches set_plan_refs() to initialize resultRelIndex of
> ForeignScan plans that appear under ModifyTable.  Fujita-san said he
> plans to revise the planning of direct-modification style queries to
> not require a ModifyTable node anymore, but maybe he'll just need to
> add similar code elsewhere but not outside setrefs.c.

Yeah, but I'm not sure this is a good idea:

@ -877,12 +878,6 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
rc->rti += rtoffset;
rc->prti += rtoffset;
}
-   foreach(l, splan->plans)
-   {
-   lfirst(l) = set_plan_refs(root,
- (Plan *) lfirst(l),
- rtoffset);
-   }

/*
 * Append this ModifyTable node's final result relation RT
@@ -908,6 +903,27 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
lappend_int(root->glob->rootResultRelations,
splan->rootRelation);
}
+
+   resultRelIndex = splan->resultRelIndex;
+   foreach(l, splan->plans)
+   {
+   lfirst(l) = set_plan_refs(root,
+ (Plan *) lfirst(l),
+ rtoffset);
+
+   /*
+* For foreign table result relations, save their index
+* in the global list of result relations into the
+* corresponding ForeignScan nodes.
+*/
+   if (IsA(lfirst(l), ForeignScan))
+   {
+   ForeignScan *fscan = (ForeignScan *) lfirst(l);
+
+   fscan->resultRelIndex = resultRelIndex;
+   }
+   resultRelIndex++;
+   }
}

because I still feel the same way as mentioned above by Andres.  What
I'm thinking for the setrefs.c change is to modify ForeignScan (ie,
set_foreignscan_references) rather than ModifyTable, like the
attached.  Maybe I'm missing something, but for direct modification
without ModifyTable, I think we would probably only have to modify
that function further so that it not only adjusts resultRelIndex but
does some extra work such as appending the result relation RT index to
root->glob->resultRelations as done for ModifyTable.

> > Then we could just have BeginForeignModify, BeginDirectModify,
> > BeginForeignScan all be called from ExecInitForeignScan().

Sorry, previously, I mistakenly agreed with that.  As I said before, I
think I was too tired.

> I too think that it would've been great if we could call both
> BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but
> the former's API seems to be designed to be called from
> ExecInitModifyTable from the get-go.  Maybe we should leave that
> as-is?

+1 for leaving that as-is; it seems reasonable to me to call
BeginForeignModify in ExecInitModifyTable, because the ForeignModify
API is designed based on an analogy with local table modifications, in
which case the initialization needed for performing
ExecInsert/ExecUpdate/ExecDelete is done in ModifyTable, not in the
underlying scan/join node.

@@ -895,6 +898,12 @@ BeginDirectModify(ForeignScanState *node,
  for ExplainDirectModify and EndDirectModif\
y.
 

+
+ Also note that it's a good idea to store the rinfo
+ in the fdw_state for
+ IterateDirectModify to use.
+

Actually, if the FDW only supports direct modifications for queries
without RETURNING, 

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-06 Thread Thomas Munro
On Tue, Aug 6, 2019 at 9:26 PM Heikki Linnakangas  wrote:
> I had some steam, and wrote a spec that reproduces this bug. It wasn't
> actually that hard to reproduce, fortunately. Or unfortunately; people
> might well be hitting it in production. I used the "freezetest.spec"
> from the 2013 thread as the starting point, and added one UPDATE to the
> initialization, so that the test starts with an already HOT-updated
> tuple. It should throw a serialization error, but on current master, it
> does not. After applying your fix.txt, it does.

Thanks!  Ahh, right, I was expecting it to be harder to see an
undetected anomaly, because of the index page lock, but of course we
never actually write to that page so it's just the heap tuple lock
holding everything together.

> Your fix.txt seems correct. For clarity, I'd prefer moving things around
> a bit, though, so that the t_self is set earlier in the function, at the
> same place where the other HeapTuple fields are set. And set blkno and
> offnum together, in one ItemPointerSet call. With that, I'm not sure we
> need such a verbose comment explaining why t_self needs to be updated
> but I kept it for now.

+1

> Attached is a patch that contains your fix.txt with the changes for
> clarity mentioned above, and an isolationtester test case.

LGTM.

> PS. Because heap_hot_search_buffer() now always sets heapTuple->t_self
> to the returned tuple version, updating *tid is redundant. And the call
> in heapam_index_fetch_tuple() wouldn't need to do
> "bslot->base.tupdata.t_self = *tid".

Right, that sounds like a separate improvement for master only.

-- 
Thomas Munro
https://enterprisedb.com




Re: SQL:2011 PERIODS vs Postgres Ranges?

2019-08-06 Thread Ibrar Ahmed
Hi Paul,

On Mon, Aug 5, 2019 at 3:11 AM Paul A Jungwirth 
wrote:

> On Fri, Aug 2, 2019 at 1:49 PM Ibrar Ahmed  wrote:
> > I did some clean-up on this patch. I have also refactored a small
> portion of the code
> > to reduce the footprint of the patch. For simplicity, I have divided the
> patch into 6
> > patches, now it is easy to review and debug.
> > Please follow the PostgreSQL coding guidelines. I have found places
> where you missed that, secondly code even in WIP stage must not have
> WARNING because it looks ugly.
>
> Thank you for the cleanup Ibrar! I'll try to stick to the coding
> standards more closely going forward. If you have any review comments
> I would certainly appreciate them, especially about the overall
> approach. I know that the implementation in its current form is not
> very tasteful, but I wanted to get some feedback on the ideas.
>
> I have reviewed the main design, and in my opinion, it is a good start.

- Why we are not allowing any other datatype other than ranges in the
primary key. Without that there is no purpose of a primary key.

- Thinking about some special token to differentiate between normal
primary key and temporal primary key



Also just to reiterate: this patch depends on my other CF entry
> (range_agg), whose scope has expanded considerably. Right now I'm
> focusing on that. And if you're trying to make this code work, it's
> important to apply the range_agg patch first, since the temporal
> foreign key implementation calls that function.
>
> Also: since this patch raises the question of how to conform to
> SQL:2011 while still supporting Postgres range types, I wrote an
> article that surveys SQL:2011 temporal features in MariaDB, DB2,
> Oracle, and MS SQL Server:
>
> https://illuminatedcomputing.com/posts/2019/08/sql2011-survey/
>
> A few highlights are:
>
> - Everyone lets you define PERIODs, but what you can do with them is
> still *very* limited.
> - No one treats PERIODs as first-class types or expressions; they are
> more like table metadata.
>


> - Oracle PERIODs do permit NULL start/end values, and it interprets
> them as "unbounded". That goes against the standard but since it's
> what Postgres does with ranges, it suggests to me that maybe we should
> follow their lead. Anyway I think a NULL is nicer than a sentinel for
> this purpose.
>

That is an open debate, that we want to strictly follow the standard or map
that
to PostgreSQL range type which allows NULL. But how you will define a
primary
key on that?






>
> Regards,
> Paul
>


-- 
Ibrar Ahmed


Re: Identity columns should own only one sequence

2019-08-06 Thread Peter Eisentraut
On 2019-08-05 13:30, Laurenz Albe wrote:
> Peter Eisentraut wrote:
>> On 2019-05-08 16:49, Laurenz Albe wrote:
>>> I believe we should have both:
>>>
>>> - Identity columns should only use sequences with an INTERNAL dependency,
>>>   as in Peter's patch.
>>
>> I have committed this.
> 
> Since this is a bug fix, shouldn't it be backpatched?

In cases where the workaround is "don't do that then", I'm inclined to
leave it alone.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Update to DocBook 4.5

2019-08-06 Thread Peter Eisentraut
I propose to apply the attached patch (to master) to update the DocBook
version to 4.5 (from 4.2).  This basically just gets us off some random
intermediate minor version to the latest within that major version.

Most packagings put all 4.* versions into one package, so you probably
don't need to change anything in your tools.  The exception is MacPorts,
as seen in the patch.  (DocBook 4.5 was released in 2006, so it should
be available everywhere.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 2c3a14083d64840ef1f3b8db9b6c492c332f1368 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 6 Aug 2019 11:42:35 +0200
Subject: [PATCH] Update to DocBook 4.5

This moves us to the latest minor version of DocBook 4.  It requires
no markup changes.
---
 configure.in| 2 +-
 doc/src/sgml/docguide.sgml  | 6 +++---
 doc/src/sgml/postgres.sgml  | 4 ++--
 doc/src/sgml/standalone-install.xml | 2 +-
 doc/src/sgml/standalone-profile.xsl | 4 ++--
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/configure.in b/configure.in
index dde3eec89f..33edfd765b 100644
--- a/configure.in
+++ b/configure.in
@@ -2257,7 +2257,7 @@ fi
 # Check for DocBook and tools
 #
 PGAC_PATH_XMLLINT
-PGAC_CHECK_DOCBOOK(4.2)
+PGAC_CHECK_DOCBOOK(4.5)
 PGAC_PATH_PROGS(DBTOEPUB, dbtoepub)
 PGAC_PATH_PROGS(XSLTPROC, xsltproc)
 PGAC_PATH_PROGS(FOP, fop)
diff --git a/doc/src/sgml/docguide.sgml b/doc/src/sgml/docguide.sgml
index 0608b8c612..8bcaf235ad 100644
--- a/doc/src/sgml/docguide.sgml
+++ b/doc/src/sgml/docguide.sgml
@@ -84,7 +84,7 @@ Tool Sets
  
   
This is the definition of DocBook itself.  We currently use version
-   4.2; you cannot use later or earlier versions.  You need
+   4.5; you cannot use later or earlier versions.  You need
the XML variant of the DocBook DTD, not
the SGML variant.
   
@@ -214,7 +214,7 @@ macOS

 If you use MacPorts, the following will get you set up:
 
-sudo port install docbook-xml-4.2 docbook-xsl fop
+sudo port install docbook-xml-4.5 docbook-xsl fop
 
 If you use Homebrew, use this:
 
@@ -234,7 +234,7 @@ Detection by configure
like this:
 
 checking for xmllint... xmllint
-checking for DocBook XML V4.2... yes
+checking for DocBook XML V4.5... yes
 checking for dbtoepub... dbtoepub
 checking for xsltproc... xsltproc
 checking for fop... fop
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index 3e115f1c76..e59cba7997 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -1,7 +1,7 @@
 
 
-http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd;
+http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd;
 [
 
 
diff --git a/doc/src/sgml/standalone-install.xml 
b/doc/src/sgml/standalone-install.xml
index f584789f9a..019377a67f 100644
--- a/doc/src/sgml/standalone-install.xml
+++ b/doc/src/sgml/standalone-install.xml
@@ -1,5 +1,5 @@
 
-http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd;>
+http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd;>
 
 
 http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"/>
+doctype-public="-//OASIS//DTD DocBook XML V4.5//EN"
+doctype-system="http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"/>
 
 
 
-- 
2.22.0



Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-06 Thread Heikki Linnakangas

On 06/08/2019 07:20, Thomas Munro wrote:

On Tue, Aug 6, 2019 at 4:35 AM Andres Freund  wrote:

On 2019-08-05 20:58:05 +1200, Thomas Munro wrote:

1.  Commit dafaa3efb75 "Implement genuine serializable isolation
level." (2011) locked the root tuple, because it set t_self to *tid.
Possibly due to confusion about the effect of the preceding line
ItemPointerSetOffsetNumber(tid, offnum).

2.  Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013)
fixed that by adding ItemPointerSetOffsetNumber(>t_self,
offnum).


Hm. It's not at all sure that it's ok to report the non-root tuple tid
here. I.e. I'm fairly sure there was a reason to not just set it to the
actual tid. I think I might have written that up on the list at some
point. Let me dig in memory and list. Obviously possible that that was
also obsoleted by parallel changes.


Adding Heikki and Kevin.

I haven't found your earlier discussion about that yet, but would be
keen to read it if you can find it.   I wondered if your argument
might have had something to do with heap pruning, but I can't find a
problem there.  It's not as though the TID of any visible tuples
change, it's just that some dead stuff goes away and the root line
pointer is changed to LP_REDIRECT if it wasn't already.

As for the argument for locking the tuple we emit rather than the HOT
root, I think the questions are: (1) How exactly do we get away with
locking only one version in a regular non-HOT update chain?  (2) Is it
OK to do that with a HOT root?

The answer to the first question is given in README-SSI[1].
Unfortunately it doesn't discuss HOT directly, but I suspect the
answer is no, HOT is not special here.  By my reading, it relies on
the version you lock being the version visible to your snapshot, which
is important because later updates have to touch that tuple to write
the next version.  That doesn't apply to some arbitrarily older tuple
that happens to be a HOT root.  Concretely, heap_update() does
CheckForSerializableConflictIn(relation, , buffer), which is
only going to produce a rw conflict if T1 took an SIREAD on precisely
the version T2 locks in that path, not some arbitrarily older version
that happens to be a HOT root.  A HOT root might never be considered
again by concurrent writers, no?


Your analysis is spot on. Thanks for the clear write-up!


This must be in want of an isolation test, but I haven't yet tried to
get my head around how to write a test that would show the difference.


Indeed.


One practical problem is that the only way to reach
PredicateLockTuple() is from an index scan, and the index scan locks
the index page (or the whole index, depending on
rd_indam->ampredlocks).  So I think if you want to see a serialization
anomaly you'll need multiple indexes (so that index page locks don't
hide the problem), a long enough HOT chain and then probably several
transactions to be able to miss a cycle that should be picked up by
the logic in [1].  I'm out of steam for this problem today though.


I had some steam, and wrote a spec that reproduces this bug. It wasn't 
actually that hard to reproduce, fortunately. Or unfortunately; people 
might well be hitting it in production. I used the "freezetest.spec" 
from the 2013 thread as the starting point, and added one UPDATE to the 
initialization, so that the test starts with an already HOT-updated 
tuple. It should throw a serialization error, but on current master, it 
does not. After applying your fix.txt, it does.


Your fix.txt seems correct. For clarity, I'd prefer moving things around 
a bit, though, so that the t_self is set earlier in the function, at the 
same place where the other HeapTuple fields are set. And set blkno and 
offnum together, in one ItemPointerSet call. With that, I'm not sure we 
need such a verbose comment explaining why t_self needs to be updated 
but I kept it for now.


Attached is a patch that contains your fix.txt with the changes for 
clarity mentioned above, and an isolationtester test case.


PS. Because heap_hot_search_buffer() now always sets heapTuple->t_self 
to the returned tuple version, updating *tid is redundant. And the call 
in heapam_index_fetch_tuple() wouldn't need to do 
"bslot->base.tupdata.t_self = *tid".


- Heikki
>From 23d07e6d5b259e1fd2fe7694cde51212920fbb3a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 6 Aug 2019 12:14:14 +0300
Subject: [PATCH 1/1] Fix predicate-locking of HOT updated rows.

In serializable mode, heap_hot_search_buffer() incorrectly acquired a
predicate lock on the root tuple, not the returned tuple that satisfied
the visibility checks. As explained in README-SSI, the predicate lock does
not need to be copied or extended to other tuple versions, but for that to
work, the correct, visible, tuple version must be locked in the first
place.

The original SSI commit had this bug in it, but it was fixed back in 2013,
in commit 81fbbfe335. But unfortunately, it was reintroduced a few months
later in commit b89e151054. 

Re: Fix a typo in add_partial_path

2019-08-06 Thread Amit Langote
On Tue, Aug 6, 2019 at 6:12 PM Michael Paquier  wrote:
>
> On Tue, Aug 06, 2019 at 05:34:06PM +0900, Amit Langote wrote:
> > Attached patch for:
> >
> > s/incompable/incompatible/g
>
> Thanks, applied.

Thank you Michael.

Regards,
Amit




Re: Fix a typo in add_partial_path

2019-08-06 Thread Michael Paquier
On Tue, Aug 06, 2019 at 05:34:06PM +0900, Amit Langote wrote:
> Attached patch for:
> 
> s/incompable/incompatible/g

Thanks, applied.
--
Michael


signature.asc
Description: PGP signature


Re: Index Skip Scan

2019-08-06 Thread Floris Van Nee
> Yes, the check should be for that. However, the query in question
> doesn't have any query_pathkeys, and hence query_uniquekeys in
> standard_qp_callback(), so therefore it isn't supported

> Your scenario is covered by one of the test cases in case the
> functionality is supported. But, I think that is outside the scope of
> the current patch.

Ah alright, thanks. That makes it clear why it doesn't work.
>From a user point of view I think it's rather strange that
SELECT DISTINCT ON (a) a,b FROM a WHERE a BETWEEN 2 AND 2
would give a fast skip scan, even though the more likely query that someone 
would write
SELECT DISTINCT ON (a) a,b FROM a WHERE a=2
would not.
It is something we could be leave up to the next patch though.

Something else I just noticed which I'm just writing here for awareness; I 
don't think it's that pressing at the moment and can be left to another patch. 
When there are multiple indices on a table the planner gets confused and 
doesn't select an index-only skip scan even though it could. I'm guessing it 
just takes the first available index based on the DISTINCT clause and then 
doesn't look further, eg.
With an index on (a,b) and (a,c,b):
postgres=# explain select distinct on (a) a,c,b FROM a;
 QUERY PLAN

 Index Scan using a_a_b_idx on a  (cost=0.29..1.45 rows=5 width=12)
   Skip scan mode: true
(2 rows)
-> This could be an index only scan with the (a,b,c) index.

> For the records, the purpose of `_bt_read_closest` is not so much to reduce
> amount of data we read from a page, but more to correctly handle those
> situations we were discussing before with reading forward/backward in cursors,
> since for that in some cases we need to remember previous values for stepping
> to the next. I've limited number of items, fetched in this function just
> because I was misled by having a check for dead tuples in `_bt_skip`. Of 
> course
> we can modify it to read a whole page and leave visibility check for the code
> after `index_getnext_tid` (although in case if we know that all tuples on this
> page are visilbe I guess it's not strictly necessary, but we still get
> improvement from skipping itself).

I understand and I agree - primary purpose why we chose this function was to 
make it work correctly. I don't think it would be something for this patch to 
use the optimization of partially reading a page. My point was however, if this 
optimization was allowed in a future patch, it would have great performance 
benefits.
To fix the current patch, we'd indeed need to read the full page. It'd be good 
to take a close look at the implementation of this function then, because 
messing around with the previous/next is also not trivial. I think the current 
implementation also has a problem when the item that is skipped to, is the 
first item on the page. Eg. (this depends on page size)

postgres=# drop table if exists b; create table b as select a,b from 
generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b); 
analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# select distinct on(a) a,b from b;
 a | b
---+---
 1 | 1
 2 | 2   <-- (2,1) is the first item on the page and doesn't get selected by 
read_closest function. it returns the second item on page which is (2,2)
 3 | 2
 4 | 2
 5 | 2
(5 rows)


-Floris



Fix a typo in add_partial_path

2019-08-06 Thread Amit Langote
Hi,

Attached patch for:

s/incompable/incompatible/g

Thanks,
Amit


add_partial_path-typo.patch
Description: Binary data


Re: Global temporary tables

2019-08-06 Thread Konstantin Knizhnik

New version of the patch with several fixes is attached.
Many thanks to Roman Zharkov for testing.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 1bd579f..2d93f6f 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -153,9 +153,9 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 			buf_state = LockBufHdr(bufHdr);
 
 			fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
-			fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode;
-			fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode;
-			fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode;
+			fctx->record[i].relfilenode = bufHdr->tag.rnode.node.relNode;
+			fctx->record[i].reltablespace = bufHdr->tag.rnode.node.spcNode;
+			fctx->record[i].reldatabase = bufHdr->tag.rnode.node.dbNode;
 			fctx->record[i].forknum = bufHdr->tag.forkNum;
 			fctx->record[i].blocknum = bufHdr->tag.blockNum;
 			fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 38ae240..8a04954 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -608,9 +608,9 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 		if (buf_state & BM_TAG_VALID &&
 			((buf_state & BM_PERMANENT) || dump_unlogged))
 		{
-			block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
-			block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
-			block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
+			block_info_array[num_blocks].database = bufHdr->tag.rnode.node.dbNode;
+			block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.node.spcNode;
+			block_info_array[num_blocks].filenode = bufHdr->tag.rnode.node.relNode;
 			block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
 			block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
 			++num_blocks;
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index c945b28..14d4e48 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -95,13 +95,13 @@ ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rda
 
 	if (PageAddItem(page, (Item) itup, IndexTupleSize(itup), offset, false, false) == InvalidOffsetNumber)
 	{
-		RelFileNode node;
+		RelFileNodeBackend rnode;
 		ForkNumber	forknum;
 		BlockNumber blknum;
 
-		BufferGetTag(buffer, , , );
+		BufferGetTag(buffer, , , );
 		elog(ERROR, "failed to add item to index page in %u/%u/%u",
-			 node.spcNode, node.dbNode, node.relNode);
+			 rnode.node.spcNode, rnode.node.dbNode, rnode.node.relNode);
 	}
 }
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 9726020..389466e 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1028,7 +1028,8 @@ gistGetFakeLSN(Relation rel)
 {
 	static XLogRecPtr counter = FirstNormalUnloggedLSN;
 
-	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP ||
+		rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION)
 	{
 		/*
 		 * Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe..c60effd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -671,6 +671,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 			 * init fork of an unlogged relation.
 			 */
 			if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ||
 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
  forkNum == INIT_FORKNUM))
 log_smgrcreate(newrnode, forkNum);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5962126..60f4696 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -763,7 +763,11 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
 		LockBuffer(buf, access);
-		_bt_checkpage(rel, buf);
+		/* Session temporary relation may be not yet initialized for this backend. */
+		if (blkno == BTREE_METAPAGE && PageIsNew(BufferGetPage(buf)) && IsSessionRelationBackendId(rel->rd_backend))
+			_bt_initmetapage(BufferGetPage(buf), P_NONE, 0);
+		else
+			_bt_checkpage(rel, buf);
 	}
 	else
 	{
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3ec67d4..edec8ca 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -213,6 +213,7 @@ void

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-06 Thread Kuntal Ghosh
Hello Thomas,

On Tue, Aug 6, 2019 at 9:50 AM Thomas Munro  wrote:
>
> On Tue, Aug 6, 2019 at 4:35 AM Andres Freund  wrote:
> > On 2019-08-05 20:58:05 +1200, Thomas Munro wrote:
> > > 1.  Commit dafaa3efb75 "Implement genuine serializable isolation
> > > level." (2011) locked the root tuple, because it set t_self to *tid.
> > > Possibly due to confusion about the effect of the preceding line
> > > ItemPointerSetOffsetNumber(tid, offnum).
> > >
> > > 2.  Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013)
> > > fixed that by adding ItemPointerSetOffsetNumber(>t_self,
> > > offnum).
> >
> > Hm. It's not at all sure that it's ok to report the non-root tuple tid
> > here. I.e. I'm fairly sure there was a reason to not just set it to the
> > actual tid. I think I might have written that up on the list at some
> > point. Let me dig in memory and list. Obviously possible that that was
> > also obsoleted by parallel changes.
>
> Adding Heikki and Kevin.
>
> I haven't found your earlier discussion about that yet, but would be
> keen to read it if you can find it.   I wondered if your argument
> might have had something to do with heap pruning, but I can't find a
> problem there.  It's not as though the TID of any visible tuples
> change, it's just that some dead stuff goes away and the root line
> pointer is changed to LP_REDIRECT if it wasn't already.
>
> As for the argument for locking the tuple we emit rather than the HOT
> root, I think the questions are: (1) How exactly do we get away with
> locking only one version in a regular non-HOT update chain?  (2) Is it
> OK to do that with a HOT root?
>
> The answer to the first question is given in README-SSI[1].
> Unfortunately it doesn't discuss HOT directly, but I suspect the
> answer is no, HOT is not special here.  By my reading, it relies on
> the version you lock being the version visible to your snapshot, which
> is important because later updates have to touch that tuple to write
> the next version.  That doesn't apply to some arbitrarily older tuple
> that happens to be a HOT root.  Concretely, heap_update() does
> CheckForSerializableConflictIn(relation, , buffer), which is
> only going to produce a rw conflict if T1 took an SIREAD on precisely
> the version T2 locks in that path, not some arbitrarily older version
> that happens to be a HOT root.  A HOT root might never be considered
> again by concurrent writers, no?
>
If I understand the problem, this is the same serialization issue as
with in-place updates for zheap. I had a discussion with Kevin
regarding the same in this thread [1]. It seems if we're locking the
hot root id, we may report some false positive serializable errors.


> > > This must be in want of an isolation test, but I haven't yet tried to
> > > get my head around how to write a test that would show the difference.
> >
> > Indeed.
>
> One practical problem is that the only way to reach
> PredicateLockTuple() is from an index scan, and the index scan locks
> the index page (or the whole index, depending on
> rd_indam->ampredlocks).  So I think if you want to see a serialization
> anomaly you'll need multiple indexes (so that index page locks don't
> hide the problem), a long enough HOT chain and then probably several
> transactions to be able to miss a cycle that should be picked up by
> the logic in [1].  I'm out of steam for this problem today though.
>
> The simple test from the report[3] that resulted in commit 81fbbfe3352
> doesn't work for me (ie with permutation "r1" "r2" "w1" "w2" "c1" "c2"
> twice in a row).  The best I've come up with so far is an assertion
> that we predicate-lock the same row version that we emitted to the
> user, when reached via an index lookup that visits a HOT row.  The
> test outputs 'f' for master, but 't' with the change to heapam.c.
>
Here is an example from the multiple-row-versions isolation test which
fails if we perform in-place updates for zheap. I think the same will
be relevant if we lock root tuple id instead of the tuple itself.
Step 1: T1-> BEGIN; Read FROM t where id=100;
Step 2: T2-> BEGIN; UPDATE t where id=100; COMMIT; (creates T1->T2)
Step 3: T3-> BEGIN; UPDATE t where id=100; Read FROM t where id=50;
Step 4: T4-> BEGIN; UPDATE t where id= 50; Read FROM t where id=1;
COMMIT;  (creates T3->T4)
Step 5: T3-> COMMIT;
Step 6: T1-> UPDATE t where id=1; COMMIT;  (creates T4->T1,)

At step 6, when the update statement is executed, T1 is rolled back
because of T3->T4->T1.

But for zheap, step 3 also creates a dependency T1->T3 because of
in-place update. When T4 commits in step 4, it marks T3 as doomed
because of T1 --> T3 --> T4. Hence, in step 5, T3 is rolled back.

[1]  Re: In-place updates and serializable transactions:
https://www.postgresql.org/message-id/CAGz5QCJzreUqJqHeXrbEs6xb0zCNKBHhOj6D9Tjd3btJTzydxg%40mail.gmail.com

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: BUG #15938: Corrupted WAL segment after crash recovery

2019-08-06 Thread Kyotaro Horiguchi
Hello.

At Thu, 01 Aug 2019 13:52:52 +, PG Bug reporting form 
 wrote in <15938-8591df7e95064...@postgresql.org>
> The following bug has been logged on the website:
> 
> Bug reference:  15938
> Logged by:  Alexander Kukushkin
> Email address:  cyberd...@gmail.com
> PostgreSQL version: 10.9
> Operating system:   Ubuntu 18.04.2 LTS
> Description:
> 
> On one of our cluster one of the postgres backend processes was killed by
> kernel oom.

Although I don't think replication reconnection to
crash-recovered master is generally guaranteed, but this seems
different.

> 2019-08-01 12:41:06.376 UTC,,,22697,,5d42dddb.58a9,4,,2019-08-01 12:40:59
> UTC,,0,LOG,0,"redo done at B8D/44FFE570",""
> 2019-08-01 12:41:06.376 UTC,,,22697,,5d42dddb.58a9,5,,2019-08-01 12:40:59
> UTC,,0,LOG,0,"last completed transaction was at log time 2019-08-01
> 12:40:54.782747+00",""
..
> Unfortunately I can't compare files in the archive and on the primary,
> because it was recycled despite usage of replication slots.
> pg_replication_slots view reports restart_lsn as B8D/451EB540 (the next
> segment).

WAL records since 44ffe6d0 (maybe) till 451eb540 are somehow
ignored during crash recovery of the master, or lost despite it
should have been fsynced out. The only thing I came up for this
is the fsync problem but it is fixed in 10.7. But the following
wiki description:

https://wiki.postgresql.org/wiki/Fsync_Errors

> Linux 4.13 and 4.15: fsync() only reports writeback errors that
> occurred after you called open() so our schemes for closing and
> opening files LRU-style and handing fsync() work off to the
> checkpointer process can hide write-back errors; also buffers are
> marked clean after errors so even if you opened the file before
> the failure, retrying fsync() can falsely report success and the
> modified buffer can be thrown away at any time due to memory
> pressure.

If I read this correctly, after checkpointer failed to fsync
while creating of a preallocate file (this is an ERROR, not a
PANIC), other processes will never receive fsync error about the
file.  This scenario is consistent with the fact that it seems
that the data loss starts from new segment's beginning (assuming
that the original 44ffe6d0 continues to the next segment).

Thoughts?


> But I can run pg_waldump, and the last record in this file is
> CHECKPOINT_SHUTDOWN:
> rmgr: XLOGlen (rec/tot):106/   106, tx:  0, lsn:
> B8D/44FFF6D0, prev B8D/44FFE570, desc: CHECKPOINT_SHUTDOWN redo
> B8D/44FFF6D0; tli 80; prev tli 80; fpw true; xid 0:42293547; oid 10741249;
> multi 1; offset 0; oldest xid 549 in DB 1; oldest multi 1 in DB 1;
> oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> pg_waldump: FATAL:  error in WAL record at B8D/44FFF6D0: invalid record
> length at B8D/44FFF740: wanted 24, got 0

The shutdown record is written at the end of crash recovery of
the master, then replicated. As the result the subsequent bytes
are inconsistent with the record.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: POC: Cleaning up orphaned files using undo logs

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> Need to do something else for a bit. More later.

Here we go.




> + /*
> +  * Compute the header size of the undo record.
> +  */
> +Size
> +UndoRecordHeaderSize(uint16 uur_info)
> +{
> + Sizesize;
> +
> + /* Add fixed header size. */
> + size = SizeOfUndoRecordHeader;
> +
> + /* Add size of transaction header if it presets. */
> + if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> + size += SizeOfUndoRecordTransaction;
> +
> + /* Add size of rmid if it presets. */
> + if ((uur_info & UREC_INFO_RMID) != 0)
> + size += sizeof(RmgrId);
> +
> + /* Add size of reloid if it presets. */
> + if ((uur_info & UREC_INFO_RELOID) != 0)
> + size += sizeof(Oid);
> +
> + /* Add size of fxid if it presets. */
> + if ((uur_info & UREC_INFO_XID) != 0)
> + size += sizeof(FullTransactionId);
> +
> + /* Add size of cid if it presets. */
> + if ((uur_info & UREC_INFO_CID) != 0)
> + size += sizeof(CommandId);
> +
> + /* Add size of forknum if it presets. */
> + if ((uur_info & UREC_INFO_FORK) != 0)
> + size += sizeof(ForkNumber);
> +
> + /* Add size of prevundo if it presets. */
> + if ((uur_info & UREC_INFO_PREVUNDO) != 0)
> + size += sizeof(UndoRecPtr);
> +
> + /* Add size of the block header if it presets. */
> + if ((uur_info & UREC_INFO_BLOCK) != 0)
> + size += SizeOfUndoRecordBlock;
> +
> + /* Add size of the log switch header if it presets. */
> + if ((uur_info & UREC_INFO_LOGSWITCH) != 0)
> + size += SizeOfUndoRecordLogSwitch;
> +
> + /* Add size of the payload header if it presets. */
> + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> + size += SizeOfUndoRecordPayload;

There's numerous blocks with one if for each type, and the body copied
basically the same for each alternative. That doesn't seem like a
reasonable approach to me. Means that many places need to be adjusted
when we invariably add another type, and seems likely to lead to bugs
over time.

> + /* Add size of the payload header if it presets. */

FWIW, repeating the same comment, with or without minor differences, 10
times is a bad idea. Especially when the comment doesn't add *any* sort
of information.

Also, "if it presets" presumably is a typo?


> +/*
> + * Compute and return the expected size of an undo record.
> + */
> +Size
> +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> +{
> + Sizesize;
> +
> + /* Header size. */
> + size = UndoRecordHeaderSize(uur->uur_info);
> +
> + /* Payload data size. */
> + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> + {
> + size += uur->uur_payload.len;
> + size += uur->uur_tuple.len;
> + }
> +
> + /* Add undo record length size. */
> + size += sizeof(uint16);
> +
> + return size;
> +}
> +
> +/*
> + * Calculate the size of the undo record stored on the page.
> + */
> +static inline Size
> +UndoRecordSizeOnPage(char *page_ptr)
> +{
> + uint16  uur_info = ((UndoRecordHeader *) page_ptr)->urec_info;
> + Sizesize;
> +
> + /* Header size. */
> + size = UndoRecordHeaderSize(uur_info);
> +
> + /* Payload data size. */
> + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> + {
> + UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr + 
> size);
> +
> + size += payload->urec_payload_len;
> + size += payload->urec_tuple_len;
> + }
> +
> + return size;
> +}
> +
> +/*
> + * Compute size of the Unpacked undo record in memory
> + */
> +Size
> +UnpackedUndoRecordSize(UnpackedUndoRecord *uur)
> +{
> + Sizesize;
> +
> + size = sizeof(UnpackedUndoRecord);
> +
> + /* Add payload size if record contains payload data. */
> + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> + {
> + size += uur->uur_payload.len;
> + size += uur->uur_tuple.len;
> + }
> +
> + return size;
> +}

These functions are all basically the same. We shouldn't copy code over
and over like this.


> +/*
> + * Initiate inserting an undo record.
> + *
> + * This function will initialize the context for inserting and undo record
> + * which will be inserted by calling InsertUndoData.
> + */
> +void
> +BeginInsertUndo(UndoPackContext *ucontext, UnpackedUndoRecord *uur)
> +{
> + ucontext->stage = UNDO_PACK_STAGE_HEADER;
> + ucontext->already_processed = 0;
> + ucontext->partial_bytes = 0;
> +
> + /* Copy undo record header. */
> + ucontext->urec_hd.urec_type = uur->uur_type;
> + ucontext->urec_hd.urec_info = uur->uur_info;
> +
> + /* Copy undo record transaction header if it is present. */
> + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
> + memcpy(>urec_txn, uur->uur_txn, 
> SizeOfUndoRecordTransaction);
> +
> + /* 

Re: Adding column "mem_usage" to view pg_prepared_statements

2019-08-06 Thread Konstantin Knizhnik




On 05.08.2019 22:35, Daniel Migowski wrote:

.
I think that including in pg_prepared_statements information about 
memory used this statement is very useful.
CachedPlanMemoryUsage function may be useful not only for this view, 
but for example it is also need in my autoprepare patch.
I would love to use your work if it's done, and would also love to 
work together here. I am quite novice in C thought, I might take my 
time to get things right.


Right now I resused your implementation of CachedPlanMemoryUsage function:)
Before I took in account only memory used by plan->context, but not of 
plan->query_context and plan->gplan->context (although query_context for 
raw parse tree seems to be much smaller).



I wonder if you consider go further and not only report but control 
memory used by prepared statements?
For example implement some LRU replacement discipline on top of 
prepared statements cache which can

evict rarely used prepared statements to avoid memory overflow.


THIS! Having some kind of safety net here would finally make sure that 
my precious processes will not grow endlessly until all mem is eaten 
up, even with prep statement count limits.


While working on stuff I noticed there are three things stored in a 
CachedPlanSource. The raw query tree (a relatively small thing), the 
query list (analyzed-and-rewritten query tree) which takes up the most 
memory (at least here, maybe different with your usecases), and (often 
after the 6th call) the CachedPlan, which is more optimized that the 
query list and often needs less memory (half of the query list here).


The query list seems to take the most time to create here, because I 
hit the GEQO engine here, but it could be recreated easily (up to 
500ms for some queries). Creating the CachedPlan afterwards takes 60ms 
in some usecase. IF we could just invalidate them from time to time, 
that would be grate.


Also, invalidating just the queries or the CachedPlan would not 
invalidate the whole prepared statement, which would break clients 
expectations, but just make them a slower, adding much to the 
stability of the system. I would pay that price, because I just don't 
use manually named prepared statements anyway and just autogenerate 
them as performance sugar without thinking about what really needs to 
be prepared anyway. There is an option in the JDBC driver to use 
prepared statements automatically after you have used them a few time.


I have noticed that cached plans for implicitly prepared statements in 
stored procedures are not shown in pg_prepared_statements view.
It may be not a problem in your case (if you are accessing Postgres 
through  JDBC and not using prepared statements),
but can cause memory overflow in applications actively using stored 
procedures, because unlike explicitly created prepared statements, it is 
very difficult

to estimate and control statements implicitly prepared by plpgsql.

I am not sure what will be the best solution in this case. Adding yet 
another view for implicitly prepared statements? Or include them in 
pg_prepared_statements view?


We have such patch for PgPro-EE but it limits only number of prepared 
statement, not taken in account amount of memory used by them.
I think that memory based limit will be more accurate (although it 
adds more overhead).


Limiting them by number is already done automatically here and would 
really not be of much value, but having a mem limit would be great. We 
could have a combined memory limit for your autoprepared statements as 
well as the manually prepared ones, so clients can know for sure the 
server processes won't eat up more that e.g. 800MB for prepared 
statements. And also I would like to have this value spread across all 
client processes, e.g. specifying max_prepared_statement_total_mem=5G 
for the server, and maybe max_prepared_statement_mem=1G for client 
processes. Of course we would have to implement cross client process 
invalidation here, and I don't know if communicating client processes 
are even intended.


Anyway, a memory limit won't really add that much more overhead. At 
least not more than having no prepared statements at all because of 
the fear of server OOMs, or have just a small count of those 
statements. I was even think about a prepared statement reaper that 
checks the pg_prepared_statements every few minutes to clean things up 
manually, but having this in the server would be of great value to me.


Right now memory context has no field containing amount of currently 
used memory.
This is why context->methods->stats function implementation has to 
traverse all blocks to calculate size of memory used by context.
It may be not so fast for large contexts. But I do not expect that 
contexts of prepared statements will be very large, although
I have deal with customers which issued queries with query text length 
larger than few megabytes. I afraid to estimate size of plan for such 
queries.
This is the reason of my concern 

Re: tableam vs. TOAST

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-05 15:36:59 -0400, Robert Haas wrote:
> On Fri, Aug 2, 2019 at 6:42 PM Andres Freund  wrote:
> > Hm. This leaves toast_insert_or_update() as a name. That makes it sound
> > like it's generic toast code, rather than heap specific?
> 
> I'll rename it to heap_toast_insert_or_update().  But I think I'll put
> that in 0004 with the other renames.

Makes sense.


> > It's definitely nice how a lot of repetitive code has been deduplicated.
> > Also makes it easier to see how algorithmically expensive
> > toast_insert_or_update() is :(.
> 
> Yep.
> 
> > Shouldn't this condition be the other way round?
> 
> I had to fight pretty hard to stop myself from tinkering with the
> algorithm -- this can clearly be done better, but I wanted to make it
> match the existing structure as far as possible. It also only needs to
> be tested once, not on every loop iteration, so if we're going to
> start changing things, we should go further than just swapping the
> order of the tests.  For now I prefer to draw a line in the sand and
> change nothing.

Makes sense.


> > Couldn't most of these be handled via colflags, instead of having
> > numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
> > TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
> > size check ought to boil down to a single mask test?
> 
> I'm not really seeing how more flags would significantly simplify this
> logic, but I might be missing something.

Well, right now you have a number of ifs for each attribute. If you
encoded all the parameters into flags, you could change that to a single
flag test - as far as I can tell, all the parameters could be
represented as that, if you moved MAIN etc into flags.  A single if for
flags (and then the size check) is cheaper than several separate checks.


> > I'm quite unconvinced that making the chunk size specified by the AM is
> > a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
> > pg_control etc. It seems a bit dangerous to let AMs provide the size,
> > without being very clear that any change of the value will make data
> > inaccessible. It'd be different if the max were only used during
> > toasting.
> 
> I was actually thinking about proposing that we rip
> TOAST_MAX_CHUNK_SIZE out of pg_control.  No real effort has been made
> here to make this something that users could configure, and I don't
> know of a good reason to configure it. It also seems pretty out of
> place in a world where there are multiple AMs floating around -- why
> should heap, and only heap, be checked there? Granted it does have
> some pride of place, but still.

> > I think the *size* checks should be weakened so we check:
> > 1) After each chunk, whether the already assembled chunks exceed the
> >expected size.
> > 2) After all chunks have been collected, check that the size is exactly
> >what we expect.
> >
> > I don't think that removes a meaningful amount of error
> > checking. Missing tuples etc get detected by the chunk_ids not being
> > consecutive. The overall size is still verified.
> >
> > The obvious problem with this is the slice fetching logic. For slices
> > with an offset of 0, it's obviously trivial to implement. For the higher
> > slice logic, I'd assume we'd have to fetch the first slice by estimating
> > where the start chunk is based on the current suggest chunk size, and
> > restarting the scan earlier/later if not accurate.  In most cases it'll
> > be accurate, so we'd not loose efficiency.
> 
> I don't feel entirely convinced that there's any rush to do all of
> this right now, and the more I change the harder it is to make sure
> that I haven't broken anything.  How strongly do you feel about this
> stuff?

I think we either should leave the hardcoded limit in place, or make it
actually not fixed. Ripping-it-out-but-not-actually just seems like a
trap for the unwary, without much point.

Greetings,

Andres Freund




Re: pglz performance

2019-08-06 Thread Andres Freund
Hi,

On 2019-08-05 16:04:46 +0900, Michael Paquier wrote:
> On Fri, Aug 02, 2019 at 07:52:39PM +0200, Tomas Vondra wrote:
> > On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote:
> >> Why would they be stuck continuing to *compress* with pglz? As we
> >> fully retoast on write anyway we can just gradually switch over to the
> >> better algorithm. Decompression speed is another story, of course.
> > 
> > Hmmm, I don't remember the details of those patches so I didn't realize
> > it allows incremental recompression. If that's possible, that would mean
> > existing systems can start using it. Which is good.
> 
> It may become a problem on some platforms though (Windows?), so
> patches to improve either the compression or decompression of pglz are
> not that much crazy as they are still likely going to be used, and for
> read-mostly switching to a new algo may not be worth the extra cost so
> it is not like we are going to drop it completely either.

What's the platform dependency that you're thinking of? And how's
compression speed relevant to "read mostly"?  Switching would just
happen whenever tuple fields are changed. And it'll not have an
additional cost, because all it does is reduce the cost of a toast write
that'd otherwise happened with pglz.


> Linking to system libraries would make our maintenance much easier,
> and when it comes to have a copy of something else in the tree we
> would be stuck with more maintenance around it.  These tend to rot
> easily.

I don't think it's really our experience that they "rot easily".


> After that comes the case where the compression algo is not
> in the binary across one server to another, in which case we have an
> automatic ERROR in case of a mismatching algo, or FATAL for
> deompression of FPWs at recovery when wal_compression is used.

Huh? That's a failure case that only exists if you don't include it in
the tree (with the option to use an out-of-tree lib)?

Greetings,

Andres Freund




Re: pgbench - implement strict TPC-B benchmark

2019-08-06 Thread Fabien COELHO


Hello Robert,


Ok, one thread cannot feed an N core server if enough client are executed
per thread and the server has few things to do.


Right ... where N is, uh, TWO.


Yes, two indeed… For low-work cpu-bound load, given libpq & system 
overheads, you cannot really hope for a better deal.


I think that the documentation could be clearer about thread/core 
recommendations, i.e. how much ressources you should allocate to pgbench
so that the server is more likely to be the bottleneck, in the "Good 
Practices" section.



The point I'm clumsily trying to make is that pgbench-specific overheads
are quite small: Any benchmark driver would have pretty much at least the
same costs, because you have the cpu cost of the tool itself, then the
library it uses, eg lib{pq,c}, then syscalls. Even if the first costs are
reduced to zero, you still have to deal with the database through the
system, and this part will be the same.


I'm not convinced. Perhaps you're right; after all, it's not like
pgbench is doing any real work.


Yep, pgbench is not doing much beyond going from one libpq call to the 
next. It can be improved, but the overhead is already reasonably low.


On the other hand, I've repeatedly been annoyed by how inefficient 
pgbench is, so I'm not totally prepared to concede that any benchmark 
driver would have the same costs, or that it's a reasonably 
well-optimized client application. When I run the pgbench, I want to 
know how fast the server is, not how fast pgbench is.


Hmmm. You cannot fully isolate components: You can only basically learn 
how fast the server is when accessed through a libpq connection, which is 
quite reasonable because this is what a user can expect anyway.


--
Fabien.

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

2019-08-06 Thread Peter Geoghegan
On Mon, Aug 5, 2019 at 11:13 PM Tom Lane  wrote:
> > Right. Besides, adding something along the lines Michael described
> > necessitates fixing the problems that it creates. We'll run out of
> > blocks of 5 contiguous OIDs (or whatever) far sooner than we'll run
> > out of single OIDs.
>
> Well, if we ever get even close to that situation, this whole approach
> isn't really gonna work.

My point was that I don't see any reason to draw the line at or after
what Michael suggested, but before handling the exhaustion of
available blocks of 5 contiguous OIDs in the range 8000-. It's
just busy work.

-- 
Peter Geoghegan




Re: Recent failures in IsolationCheck deadlock-hard

2019-08-06 Thread Tom Lane
Thomas Munro  writes:
> On Sat, Aug 3, 2019 at 2:11 AM Tom Lane  wrote:
>> Also worth noting is that anole failed its first try at the new
>> deadlock-parallel isolation test:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole=2019-08-01%2015%3A48%3A16

> And friarbird (also CLOBBER_CACHE_ALWAYS) fails every time.

Yeah, there have been half a dozen failures since deadlock-parallel
went in, mostly on critters that are slowed by CLOBBER_CACHE_ALWAYS
or valgrind.  I've tried repeatedly to reproduce that here, without
success :-(.  It's unclear whether the failures represent a real
code bug or just a problem in the test case, so I don't really want
to speculate about fixes till I can reproduce it.

regards, tom lane




Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

2019-08-06 Thread Tom Lane
Peter Geoghegan  writes:
> On Mon, Aug 5, 2019 at 10:41 PM Tom Lane  wrote:
>> But as long as the script
>> tells you how many OIDs are available, what's the problem?  Just run
>> it again if you want a different suggestion, or make your own choice.

> Right. Besides, adding something along the lines Michael described
> necessitates fixing the problems that it creates. We'll run out of
> blocks of 5 contiguous OIDs (or whatever) far sooner than we'll run
> out of single OIDs.

Well, if we ever get even close to that situation, this whole approach
isn't really gonna work.  My estimate is that in any one development
cycle we'll commit order-of-a-couple-dozen patches that consume new OIDs.
In that context you'd be just unlucky to get an OID suggestion that
doesn't have dozens to hundreds of free OIDs after it.  (If the rate
of manually-assigned-OID consumption were any faster than that, we'd
have filled up the 1-10K space long since.)

regards, tom lane




Re: Recent failures in IsolationCheck deadlock-hard

2019-08-06 Thread Thomas Munro
On Sat, Aug 3, 2019 at 2:11 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > There have been five failures on three animals like this, over the
> > past couple of months:
>
> Also worth noting is that anole failed its first try at the new
> deadlock-parallel isolation test:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole=2019-08-01%2015%3A48%3A16

And friarbird (also CLOBBER_CACHE_ALWAYS) fails every time.

  animal   |  snapshot   | branch | commit  | result  |
fail_stage   | fail_tests
---+-++-+-++-
 lousyjack | 2019-08-05 11:33:02 | HEAD   | a76cfba | FAILURE |
IsolationCheck | {deadlock-parallel}
 gharial   | 2019-08-05 10:30:37 | HEAD   | a76cfba | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-05 05:20:01 | HEAD   | 8548ddc | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-04 05:20:02 | HEAD   | 69edf4f | FAILURE |
IsolationCheck | {deadlock-parallel}
 hyrax | 2019-08-03 12:20:57 | HEAD   | 2abd7ae | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-03 05:20:01 | HEAD   | 2abd7ae | FAILURE |
IsolationCheck | {deadlock-parallel}
 friarbird | 2019-08-02 05:20:00 | HEAD   | a9f301d | FAILURE |
IsolationCheck | {deadlock-parallel}
 anole | 2019-08-01 15:48:16 | HEAD   | da9456d | FAILURE |
IsolationCheck | {deadlock-parallel}

 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack=2019-08-05%2011:33:02
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial=2019-08-05%2010:30:37
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird=2019-08-05%2005:20:01
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird=2019-08-04%2005:20:02
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax=2019-08-03%2012:20:57
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird=2019-08-03%2005:20:01
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird=2019-08-02%2005:20:00
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole=2019-08-01%2015:48:16

 1
 step d2a1: <... completed>
-sum
-
-1
+error in steps d1c e1l d2a1: ERROR:  canceling statement due to user request
 step e1c: COMMIT;
-step d2c: COMMIT;
 step e2l: <... completed>
 lock_excl

 1
+step d2c: COMMIT;
 step e2c: COMMIT;

-- 
Thomas Munro
https://enterprisedb.com