date:20190612

Re: pgbench rate limiting changes transaction latency computation

2019-06-12 Thread Heikki Linnakangas


On 12/06/2019 09:23, Fabien COELHO wrote:

But that was just because I was showing a simplistic example. E.g.
here's a log of a vacuum finishing, and then another starting a few
seconds later (both vacuums lasting a fair while):

progress: 139.0 s, 2438.4 tps, txtime 13.033 ms stddev 3.830, lat 17530.784 ms 
stddev 590.153, lag 17517.751 ms
progress: 140.0 s, 2489.0 tps, txtime 12.911 ms stddev 3.642, lat 17752.862 ms 
stddev 600.661, lag 17739.952 ms
progress: 141.0 s, 2270.0 tps, txtime 14.021 ms stddev 4.965, lat 17973.805 ms 
stddev 594.784, lag 17959.784 ms
progress: 142.0 s, 1408.0 tps, txtime 22.848 ms stddev 5.365, lat 18417.808 ms 
stddev 632.729, lag 18394.960 ms
progress: 143.0 s, 3001.0 tps, txtime 10.724 ms stddev 4.318, lat 18796.971 ms 
stddev 617.462, lag 18786.247 ms
progress: 144.0 s, 4678.0 tps, txtime 6.823 ms stddev 2.136, lat 18503.253 ms 
stddev 669.072, lag 18496.431 ms
progress: 145.0 s, 4577.0 tps, txtime 7.001 ms stddev 1.526, lat 18108.596 ms 
stddev 689.843, lag 18101.596 ms
progress: 146.0 s, 2596.0 tps, txtime 12.261 ms stddev 3.060, lat 17961.623 ms 
stddev 683.498, lag 17949.363 ms
progress: 147.0 s, 2654.0 tps, txtime 12.072 ms stddev 3.282, lat 18120.009 ms 
stddev 685.074, lag 18107.938 ms
progress: 148.0 s, 3471.0 tps, txtime 9.240 ms stddev 3.702, lat 18251.712 ms 
stddev 676.572, lag 18242.472 ms
progress: 149.0 s, 3056.0 tps, txtime 10.468 ms stddev 5.131, lat 18058.950 ms 
stddev 675.334, lag 18048.482 ms
progress: 150.0 s, 2319.0 tps, txtime 13.778 ms stddev 3.762, lat 18305.101 ms 
stddev 688.186, lag 18291.323 ms
progress: 151.0 s, 2355.0 tps, txtime 13.567 ms stddev 3.891, lat 18586.073 ms 
stddev 691.656, lag 18572.506 ms
progress: 152.0 s, 2321.0 tps, txtime 13.742 ms stddev 3.708, lat 18835.985 ms 
stddev 709.580, lag 18822.244 ms
progress: 153.0 s, 2360.0 tps, txtime 13.604 ms stddev 3.533, lat 19121.166 ms 
stddev 709.682, lag 19107.562 ms

The period inbetween where no vacuum was running is imo considerably
harder to spot looking at 'lat'.


ISTM that the signal is pretty clear in whether the lag increases or
decreases. Basically the database is 18 seconds behind its load, which is
very bad if a user is waiting.


That was my thought too, when looking at this example. When there is 
already a long queue of transactions, the txtime of individual 
transactions doesn't matter much. The most important thing under that 
condition is how fast the system can dissolve the queue (or how fast it 
builds up even more). So the derivative of the lag or lat seems like the 
most important figure. We don't print exactly that, but it's roughly the 
same as the TPS. Jitter experienced by the user matters too, i.e. stddev 
of 'lat'.


To illustrate this, imagine that the server magically detected that 
there's a long queue of transactions. It would be beneficial to go into 
"batch mode", where it collects incoming transactions into larger 
batches. The effect of this imaginary batch mode is that the TPS rate 
increases by 50%, but the txtime also increases by 1000%, and becomes 
highly variable. Would that be a good tradeoff? I would say yes. The 
user is experiencing an 18 s delay anyway, and the increase in txtime 
would be insignificant compared to that, but the queue would be busted 
more quickly.


Of course, there is no such batch mode in PostgreSQL, and I wouldn't 
suggest trying to implement anything like that. In a different kind of 
application, you would rather maintain a steady txtime when the server 
is at full load, even if it means a lower overall TPS rate. And that 
feels like a more important goal than just TPS. I think we all agree on 
that. To simulate that kind of an application, though, you probably 
don't want to use -R, or you would use it with --latency-limit. Except 
clearly Andres is trying to do just that, which is why I'm still a bit 
confused :-).


- Heikki

Re: Adaptive query optimization

2019-06-12 Thread legrand legrand

>>I tried to create much simpler version of AQO based on auto_explain 
>>extension.
>>This extension provide all necessary infrastructure to analyze 
>>statements with long execution time.
>>I have added two new modes to auto_explain:
>>1. Auto generation of multicolumn statistics for variables using in 
>>clauses with large estimation error.

>Interesting! I probably wouldn't consider this part of adaptive query
>optimization, but it probably makes sense to make it part of this. I
>wonder if we might improve this to also suggest "missing" indexes? 

Shouldn't this be extended to adjust the default_statistics_target
configuration 
variable,  or on a column-by-column basis by setting the per-column
statistics
target with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS ?

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: [PATCH] Speedup truncates of relation forks

2019-06-12 Thread Masahiko Sawada

On Wed, Jun 12, 2019 at 12:25 PM Tsunakawa, Takayuki
 wrote:
>
> From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> > Years ago I've implemented an optimization for many DROP TABLE commands
> > in a single transaction - instead of scanning buffers for each relation,
> > the code now accumulates a small number of relations into an array, and
> > then does a bsearch for each buffer.
> >
> > Would something like that be applicable/useful here? That is, if we do
> > multiple TRUNCATE commands in a single transaction, can we optimize it
> > like this?
>
> Unfortunately not.  VACUUM and autovacuum handles each table in a different 
> transaction.

We do RelationTruncate() also when we truncate heaps that are created
in the current transactions or has a new relfilenodes in the current
transaction. So I think there is a room for optimization Thomas
suggested, although I'm not sure it's a popular use case.

I've not look at this patch deeply but in DropRelFileNodeBuffer I
think we can get the min value of all firstDelBlock and use it as the
lower bound of block number that we're interested in. That way we can
skip checking the array during scanning the buffer pool.

-extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum,
bool isRedo);
+extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber *forknum,
+bool isRedo,
int nforks);
-extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
-BlockNumber nblocks);
+extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
+BlockNumber *nblocks,
int nforks);

Don't we use each elements of nblocks for each fork? That is, each
fork uses an element at its fork number in the nblocks array and sets
InvalidBlockNumber for invalid slots, instead of passing the valid
number of elements. That way the following code that exist at many places,

blocks[nforks] = visibilitymap_mark_truncate(rel, nblocks);
   if (BlockNumberIsValid(blocks[nforks]))
   {
   forks[nforks] = VISIBILITYMAP_FORKNUM;
   nforks++;
   }

would become

blocks[VISIBILITYMAP_FORKNUM] = visibilitymap_mark_truncate(rel, nblocks);

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BEFORE UPDATE trigger on postgres_fdw table not work

2019-06-12 Thread Etsuro Fujita

Amit-san,

On Wed, Jun 12, 2019 at 3:33 PM Amit Langote  wrote:
> On Wed, Jun 12, 2019 at 3:14 PM Etsuro Fujita  wrote:
> > * Reworded the comments a bit in postgresPlanFoereignModify the
> > original patch modified
>
> + * statement, and for UPDATE if BEFORE ROW UPDATE triggers since those
> + * triggers might change values for non-target columns, in which case we
>
> First line seems to be missing a word or two.  Maybe:
>
> + * statement, and for UPDATE if there are BEFORE ROW UPDATE triggers,
> + * since those triggers might change values for non-target columns, in

Actually, I omitted such words to shorten the comment, but I think
this improves the readability, so I'll update the comment that way.

Thanks for the review!

Best regards,
Etsuro Fujita

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-12 Thread Noah Misch

On Tue, Jun 11, 2019 at 01:59:20PM +0200, Dmitry Dolgov wrote:
> > On Tue, Jun 11, 2019 at 6:46 AM Fabien COELHO  wrote:

> > > Although I guess it can be avoided by `-reuse-previous=off`, probably it
> > > makese sense to update the docs.
> >
> > Yep, that is one option. The other is to revert or alter the subtle
> > change, but ISTM that it made sense in some use case, so I wanted some
> > time to think about it and test.
> 
> Sure, no one argue that the behaviour should be changed, it's only about the
> documentation part.

No, I was arguing that a behavior should revert back its v11 behavior:

\connect mydb myuser myhost
-- should resolve "myhost" again, like it did in v11
\connect

\connect "dbname=mydb host=myhost hostaddr=127.0.0.1"
-- ok to reuse hostaddr=127.0.0.1; I agree that's a feature
\connect

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-12 Thread Dmitry Dolgov

> On Wed, Jun 12, 2019 at 9:45 AM Noah Misch  wrote:
>
> On Tue, Jun 11, 2019 at 01:59:20PM +0200, Dmitry Dolgov wrote:
> > > On Tue, Jun 11, 2019 at 6:46 AM Fabien COELHO  wrote:
>
> > > > Although I guess it can be avoided by `-reuse-previous=off`, probably it
> > > > makese sense to update the docs.
> > >
> > > Yep, that is one option. The other is to revert or alter the subtle
> > > change, but ISTM that it made sense in some use case, so I wanted some
> > > time to think about it and test.
> >
> > Sure, no one argue that the behaviour should be changed, it's only about the
> > documentation part.
>
> No, I was arguing that a behavior should revert back its v11 behavior:
>
> \connect mydb myuser myhost
> -- should resolve "myhost" again, like it did in v11
> \connect
>
> \connect "dbname=mydb host=myhost hostaddr=127.0.0.1"
> -- ok to reuse hostaddr=127.0.0.1; I agree that's a feature
> \connect

Oh, ok, sorry for misunderstanding.

Re: GiST limits on contrib/cube with dimension > 100?

2019-06-12 Thread Siarhei Siniak

 I've added debug prints to cube extension.g_custom_cube_a_f8
g_custom_cube_picksplit
are the only called methods
after that it prints
import psycopg2
import logging
import numpy
import unittest


import python.utils.logging
import python.custom_db.backends
import python.custom_db.backends.postgresql


class TestPostgresql(unittest.TestCase):
    def test_gist(self):
    b = python.custom_db.backends.postgresql.Postgresql(
    databases=dict(
    test=dict(
    minconn=1,
    maxconn=1
    )
    )
    )

    b.connect()

    try:
    c = b.get_connection(use='test')

    c2 = c[0]

    with c2.cursor() as cur:
    cur.execute(r'''
    drop table if exists test;
    create table test(image_id integer primary key, latent_code 
custom_cube);
    create index lc_idx on test using gist(latent_code);
    ''')
    c2.commit()

    with self.assertRaises(psycopg2.errors.InternalError_):
    for k in range(10):
    logging.info('test_postgresql.test_gist, k = %d' % k)
    cur.execute(
    r'''
    insert into test (image_id, latent_code)
    values (%s, custom_cube(%s))
    ''',
    [
    k,
    [float(x) for x in numpy.random.uniform(0, 1, 
512)],
    ]
    )
    c2.commit()
    finally:
    b.put_connection(c2, 'test')

```

Re: openssl valgrind failures on skink are due to openssl issue

2019-06-12 Thread Michael Paquier

On Tue, Jun 11, 2019 at 02:07:29PM -0700, Andres Freund wrote:
> On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> I can't think of a better way to fix skink for now than just disabling
>>> openssl for skink, until 1.1.1d is released.

Thanks for digging into the details of that!  I was wondering if we
did something wrong on our side but the backtraces were weird.
--
Michael


signature.asc
Description: PGP signature

set parameter for all existing session

2019-06-12 Thread alex lock

I check the “alter database, alter role " and "set " command, but none of
them can set the parameters to all the existing sessions.   do we have a
way to do that?  looks the "assign_hook" can be used to customize this,  is
it a right way to do that?

Re: GiST limits on contrib/cube with dimension > 100?

2019-06-12 Thread Siarhei Siniak

I've added debug prints to cube extension. g_custom_cube_a_f8 and 
g_custom_cube_picksplit are the only called methods.
After that it prints:
 ERROR:  failed to add item to index page in "lc_idx" 
Cube extension modifications:
    #define MAX_DIM (512)
Python test source code has been attached to the letter.

P.S.
sorry for the previous letter, didn't configure plain text composition

import psycopg2
import logging
import numpy
import unittest


import python.utils.logging
import python.custom_db.backends
import python.custom_db.backends.postgresql


class TestPostgresql(unittest.TestCase):
def test_gist(self):
b = python.custom_db.backends.postgresql.Postgresql(
databases=dict(
test=dict(
minconn=1,
maxconn=1
)
)
)

b.connect()

try:
c = b.get_connection(use='test')

c2 = c[0]

with c2.cursor() as cur:
cur.execute(r'''
drop table if exists test;
create table test(image_id integer primary key, latent_code custom_cube);
create index lc_idx on test using gist(latent_code);
''')
c2.commit()

with self.assertRaises(psycopg2.errors.InternalError_):
for k in range(10):
logging.info('test_postgresql.test_gist, k = %d' % k)
cur.execute(
r'''
insert into test (image_id, latent_code)
values (%s, custom_cube(%s))
''',
[
k,
[float(x) for x in numpy.random.uniform(0, 1, 512)],
]
)
c2.commit()
finally:
b.put_connection(c2, 'test')

Re: pg_basebackup failure after setting default_table_access_method option

2019-06-12 Thread Michael Paquier

On Mon, Jun 10, 2019 at 11:49:03PM -0700, Andres Freund wrote:
> Well, all four. Given it's just copied code I don't see much code in
> splitting the commit anymore.

Thanks for pushing the fix, the result looks fine.

> I noticed some other uglyness: check_timezone calls interval_in(),
> without any checks. Not a huge fan of doing all that in postmaster, even
> leaving the wrong error reporting aside :(.  But that seems like a
> plenty different enough issue to fix it separately, if we decide we want
> to do so.

Indeed, I have not noticed this one :(
--
Michael


signature.asc
Description: PGP signature

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-12 Thread Fabien COELHO




Hello Noah,


Although I guess it can be avoided by `-reuse-previous=off`, probably it
makese sense to update the docs.


Yep, that is one option. The other is to revert or alter the subtle
change, but ISTM that it made sense in some use case, so I wanted some
time to think about it and test.


Sure, no one argue that the behaviour should be changed, it's only about the
documentation part.


No, I was arguing that a behavior should revert back its v11 behavior:


I got that. I'm working on it, and on the other issues you raised.

The issue I see is what do we want when a name resolves to multiple 
addresses. The answer is not fully obvious to me right now. I'll try to 
send a patch over the week-end.



\connect mydb myuser myhost
-- should resolve "myhost" again, like it did in v11
\connect

\connect "dbname=mydb host=myhost hostaddr=127.0.0.1"
-- ok to reuse hostaddr=127.0.0.1; I agree that's a feature
\connect


--
Fabien.

Re: set parameter for all existing session

2019-06-12 Thread Pavel Stehule

Hi

st 12. 6. 2019 v 9:58 odesílatel alex lock  napsal:

> I check the “alter database, alter role " and "set " command, but none of
> them can set the parameters to all the existing sessions.   do we have a
> way to do that?  looks the "assign_hook" can be used to customize this,  is
> it a right way to do that?
>
>
Maybe you miss to call pg_reload_conf();

example:

alter system set work_mem to '10MB';
select pg_reload_conf();

in other session you can:

show work_mem;

Regards

Pavel

RE: [PATCH] Speedup truncates of relation forks

2019-06-12 Thread Jamison, Kirk

On Tuesday, June 11, 2019 7:23 PM (GMT+9), Adrien Nayrat wrote:

> > Attached is a patch to speed up the performance of truncates of relations.
> 
> Thanks for working on this!

Thank you also for taking a look at my thread. 

> > If you want to test with large number of relations,
> > you may use the stored functions I used here:
> > http://bit.ly/reltruncates
> 
> You should post these functions in this thread for the archives ;)
This is noted. Pasting it below: 

create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'create table tab_' || i::text || ' (a int);';
execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function delfrom_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'delete from tab_' || i::text;
execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function insert_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'insert into tab_' || i::text || ' VALUES (5);' ;
execute query_string;
  end loop;
end;
$$ language plpgsql;


> From a user POW, the main issue with relation truncation is that it can block
> queries on standby server during truncation replay.
> 
> It could be interesting if you can test this case and give results of your
> path.
> Maybe by performing read queries on standby server and counting wait_event
> with pg_wait_sampling?

Thanks for the suggestion. I tried using the extension pg_wait_sampling,
But I wasn't sure that I could replicate the problem of blocked queries on 
standby server.
Could you advise?
Here's what I did for now, similar to my previous test with hot standby setup,
but with additional read queries of wait events on standby server.

128MB shared_buffers
SELECT create_tables(1);
SELECT insert_tables(1);
SELECT delfrom_tables(1);

[Before VACUUM]
Standby: SELECT the following view from pg_stat_waitaccum

wait_event_type |   wait_event| calls | microsec
-+-+---+--
 Client  | ClientRead  | 2 | 20887759
 IO  | DataFileRead|   175 | 2788
 IO  | RelationMapRead | 4 |   26
 IO  | SLRURead| 2 |   38

Primary: Execute VACUUM (induces relation truncates)

[After VACUUM]
Standby:
 wait_event_type |   wait_event| calls | microsec
-+-+---+--
 Client  | ClientRead  | 7 | 77662067
 IO  | DataFileRead|   284 | 4523
 IO  | RelationMapRead |10 |   51
 IO  | SLRURead| 3 |   57

Regards,
Kirk Jamison

Re: set parameter for all existing session

2019-06-12 Thread alex lock

On Wed, Jun 12, 2019 at 4:25 PM Pavel Stehule 
wrote:

> Hi
>
> st 12. 6. 2019 v 9:58 odesílatel alex lock  napsal:
>
>> I check the “alter database, alter role " and "set " command, but none of
>> them can set the parameters to all the existing sessions.   do we have a
>> way to do that?  looks the "assign_hook" can be used to customize this,  is
>> it a right way to do that?
>>
>>
> Maybe you miss to call pg_reload_conf();
>
> example:
>
> alter system set work_mem to '10MB';
> select pg_reload_conf();
>

Thanks,  it works!

>
> in other session you can:
>
> show work_mem;
>
> Regards
>
> Pavel
>

[PATCH] vacuumlo: print the number of large objects going to be removed

2019-06-12 Thread Timur Birsh

Hello,

If tables has a lot of rows with large objects (>1_000_000) that
removed throughout the day, it would be useful to know how many
LOs going to be removed.

First patch - print the number of large objects going to be removed,
second patch - print how many LOs removed in percent.

Can anyone please review.

Please cc, I am not subscribed to the list.

Regards,
TimurFrom 8465774c33b200c1a531465acaef85d2d261bb26 Mon Sep 17 00:00:00 2001
From: Timur Birsh 
Date: Wed, 12 Jun 2019 04:13:29 +
Subject: [PATCH 1/2] vacuumlo: print the number of large objects going to be
 removed

If tables has a lot of rows with large objects (>1_000_000) that
removed throughout the day, it would be useful to know how many
LOs going to be removed.
---
 contrib/vacuumlo/vacuumlo.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c
index 73c06a043e..beade1c9c0 100644
--- a/contrib/vacuumlo/vacuumlo.c
+++ b/contrib/vacuumlo/vacuumlo.c
@@ -64,6 +64,7 @@ vacuumlo(const char *database, const struct _param *param)
 	PGresult   *res,
 			   *res2;
 	char		buf[BUFSIZE];
+	long		to_delete = 0;
 	long		matched;
 	long		deleted;
 	int			i;
@@ -276,6 +277,25 @@ vacuumlo(const char *database, const struct _param *param)
 	}
 	PQclear(res);
 
+	if (param->verbose)
+	{
+		snprintf(buf, BUFSIZE, "SELECT count(*) FROM vacuum_l");
+		res = PQexec(conn, buf);
+		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		{
+			fprintf(stderr, "Failed to get number of large objects "
+	"going to be removed:\n");
+			fprintf(stderr, "%s", PQerrorMessage(conn));
+			PQclear(res);
+			PQfinish(conn);
+			return -1;
+		}
+		to_delete = strtol(PQgetvalue(res, 0, 0), NULL, 10);
+		PQclear(res);
+		fprintf(stdout, "%ld large objects will be removed\n",
+to_delete);
+	}
+
 	/*
 	 * Now, those entries remaining in vacuum_l are orphans.  Delete 'em.
 	 *
-- 
2.17.1

From 4be55b5b5e566faad441824910b4e49c6b5f5879 Mon Sep 17 00:00:00 2001
From: Timur Birsh 
Date: Wed, 12 Jun 2019 05:56:44 +
Subject: [PATCH 2/2] vacuumlo: print how many LOs removed in percent

---
 contrib/vacuumlo/vacuumlo.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c
index beade1c9c0..5d629264f1 100644
--- a/contrib/vacuumlo/vacuumlo.c
+++ b/contrib/vacuumlo/vacuumlo.c
@@ -359,7 +359,8 @@ vacuumlo(const char *database, const struct _param *param)
 
 			if (param->verbose)
 			{
-fprintf(stdout, "\rRemoving lo %6u   ", lo);
+fprintf(stdout, "\rRemoving lo %6u\t(%3.f%%)", lo,
+		((float) deleted / (float) to_delete) * 100);
 fflush(stdout);
 			}
 
-- 
2.17.1

Re: GiST limits on contrib/cube with dimension > 100?

2019-06-12 Thread Andrey Borodin

Hi!

> 9 июня 2019 г., в 23:05, Siarhei Siniak  написал(а):
> 
> I've been using cube extension recompiled with
> #define MAX_DIM 256.
> 
> But with a version 11.3 I'm getting the following error:
> failed to add item to index page in 

So, you have changed source code (removing dim constraint) and get cryptic 
error after that. In some way this is expected...

Though, the reason why cube has this limit is not physical. R-tree's 
(cube+gist) just do not work effectively with more than 10 non-correlated 
dimensions.
If you have some correlated dimensions - you, probably, should invent something 
more clever that just cube - plain array of D*2*doubles for each tuple.

100 is upper bound for sane data that can be indexed in case of cube...

Nevertheless, we can improve AddTuple messages. But there is not such strict 
guidelines as with B-tree. Probably, tuples should not be bigger than half of 
usable page space.

Best regards, Andrey Borodin.

RE: BEFORE UPDATE trigger on postgres_fdw table not work

2019-06-12 Thread shohei.mochizuki

Fujita-san,
> On Mon, Jun 10, 2019 at 9:04 PM Etsuro Fujita 
> wrote:
> > I'll look into the patch more closely tomorrow.
> 
> I did that, but couldn't find any issue about the patch.  Here is an updated
> version of the patch.  Changes are:
> 
> * Reworded the comments a bit in postgresPlanFoereignModify the original
> patch modified
> * Added the commit message

Thanks for the update.

I think your wording is more understandable than my original patch.

Regards,

-- 
Shohei Mochizuki
TOSHIBA CORPORATION

Re: GiST limits on contrib/cube with dimension > 100?

2019-06-12 Thread Siarhei Siniak

A uniform set of points with a dimension 128 and type cube. That has a size of 
50 ** 3. Can be queried for a nearest neighbor at a speed of 10 queries per 
second with limit varying from 1 to 25.
It works better than when no index used at all. So gist gives here a speed up.
The documentation of postgresql doesn't mention complexity of such an index. 
I've got confused as to its speed.
Does postgresql allow for an approximate nearest neighbor search?
 https://github.com/erikbern/ann-benchmarks has a lot of efficient 
implementations.

Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-12 Thread Peter Eisentraut

On 2019-06-11 08:11, Andres Freund wrote:
> While working on fixing [1] I noticed that 2dedf4d9a899 "Integrate
> recovery.conf into postgresql.conf" added two non-rethrowing PG_CATCH
> uses. That's not OK.

Right.  Here is a patch that addresses this by copying the relevant code
from pg_lsn_in() and timestamptz_in() directly into the check hooks.
It's obviously a bit unfortunate not to be able to share that code, but
it's not actually that much.

I haven't figured out the time zone issue yet, but I guess the solution
might involve moving some of the code from check_recovery_target_time()
to assign_recovery_target_time().

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 01f26abca6442b80a03359b9a6057610fd2068a4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 12 Jun 2019 11:29:53 +0200
Subject: [PATCH 1/2] Remove explicit error handling for obsolete date/time
 values

The date/time values 'current', 'invalid', and 'undefined' were
removed a long time ago, but the code still contains explicit error
handling for the transition.  To simplify the code and avoid having to
handle these values everywhere, just remove the recognition of these
tokens altogether now.
---
 src/backend/utils/adt/date.c   |  8 
 src/backend/utils/adt/datetime.c   | 20 
 src/backend/utils/adt/timestamp.c  | 22 --
 src/include/utils/datetime.h   |  2 --
 src/interfaces/ecpg/pgtypeslib/dt.h|  2 --
 src/interfaces/ecpg/pgtypeslib/dt_common.c |  5 -
 src/interfaces/ecpg/pgtypeslib/timestamp.c |  4 
 src/test/regress/expected/date.out |  3 ---
 src/test/regress/expected/timestamp.out| 13 -
 src/test/regress/expected/timestamptz.out  | 13 -
 src/test/regress/sql/date.sql  |  1 -
 src/test/regress/sql/timestamp.sql |  4 
 src/test/regress/sql/timestamptz.sql   |  4 
 13 files changed, 101 deletions(-)

diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 1ff3cfea8b..e440a4fedd 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -138,14 +138,6 @@ date_in(PG_FUNCTION_ARGS)
case DTK_DATE:
break;
 
-   case DTK_CURRENT:
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("date/time value \"current\" is 
no longer supported")));
-
-   GetCurrentDateTime(tm);
-   break;
-
case DTK_EPOCH:
GetEpochTime(tm);
break;
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 9def318c57..e9add385ba 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -99,7 +99,6 @@ static const datetkn datetktbl[] = {
{"aug", MONTH, 8},
{"august", MONTH, 8},
{DB_C, ADBC, BC},   /* "bc" for years <= 0 */
-   {DCURRENT, RESERV, DTK_CURRENT},/* "current" is always now */
{"d", UNITS, DTK_DAY},  /* "day of month" for ISO input */
{"dec", MONTH, 12},
{"december", MONTH, 12},
@@ -113,7 +112,6 @@ static const datetkn datetktbl[] = {
{"friday", DOW, 5},
{"h", UNITS, DTK_HOUR}, /* "hour" */
{LATE, RESERV, DTK_LATE},   /* "infinity" reserved for "late time" 
*/
-   {INVALID, RESERV, DTK_INVALID}, /* "invalid" reserved for bad time */
{"isodow", UNITS, DTK_ISODOW},  /* ISO day of week, Sunday == 7 */
{"isoyear", UNITS, DTK_ISOYEAR},/* year in terms of the ISO 
week date */
{"j", UNITS, DTK_JULIAN},
@@ -157,7 +155,6 @@ static const datetkn datetktbl[] = {
{"tue", DOW, 2},
{"tues", DOW, 2},
{"tuesday", DOW, 2},
-   {"undefined", RESERV, DTK_INVALID}, /* pre-v6.1 invalid time */
{"wed", DOW, 3},
{"wednesday", DOW, 3},
{"weds", DOW, 3},
@@ -191,7 +188,6 @@ static const datetkn deltatktbl[] = {
{"hours", UNITS, DTK_HOUR}, /* "hours" relative */
{"hr", UNITS, DTK_HOUR},/* "hour" relative */
{"hrs", UNITS, DTK_HOUR},   /* "hours" relative */
-   {INVALID, RESERV, DTK_INVALID}, /* reserved for invalid time */
{"m", UNITS, DTK_MINUTE},   /* "minute" relative */
{"microsecon", UNITS, DTK_MICROSEC},/* "microsecond" relative */
{"mil", UNITS, DTK_MILLENNIUM}, /* "millennium" relative */
@@ -222,7 +218,6 @@ static const datetkn deltatktbl[] = {
{DTIMEZONE, UNITS, DTK_TZ}, /* "timezone" time offset */
{"timezone_h", UNITS, DTK_TZ_HOUR}, /* timezone hour units */
{"timezone_m", UNITS, DTK_TZ_MINUTE},   /* timezone minutes units */
-   {"undefine

Re: Adaptive query optimization

2019-06-12 Thread Konstantin Knizhnik




On 12.06.2019 0:43, Tomas Vondra wrote:



I don't think "learning phase" is an issue, in fact I think that's
something we need to do - it ensures we have enough data to make good
decisions.

What is wrong with learning phase is that it requires some DBA 
assistance: somebody should determine when to start learning,

provide relevant workload and determine when learning is finished.
One of the most recent trends in DBMSes is autonomous databases with 
zero administration effort.
It is especially important for clouds. And of one the main advantages of 
AQO is that it allows to optimize queries without human interaction.


But unfortunately I really do not know how to avoid learning phase, 
especially if we what to run queries at replica.



2. It saves collected data in Postgres tables, which makes read-only 
transaction executing only queries to become read-write transaction, 
obtaining XID...


Yeah, that's an issue because it makes it useless on standbys etc. I
think it'd be enough to do something similar to pg_stat_statements, i.e.
store it in memory and flush it to disk once in a while.


Thus is why my AQO implementation is storing data in file.

3. It doesn't take in account concrete values of literals used in 
clauses, so it is not able to address data skews.


Yep. I don't think it's necessarily an issue with all approaches to
adaptive optimization, though. But I agree we should detect both
systemic estimation issues, and misestimates for particular parameter
values. I think that's doable.

4. Machine learning  can be quite expensive and seems to be overkill 
if we want just to adjust selectivities according to actual number of 
affected rows.




I think that depends - some machine learning approaches are not that
bad. But I think there's a more serious issue - explainability. We need
a solution where we can explain/justify why it makes some decisions. I
really don't want a black box that produces numbers that you just need
to take at face value.

The good thing is that the simpler the method, the less expensive and
more explainable it is.

I tried to create much simpler version of AQO based on auto_explain 
extension.
This extension provide all necessary infrastructure to analyze 
statements with long execution time.

I have added two new modes to auto_explain:
1. Auto generation of multicolumn statistics for variables using in 
clauses with large estimation error.


Interesting! I probably wouldn't consider this part of adaptive query
optimization, but it probably makes sense to make it part of this. I
wonder if we might improve this to also suggest "missing" indexes?


I think that it should be nest step of adaptive query optimization:
- autogeneration of indexes
- auto adjustment of optimizer cost parameters (cpu cost, 
random/sequential page access cost,...)


There is already extension hypothetical index 
https://github.com/HypoPG/hypopg

which can be used to estimate effect of introducing new indexes.



Right. But I think I might have an idea how to address (some of) this.

As I already mentioned, I was experimenting with something similar,
maybe two or three years ago (I remember chatting about it with Teodor
at pgcon last week). I was facing the same issues, and my approach was
based on hooks too.

But my idea was to not to track stats for a plan as a whole, but instead
decompose it into individual nodes, categoried into three basic groups -
scans, joins and aggregations. And then use this extracted information
to other plans, with "matching" nodes.

For example, let's consider a simple "single-scan" query

   SELECT * FROM t1 WHERE a = ? AND b = ? AND c < ?;

Now, if you execute this enought times (say, 100x or 1000x), tracking
the estimates and actual row counts, you may then compute the average
misestimate (maybe a geometric mean would be more appropriate here?):

   AVG(actual/estimate)


Certainly stats should be collected for each plan node, not for the 
whole plan.

And it is done now in Oleg's and my implementation.
Oleg is using gradient descent method. I first tried to calculate 
average, but then find out that building something like "histogram",

where bin is determined as log10 of estimated number of rows.



and if this is significantly different from 1.0, then we can say there's
a systemic misestimate, and we can use this as a correction coefficient
when computing the scan estimate. (And we need to be careful about
collection new data, because the estimates will include this correction.
But that can be done by tracking "epoch" of the plan.)

Now, if someone uses this same scan in a join, like for example

   SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
    WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
  AND (t2.x = ? AND t2.y = ?)

then we can still apply the same correction to the t1 scan (I think).
But then we can also collect data for the t1-t2 join, and compute a
correction coefficient in a similar way. It requires a bit of care
because we need to compensate for mis

catalog files simplification

2019-06-12 Thread Peter Eisentraut

The current catalog files all do this:

CATALOG(pg_aggregate,2600,AggregateRelationId)
{
...
} FormData_pg_aggregate;

typedef FormData_pg_aggregate *Form_pg_aggregate;

The bottom part of this seems redundant.  With the attached patch, we
can generate that automatically, so this becomes just

CATALOG(pg_aggregate,2600,AggregateRelationId)
{
...
};

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 2b1b7e4c6d8b1f9f66bef0fd32992812425bb739 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 12 Jun 2019 13:17:55 +0200
Subject: [PATCH] Catalog files simplification

Extend the CATALOG() macro used in src/include/catalog/ to also create
the Form_pg_foo and FormData_pg_foo typedefs automatically.
---
 src/include/catalog/genbki.h  |  2 +-
 src/include/catalog/pg_aggregate.h|  9 +
 src/include/catalog/pg_am.h   |  9 +
 src/include/catalog/pg_amop.h |  9 +
 src/include/catalog/pg_amproc.h   |  9 +
 src/include/catalog/pg_attrdef.h  |  9 +
 src/include/catalog/pg_attribute.h|  9 +
 src/include/catalog/pg_auth_members.h |  9 +
 src/include/catalog/pg_authid.h   |  9 +
 src/include/catalog/pg_cast.h |  9 +
 src/include/catalog/pg_class.h|  9 +
 src/include/catalog/pg_collation.h|  9 +
 src/include/catalog/pg_constraint.h   |  9 +
 src/include/catalog/pg_conversion.h   |  9 +
 src/include/catalog/pg_database.h |  9 +
 src/include/catalog/pg_db_role_setting.h  |  4 +---
 src/include/catalog/pg_default_acl.h  |  9 +
 src/include/catalog/pg_depend.h   |  9 +
 src/include/catalog/pg_description.h  |  9 +
 src/include/catalog/pg_enum.h |  9 +
 src/include/catalog/pg_event_trigger.h|  9 +
 src/include/catalog/pg_extension.h|  9 +
 src/include/catalog/pg_foreign_data_wrapper.h |  9 +
 src/include/catalog/pg_foreign_server.h   |  9 +
 src/include/catalog/pg_foreign_table.h|  9 +
 src/include/catalog/pg_index.h|  9 +
 src/include/catalog/pg_inherits.h |  9 +
 src/include/catalog/pg_init_privs.h   |  9 +
 src/include/catalog/pg_language.h |  9 +
 src/include/catalog/pg_largeobject.h  |  9 +
 src/include/catalog/pg_largeobject_metadata.h |  9 +
 src/include/catalog/pg_namespace.h|  9 +
 src/include/catalog/pg_opclass.h  |  9 +
 src/include/catalog/pg_operator.h |  9 +
 src/include/catalog/pg_opfamily.h |  9 +
 src/include/catalog/pg_partitioned_table.h|  9 +
 src/include/catalog/pg_pltemplate.h   |  9 +
 src/include/catalog/pg_policy.h   |  9 +
 src/include/catalog/pg_proc.h |  9 +
 src/include/catalog/pg_publication.h  |  9 +
 src/include/catalog/pg_publication_rel.h  |  9 +
 src/include/catalog/pg_range.h|  9 +
 src/include/catalog/pg_replication_origin.h   |  4 +---
 src/include/catalog/pg_rewrite.h  |  9 +
 src/include/catalog/pg_seclabel.h |  2 +-
 src/include/catalog/pg_sequence.h |  9 +
 src/include/catalog/pg_shdepend.h |  9 +
 src/include/catalog/pg_shdescription.h|  9 +
 src/include/catalog/pg_shseclabel.h   |  4 +---
 src/include/catalog/pg_statistic.h| 10 +-
 src/include/catalog/pg_statistic_ext.h|  9 +
 src/include/catalog/pg_subscription.h |  4 +---
 src/include/catalog/pg_subscription_rel.h |  4 +---
 src/include/catalog/pg_tablespace.h   |  9 +
 src/include/catalog/pg_transform.h|  9 +
 src/include/catalog/pg_trigger.h  |  9 +
 src/include/catalog/pg_ts_config.h|  4 +---
 src/include/catalog/pg_ts_config_map.h|  4 +---
 src/include/catalog/pg_ts_dict.h  |  4 +---
 src/include/catalog/pg_ts_parser.h|  4 +---
 src/include/catalog/pg_ts_template.h  |  4 +---
 src/include/catalog/pg_type.h |  9 +
 src/include/catalog/pg_user_mapping.h |  9 +
 63 files changed, 63 insertions(+), 441 deletions(-)

diff --git a/src/include/catalog/genbki.h b/src/include/catalog/genbki.h
index 1b8e4e9e19..43e667af25 100644
--- a/src/include/catalog/genbki.h
+++ b/src/include/catalog/genbki.h
@@ -20,7 +20,7 @@
 #define GENBKI_H
 
 /* Introduces a catalog's structure defi

Re: Minimal logical decoding on standbys

2019-06-12 Thread Amit Khandekar

On Tue, 11 Jun 2019 at 12:24, Amit Khandekar  wrote:
>
> On Mon, 10 Jun 2019 at 10:37, Amit Khandekar  wrote:
> >
> > On Tue, 4 Jun 2019 at 21:28, Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:
> > > > After giving more thought on this, I think it might make sense to
> > > > arrange for the xl_running_xact record to be sent from master to the
> > > > standby, when a logical slot is to be created on standby. How about
> > > > standby sending a new message type to the master, requesting for
> > > > xl_running_xact record ? Then on master, ProcessStandbyMessage() will
> > > > process this new message type and call LogStandbySnapshot().
> > >
> > > I think that should be a secondary feature. You don't necessarily know
> > > the upstream master, as the setup could be cascading one.
> > Oh yeah, cascading setup makes it more complicated.
> >
> > > I think for
> > > now just having to wait, perhaps with a comment to manually start a
> > > checkpoint, ought to suffice?
> >
> > Ok.
> >
> > Since this requires the test to handle the
> > fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
> > modifying the test file to do this. After doing that, I found that the
> > slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
> > This happens only when the restart_lsn is set to ReplayRecPtr.
> > Somehow, this does not happen when I manually create the logical slot.
> > It happens only while running testcase. Working on it ...
>
> Like I mentioned above, I get an assertion failure for
> Assert(XRecOffIsValid(RecPtr)) while reading WAL records looking for a
> start position (DecodingContextFindStartpoint()). This is because in
> CreateInitDecodingContext()=>ReplicationSlotReserveWal(), I now set
> the logical slot's restart_lsn to XLogCtl->lastReplayedEndRecPtr. And
> just after bringing up slave, lastReplayedEndRecPtr's initial values
> are in this order : 0/228, 0/260, 0/2D8, 0/2000100,
> 0/300, 0/360. You can see that 0/300 is not a valid value
> because it points to the start of a WAL block, meaning it points to
> the XLog page header (I think it's possible because it is 1 + endof
> last replayed record, which can be start of next block). So when we
> try to create a slot when it's in that position, then XRecOffIsValid()
> fails while looking for a starting point.
>
> One option I considered was : If lastReplayedEndRecPtr points to XLog
> page header, get a position of the first record on that WAL block,
> probably with XLogFindNextRecord(). But it is not trivial because
> while in ReplicationSlotReserveWal(), XLogReaderState is not created
> yet.

In the attached v6 version of the patch, I did the above. That is, I
used XLogFindNextRecord() to bump up the restart_lsn of the slot to
the first valid record. But since XLogReaderState is not available in
ReplicationSlotReserveWal(), I did this in
DecodingContextFindStartpoint(). And then updated the slot restart_lsn
with this corrected position.

Since XLogFindNextRecord() is currently disabled using #if 0, removed
this directive.

> Or else, do you think we can just increment the record pointer by
> doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
> SizeOfXLogShortPHD() ?

I found out that we can't do this, because we don't know whether the
xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
our context, it is SizeOfXLogLongPHD. So we indeed need the
XLogReaderState handle.

>
> Do you think that we can solve this using some other approach ? I am
> not sure whether it's only the initial conditions that cause
> lastReplayedEndRecPtr value to *not* point to a valid record, or is it
> just a coincidence and that lastReplayedEndRecPtr can also have such a
> value any time afterwards. If it's only possible initially, we can
> just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
> lastReplayedEndRecPtr is invalid.

So now as the v6 patch stands, lastReplayedEndRecPtr is used to set
the restart_lsn, but its position is later adjusted in
DecodingContextFindStartpoint().

Also, modified the test to handle the requirement that the logical
slot creation on standby requires a checkpoint (or any other
transaction commit) to be given from master. For that, in
src/test/perl/PostgresNode.pm, added a new function
create_logical_slot_on_standby() which does the reqiured steps.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


logical-decoding-on-standby_v6.patch
Description: Binary data

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-12 Thread Fabien COELHO



Hello,


I got that. I'm working on it, and on the other issues you raised.

The issue I see is what do we want when a name resolves to multiple 
addresses. The answer is not fully obvious to me right now. I'll try to send 
a patch over the week-end.


At Alvaro's request, here is a quick WIP patch, that does not do the right 
thing, because there is no simple way to know whether hostaddr was set at 
the libPQ level, so either we set it always, about which Noah complained, 
or we don't, about which someone else will complain quite easily, i.e. 
with this patch


  \c "host=foo hostaddr=ip"

connects to ip, but then

  \c

will reconnect to foo but ignore ip. Well, ISTM that this is back to the 
previous doubtful behavior, so at least it is not a regression, just the 
same bug:-)


A solution could be to have a PQdoestheconnectionuseshostaddr(conn) 
function, but I cannot say I'd be thrilled.


Another option would be to import PGconn full definition in 
"psql/command.c", but that would break the PQ interface, I cannot say I'd 
be thrilled either.


The patch returns host as defined by the user, but the regenerated 
hostaddr (aka connip), which is not an homogeneous behavior. PQhost should 
probably use connip if host was set as an ip, but that needs guessing.


The underlying issue is that the host/hostaddr stuff is not that easy to 
fix.


At least, after the patch the connection information (\conninfo) is still 
the right one, which is an improvement.


--
Fabien.diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 695d6ba9f1..4bf4726981 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -2937,16 +2937,16 @@ do_connect(enum trivalue reuse_previous_specification,
 		if (host && strcmp(host, PQhost(o_conn)) == 0)
 		{
 			/*
-			 * if we are targeting the same host, reuse its hostaddr for
-			 * consistency
+			 * if we are targeting the same host, we should reuse its hostaddr for
+			 * consistency if hostaddr was explicitely set. However, the library
+			 * does not allow to know that at the libPQ level.
 			 */
-			hostaddr = PQhostaddr(o_conn);
+			;
 		}
 		if (!host)
 		{
 			host = PQhost(o_conn);
-			/* also set hostaddr for consistency */
-			hostaddr = PQhostaddr(o_conn);
+			/* we should also set hostaddr for consistency, if hostaddr was set */
 		}
 		if (!port)
 			port = PQport(o_conn);
@@ -3129,7 +3129,10 @@ do_connect(enum trivalue reuse_previous_specification,
 			char	   *host = PQhost(pset.db);
 			char	   *hostaddr = PQhostaddr(pset.db);
 
-			/* If the host is an absolute path, the connection is via socket */
+			/*
+			 * If the host is an absolute path, the connection is via socket
+			 * unless overriden by hostaddr
+			 */
 			if (is_absolute_path(host))
 			{
 if (hostaddr && *hostaddr)
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index e58fa6742a..5d88c15cc5 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -1536,9 +1536,7 @@ getHostaddr(PGconn *conn, char *host_addr, int host_addr_len)
 {
 	struct sockaddr_storage *addr = &conn->raddr.addr;
 
-	if (conn->connhost[conn->whichhost].type == CHT_HOST_ADDRESS)
-		strlcpy(host_addr, conn->connhost[conn->whichhost].hostaddr, host_addr_len);
-	else if (addr->ss_family == AF_INET)
+	if (addr->ss_family == AF_INET)
 	{
 		if (inet_net_ntop(AF_INET,
 		  &((struct sockaddr_in *) addr)->sin_addr.s_addr,
@@ -6463,6 +6461,11 @@ PQhost(const PGconn *conn)
 
 	if (conn->connhost != NULL)
 	{
+		/*
+		 * note this return the host=... value provided by the user,
+		 * even if it is an IP, this it can include spaces and so,
+		 * whereas the next function uses the regenerated IP.
+		 */
 		if (conn->connhost[conn->whichhost].host != NULL &&
 			conn->connhost[conn->whichhost].host[0] != '\0')
 			return conn->connhost[conn->whichhost].host;
@@ -6480,15 +6483,8 @@ PQhostaddr(const PGconn *conn)
 	if (!conn)
 		return NULL;
 
-	if (conn->connhost != NULL)
-	{
-		if (conn->connhost[conn->whichhost].hostaddr != NULL &&
-			conn->connhost[conn->whichhost].hostaddr[0] != '\0')
-			return conn->connhost[conn->whichhost].hostaddr;
-
-		if (conn->connip != NULL)
-			return conn->connip;
-	}
+	if (conn->connhost != NULL && conn->connip != NULL)
+		return conn->connip;
 
 	return "";
 }

Re: Adaptive query optimization

2019-06-12 Thread Kuntal Ghosh

Hello,

On Wed, Jun 12, 2019 at 5:06 PM Konstantin Knizhnik
 wrote:
> On 12.06.2019 0:43, Tomas Vondra wrote:
> I don't think "learning phase" is an issue, in fact I think that's
> something we need to do - it ensures we have enough data to make good
> decisions.
>
> What is wrong with learning phase is that it requires some DBA assistance: 
> somebody should determine when to start learning,
> provide relevant workload and determine when learning is finished.
> One of the most recent trends in DBMSes is autonomous databases with zero 
> administration effort.
> It is especially important for clouds. And of one the main advantages of AQO 
> is that it allows to optimize queries without human interaction.
>
> But unfortunately I really do not know how to avoid learning phase, 
> especially if we what to run queries at replica.
>
Avoiding learning phase in AQO a implementation sounds like an oxymoron. :-)
Perhaps, you meant how we can minimize the effort in learning phase. A
learning phase has its own complications - like
a. deciding the the number of iterations needed to achieve certain
kind of confidence
b. which parameters to tune (are the existing parameters enough?)
c. deciding the cost model
Coming up answers for these things is pretty hard.

>
> I think that depends - some machine learning approaches are not that
> bad. But I think there's a more serious issue - explainability. We need
> a solution where we can explain/justify why it makes some decisions. I
> really don't want a black box that produces numbers that you just need
> to take at face value.
>
> The good thing is that the simpler the method, the less expensive and
> more explainable it is.
+1

>
> I tried to create much simpler version of AQO based on auto_explain extension.
> This extension provide all necessary infrastructure to analyze statements 
> with long execution time.
> I have added two new modes to auto_explain:
> 1. Auto generation of multicolumn statistics for variables using in clauses 
> with large estimation error.
>
>
> Interesting! I probably wouldn't consider this part of adaptive query
> optimization, but it probably makes sense to make it part of this. I
> wonder if we might improve this to also suggest "missing" indexes?
>
I like this part of the implementation. I also agree that this can be
used to come up with good hypothetical index suggestions. But, it
needs some additional algorithms. For example, after analysing a set
of queries, we can come up with a minimal set of indexes that needs to
be created to minimize the total cost. I've not checked the internal
implementation of hypogo. Probably, I should do that.

>
> I think that it should be nest step of adaptive query optimization:
> - autogeneration of indexes
> - auto adjustment of optimizer cost parameters (cpu cost, random/sequential 
> page access cost,...)
AFAIK, the need for adjustment of cost parameters are highly dominated
by solving the selectivity estimation errors. But of course, you can
argue with that.

>
> Right. But I think I might have an idea how to address (some of) this.
>
> As I already mentioned, I was experimenting with something similar,
> maybe two or three years ago (I remember chatting about it with Teodor
> at pgcon last week). I was facing the same issues, and my approach was
> based on hooks too.
>
> But my idea was to not to track stats for a plan as a whole, but instead
> decompose it into individual nodes, categoried into three basic groups -
> scans, joins and aggregations. And then use this extracted information
> to other plans, with "matching" nodes.
>
> For example, let's consider a simple "single-scan" query
>
>SELECT * FROM t1 WHERE a = ? AND b = ? AND c < ?;
>
> Now, if you execute this enought times (say, 100x or 1000x), tracking
> the estimates and actual row counts, you may then compute the average
> misestimate (maybe a geometric mean would be more appropriate here?):
>
>AVG(actual/estimate)
>
>
> Certainly stats should be collected for each plan node, not for the whole 
> plan.
> And it is done now in Oleg's and my implementation.
> Oleg is using gradient descent method. I first tried to calculate average, 
> but then find out that building something like "histogram",
> where bin is determined as log10 of estimated number of rows.
>
I think maintaining a "histogram" sounds good. I've read a paper
called "Self-tuning Histograms: Building Histograms Without
Looking at Data" which tries to do something similar[1].

>
> and if this is significantly different from 1.0, then we can say there's
> a systemic misestimate, and we can use this as a correction coefficient
> when computing the scan estimate. (And we need to be careful about
> collection new data, because the estimates will include this correction.
> But that can be done by tracking "epoch" of the plan.)
>
> Now, if someone uses this same scan in a join, like for example
>
>SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
> WHERE (t1.a = ? AND t1.b = ? AN

Re: [pg_rewind] cp: cannot stat ‘pg_wal/RECOVERYHISTORY’: No such file or directory

2019-06-12 Thread tushar


On 06/10/2019 04:37 PM, Kuntal Ghosh wrote:

When we define a restore command, we tell the server to copy a file a
WAL file from the archive. So, it should be
restore_command='cp tmp/archive_dir1/%f %p'

This is the reason you're getting this following error.


Ohh. Mea Culpa.

Thanks for pointing out.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: tableam: abstracting relation sizing code

2019-06-12 Thread Robert Haas

On Tue, Jun 11, 2019 at 7:17 PM Andres Freund  wrote:
> Just to understand: What version are you targeting? It seems pretty
> clearly v13 material to me?

My current plan is to commit this to v13 as soon as the tree opens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: catalog files simplification

2019-06-12 Thread Robert Haas

On Wed, Jun 12, 2019 at 7:52 AM Peter Eisentraut
 wrote:
> The current catalog files all do this:
>
> CATALOG(pg_aggregate,2600,AggregateRelationId)
> {
> ...
> } FormData_pg_aggregate;
>
> typedef FormData_pg_aggregate *Form_pg_aggregate;
>
> The bottom part of this seems redundant.  With the attached patch, we
> can generate that automatically, so this becomes just
>
> CATALOG(pg_aggregate,2600,AggregateRelationId)
> {
> ...
> };

Maybe the macro definition could be split across several lines instead
of having one really long line?

Are some compilers going to be sad about typedef struct x x; preceding
any declaration or definition of struct x?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: catalog files simplification

2019-06-12 Thread Tom Lane

Robert Haas  writes:
> On Wed, Jun 12, 2019 at 7:52 AM Peter Eisentraut
>  wrote:
>> The current catalog files all do this:
>> 
>> CATALOG(pg_aggregate,2600,AggregateRelationId)
>> {
>> ...
>> } FormData_pg_aggregate;
>> 
>> typedef FormData_pg_aggregate *Form_pg_aggregate;
>> 
>> The bottom part of this seems redundant.  With the attached patch, we
>> can generate that automatically, so this becomes just
>> 
>> CATALOG(pg_aggregate,2600,AggregateRelationId)
>> {
>> ...
>> };

> Maybe the macro definition could be split across several lines instead
> of having one really long line?

I think that would complicate Catalog.pm; not clear if it's worth it.

> Are some compilers going to be sad about typedef struct x x; preceding
> any declaration or definition of struct x?

Nope, we have lots of instances of that already, cf "opaque struct"
declarations in various headers.

A bigger objection might be that this would leave us with no obvious-
to-the-untrained-eye declaration point for either the struct name or
the two typedef names.  That might make tools like ctags sad.  Perhaps
it's not really any worse than today, but it bears investigation.

We should also check whether pgindent has any issue with this layout.

regards, tom lane

Are there still non-ELF BSD systems?

2019-06-12 Thread Peter Eisentraut

The shared library code has some support for non-ELF BSD systems.  I
suspect that this is long obsolete.  Could we remove it?  See attached.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 3f21b2e13226dc1e0f5ea2fc2f7844825e5610ce Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 12 Jun 2019 15:51:57 +0200
Subject: [PATCH] Remove support for non-ELF BSD systems

This is long obsolete.
---
 configure.in   | 24 
 src/Makefile.global.in |  4 --
 src/Makefile.shlib | 71 ++
 src/makefiles/Makefile.freebsd | 12 --
 src/makefiles/Makefile.netbsd  | 14 ---
 src/makefiles/Makefile.openbsd | 12 --
 6 files changed, 28 insertions(+), 109 deletions(-)

diff --git a/configure.in b/configure.in
index 4586a1716c..dff6a1ae50 100644
--- a/configure.in
+++ b/configure.in
@@ -964,30 +964,6 @@ PGAC_ARG_BOOL(with, zlib, yes,
   [do not use Zlib])
 AC_SUBST(with_zlib)
 
-#
-# Elf
-#
-
-# Assume system is ELF if it predefines __ELF__ as 1,
-# otherwise believe host_os based default.
-case $host_os in
-freebsd1*|freebsd2*) elf=no;;
-freebsd3*|freebsd4*) elf=yes;;
-esac
-
-AC_EGREP_CPP(yes,
-[#if __ELF__
-  yes
-#endif
-],
-[ELF_SYS=true],
-[if test "X$elf" = "Xyes" ; then
-  ELF_SYS=true
-else
-  ELF_SYS=
-fi])
-AC_SUBST(ELF_SYS)
-
 #
 # Assignments
 #
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index b9d86acaa9..321af38b0c 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -506,10 +506,6 @@ host_tuple = @host@
 host_os = @host_os@
 host_cpu = @host_cpu@
 
-# This is mainly for use on FreeBSD, where we have both a.out and elf
-# systems now.  May be applicable to other systems to?
-ELF_SYSTEM= @ELF_SYS@
-
 # Backend stack size limit has to be hard-wired on Windows (it's in bytes)
 WIN32_STACK_RLIMIT=4194304
 
diff --git a/src/Makefile.shlib b/src/Makefile.shlib
index 373d73caef..c4893edfc3 100644
--- a/src/Makefile.shlib
+++ b/src/Makefile.shlib
@@ -139,57 +139,42 @@ ifeq ($(PORTNAME), darwin)
 endif
 
 ifeq ($(PORTNAME), openbsd)
-  ifdef ELF_SYSTEM
-LINK.shared= $(COMPILER) -shared
-ifdef soname
-  LINK.shared  += -Wl,-x,-soname,$(soname)
-endif
-BUILD.exports  = ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf 
"%s;\n",$$1}' $<; echo ' local: *; };' ) >$@
-exports_file   = $(SHLIB_EXPORTS:%.txt=%.list)
-ifneq (,$(exports_file))
-  LINK.shared  += -Wl,--version-script=$(exports_file)
-endif
-SHLIB_LINK += -lc
-  else
-LINK.shared= $(LD) -x -Bshareable -Bforcearchive
+  LINK.shared  = $(COMPILER) -shared
+  ifdef soname
+LINK.shared+= -Wl,-x,-soname,$(soname)
   endif
+  BUILD.exports= ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf 
"%s;\n",$$1}' $<; echo ' local: *; };' ) >$@
+  exports_file = $(SHLIB_EXPORTS:%.txt=%.list)
+  ifneq (,$(exports_file))
+LINK.shared+= -Wl,--version-script=$(exports_file)
+  endif
+  SHLIB_LINK   += -lc
 endif
 
 ifeq ($(PORTNAME), freebsd)
-  ifdef ELF_SYSTEM
-ifdef SO_MAJOR_VERSION
-  shlib= lib$(NAME)$(DLSUFFIX).$(SO_MAJOR_VERSION)
-endif
-LINK.shared= $(COMPILER) -shared
-ifdef soname
-  LINK.shared  += -Wl,-x,-soname,$(soname)
-endif
-BUILD.exports  = ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf 
"%s;\n",$$1}' $<; echo ' local: *; };' ) >$@
-exports_file   = $(SHLIB_EXPORTS:%.txt=%.list)
-ifneq (,$(exports_file))
-  LINK.shared  += -Wl,--version-script=$(exports_file)
-endif
-  else
-ifdef SO_MAJOR_VERSION
-  shlib= 
lib$(NAME)$(DLSUFFIX).$(SO_MAJOR_VERSION).$(SO_MINOR_VERSION)
-endif
-LINK.shared= $(LD) -x -Bshareable -Bforcearchive
+  ifdef SO_MAJOR_VERSION
+shlib  = lib$(NAME)$(DLSUFFIX).$(SO_MAJOR_VERSION)
+  endif
+  LINK.shared  = $(COMPILER) -shared
+  ifdef soname
+LINK.shared+= -Wl,-x,-soname,$(soname)
+  endif
+  BUILD.exports= ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf 
"%s;\n",$$1}' $<; echo ' local: *; };' ) >$@
+  exports_file = $(SHLIB_EXPORTS:%.txt=%.list)
+  ifneq (,$(exports_file))
+LINK.shared+= -Wl,--version-script=$(exports_file)
   endif
 endif
 
 ifeq ($(PORTNAME), netbsd)
-  ifdef ELF_SYSTEM
-LINK.shared= $(COMPILER) -shared
-ifdef soname
-  LINK.shared  += -Wl,-x,-soname,$(soname)
-endif
-BUILD.exports  = ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf 
"%s;\n",$$1}' $<; echo ' local: *; };' ) >$@
-exports_file   = $(SHLIB_EXPORTS:%.txt=%.list)
-ifneq (,$(exports_file))
-  LINK.shared  += -Wl,--version-script=$(exports_file)
-endif
-  else
-LINK.shared=

Re: catalog files simplification

2019-06-12 Thread Tom Lane

I wrote:
> Robert Haas  writes:
>> Maybe the macro definition could be split across several lines instead
>> of having one really long line?

> I think that would complicate Catalog.pm; not clear if it's worth it.

Oh, cancel that --- in an uncaffeinated moment, I thought you were asking
about splitting the *call* sites of the CATALOG() macro.  I agree that
the *definition* could be laid out better than it is here.

regards, tom lane

Fix inconsistencies for v12 (pass 2)

2019-06-12 Thread Alexander Lakhin

Hello Amit,

Can you also review the following fixes?:
2.1. bt_binsrch_insert -> _bt_binsrch_insert (an internal inconsistency)
2.2. EWOULDBOCK -> EWOULDBLOCK (a typo)
2.3. FORGET_RELATION_FSYNC & FORGET_DATABASE_FSYNC ->
SYNC_FORGET_REQUEST (orphaned after 3eb77eba)
2.4. GetNewObjectIdWithIndex -> GetNewOidWithIndex (an internal
inconsistency)
2.5. get_opclass_family_and_input_type ->
get_opclass_opfamily_and_input_type (an internal inconsistency)
2.6. HAVE_BUILTIN_CLZ -> HAVE__BUILTIN_CLZ (missing underscore)
2.7. HAVE_BUILTIN_CTZ -> HAVE__BUILTIN_CTZ (missing underscore)
2.8. MultiInsertInfoNextFreeSlot -> CopyMultiInsertInfoNextFreeSlot (an
internal inconsistency)
2.9. targetIsArray -> targetIsSubscripting (an internal inconsistency)
2.10. tss_htup -> remove (orphaned after 2e3da03e)

I can't see another inconsistencies for v12 for now, but there are some
that appeared before.
If this work can be performed more effectively or should be
postponed/canceled, please let me know.

Best regards,
Alexander
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1f809c24a1..c655dadb96 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -424,7 +424,7 @@ _bt_binsrch(Relation rel,
 
 /*
  *
- *	bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *	_bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
  *
  * Like _bt_binsrch(), but with support for caching the binary search
  * bounds.  Only used during insertion, and only on the leaf page that it
diff --git a/src/backend/libpq/be-secure-gssapi.c b/src/backend/libpq/be-secure-gssapi.c
index 1673b10315..ba8c0cd0f0 100644
--- a/src/backend/libpq/be-secure-gssapi.c
+++ b/src/backend/libpq/be-secure-gssapi.c
@@ -400,7 +400,7 @@ read_or_wait(Port *port, ssize_t len)
 		{
 			/*
 			 * If we got back less than zero, indicating an error, and that
-			 * wasn't just a EWOULDBOCK/EAGAIN, then give up.
+			 * wasn't just a EWOULDBLOCK/EAGAIN, then give up.
 			 */
 			if (ret < 0 && !(errno == EWOULDBLOCK || errno == EAGAIN))
 return -1;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 13f152b473..6ceff63ae1 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1213,8 +1213,8 @@ CompactCheckpointerRequestQueue(void)
 	 * backwards from the end of the queue and check whether a request is
 	 * *preceded* by an earlier, identical request, in the hopes of doing less
 	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
+	 * intervening SYNC_FORGET_REQUEST, so we do it this way.
+	 * It would be possible to be even smarter if we made
 	 * the code below understand the specific semantics of such requests (it
 	 * could blow away preceding entries that would end up being canceled
 	 * anyhow), but it's not clear that the extra complexity would buy us
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f35a..c13c08a97b 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -1055,7 +1055,7 @@ get_opclass_input_type(Oid opclass)
 }
 
 /*
- * get_opclass_family_and_input_type
+ * get_opclass_opfamily_and_input_type
  *
  *		Returns the OID of the operator family the opclass belongs to,
  *the OID of the datatype the opclass indexes
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 11936a6571..a065419cdb 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -383,7 +383,7 @@ GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn)
  * is also an unused OID within pg_class.  If the result is to be used only
  * as a relfilenode for an existing relation, pass NULL for pg_class.
  *
- * As with GetNewObjectIdWithIndex(), there is some theoretical risk of a race
+ * As with GetNewOidWithIndex(), there is some theoretical risk of a race
  * condition, but it doesn't seem worth worrying about.
  *
  * Note: we don't support using this in bootstrap mode.  All relations
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 60fb55af53..589c2f1615 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -28,7 +28,7 @@
  * left-most the 7th bit.  The 0th entry of the array should not be used.
  *
  * Note: this is not used by the functions in pg_bitutils.h when
- * HAVE_BUILTIN_CLZ is defined, but we provide it anyway, so that
+ * HAVE__BUILTIN_CLZ is defined, but we provide it anyway, so that
  * extensions possibly compiled with a different compiler can use it.
  */
 const uint8 pg_leftmost_one_pos[256] = {
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 60fb55af53..8d9f5be32a 100644
--- a/src/port/pg_bitutils.

Re: Binary support for pgoutput plugin

2019-06-12 Thread Dave Cramer

On Mon, 10 Jun 2019 at 07:49, Petr Jelinek 
wrote:

> Hi,
>
> On 10/06/2019 13:27, Dave Cramer wrote:
> > So back to binary output.
> >
> > From what I can tell the place to specify binary options would be in the
> > create publication and or in replication slots?
> >
> > The challenge as I see it is that the subscriber would have to be able
> > to decode binary output.
> >
> > Any thoughts on how to handle this? At the moment I'm assuming that this
> > would only work for subscribers that knew how to handle binary.
> >
>
> Given that we don't need to write anything extra to WAL on upstream to
> support binary output, why not just have the request for binary data as
> an option for the pgoutput and have it chosen dynamically? Then it's the
> subscriber who asks for binary output via option(s) to START_REPLICATION.
>

If I understand this correctly this would add something to the CREATE/ALTER
SUBSCRIPTION commands in the WITH clause.
Additionally another column would be required for pg_subscription for the
binary option.
Does it make sense to add an options column which would just be a comma
separated string?
Not that I have future options in mind but seems like something that might
come up in the future.

Dave Cramer

>
>

Re: Are there still non-ELF BSD systems?

2019-06-12 Thread Tom Lane

Peter Eisentraut  writes:
> The shared library code has some support for non-ELF BSD systems.  I
> suspect that this is long obsolete.  Could we remove it?  See attached.

I checked around a bit ... all of the *BSD systems in the buildfarm
report ELF_SYS='true', so it's safe to say that the code you want to
remove is untested.

Excavation in the various BSDens' change logs suggests that the last
one to fully drop a.out was OpenBSD, which didn't do so until 5.5
(released 1 May 2015).  That's more recent than I'd have hoped for,
though it looks like the holdout architectures were ones we don't
support anyway (e.g., m68k, vax).

If we're considering this change for v13, it's hard to believe
there'd be any objections in practice.

regards, tom lane

Re: upgrades in row-level locks can deadlock

2019-06-12 Thread Oleksii Kliukin

Oleksii Kliukin  wrote:

> Hi,
> 
> Alvaro Herrera  wrote:
> 
>> On 2019-May-22, Oleksii Kliukin wrote:
>> 
>>> X1: select id from job where name = 'a' for key share;
>>> Y: select id from job where name = 'a' for update; -- starts waiting for X1
>>> X2: select id from job where name = 'a' for key share;
>>> X1: update job set name = 'b' where id = 1;
>>> X2: update job set name = 'c' where id = 1; -- starts waiting for X1
>>> X1: rollback;
>>> 
>>> At this point, Y is terminated by the deadlock detector:
>> 
>> Yeah, this seems undesirable in general terms.  Here's a quick
>> isolationtester spec that reproduces the problem:
>> 
>> setup {
>>  drop table if exists tlu_job;
>>  create table tlu_job (id integer primary key, name text);
>>  insert into tlu_job values (1, 'a');
>> }
>> 
>> teardown {
>>  drop table tlu_job;
>> }
>> 
>> session "s1"
>> setup{ begin; }
>> step "s1_keyshare"   { select id from tlu_job where name = 'a' for key 
>> share; }
>> step "s1_update" { update tlu_job set name = 'b' where id = 1; }
>> step "s1_rollback"   { rollback; }
>> 
>> session "s2"
>> setup{ begin; }
>> step "s2_keyshare"   { select id from tlu_job where name = 'a' for key 
>> share; }
>> step "s2_update" { update tlu_job set name = 'c' where id = 1; }
>> step "s2_commit" { commit; }
>> 
>> session "s3"
>> setup{ begin; }
>> step "s3_forupd" { select id from tlu_job where name = 'a' for update; }
>> teardown { commit; }
> 
> Thank you! I can make it even simpler;  s1 just acquires for share lock, s3
> gets for update one and s2 takes for share lock first, and then tries to
> acquire for update one; once s1 finishes, s3 deadlocks.
> 
>> But semantically, I wonder if your transactions are correct.  If you
>> intend to modify the row in s1 and s2, shouldn't you be acquiring FOR NO
>> KEY UPDATE lock instead?  I don't see how can s1 and s2 coexist
>> peacefully.  Also, can your Y transaction use FOR NO KEY UPDATE instead
>> .. unless you intend to delete the tuple in that transaction?
> 
> It is correct. I wanted to make sure jobs that acquire for key share lock
> can run concurrently most of the time; they only execute one update at the
> end of the job, and it is just to update the last run timestamp.
> 
>> I'm mulling over your proposed fix.  I don't much like the idea that
>> DoesMultiXactIdConflict() returns that new boolean -- seems pretty
>> ad-hoc -- but I don't see any way to do better than that ...  (If we get
>> down to details, DoesMultiXactIdConflict needn't initialize that
>> boolean: better let the callers do.)
> 
> I am also not happy about the new parameter to DoesMultiXactIdConflict, but
> calling a separate function to fetch the presence of the current transaction
> in the multixact would mean doing the job of DoesMultiXactIdConflict twice.
> I am open to suggestions on how to make it nicer.
> 
> Attached is a slightly modified patch that avoids initializing
> has_current_xid inside DoesMultiXactIdConflict and should apply cleanly to
> the current master.

And here is the v3 that also includes the isolation test I described above
(quoting my previous message in full as I accidentally sent it off-list,
sorry about that).

Cheers,
Oleksii


deadlock_on_row_lock_upgrades_v3.diff
Description: Binary data

Re: GiST limits on contrib/cube with dimension > 100?

2019-06-12 Thread Andrey Borodin




> 12 июня 2019 г., в 15:11, Siarhei Siniak  написал(а):
> 
> A uniform set of points with a dimension 128 and type cube. That has a size 
> of 50 ** 3. Can be queried for a nearest neighbor at a speed of 10 queries 
> per second with limit varying from 1 to 25.
> It works better than when no index used at all. So gist gives here a speed up.

Then, I think, data is correlated. I believe it is possible to implement 
something better for high dimensional KNN in GiST than cube.


> 
> The documentation of postgresql doesn't mention complexity of such an index. 
> I've got confused as to its speed.
> 
> Does postgresql allow for an approximate nearest neighbor search?
> https://github.com/erikbern/ann-benchmarks has a lot of efficient 
> implementations.

ANN is beyond concepts of SQL standard: database with index must return same 
results as without index.
I can add ANN to github.com/x4m/ags which is a fork of GiST.

But PostgreSQL adds a lot of OLTP overhead:
1. it is crash-safe
2. it supports concurrent operations
2a. in a very unexpected way, for example in serializable mode it guaranties 
you that if you were looking for nearest neighbor there will not appear any new 
closer neighbor until the end of your transaction.
3. it allows extensibility and has abstraction layers
4. it has declarative querying language
etcetc

All this comes at a cost of database that can do anything and be anything. It 
its very hard to compete with specialized indexes that only do ANN.

Yet, as far as I know, no one really pursued the idea of fast high dimensional 
ANN in PG.

Best regards, Andrey Borodin.

RE: proposal: pg_restore --convert-to-text

2019-06-12 Thread Daniel Verite

José Arthur Benetasso Villanova wrote:

> On Thu, 28 Feb 2019, Imai, Yoshikazu wrote:
> 
> > Is there no need to rewrite the Description in the Doc to state we should 
> > specify either -d or -f option?
> > (and also it might be better to write if -l option is given, neither -d nor 
> > -f option isn't necessarily needed.)
> 
> Since the default part of text was removed, looks ok to me.

[4 months later]

While testing pg_restore on v12, I'm stumbling on this too.
pg_restore without argument fails like that:

$ pg_restore
pg_restore: error: one of -d/--dbname and -f/--file must be specified

But that's not right since "pg_restore -l dumpfile" would work.
I don't understand why the discussion seems to conclude that it's okay.

Also, in the doc at https://www.postgresql.org/docs/devel/app-pgrestore.html
the synopsis is

  pg_restore [connection-option...] [option...] [filename]

so the invocation without argument seems possible while in fact
it's not.

On a subjective note, I must say that I don't find the change to be
helpful.
In particular saying that --file must be specified leaves me with
no idea what file it's talking about: the dumpfile or the output file?
My first inclination is to transform "pg_restore dumpfile" into
"pg_restore -f dumpfile", which does nothing but wait
(it waits for the dumpfile on the standard input, before
I realize that it's not working and hit ^C. Fortunately it doesn't
overwrite the dumpfile with an empty output).

So the change in v12 removes the standard output as default,
but does not remove the standard input as default.
Isn't that weird?

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: GiST limits on contrib/cube with dimension > 100?

2019-06-12 Thread Siarhei Siniak

 
>ANN is beyond concepts of SQL standard: database with index must return same 
>results as without index.
>I can add ANN to github.com/x4m/ags which is a fork of GiST.How to recompile 
>that extension and not to get a name conflict with a standard one?
I've renamed everything for cube extension. When I needed to fork it.But it's 
impractical.

Re: upgrades in row-level locks can deadlock

2019-06-12 Thread Alvaro Herrera

On 2019-Jun-12, Oleksii Kliukin wrote:

> Thank you! I can make it even simpler;  s1 just acquires for share lock, s3
> gets for update one and s2 takes for share lock first, and then tries to
> acquire for update one; once s1 finishes, s3 deadlocks.

Cool.  I think it would be worthwhile to include a number of reasonable
permutations instead of just one, and make sure they all work correctly.
I don't think we need to include all possible permutations, just a few.
I think we need at least enough permutations to cover the two places of
the code that are modified by the patch, for both values of
have_current_xid (so there should be four permutations, I think).

Please don't simplify the table name to just "t" -- the reason I used
another name is that we want these tests to be able to run concurrently
at some point; ref.
https://postgr.es/m/20180124231006.z7spaz5gkzbdvob5@alvherre.pgsql

> > But semantically, I wonder if your transactions are correct.  If you
> > intend to modify the row in s1 and s2, shouldn't you be acquiring FOR NO
> > KEY UPDATE lock instead?  I don't see how can s1 and s2 coexist
> > peacefully.  Also, can your Y transaction use FOR NO KEY UPDATE instead
> > .. unless you intend to delete the tuple in that transaction?
> 
> It is correct. I wanted to make sure jobs that acquire for key share lock
> can run concurrently most of the time; they only execute one update at the
> end of the job, and it is just to update the last run timestamp.

I see.  Under READ COMMITTED it works okay, I suppose.

> > I'm mulling over your proposed fix.  I don't much like the idea that
> > DoesMultiXactIdConflict() returns that new boolean -- seems pretty
> > ad-hoc -- but I don't see any way to do better than that ...  (If we get
> > down to details, DoesMultiXactIdConflict needn't initialize that
> > boolean: better let the callers do.)
> 
> I am also not happy about the new parameter to DoesMultiXactIdConflict, but
> calling a separate function to fetch the presence of the current transaction
> in the multixact would mean doing the job of DoesMultiXactIdConflict twice.
> I am open to suggestions on how to make it nicer.

Yeah, I didn't find anything better either.  We could make things more
complex that we resolve the multixact once and then extract the two
sepraate bits of information that we need from that ... but it ends up
being uglier and messier for no real gain.  So let's go with your
original idea.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: proposal: pg_restore --convert-to-text

2019-06-12 Thread Andrew Gierth

> "Daniel" == Daniel Verite  writes:

 Daniel> While testing pg_restore on v12, I'm stumbling on this too.
 Daniel> pg_restore without argument fails like that:

 Daniel> $ pg_restore
 Daniel> pg_restore: error: one of -d/--dbname and -f/--file must be specified

Yeah, that's not good.

How about:

pg_restore: error: no destination specified for restore
pg_restore: error: use -d/--dbname to restore into a database, or -f/--file to 
output SQL text

-- 
Andrew (irc:RhodiumToad)

Re: proposal: pg_restore --convert-to-text

2019-06-12 Thread Alvaro Herrera

On 2019-Jun-12, Daniel Verite wrote:

> While testing pg_restore on v12, I'm stumbling on this too.

Thanks for testing.

> pg_restore without argument fails like that:
> 
> $ pg_restore
> pg_restore: error: one of -d/--dbname and -f/--file must be specified
> 
> But that's not right since "pg_restore -l dumpfile" would work.

So you suggest that it should be 

pg_restore: error: one of -d/--dbname, -f/--file and -l/--list must be specified
?

> Also, in the doc at https://www.postgresql.org/docs/devel/app-pgrestore.html
> the synopsis is
> 
>   pg_restore [connection-option...] [option...] [filename]
> 
> so the invocation without argument seems possible while in fact
> it's not.

So you suggest that it should be 
   pg_restore [connection-option...] { -d | -f | -l } [option...] [filename]
?

Maybe we should do that and split out the "output destination options"
from other options in the list of options, to make this clearer; see
a proposal after my sig.


> In particular saying that --file must be specified leaves me with
> no idea what file it's talking about: the dumpfile or the output file?

If you want to submit a patch (for pg13) to rename --file to
--output-file (and make --file an alias of that), you're welcome to, and
endure the resulting discussion and possible rejection.  I don't think
we're changing that at this point of pg12.

> My first inclination is to transform "pg_restore dumpfile" into
> "pg_restore -f dumpfile", which does nothing but wait
> (it waits for the dumpfile on the standard input, before
> I realize that it's not working and hit ^C. Fortunately it doesn't
> overwrite the dumpfile with an empty output).

Would you have it emit to stderr a message saying "reading standard
input" when it is?

> So the change in v12 removes the standard output as default,
> but does not remove the standard input as default.
> Isn't that weird?

I don't think they have the same surprise factor.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Usage:
   pg_restore [connection-option...] { -d | -f | -l } [option...] [filename]

General options:
  -F, --format=c|d|t   backup file format (should be automatic)
  -v, --verboseverbose mode
  -V, --versionoutput version information, then exit
  -?, --help   show this help, then exit

Output target options:
  -l, --list   print summarized TOC of the archive
  -d, --dbname=NAMEconnect to database name
  -f, --file=FILENAME  output file name (- for stdout)

Options controlling the restore:
  -a, --data-only  restore only the data, no schema
  -c, --clean  clean (drop) database objects before recreating
  ...

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Alvaro Herrera

On 2019-Apr-29, Tom Lane wrote:

> Ashwin Agrawal  writes:
> > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane  wrote:
> >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
> >> to see if it fixes the problem for you?
> 
> > Yes, will give it a try on greenplum and report back the result.
> 
> > Have we decided if this will be applied to back branches?

Hi Ashwin, did you have the chance to try this out?

> My feeling about it is "maybe eventually, but most definitely not
> the week before a set of minor releases".  Some positive experience
> with Greenplum would help increase confidence in the patch, for sure.

I looked at the buildfarm failures for the recoveryCheck stage.  It
looks like there is only one failure for branch master after this
commit, which was chipmunk saying:

  # poll_query_until timed out executing this query:
  # SELECT application_name, sync_priority, sync_state FROM pg_stat_replication 
ORDER BY application_name;
  # expecting this output:
  # standby1|1|sync
  # standby2|2|sync
  # standby3|2|potential
  # standby4|2|potential
  # last actual query output:
  # standby1|1|sync
  # standby2|2|potential
  # standby3|2|sync
  # standby4|2|potential
  # with stderr:
  not ok 6 - asterisk comes before another standby name

  #   Failed test 'asterisk comes before another standby name'
  #   at t/007_sync_rep.pl line 26.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
AFAICS this is wholly unrelated to the problem at hand.

No other animal failed recoveryCheck test; before the commit, the
failure was not terribly frequent, but rarely would 10 days go by
without it failing.  So I suggest that the bug has indeed been fixed.

Maybe now's a good time to get it back-patched?  In branch
REL_11_STABLE, it failed as recently as 11 days ago in gull,
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-06-01%2004%3A11%3A36

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Quitting the thes

2019-06-12 Thread Alvaro Herrera

On 2019-May-14, Michael Paquier wrote:

> Thanks, committed.  I have noticed an extra one in reorderbuffer.c.

Some grepping found a bit more; patch attached.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 9e00caf355ae9190ba88c12e969262da5c756864 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 12 Jun 2019 14:16:18 -0400
Subject: [PATCH] Fix double-word typos

---
 doc/src/sgml/backup.sgml   | 2 +-
 src/bin/pg_dump/pg_dump.c  | 4 ++--
 src/include/executor/tuptable.h| 2 +-
 src/include/fmgr.h | 2 +-
 src/include/utils/snapshot.h   | 2 +-
 src/interfaces/ecpg/preproc/type.c | 2 +-
 src/test/modules/test_integerset/test_integerset.c | 4 ++--
 src/test/recovery/t/013_crash_restart.pl   | 2 +-
 src/test/regress/expected/polygon.out  | 2 +-
 src/test/regress/sql/polygon.sql   | 2 +-
 10 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index b67da8916a..9d4c000df0 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -965,7 +965,7 @@ SELECT * FROM pg_stop_backup(false, true);
  backups.  Moreover, because it writes a backup_label file on the
  master, it can cause the master to fail to restart automatically after
  a crash.  On the other hand, the erroneous removal of a backup_label
- file from a backup or standby is a common mistake which can can result
+ file from a backup or standby is a common mistake which can result
  in serious data corruption.  If it is necessary to use this method,
  the following steps may be used.
 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 36e8d27edd..8909a45d61 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4051,8 +4051,8 @@ dumpPublicationTable(Archive *fout, PublicationRelInfo *pubrinfo)
 	  fmtQualifiedDumpable(tbinfo));
 
 	/*
-	 * There is no point in creating drop query as drop query as the drop is
-	 * done by table drop.
+	 * There is no point in creating drop query as the drop is done by table
+	 * drop.
 	 */
 	ArchiveEntry(fout, pubrinfo->dobj.catId, pubrinfo->dobj.dumpId,
  ARCHIVE_OPTS(.tag = tag,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index d912280386..0710a7dd38 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -59,7 +59,7 @@
  * also that a virtual tuple does not have any "system columns".
  *
  * The Datum/isnull arrays of a TupleTableSlot serve double duty.  For virtual
- * slots they they are the authoritative data.  For the other builtin slots,
+ * slots they are the authoritative data.  For the other builtin slots,
  * the arrays contain data extracted from the tuple.  (In this state, any
  * pass-by-reference Datums point into the physical tuple.)  The extracted
  * information is built "lazily", ie, only as needed.  This serves to avoid
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 1ac71a40f5..3ff099986b 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -96,7 +96,7 @@ typedef struct FunctionCallInfoBaseData
 } FunctionCallInfoBaseData;
 
 /*
- * Space needed for for a FunctionCallInfoBaseData struct with sufficient space
+ * Space needed for a FunctionCallInfoBaseData struct with sufficient space
  * for `nargs` arguments.
  */
 #define SizeForFunctionCallInfo(nargs) \
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 9f92ce011d..c00f1fe908 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -69,7 +69,7 @@ typedef enum SnapshotType
 	SNAPSHOT_ANY,
 
 	/*
-	 * A tuple is visible iff the tuple tuple is valid as a TOAST row.
+	 * A tuple is visible iff the tuple is valid as a TOAST row.
 	 */
 	SNAPSHOT_TOAST,
 
diff --git a/src/interfaces/ecpg/preproc/type.c b/src/interfaces/ecpg/preproc/type.c
index 165b5df24f..6a71766ccf 100644
--- a/src/interfaces/ecpg/preproc/type.c
+++ b/src/interfaces/ecpg/preproc/type.c
@@ -582,7 +582,7 @@ ECPGdump_a_struct(FILE *o, const char *name, const char *ind_name, char *arrsize
 {
 	/*
 	 * If offset is NULL, then this is the first recursive level. If not then
-	 * we are in a struct in a struct and the offset is used as offset.
+	 * we are in a struct and the offset is used as offset.
 	 */
 	struct ECPGstruct_member *p,
 			   *ind_p = NULL;
diff --git a/src/test/modules/test_integerset/test_integerset.c b/src/test/modules/test_integerset/test_integerset.c
index 15b9ce1ac3..713f4a69ef 100644
--- a/src/test/modules/test_integerset/test_integerset.c
+++ b/src/test/modules/test_integerset/test_integerset.c
@@ -246,7 +246,7 @@ test_pattern(const test_spec *spec)
 		 * last integer that we added to the set, plus an arbitrary consta

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-12 Thread Bruce Momjian

On Wed, Jun  5, 2019 at 11:54:04AM +0900, Masahiko Sawada wrote:
> On Fri, May 10, 2019 at 2:42 AM Bruce Momjian  wrote:
> > I think we need to step back and see what we want to do.  There are six
> > levels of possible encryption:
> >
> > 1.  client-side column encryption
> > 2.  server-side column encryption
> > 3.  table-level
> > 4.  database-level
> > 5.  tablespace-level
> > 6.  cluster-level
> >
> > 1 & 2 encrypt the data in the WAL automatically, and option 6 is
> > encrypting the entire WAL.  This leaves 3-5 as cases where there will be
> > mismatch between the object-level encryption and WAL.  I don't think it
> > is very valuable to use these options so reencryption will be easier.
> > In many cases, taking any object offline might cause the application to
> > fail, and having multiple encrypted data keys active will allow key
> > replacement to be done on an as-needed basis.
> >
> 
> Summarizing the design discussion so far and the discussion I had at
> PGCon, there are several basic design items here. Each of them is
> loosely related and there are trade-off.
> 
> 1. Encryption Levels.
> As Bruce suggested there are 6 levels.  The fine grained control will
> help to suppress performance overheads of tables that we don't
> actually need to encrypt. Even in terms of security it might help
> since we don't give the key users who don't or cannot access to
> encrypted tables. But whichever we choose the level, we can protect
> data from attack bypassing PostgresSQL's ACL such as reading database
> file directly, as long as we encrypt data inside database. Threats we
> want to protect by has already gotten consensus so far, I think.

I think level 6 is an obvious must-have.  I think the big question is
whether we gain enough by implementing levels 3-5 compared to the
complexity of the code and user interface.  

The big question is how many people will be mixing encrypted and
unencrypted data in the same cluster, and care about performance?  Just
because someone might care is not enough of a justification.  They can
certainly create separate encrypted and non-encrypted clusters. Can we
implement level 6 and then implement levels 3-5 later if desired?

> Among these levels, the tablespace level would be somewhat different
> from others because it corresponds to physical directories rather than
> database objects. So in principles it's possible that tables are
> created on an encrypted tablespace while indexes are created on
> non-encrypted tablespace, which does not make sense though. But having
> less encryption keys would be better for simple architecture.

How would you configure the WAL to know which key to use if we did #5?
Wouldn't system tables and statistics, and perhaps referential integry
allow for information leakage?

> 2. Encryption Objects.
> Indexes, WAL and TOAST table pertaining to encrypted tables, and
> temporary files must also be encrypted but we need to discuss whether
> we encrypt non-user data as well such as SLRU data, vm and fsm, and
> perhaps even other files such as 2PC state files, backend_label etc.
> Encryption everything is required by some use case but it's also true
> that there are users who wish to encrypt database while minimizing
> performance overheads.

I don't think we need to encrypt the "status" files like SLRU data, vm
and fsm.

> 3. Encryption keys.
> Encryption levels would be relevant with the number of encryption keys
> we use. The database cluster levels would use single encryption key
> and can encrypt everything easier including non-user data such as xact
> WALs and SRLU data with the same key. On the other hand, for instance
> the table level would use multiple keys and can encrypt tables with
> different encryption keys. One advantage of having multiple keys in
> database would be that it can re-encrypt encrypted database object
> as-needed basis. For instance in multi tenant architecture, the
> stopping database cluster would affect all services but we can
> re-encrypt data one by one while minimizing downtime of each services
> if we use multiple keys. Even in terms of security, having multiple
> keys helps the diversification of risk.

I agree we need a 2 tier key hierarchy.   See my pgcryptokey extension
as an example:

http://momjian.us/download/pgcryptokey/

> Apart from the above discussion, there are random concerns about the
> design regarding to the fine grained design. For WAL encryption, as a
> result of discussion so far I'm going to use the same encryption for
> WAL encryption as that used for tables. Given that approach, it would
> be required to make utility commands that read WAL (pg_waldump and
> pg_rewind) be able to get arbitrary encryption keys. pg_waldump might
> require even an encryption keys of WAL of which table has already been
> dropped. As I discussed at PGCon[3], by rearranging WAL format would
> solve this issue but it doesn't resolve fundamental issue.

Good point about pg_waldump.  I am a little worried we m

Re: GiST limits on contrib/cube with dimension > 100?

2019-06-12 Thread Tom Lane

Andrey Borodin  writes:
>> 9 июня 2019 г., в 23:05, Siarhei Siniak  написал(а):
>> I've been using cube extension recompiled with
>> #define MAX_DIM 256.
>> But with a version 11.3 I'm getting the following error:
>> failed to add item to index page in 

> So, you have changed source code (removing dim constraint) and get cryptic 
> error after that. In some way this is expected...

Yeah.  There is necessarily a limit on the size of index entries,
and it's getting exceeded.

> Nevertheless, we can improve AddTuple messages. But there is not such strict 
> guidelines as with B-tree. Probably, tuples should not be bigger than half of 
> usable page space.

I do not think "improve AddTuple messages" is going to be enough to fix
this.  Daniel was correct that this is the same problem seen in

https://www.postgresql.org/message-id/flat/AM6PR06MB57318C9882C021879DD4101EA3100%40AM6PR06MB5731.eurprd06.prod.outlook.com

See my reply there.  I think we should continue this discussion on that
thread, since it has seniority.

regards, tom lane

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Apr-29, Tom Lane wrote:
>> Ashwin Agrawal  writes:
>>> Have we decided if this will be applied to back branches?

>> My feeling about it is "maybe eventually, but most definitely not
>> the week before a set of minor releases".  Some positive experience
>> with Greenplum would help increase confidence in the patch, for sure.

> I looked at the buildfarm failures for the recoveryCheck stage.  It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
> ...
> AFAICS this is wholly unrelated to the problem at hand.

Yeah, that seems unrelated.

> No other animal failed recoveryCheck test; before the commit, the
> failure was not terribly frequent, but rarely would 10 days go by
> without it failing.  So I suggest that the bug has indeed been fixed.

I feel pretty good about it too.

> Maybe now's a good time to get it back-patched?

Should we do that now, or wait till after next week's releases?

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

2019-06-12 Thread Robert Haas

On Tue, Jun 11, 2019 at 1:57 PM Tom Lane  wrote:
> I think the existing code is horribly ugly and this is even worse.
> It adds cycles to RelationDecrementReferenceCount which is a hotspot
> that has no business dealing with this; the invariants are unclear;
> and there's no strong reason to think there aren't still cases where
> we accumulate lots of copies of old partition descriptors during a
> sequence of operations.  Basically you're just doubling down on a
> wrong design.

I don't understand why a function that decrements a reference count
shouldn't be expected to free things if the reference count goes to
zero.

Under what workload do you think adding this to
RelationDecrementReferenceCount would cause a noticeable performance
regression?

I think the change is responsive to your previous complaint that the
timing of stuff getting freed is not very well pinned down.  With this
change, it's much more tightly pinned down: it happens when the
refcount goes to 0.  That is definitely not perfect, but I think that
it is a lot easier to come up with scenarios where the leak
accumulates because no cache flush happens while the relfcount is 0
than it is to come up with scenarios where the refcount never reaches
0.  I agree that the latter type of scenario probably exists, but I
don't think we've come up with one yet.

> As I said upthread, my current inclination is to do nothing in this
> area for v12 and then try to replace the whole thing with proper
> reference counting in v13.  I think the cases where we have a major
> leak are corner-case-ish enough that we can leave it as-is for one
> release.

Is this something you're planning to work on yourself?  Do you have a
design in mind?  Is the idea to reference-count the PartitionDesc?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Alvaro Herrera

On 2019-Jun-12, Tom Lane wrote:

> Alvaro Herrera  writes:

> > Maybe now's a good time to get it back-patched?
> 
> Should we do that now, or wait till after next week's releases?

IMO this has been hammered enough in master, and we still have a few
days in the back-branches for buildfarm, that it's okay to do it now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: hyrax vs. RelationBuildPartitionDesc

2019-06-12 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jun 11, 2019 at 1:57 PM Tom Lane  wrote:

> I think the change is responsive to your previous complaint that the
> timing of stuff getting freed is not very well pinned down.  With this
> change, it's much more tightly pinned down: it happens when the
> refcount goes to 0.  That is definitely not perfect, but I think that
> it is a lot easier to come up with scenarios where the leak
> accumulates because no cache flush happens while the relfcount is 0
> than it is to come up with scenarios where the refcount never reaches
> 0.  I agree that the latter type of scenario probably exists, but I
> don't think we've come up with one yet.

I don't know why you think that's improbable, given that the changes
around PartitionDirectory-s cause relcache entries to be held open much
longer than before (something I've also objected to on this thread).

>> As I said upthread, my current inclination is to do nothing in this
>> area for v12 and then try to replace the whole thing with proper
>> reference counting in v13.  I think the cases where we have a major
>> leak are corner-case-ish enough that we can leave it as-is for one
>> release.

> Is this something you're planning to work on yourself?

Well, I'd rather farm it out to somebody else, but ...

> Do you have a
> design in mind?  Is the idea to reference-count the PartitionDesc?

What we discussed upthread was refcounting each of the various
large sub-objects of relcache entries, not just the partdesc.
I think if we're going to go this way we should bite the bullet and fix
them all.  I really want to burn down RememberToFreeTupleDescAtEOX() in
particular ... it seems likely to me that that's also a source of
unpleasant memory leaks.

regards, tom lane

Re: Quitting the thes

2019-06-12 Thread Guillaume Lelarge

Le mer. 12 juin 2019 à 20:45, Alvaro Herrera  a
écrit :

> On 2019-May-14, Michael Paquier wrote:
>
> > Thanks, committed.  I have noticed an extra one in reorderbuffer.c.
>
> Some grepping found a bit more; patch attached.
>
>
Thanks a lot, this is very good. I've got some fixes to do :)


-- 
Guillaume.

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Tue, May 21, 2019 at 02:22:53PM -0700, Peter Geoghegan wrote:
> On Tue, May 21, 2019 at 1:51 PM Bruce Momjian  wrote:
> > > My concern here (which I believe Alexander shares) is that it doesn't
> > > make sense to group these two items together. They're two totally
> > > unrelated pieces of work. Alexander's work does more or less help with
> > > lock contention with writes, whereas the feature that that was merged
> > > with is about preventing index bloat, which is mostly helpful for
> > > reads (it helps writes to the extent that writes are also reads).
> > >
> > > The release notes go on to say that this item "gives better
> > > performance for UPDATEs and DELETEs on indexes with many duplicates",
> > > which is wrong. That is something that should have been listed below,
> > > under the "duplicate index entries in heap-storage order" item.
> >
> > OK, I understand how the lock stuff improves things, but I have
> > forgotten how indexes are made smaller.  Is it because of better page
> > split logic?
> 
> That is clearly the main reason, though suffix truncation (which
> represents that trailing/suffix columns in index tuples from branch
> pages have "negative infinity" sentinel values) also contributes to
> making indexes smaller.
> 
> The page split stuff was mostly added by commit fab250243 ("Consider
> secondary factors during nbtree splits"), but commit f21668f32 ("Add
> "split after new tuple" nbtree optimization") added to that in a way
> that really helped the TPC-C indexes. The TPC-C indexes are about 40%
> smaller now.

First, my apologies in getting to this so late.  Peter Geoghegan
supplied me with slides and a video to study, and I now understand how
complex the btree improvements are.  Here is a video of Peter's
presentation at PGCon:

https://www.youtube.com/watch?v=p5RaATILoiE

What I would like to do is to type them out here, and if I got it right,
I can then add these details to the release notes.

The over-arching improvement to btree in PG 12 is the ordering of index
entries by tid so all entries are unique.  As Peter has stated, many
optimizations come out of that:

1.  Since all index tuples are ordered, you can move from one leaf page
to the next without keeping a lock on the internal page that references
it, increasing concurrency.

2.  When inserting a duplicate value in the index, we used to try a few
pages to see if there was space, then "get tired" and just split a page
containing duplicates.  Now that there are no duplicates (because
duplicate key fields are sorted by tid) the system knows exactly what
page the index tuple belongs on, and inserts or splits the page as
necessary.

3.  Pivot tuples are used on internal pages and as min/max indicators on
leaf pages.  These pivot tuples are now trimmed if their trailing key
fields are not significant.  For example, if an index is
field1/field2/field3, and the page contains values for which field1==5
and none that field1==6, there is no need to include field2 and field3
in the pivot tuple --- it can just list the pivot as field1==5,
field2=infinity.  This is called suffix truncation.

Page splits used to just split the page in half, which minimizes the
number of page splits, but sometimes causes lots of wasted space.  The
new code tries to split to reduce the length of pivot tuples, which ends
up being more efficient in space savings because the pivot tuples are
shorter, and the leaf pages end up being more tightly packed.  This is
particularly true for ever-increasing keys because we often end up
creating a new empty page, rather than splitting an existing page.

4.  Vacuum's removal of index tuples in indexes with many duplicates is
faster since it can more quickly find desired tuples.

Did I cover everything?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: pgbench rate limiting changes transaction latency computation

2019-06-12 Thread Alvaro Herrera

On 2019-Jun-12, Heikki Linnakangas wrote:

> That was my thought too, when looking at this example. When there is already
> a long queue of transactions, the txtime of individual transactions doesn't
> matter much. The most important thing under that condition is how fast the
> system can dissolve the queue (or how fast it builds up even more). So the
> derivative of the lag or lat seems like the most important figure. We don't
> print exactly that, but it's roughly the same as the TPS. Jitter experienced
> by the user matters too, i.e. stddev of 'lat'.

It's funny that you mention taking the derivative of lat or lag, because
that suggests that these numbers should not be merely printed on the
screen but rather produced in a way that's easy for a database to
consume.  Then you can just write the raw numbers and provide a set of
pre-written queries that generate whatever numbers the user desires.
We already have that ... but we don't provide any help on actually using
those log files -- there aren't instructions on how to import that into
a table, or what queries could be useful.

Maybe that's a useful direction to move towards?  I think the console
output is good for getting a gut-feeling of the test, but not for actual
data analysis.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Jun-12, Tom Lane wrote:
>> Should we do that now, or wait till after next week's releases?

> IMO this has been hammered enough in master, and we still have a few
> days in the back-branches for buildfarm, that it's okay to do it now.

Poking at that, I find that a1a789eb5 back-patches reasonably painlessly
into v11 and v10, but trying to bring it back to 9.6 encounters a pile of
merge failures.  Also, looking at the git logs shows that we did a hell
of a lot of subtle work on that code (libpqwalreceiver.c in particular)
during the v10 cycle.  So I've got no confidence that successful
buildfarm/beta1 testing of the HEAD patch means much of anything for
putting it into pre-v10 branches.

Given that we've seen few if any field reports of this issue, my
inclination is to back-patch as far as v10, but not take the risk
and effort involved in going further.

regards, tom lane

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Tue, May 21, 2019 at 12:57:56PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-21 15:47:34 -0400, Bruce Momjian wrote:
> > On Mon, May 20, 2019 at 03:17:19PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > Note that I've added a few questions to individuals involved with
> > > specific points. If you're in the To: list, please search for your name.
> > > 
> > > 
> > > On 2019-05-11 16:33:24 -0400, Bruce Momjian wrote:
> > > > I have posted a draft copy of the PG 12 release notes here:
> > > >
> > > > http://momjian.us/pgsql_docs/release-12.html
> > > > They are committed to git.
> > > 
> > > Thanks!
> > > 
> > >   Migration to Version 12
> > > 
> > > There's a number of features in the compat section that are more general
> > > improvements with a side of incompatibility. Won't it be confusing to
> > > e.g. have have the ryu floating point conversion speedups in the compat
> > > section, but not in the "General Performance" section?
> > 
> > Yes, it can be.  What I did with the btree item was to split out the max
> > length change with the larger changes.  We can do the same for other
> > items.  As you rightly stated, it is for cases where the incompatibility
> > is minor compared to the change.  Do you have a list of the ones that
> > need this treatment?
> 
> I was concretely thinking of:
> - floating point output changes, which are primarily about performance

If we split out the compatibility change, we don't have much left but
"performance", and that doesn't seem long enough for an entry.

> - recovery.conf changes where I'd merge:
>   - Do not allow multiple different recovery_target specificatios
>   - Allow some recovery parameters to be changed with reload
>   - Cause recovery to advance to the latest timeline by default
>   - Add an explicit value of current for guc-recovery-target-time
>   into on entry on the feature side.
> 
> After having to move recovery settings to a different file, disallowing
> multiple targets isn't really a separate config break imo. And all the
> other changes are also fallout from the recovery.conf GUCification.

Even though we moved the recovery.conf values into postgresql.conf, I
think people will assume they just behave the same and copy them into
the new file.  If their behavior changes, they need to know that
explicitly.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Wed, May 22, 2019 at 04:50:14PM +0900, Ian Barwick wrote:
> On 5/22/19 4:26 PM, Michael Paquier wrote:
> > On Wed, May 22, 2019 at 09:19:53AM +0900, Ian Barwick wrote:
> > > the last two items are performance improvements not related to 
> > > authentication;
> > > presumably the VACUUM item would be better off in the "Utility Commands"
> > > section and the TRUNCATE item in "General Performance"?
> > 
> > I agree with removing them from authentication, but these are not
> > performance-related items.  Instead I think that "Utility commands" is
> > a place where they can live better.
> > 
> > I am wondering if we should insist on the DOS attacks on a server, as
> > non-authorized users are basically able to block any tables, and
> > authorization is only a part of it, one of the worst parts
> > actually...  Anyway, I think that "This prevents unauthorized locking
> > delays." does not provide enough details.  What about reusing the
> > first paragraph of the commits?  Here is an idea:
> > "A caller of TRUNCATE/VACUUM/ANALYZE could previously queue for an
> > access exclusive lock on a relation it may not have permission to
> > truncate/vacuum/analyze, potentially interfering with users authorized
> > to work on it.  This could prevent users from accessing some relations
> > they have access to, in some cases preventing authentication if a
> > critical catalog relation was blocked."
> 
> Ah, if that's the intent behind/use for those changes (I haven't looked at 
> them
> in any detail, was just scanning the release notes) then it certainly needs 
> some
> explanation along those lines.

Since we did not backpatch this fix, I am hesitant to spell out exactly
how to exploit this DOS attack.  Yes, people can read it in the email
archives, and commit messages, but I don't see the value in spelling it
out the release notes too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Tue, May 28, 2019 at 08:58:23AM -0700, Andres Freund wrote:
> 
> 
>
> Add function  
> linkend="functions-info-partition">pg_partition_root()
> to return top-most parent of a partition tree (Michaël Paquier)
>
>   
> 
>   
> 
> 
>
> Add function  
> linkend="functions-info-partition">pg_partition_ancestors()
> to report all ancestors of a partition (Álvaro Herrera)
>
>   
> 
>   
> 
> 
>
> Add function  
> linkend="functions-info-partition">pg_partition_tree()
> to display information about partitions (Amit Langote)
>
>   
> 
> Can we combine these three?

Good idea, done.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: POC: Cleaning up orphaned files using undo logs

2019-06-12 Thread Robert Haas

On Mon, May 27, 2019 at 5:59 AM Amit Kapila  wrote:
> Apart from fixing the above comments, the patch is rebased on latest
> undo patchset.  As of now, I have split the binaryheap.c changes into
> a separate patch.  We are stilll enhancing the patch to compute
> oldestXidHavingUnappliedUndo which touches various parts of patch, so
> splitting further without completing that can make it a bit difficult
> to work on that.

Some review comments around execute_undo_actions:

The 'nopartial' argument to execute_undo_actions is confusing.  First,
it would probably be worth spelling it out instead of abbreviation:
not_partial_transaction rather than nopartial.  Second, it is usually
better to phrase parameter names in terms of what they are rather than
in terms of what they are not: complete_transaction rather than
not_partial_transaction.  Third, it's unclear from these comments why
we'd be undoing something other than a complete transaction.  It looks
as though the answer is that this flag will be false when we're
undoing a subxact -- in which case, why not invert the sense of the
flag and call it 'bool subxact'?  I might be wrong, but it seems like
that would be a whole lot clearer. Fourth, the block at the top of
this function, guarded by nopartial, seems like it must be vulnerable
to race conditions.  If we're undoing the complete transaction, then
it checks whether UndoFetchRecord() still returns anything.  But that
could change not just at the beginning of the function, but also any
time in the middle, or so it seems to me.  I doubt that this is the
right level at which to deal with this sort of interlocking. I think
there should be some higher-level mechanism that prevents two
processes from trying to undo the same transaction at the same time,
like a heavyweight lock or some kind of flag in the shared memory data
structure that keeps track of pending undo, so that we never even
reach this code unless we know that this XID needs undo work and no
other process is already doing it.  If you're the only one undoing XID
123456, then there shouldn't be any chance of the undo disappearing
from underneath you.  And we definitely want to guarantee that only
one process is undoing any given XID at a time.

The 'blk_chain_complete' variable which is set in this function and
passed down to execute_undo_actions_page() and then to the rmgr's
rm_undo callback also looks problematic.  First, not every AM that
uses undo may even have the notion of a 'block chain'; zedstore for
example uses TIDs as a 48-bit integer, not a block + offset number, so
it's really not going to have a 'block chain.'  Second, even in
zheap's usage, it seems to me that the block chain could be complete
even when this gets set to false. It gets set to true when we're
undoing a toplevel transaction (not a subxact) and we were able to
fetch all of the undo for that toplevel transaction. But even if
that's not true, the chain for an individual block could still be
complete, because all the remaining undo for the block at issue
might've been in the chunk of undo we already read; the remaining undo
could be for other blocks.  For that reason, I can't see how the zheap
code that relies on this value can be correct; it uses this value to
decide whether to stick zeroes in the transaction slot, but if the
scenario described above happened, then I suppose the XID would not
get cleared from the slot during undo.  Maybe zheap is just relying on
that being harmless, since if all of the undo actions have been
correctly executed for the page, the fact that the transaction slot is
still bogusly used by an aborted xact won't matter; nothing will refer
to it. However, it seems to me that it would be better for zheap to
set things up so that the first undo record for a particular txn/page
combination is flagged in some way (in the payload!) so that undo can
zero the slot if the action being undone is the one that claimed the
slot.  That seems cleaner on principle, and it also avoids having
supposedly AM-independent code pass down details that are driven by
zheap's particular needs.

While it's probably moot since I think this code should go away
anyway, I find it poor style to write something like:

+ if (nopartial && !UndoRecPtrIsValid(urec_ptr))
+ blk_chain_complete = true;
+ else
+ blk_chain_complete = false;

"if (x) y = true; else y = false;" can be more compactly written as "y
= x;", like this:

blk_chain_complete = nopartial && !UndoRecPtrIsValid(urec_ptr);

I think that the signature for rm_undo can be simplified considerably.
I think blk_chain_complete should go away for the reasons discussed
above.  Also, based on our conversations with Heikki at PGCon, we
decided that we should not presume that the AM wants the records
grouped by block, so the blkno argument should go away.  In addition,
I don't see much reason to have a first_idx argument. Instead of
passing a pointer to the caller's entire array and telling the
callback where to start looking, couldn't w

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Fri, Jun  7, 2019 at 12:04:33PM +0200, Adrien Nayrat wrote:
> On 5/11/19 10:33 PM, Bruce Momjian wrote:
> > I have posted a draft copy of the PG 12 release notes here:
> > 
> > http://momjian.us/pgsql_docs/release-12.html
> > 
> > They are committed to git.  It includes links to the main docs, where
> > appropriate.  Our official developer docs will rebuild in a few hours.
> > 
> 
> Hello,
> 
> By looking to a user request to add greek in our FTS [1], I suggest to mention
> which languages has been added in fd582317e.
> 
> Patch attached.
> 
> I hesitate to also mention these changes?
> 
> > These all work in UTF8, and the indonesian and irish ones also work in 
> > LATIN1.
> 
> > The non-UTF8 version of the hungarian stemmer now works in LATIN2 not 
> > LATIN1.
> 
> 
> 1:
> https://www.postgresql.org/message-id/trinity-f09793a1-8c13-4b56-94fe-10779e96c87e-1559896268438%403c-app-mailcom-bs16

Good idea, done.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: PG 12 draft release notes

2019-06-12 Thread Peter Geoghegan

On Wed, Jun 12, 2019 at 1:16 PM Bruce Momjian  wrote:
> First, my apologies in getting to this so late.  Peter Geoghegan
> supplied me with slides and a video to study, and I now understand how
> complex the btree improvements are.  Here is a video of Peter's
> presentation at PGCon:
>
> https://www.youtube.com/watch?v=p5RaATILoiE

Thank you for going to the trouble of researching the B-Tree stuff in
detail! I wouldn't ask that of anybody in the position of writing
release notes, so it's appreciated. It is awkward to take the work
that I've done and make it into multiple bullet points; I have a hard
time with that myself.

> The over-arching improvement to btree in PG 12 is the ordering of index
> entries by tid so all entries are unique.

Right. Everything good happens as a direct or indirect result of the
TID-as-column thing. That is really the kernel of the whole thing,
because it means that the implementation now *fully* follows the
Lehman and Yao design.

> 1.  Since all index tuples are ordered, you can move from one leaf page
> to the next without keeping a lock on the internal page that references
> it, increasing concurrency.

I'm not sure what you mean here. We've never had to keep a lock on an
internal page while holding a lock on a leaf page -- such "lock
coupling" was necessary in earlier B-Tree designs, but Lehman and
Yao's algorithm avoids that. Of course, that isn't new.

I think that you're talking about the way that we now check the high
key during index scans, and find that we don't have to move to the
next leaf page, per this commit:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b64d1de7c77ffb5cb10696693e6ed8a6fc481c

All of the suffix truncation stuff is good because it makes separator
keys in internal pages smaller, but it's also good because the
separator keys are simultaneously more "discriminating". We tend to
get a nice "clean break" between leaf pages, so checking the high key
before moving right to find additional matches (once we've already
returned some tuples to the executor) is surprisingly effective. It
would have been possible to add this optimization even without the
suffix truncation stuff, but truncation makes it effective.

If you had to cut one thing from this list, then I would suggest that
it be this item. It's nice, but it's also very obvious, which makes it
hard to explain. I mean, why should there be any ambiguity at all?
Unless we have to return *hundreds* of items to the index scan, then a
simple "select * from foo where bar = ?" style query should only have
to visit one leaf page, even when the constant happens to be on the
boundary of a leaf page (maybe a concurrent page split could make this
happen, but that should be rare).

> 2.  When inserting a duplicate value in the index, we used to try a few
> pages to see if there was space, then "get tired" and just split a page
> containing duplicates.  Now that there are no duplicates (because
> duplicate key fields are sorted by tid) the system knows exactly what
> page the index tuple belongs on, and inserts or splits the page as
> necessary.

Right -- inserts will descend straight to the correct leaf page, and
the "getting tired" dance isn't used anymore. This makes insertions
faster, but more importantly is a better high level strategy for
storing lots of duplicates. We'll dirty far fewer pages, because
insertions automatically end up inserting around the same place as we
inserted to a moment ago. Insertions of duplicates behave like
serial/auto-incrementing insertions, which was already
fast/well-optimized in various ways.

It's easy to measure this by looking at index bloat when inserting
lots of duplicates -- I came up with the 16% figure in the talk based
on a simple insert-only test.

> 3.  Pivot tuples are used on internal pages and as min/max indicators on
> leaf pages.  These pivot tuples are now trimmed if their trailing key
> fields are not significant.  For example, if an index is
> field1/field2/field3, and the page contains values for which field1==5
> and none that field1==6, there is no need to include field2 and field3
> in the pivot tuple --- it can just list the pivot as field1==5,
> field2=infinity.  This is called suffix truncation.

Right -- that's exactly how it works. Users may find that indexes with
lots of extra columns at the end won't get so bloated in Postgres 12.
Indexing many columns is typically seen when index-only scans are
important. Of course, you could have made those indexes INCLUDE
indexes on v11, which is actually a closely related idea, but that
makes it impossible to use the trailing/suffix columns in an ORDER BY.
And, you have to know about INCLUDE indexes, and remember to use them.

(This must be why Oracle can get away with not having INCLUDE indexes.)

> Page splits used to just split the page in half, which minimizes the
> number of page splits, but sometimes causes lots of wasted space.  The
> new code tries to split to reduce the lengt

Re: Question about some changes in 11.3

2019-06-12 Thread Tom Lane

Mat Arye  writes:
> On Mon, Jun 3, 2019 at 4:07 PM Tom Lane  wrote:
>> Hm.  I'd say this was already broken by the invention of
>> apply_scanjoin_target_to_paths; perhaps 11-before-11.3 managed to
>> still work for you, but it's not hard to envision applications of
>> set_rel_pathlist_hook for which it would not have worked.  The contract
>> for set_rel_pathlist_hook is supposed to be that it gets to editorialize
>> on all normal (non-Gather) paths created by the core code, and that's
>> no longer the case now that apply_scanjoin_target_to_paths can add more.
>> ...
>> I wonder whether we could fix matters by postponing the
>> set_rel_pathlist_hook call till later in the same cases where
>> we postpone generate_gather_paths, ie, it's the only baserel.

> Is it really guaranteed that `apply_scanjoin_target_to_paths` will not be
> called in other cases?

Well, apply_scanjoin_target_to_paths is called in *all* cases.  It only
throws away the original paths for partitioned rels, though.

I spent some more time looking at this, and I am afraid that my idea
of postponing set_rel_pathlist_hook into apply_scanjoin_target_to_paths
isn't going to work: there is not anyplace in that function where we
could call the hook without the API being noticeably different from
what it is at the current call site.  In particular, if we try to call
it down near the end so that it still has the property of being able
to remove any core-generated path, then there's a *big* difference for
queries involving SRFs: we've already plastered ProjectSetPath(s) atop
the original paths, and any user of the hook would have to know to do
likewise for freshly-generated paths.  That would certainly break
existing hook users.

I'm inclined to think that the safest course is to leave
set_rel_pathlist_hook as it stands, and invent a new hook that is called
in apply_scanjoin_target_to_paths just before the generate_gather_paths
call.  (Or, perhaps, just after that --- but the precedent of
set_rel_pathlist_hook suggests that "before" is more useful.)
For your use-case you'd have to get into both hooks, and they'd both have
to know that if they're dealing with a partitioned baserel that is the
only baserel in the query, the new hook is where to generate paths
rather than the old hook.  Maybe it'd be worth having the core code
export some simple test function for that, rather than having the details
of those semantics be wired into various extensions.

I think it'd be all right to put a patch done that way into the v11
branch.  It would not make anything any worse for code that uses
set_rel_pathlist_hook and is OK with the v11 behavior.  Code that
needs to use the new hook would fail to load into 11-before-11.whatever,
but that's probably better than loading and then doing the wrong thing.

regards, tom lane

Re: Should we warn against using too many partitions?

2019-06-12 Thread David Rowley

On Wed, 12 Jun 2019 at 17:49, Amit Langote  wrote:
> I noticed a typo:
>
> "...able to handle partition hierarchies up a few thousand partitions"
>
> s/up/up to/g
>
> I'm inclined to add one more word though, as:
>
> "...able to handle partition hierarchies with up to a few thousand partitions"
>
> or
>
> "...able to handle partition hierarchies containing up to a few
> thousand partitions"

Thanks for noticing that. I've pushed a fix.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] ruleutils: Fix subqueries with shadowed aliases

2019-06-12 Thread Tom Lane

=?iso-8859-1?Q?Philip_Dub=E9?=  writes:
> Discovered while looking into issue here: 
> https://github.com/citusdata/citus/pull/2733
> For completeness I'll quote the example code to demonstrate the issue:
> ...
> Where the 2nd join_alias should be renamed to join_alias_1

Good catch!  The proposed test case is less good though, because
it doesn't actually exercise the bug, ie the test case passes
with or without the code change.  (You also stuck it into the
middle of a bunch of not-very-related test cases.)  I adapted
your example into a better test case and pushed it.  Thanks
for the report and fix.

regards, tom lane

Re: Adaptive query optimization

2019-06-12 Thread Tomas Vondra


On Wed, Jun 12, 2019 at 06:14:41PM +0530, Kuntal Ghosh wrote:

Hello,

On Wed, Jun 12, 2019 at 5:06 PM Konstantin Knizhnik
 wrote:

On 12.06.2019 0:43, Tomas Vondra wrote:
I don't think "learning phase" is an issue, in fact I think that's
something we need to do - it ensures we have enough data to make good
decisions.

What is wrong with learning phase is that it requires some DBA assistance: 
somebody should determine when to start learning,
provide relevant workload and determine when learning is finished.
One of the most recent trends in DBMSes is autonomous databases with zero 
administration effort.
It is especially important for clouds. And of one the main advantages of AQO is 
that it allows to optimize queries without human interaction.

But unfortunately I really do not know how to avoid learning phase, especially 
if we what to run queries at replica.


Avoiding learning phase in AQO a implementation sounds like an oxymoron. :-)
Perhaps, you meant how we can minimize the effort in learning phase. A
learning phase has its own complications - like
a. deciding the the number of iterations needed to achieve certain
kind of confidence
b. which parameters to tune (are the existing parameters enough?)
c. deciding the cost model
Coming up answers for these things is pretty hard.



I kinda agree with both of you - the learning phase may be a significant
burden. But I don't think we can get rid of it entirely - we simply need
to collect the data to learn from somehow. But we should make it as
unobtrusive and easy to perform as possible.

My plan was to allow continuous learning during regular operation, i.e.
from workload generated by the application. So instead of requiring a
separate learning phase, we'd require a certain number of samples for a
given node, because we start using it to correct estimates.

For example, we might require 1000 samples for a given node (say, scan
with some quals), before we start using it to tweak the estimates. Once
we get the number of estimates, we can continue collecting more data,
and once in a while update the correction. This would require some care,
of course, because we need to know what coefficient was used to compute
the estimate, but that's solvable by having some sort of epoch.

Of course, the question is what number should we use, but overall this
would be a much lower-overhead way to do the learning.

Unfortunately, the learning as implemented in the patch does not allow
this. It pretty much requires dedicated learning phase with generated
workload, in a single process.

But I think that's solvable, assuming we:

1) Store the data in shared memory, instead of a file. Collect data from
all backends, instead of just a single one, etc.

2) Make the decision for individual entries, depending on how many
samples we have for it.



I think that depends - some machine learning approaches are not that
bad. But I think there's a more serious issue - explainability. We need
a solution where we can explain/justify why it makes some decisions. I
really don't want a black box that produces numbers that you just need
to take at face value.

The good thing is that the simpler the method, the less expensive and
more explainable it is.

+1



I tried to create much simpler version of AQO based on auto_explain extension.
This extension provide all necessary infrastructure to analyze statements with 
long execution time.
I have added two new modes to auto_explain:
1. Auto generation of multicolumn statistics for variables using in clauses 
with large estimation error.


Interesting! I probably wouldn't consider this part of adaptive query
optimization, but it probably makes sense to make it part of this. I
wonder if we might improve this to also suggest "missing" indexes?


I like this part of the implementation. I also agree that this can be
used to come up with good hypothetical index suggestions. But, it
needs some additional algorithms. For example, after analysing a set
of queries, we can come up with a minimal set of indexes that needs to
be created to minimize the total cost. I've not checked the internal
implementation of hypogo. Probably, I should do that.



I suggest we try to solve one issue at a time. I agree advising which
indexes to create is a very interesting (and valuable) thing, but I see
it as an extension of the AQO feature. That is, basic AQO (tweaking row
estimates) can work without it.



I think that it should be nest step of adaptive query optimization:
- autogeneration of indexes
- auto adjustment of optimizer cost parameters (cpu cost, random/sequential 
page access cost,...)

AFAIK, the need for adjustment of cost parameters are highly dominated
by solving the selectivity estimation errors. But of course, you can
argue with that.


That's probably true. But more to the point, it makes little sense to
tune cost parameters until the row estimates are fairly accurate. So I
think we should focus on getting that part working first, and then maybe
look in

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Wed, Jun 12, 2019 at 03:06:34PM -0700, Peter Geoghegan wrote:
> On Wed, Jun 12, 2019 at 1:16 PM Bruce Momjian  wrote:
> > First, my apologies in getting to this so late.  Peter Geoghegan
> > supplied me with slides and a video to study, and I now understand how
> > complex the btree improvements are.  Here is a video of Peter's
> > presentation at PGCon:
> >
> > https://www.youtube.com/watch?v=p5RaATILoiE
> 
> Thank you for going to the trouble of researching the B-Tree stuff in
> detail! I wouldn't ask that of anybody in the position of writing
> release notes, so it's appreciated. It is awkward to take the work
> that I've done and make it into multiple bullet points; I have a hard
> time with that myself.

I had become so confused by this item that I needed a few weeks to
settle on what was actually going on.

> > The over-arching improvement to btree in PG 12 is the ordering of index
> > entries by tid so all entries are unique.
> 
> Right. Everything good happens as a direct or indirect result of the
> TID-as-column thing. That is really the kernel of the whole thing,
> because it means that the implementation now *fully* follows the
> Lehman and Yao design.

Right.

> > 1.  Since all index tuples are ordered, you can move from one leaf page
> > to the next without keeping a lock on the internal page that references
> > it, increasing concurrency.
> 
> I'm not sure what you mean here. We've never had to keep a lock on an
> internal page while holding a lock on a leaf page -- such "lock
> coupling" was necessary in earlier B-Tree designs, but Lehman and
> Yao's algorithm avoids that. Of course, that isn't new.
> 
> I think that you're talking about the way that we now check the high
> key during index scans, and find that we don't have to move to the
> next leaf page, per this commit:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b64d1de7c77ffb5cb10696693e6ed8a6fc481c

I was wrong.  I was thinking of this commit:

commit d2086b08b0
Author: Alexander Korotkov 
Date:   Sat Jul 28 00:31:40 2018 +0300

Reduce path length for locking leaf B-tree pages during insertion

In our B-tree implementation appropriate leaf page for new tuple
insertion is acquired using _bt_search() function.  This function always
returns leaf page locked in shared mode.  In order to obtain exclusive
lock, caller have to relock the page.

This commit makes _bt_search() function lock leaf page immediately in
exclusive mode when needed.  That removes unnecessary relock and, in
turn reduces lock contention for B-tree leaf pages.  Our experiments
on multi-core systems showed acceleration up to 4.5 times in corner
case.

but got it confused by an optimization you mentioned in the video where
you were talking about the need to perhaps recheck the internal page
when moving right.  We certainly don't keep the internal page locked.

> All of the suffix truncation stuff is good because it makes separator
> keys in internal pages smaller, but it's also good because the
> separator keys are simultaneously more "discriminating". We tend to
> get a nice "clean break" between leaf pages, so checking the high key
> before moving right to find additional matches (once we've already
> returned some tuples to the executor) is surprisingly effective. It
> would have been possible to add this optimization even without the
> suffix truncation stuff, but truncation makes it effective.
> 
> If you had to cut one thing from this list, then I would suggest that
> it be this item. It's nice, but it's also very obvious, which makes it
> hard to explain. I mean, why should there be any ambiguity at all?
> Unless we have to return *hundreds* of items to the index scan, then a
> simple "select * from foo where bar = ?" style query should only have
> to visit one leaf page, even when the constant happens to be on the
> boundary of a leaf page (maybe a concurrent page split could make this
> happen, but that should be rare).

Right.  The commit mentioned a 4.5x speedup in a rare benchmark, so I
added it lower on the list.

> > 2.  When inserting a duplicate value in the index, we used to try a few
> > pages to see if there was space, then "get tired" and just split a page
> > containing duplicates.  Now that there are no duplicates (because
> > duplicate key fields are sorted by tid) the system knows exactly what
> > page the index tuple belongs on, and inserts or splits the page as
> > necessary.
> 
> Right -- inserts will descend straight to the correct leaf page, and
> the "getting tired" dance isn't used anymore. This makes insertions
> faster, but more importantly is a better high level strategy for
> storing lots of duplicates. We'll dirty far fewer pages, because
> insertions automatically end up inserting around the same place as we
> inserted to a moment ago. Insertions of duplicates behave like
> serial/auto-incrementing insertions, which was already
> fast/well-optimize

Re: proposal: new polymorphic types - commontype and commontypearray

2019-06-12 Thread Tom Lane

Greg Stark  writes:
> The proposals I see above are "commontype", "supertype", "anycommontype",
> and various abbreviations of those. I would humbly add "compatibletype".
> Fwiw I kind of like commontype.
> Alternately an argument could be made that length and typing convenience
> isn't really a factor here since database users never have to type these
> types. The only place they get written is when defining polymorphic
> functions which is a pretty uncommon operation.
> In which case a very explicit "anycompatibletype" may be better.

I could go with "anycompatibletype".  That would lead us to needing
related names like "anycompatiblearraytype", which is getting annoyingly
long, but you might be right that people wouldn't have to type it that
often.

Also, given the precedent of "anyarray" and "anyrange", it might be
okay to make these just "anycompatible" and "anycompatiblearray".

[ wanders away wondering if psql can tab-complete type names in
function definitions ... ]

regards, tom lane

Re: pg_dump: fail to restore partition table with serial type

2019-06-12 Thread Tom Lane

Alvaro Herrera  writes:
> There was indeed one more problem, that only the pg10 pg_upgrade test
> detected.  Namely, binary-upgrade dump didn't restore for locally
> defined constraints: they were dumped twice, first in the table
> definition and later by the ALTER TABLE ADD CONSTRAINT bit for binary
> upgrade that I had failed to notice.  Ooops.  The reason pg10 detected
> it and the other branches didn't, is that the only constraint of this
> ilk that remained after running regress was removed by 05bd889904e0 :-(

Seems like we'd better put back some coverage for that case, no?
But I'm confused by your reference to 05bd889904e0.  It looks like
that didn't change anything about tables that weren't getting dropped
anyhow.

regards, tom lane

Re: PG 12 draft release notes

2019-06-12 Thread Peter Geoghegan

On Wed, Jun 12, 2019 at 5:22 PM Bruce Momjian  wrote:
> I had become so confused by this item that I needed a few weeks to
> settle on what was actually going on.

I put a lot of time into my pgCon talk, especially on the diagrams.
Seems like that paid off. Even Heikki was confused by my explanations
at one point.

I should go add a similar diagram to our documentation, under "Chapter
63. B-Tree Indexes", because diagrams are the only sensible way to
explain the concepts.

> I was wrong.  I was thinking of this commit:
>
> commit d2086b08b0
> Author: Alexander Korotkov 
> Date:   Sat Jul 28 00:31:40 2018 +0300
>
> Reduce path length for locking leaf B-tree pages during insertion

> > If you had to cut one thing from this list, then I would suggest that
> > it be this item. It's nice, but it's also very obvious, which makes it
> > hard to explain.

> Right.  The commit mentioned a 4.5x speedup in a rare benchmark, so I
> added it lower on the list.

My remark about cutting an item referred to a lesser item that I
worked on (the 'Add nbtree high key "continuescan" optimization'
commit), not Alexander independent B-Tree work. I think that
Alexander's optimization is also quite effective. Though FWIW the 4.5x
improvement concerned a case involving lots of duplicates...cases with
a lot of duplicates will be far far better in Postgres 12. (I never
tested my patch without Alexander's commit, since it went in early in
the v12 cycle.)

> Yes, locality.

"Locality" is one of my favorite words.

> Attached is an updated patch.  I might have missed something, but I
> think it might be close.

This looks great. I do have a few things:

* I would put "Improve performance and space utilization of btree
indexes with many duplicates" first (before "Allow multi-column btree
indexes to be smaller"). I think that this is far more common than we
tend to assume, and is also where the biggest benefits are.

* The wording of the "many duplicates" item itself is almost perfect,
though the "...and inefficiency when VACUUM needed to find a row for
removal" part seems a bit off -- this is really about the
effectiveness of VACUUM, not the speed at which the VACUUM completes
(it's a bit faster, but that's not that important). Perhaps that part
should read: "...and often failed to efficiently recycle space made
available by VACUUM". Something like that.

* The "Allow multi-column btree indexes to be smaller" item is about
both suffix truncation, and about the "Split after new tuple"
optimization. I think that that makes it more complicated than it
needs to be. While the improvements that we saw with TPC-C on account
of the "Split after new tuple" optimization were nice, I doubt that
users will be looking out for it. I would be okay if you dropped any
mention of the "Split after new tuple" optimization, in the interest
of making the description more useful to users. We can just lose that.

* Once you simplify the item by making it all about suffix truncation,
it would make sense to change the single line summary to "Reduce the
number of branch blocks needed for multi-column indexes". Then go on
to talk about how we now only store those columns that are necessary
to guide index scans in tuples stored in branch pages (we tend to call
branch pages internal pages, but branch pages seems friendlier to me).
Note that the user docs of other database systems reference these
details, even in their introductory material on how B-Tree indexes
work. The term "suffix truncation" isn't something users have heard
of, and we shouldn't use it here, but the *idea* of suffix truncation
is very well established. As I mentioned, it matters for things like
covering indexes (indexes that are designed to be used by index-only
scans, which are not necessarily INCLUDE indexes).

Thanks!
--
Peter Geoghegan

Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-12 Thread David Rowley

On Mon, 10 Jun 2019 at 11:45, David Rowley  wrote:
>
> On Sat, 8 Jun 2019 at 04:51, Andres Freund  wrote:
> > David, any opinions on how to best fix this? It's not extremely obvious
> > how to do so best in the current setup of the partition actually being
> > hidden somewhere a few calls away (i.e. the table_open happening in
> > ExecInitPartitionInfo()).
>
> That's been overlooked. I agree it's not a bug with heap, since
> heapam_finish_bulk_insert() only does anything there when we're
> skipping WAL, which we don't do in copy.c for partitioned tables.
> However, who knows what other AMs will need, so we'd better fix that.
>
> My proposed patch is attached.
>
> I ended up moving the call to CopyMultiInsertInfoCleanup() down to
> after we call table_finish_bulk_insert for the main table. This might
> not be required but I lack imagination right now to what AMs might put
> in the finish_bulk_insert call, so doing this seems safer.

Andres, do you want to look at this before I look again?

Do you see any issue with calling table_finish_bulk_insert() when the
partition's CopyMultiInsertBuffer is evicted from the
CopyMultiInsertInfo rather than at the end of the copy? It can mean
that we call the function multiple times per partition. I assume the
function is only really intended to flush bulk inserted tuple to the
storage, so calling it more than once would just mean an inefficiency
rather than a bug.

Let me know your thoughts.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Wed, Jun 12, 2019 at 06:34:27PM -0700, Peter Geoghegan wrote:
> > I was wrong.  I was thinking of this commit:
> >
> > commit d2086b08b0
> > Author: Alexander Korotkov 
> > Date:   Sat Jul 28 00:31:40 2018 +0300
> >
> > Reduce path length for locking leaf B-tree pages during insertion
> 
> > > If you had to cut one thing from this list, then I would suggest that
> > > it be this item. It's nice, but it's also very obvious, which makes it
> > > hard to explain.
> 
> > Right.  The commit mentioned a 4.5x speedup in a rare benchmark, so I
> > added it lower on the list.
> 
> My remark about cutting an item referred to a lesser item that I
> worked on (the 'Add nbtree high key "continuescan" optimization'
> commit), not Alexander independent B-Tree work. I think that
> Alexander's optimization is also quite effective. Though FWIW the 4.5x
> improvement concerned a case involving lots of duplicates...cases with
> a lot of duplicates will be far far better in Postgres 12. (I never
> tested my patch without Alexander's commit, since it went in early in
> the v12 cycle.)

OK, good to know.

> > Attached is an updated patch.  I might have missed something, but I
> > think it might be close.
> 
> This looks great. I do have a few things:
> 
> * I would put "Improve performance and space utilization of btree
> indexes with many duplicates" first (before "Allow multi-column btree
> indexes to be smaller"). I think that this is far more common than we
> tend to assume, and is also where the biggest benefits are.

OK, done, I was wondering about that.

> * The wording of the "many duplicates" item itself is almost perfect,
> though the "...and inefficiency when VACUUM needed to find a row for
> removal" part seems a bit off -- this is really about the
> effectiveness of VACUUM, not the speed at which the VACUUM completes
> (it's a bit faster, but that's not that important). Perhaps that part
> should read: "...and often failed to efficiently recycle space made
> available by VACUUM". Something like that.

Ah, I see what you mean --- recycle entire pages.  I have updated the
patch.

> * The "Allow multi-column btree indexes to be smaller" item is about
> both suffix truncation, and about the "Split after new tuple"
> optimization. I think that that makes it more complicated than it
> needs to be. While the improvements that we saw with TPC-C on account
> of the "Split after new tuple" optimization were nice, I doubt that
> users will be looking out for it. I would be okay if you dropped any
> mention of the "Split after new tuple" optimization, in the interest
> of making the description more useful to users. We can just lose that.

OK, done.

> * Once you simplify the item by making it all about suffix truncation,
> it would make sense to change the single line summary to "Reduce the
> number of branch blocks needed for multi-column indexes". Then go on
> to talk about how we now only store those columns that are necessary
> to guide index scans in tuples stored in branch pages (we tend to call
> branch pages internal pages, but branch pages seems friendlier to me).
> Note that the user docs of other database systems reference these
> details, even in their introductory material on how B-Tree indexes
> work. The term "suffix truncation" isn't something users have heard
> of, and we shouldn't use it here, but the *idea* of suffix truncation
> is very well established. As I mentioned, it matters for things like
> covering indexes (indexes that are designed to be used by index-only
> scans, which are not necessarily INCLUDE indexes).

OK, I mentioned something about increased locality now.  Patch attached.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml
new file mode 100644
index 4a6c989..fcc49ff
*** a/doc/src/sgml/release-12.sgml
--- b/doc/src/sgml/release-12.sgml
*** Author: Tom Lane 
*** 606,627 
  

  
  
 
! Improve speed of btree index insertions (Peter Geoghegan,
! Alexander Korotkov)
 
  
 
! The new code improves the space-efficiency of page splits,
! reduces locking overhead, and gives better performance for
! UPDATEs and DELETEs on
! indexes with many duplicates.
 

  
--- 606,669 
  

  
  
 
! Improve performance and space utilization of btree indexes with
! many duplicates (Peter Geoghegan, Heikki Linnakangas)
 
  
 
! Previously, duplicate index entries were stored unordered within
! their duplicate groups.  This caused overhead during index
! inserts, wasted space due to excessive page splits, and reduced
! VACUUM's ability to recycle entire page

Re: PG 12 draft release notes

2019-06-12 Thread Peter Geoghegan

On Wed, Jun 12, 2019 at 7:29 PM Bruce Momjian  wrote:
> OK, I mentioned something about increased locality now.  Patch attached.

Looks good -- locality is a good catch-all term.

Thanks!
-- 
Peter Geoghegan

Re: PG 12 draft release notes

2019-06-12 Thread Bruce Momjian

On Wed, Jun 12, 2019 at 07:42:31PM -0700, Peter Geoghegan wrote:
> On Wed, Jun 12, 2019 at 7:29 PM Bruce Momjian  wrote:
> > OK, I mentioned something about increased locality now.  Patch attached.
> 
> Looks good -- locality is a good catch-all term.

Great, patch applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: pg_dump: fail to restore partition table with serial type

2019-06-12 Thread Alvaro Herrera

On 2019-Jun-12, Tom Lane wrote:

> Alvaro Herrera  writes:
> > There was indeed one more problem, that only the pg10 pg_upgrade test
> > detected.  Namely, binary-upgrade dump didn't restore for locally
> > defined constraints: they were dumped twice, first in the table
> > definition and later by the ALTER TABLE ADD CONSTRAINT bit for binary
> > upgrade that I had failed to notice.  Ooops.  The reason pg10 detected
> > it and the other branches didn't, is that the only constraint of this
> > ilk that remained after running regress was removed by 05bd889904e0 :-(
> 
> Seems like we'd better put back some coverage for that case, no?

I'll work on that.

> But I'm confused by your reference to 05bd889904e0.  It looks like
> that didn't change anything about tables that weren't getting dropped
> anyhow.

Ah ... yeah, I pasted the wrong commit ID.  That commit indeed removed
one occurrence of constraint check_b, but it wasn't the one that
detected the failure -- the one that did (also named check_b) was
removed by commit 6f6b99d1335b (pg11 only).

Commit aa56671836e6 (in pg10, two months after 05bd889904e0) changed
those tables afterwards so that they wouldn't be dropped. 

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Adaptive query optimization

2019-06-12 Thread Kuntal Ghosh

On Thu, Jun 13, 2019 at 5:49 AM Tomas Vondra
 wrote:
>
> For example, we might require 1000 samples for a given node (say, scan
> with some quals), before we start using it to tweak the estimates. Once
> we get the number of estimates, we can continue collecting more data,
> and once in a while update the correction. This would require some care,
> of course, because we need to know what coefficient was used to compute
> the estimate, but that's solvable by having some sort of epoch.
>
> Of course, the question is what number should we use, but overall this
> would be a much lower-overhead way to do the learning.
>
> Unfortunately, the learning as implemented in the patch does not allow
> this. It pretty much requires dedicated learning phase with generated
> workload, in a single process.
>
> But I think that's solvable, assuming we:
>
> 1) Store the data in shared memory, instead of a file. Collect data from
> all backends, instead of just a single one, etc.
>
> 2) Make the decision for individual entries, depending on how many
> samples we have for it.
>
Sounds good. I was trying to think whether we can maintain a running
coefficient. In that way, we don't have to store the samples. But,
calculating a running coefficient for more than two variables (with
some single pass algorithm) seems to be a hard problem. Moreover, it
can introduce significant misestimation. Your suggested approach works
better.

> I suggest we try to solve one issue at a time. I agree advising which
> indexes to create is a very interesting (and valuable) thing, but I see
> it as an extension of the AQO feature. That is, basic AQO (tweaking row
> estimates) can work without it.
>
+1

> >> Now, if someone uses this same scan in a join, like for example
> >>
> >>SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
> >> WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
> >>   AND (t2.x = ? AND t2.y = ?)
> >>
> >> then we can still apply the same correction to the t1 scan (I think).
> >> But then we can also collect data for the t1-t2 join, and compute a
> >> correction coefficient in a similar way. It requires a bit of care
> >> because we need to compensate for misestimates of inputs, but I think
> >> that's doable.
> >>
> >That'll be an interesting work. For the above query, we can definitely
> >calculate the correction coefficient of t1-t2 join given (t1.a = ? AND
> >t1.b = ? AND t1.c < ?) and
> >(t2.x = ? AND t2.y = ?) are true. But, I'm not sure how we can
> >extrapolate that value for t1-t2 join.
>
> I'm not sure I see the problem? Essentially, we need to know the sizes
> of the join inputs, i.e.
>
> t1 WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
>
> t2 WHERE (t2.x = ? AND t2.y = ?)
>
> (which we know, and we know how to correct the estimate), and then the
> selectivity of the join condition. Which we also know.
>
> Obviously, there's a chance those parts (clauses at the scan / join
> level) are correlated, which could make this less accurate.
This is exactly what my concern is. The base predicate selectivities
of t1 and t2 should have an impact on the calculation of the
correction coefficient. If those selectivities are low, the
misestimation (which is actual/estimate) should not affect the t1-t2
join correction coefficient much.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel grouping sets

2019-06-12 Thread David Rowley

On Wed, 12 Jun 2019 at 14:59, Richard Guo  wrote:
> Implementation 1

> Parallel aggregation has already been supported in PostgreSQL and it is
> implemented by aggregating in two stages. First, each worker performs an
> aggregation step, producing a partial result for each group of which
> that process is aware. Second, the partial results are transferred to
> the leader via the Gather node. Finally, the leader merges the partial
> results and produces the final result for each group.
>
> We are implementing parallel grouping sets in the same way. The only
> difference is that in the final stage, the leader performs a grouping
> sets aggregation, rather than a normal aggregation.

Hi Richard,

I think it was you an I that discussed #1 at unconference at PGCon 2
weeks ago. The good thing about #1 is that it can be implemented as
planner-only changes just by adding some additional paths and some
costing. #2 will be useful when we're unable to reduce the number of
inputs to the final aggregate node by doing the initial grouping.
However, since #1 is easier, then I'd suggest going with it first,
since it's the path of least resistance. #1 should be fine as long as
you properly cost the parallel agg and don't choose it when the number
of groups going into the final agg isn't reduced by the partial agg
node.  Which brings me to:

You'll need to do further work with the dNumGroups value. Since you're
grouping by all the columns/exprs in the grouping sets you'll need the
number of groups to be an estimate of that.

Here's a quick test I did that shows the problem:

create table abc(a int, b int, c int);
insert into abc select a,b,1 from generate_Series(1,1000)
a,generate_Series(1,1000) b;
create statistics abc_a_b_stats (ndistinct) on a,b from abc;
analyze abc;

-- Here the Partial HashAggregate really should estimate that there
will be 1 million rows.
explain analyze select a,b,sum(c) from abc group by grouping sets ((a),(b));
  QUERY PLAN
---
 Finalize HashAggregate  (cost=14137.67..14177.67 rows=2000 width=16)
(actual time=1482.746..1483.203 rows=2000 loops=1)
   Hash Key: a
   Hash Key: b
   ->  Gather  (cost=13697.67..14117.67 rows=4000 width=16) (actual
time=442.140..765.931 rows=100 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial HashAggregate  (cost=12697.67..12717.67 rows=2000
width=16) (actual time=402.917..526.045 rows=33 loops=3)
   Group Key: a, b
   ->  Parallel Seq Scan on abc  (cost=0.00..9572.67
rows=416667 width=12) (actual time=0.036..50.275 rows=33 loops=3)
 Planning Time: 0.140 ms
 Execution Time: 1489.734 ms
(11 rows)

but really, likely the parallel plan should not be chosen in this case
since we're not really reducing the number of groups going into the
finalize aggregate node.  That'll need to be factored into the costing
so that we don't choose the parallel plan when we're not going to
reduce the work in the finalize aggregate node. I'm unsure exactly how
that'll look. Logically, I think the choice parallelize or not to
parallelize needs to be if (cost_partial_agg + cost_gather +
cost_final_agg < cost_agg) { do it in parallel } else { do it in
serial }.  If you build both a serial and parallel set of paths then
you should see which one is cheaper without actually constructing an
"if" test like the one above.

Here's a simple group by with the same group by clause items as you
have in the plan above that does get the estimated number of groups
perfectly.  The plan above should have the same estimate.

explain analyze select a,b,sum(c) from abc group by a,b;
 QUERY PLAN

 GroupAggregate  (cost=132154.34..152154.34 rows=100 width=16)
(actual time=404.304..1383.343 rows=100 loops=1)
   Group Key: a, b
   ->  Sort  (cost=132154.34..134654.34 rows=100 width=12) (actual
time=404.291..620.774 rows=100 loops=1)
 Sort Key: a, b
 Sort Method: external merge  Disk: 21584kB
 ->  Seq Scan on abc  (cost=0.00..15406.00 rows=100
width=12) (actual time=0.017..100.299 rows=100 loops=1)
 Planning Time: 0.115 ms
 Execution Time: 1412.034 ms
(8 rows)

Also, in the tests:

> insert into gstest select 1,10,100 from generate_series(1,100)i;
> insert into gstest select 1,10,200 from generate_series(1,100)i;
> insert into gstest select 1,20,30 from generate_series(1,100)i;
> insert into gstest select 2,30,40 from generate_series(1,100)i;
> insert into gstest select 2,40,50 from generate_series(1,100)i;
> insert into gstest select 3,50,60 from generate_series(1,100)i;
> insert into gstest select 1,NULL,000

Re: [PATCH] vacuumlo: print the number of large objects going to be removed

2019-06-12 Thread Timur Birsh

12.06.2019, 14:31, "Timur Birsh" :
> Please cc, I am not subscribed to the list.

I have subscribed to the mailing list, there is no need to cc me.

Thank you.

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-12 Thread Noah Misch

On Wed, Jun 12, 2019 at 02:44:49PM +0200, Fabien COELHO wrote:
> there is no simple way to know whether hostaddr was set at
> the libPQ level

> A solution could be to have a PQdoestheconnectionuseshostaddr(conn)
> function, but I cannot say I'd be thrilled.

PQconninfo() is the official way to retrieve that.

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Michael Paquier

On Wed, Jun 12, 2019 at 04:26:23PM -0400, Tom Lane wrote:
> Poking at that, I find that a1a789eb5 back-patches reasonably painlessly
> into v11 and v10, but trying to bring it back to 9.6 encounters a pile of
> merge failures.  Also, looking at the git logs shows that we did a hell
> of a lot of subtle work on that code (libpqwalreceiver.c in particular)
> during the v10 cycle.  So I've got no confidence that successful
> buildfarm/beta1 testing of the HEAD patch means much of anything for
> putting it into pre-v10 branches.
> 
> Given that we've seen few if any field reports of this issue, my
> inclination is to back-patch as far as v10, but not take the risk
> and effort involved in going further.

+1 for only a back-patch to v10 per the invasiveness argument.  I
think that you have made the right move here.
--
Michael


signature.asc
Description: PGP signature

Re: Cleaning up and speeding up string functions

2019-06-12 Thread David Rowley

On Thu, 6 Jun 2019 at 08:54, Alvaro Herrera  wrote:
>
> On 2019-May-26, David Rowley wrote:
>
> > On Sun, 26 May 2019 at 04:50, Tom Lane  wrote:
>
> > > Here the cost is code space rather than programmer-visible complexity,
> > > but I still doubt that it's worth it.
> >
> > I see on today's master the postgres binary did grow from 8633960
> > bytes to 8642504 on my machine using GCC 8.3, so you might be right.
> > pg_receivewal grew from 96376 to 96424 bytes.
>
> I suppose one place that could be affected visibly is JSON object
> construction (json.c, jsonfuncs.c) that could potentially deal with
> millions of stringinfo manipulations, but most of those calls don't
> actually use appendStringInfoString with constant values, so it's
> probably not worth bothering with.

We could probably get the best of both worlds by using a macro and
__builtin_constant_p() to detect if the string is a const, but I won't
be pushing for that unless I find something to make it worthwhile.

For patch 0004, I think it's likely worth revising so instead of
adding a new function, make use of appendBinaryStringInfo() and pass
in the known length. Likely mostly for the xml.c calls.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: libpq debug log

2019-06-12 Thread Iwata, Aya

Hi, 

I rebased my patch from head. Please find my attached patch.

Regard,
Aya Iwata


v5-libpq-PQtrace-output-one-line.patch
Description: v5-libpq-PQtrace-output-one-line.patch

RE: [PATCH] Speedup truncates of relation forks

2019-06-12 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> We do RelationTruncate() also when we truncate heaps that are created
> in the current transactions or has a new relfilenodes in the current
> transaction. So I think there is a room for optimization Thomas
> suggested, although I'm not sure it's a popular use case.

Right, and I don't think of a use case that motivates the opmitizaion, too.


> I've not look at this patch deeply but in DropRelFileNodeBuffer I
> think we can get the min value of all firstDelBlock and use it as the
> lower bound of block number that we're interested in. That way we can
> skip checking the array during scanning the buffer pool.

That sounds reasonable, although I haven't examined the code, either.


> Don't we use each elements of nblocks for each fork? That is, each
> fork uses an element at its fork number in the nblocks array and sets
> InvalidBlockNumber for invalid slots, instead of passing the valid
> number of elements. That way the following code that exist at many places,

I think the current patch tries to reduce the loop count in 
DropRelFileNodeBuffers() by passing the number of target forks.


Regards
Takayuki Tsunakawa

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Michael Paquier

On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote:
> I looked at the buildfarm failures for the recoveryCheck stage.  It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
> AFAICS this is wholly unrelated to the problem at hand.

Yes, that's unrelated.

This failure is interesting, still it has happened only once per what
I can see.  I think that this points out a rare race condition in test
007_sync_rep.pl because of this sequence which reorders the standbys:
# Stop and start standbys to rearrange the order of standbys
# in WalSnd array. Now, if standbys have the same priority,
# standby2 is selected preferentially and standby3 is next.
$node_standby_1->stop;
$node_standby_2->stop;
$node_standby_3->stop;

$node_standby_2->start;
$node_standby_3->start;

The failure actually indicates that even if standby3 has started
after standby2, standby3 has registered back into the WAL sender array
of the primary before standby2 had the occasion to do it, but we have
no guarantee that the ordering is actually right.  So this has messed
up with the expected set of sync standbys when these are not directly
listed.  I think that this can happen as well when initializing the
standbys at the top of the test, but the window is much, much
narrower and basically impossible to reach.

Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes
the test faster, but that's much more costly than using just
poll_query_until() on pg_stat_replication to make sure that each
standby is registered on the primary's WAL sender array.  In short,
something like the attached should improve the stability of the test.

Thoughts?
--
Michael
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index bba47da17a..653a6bc842 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -27,6 +27,23 @@ sub test_sync_state
 	return;
 }
 
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print "### Waiting for standby \"$standby_name\" on \"$master_name\"\n";
+	$master->poll_query_until('postgres', $query);
+}
+
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
@@ -36,23 +53,27 @@ my $backup_name = 'master_backup';
 # Take backup
 $node_master->backup($backup_name);
 
+# Create all the standbys.  Their status on the primary is checked to
+# ensure the ordering of each one of them in the WAL sender array of the
+# primary.
+
 # Create standby1 linking to master
 my $node_standby_1 = get_new_node('standby1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_1->start;
+start_standby_and_wait($node_master, $node_standby_1);
 
 # Create standby2 linking to master
 my $node_standby_2 = get_new_node('standby2');
 $node_standby_2->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_2->start;
+start_standby_and_wait($node_master, $node_standby_2);
 
 # Create standby3 linking to master
 my $node_standby_3 = get_new_node('standby3');
 $node_standby_3->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_3->start;
+start_standby_and_wait($node_master, $node_standby_3);
 
 # Check that sync_state is determined correctly when
 # synchronous_standby_names is specified in old syntax.
@@ -82,8 +103,10 @@ $node_standby_1->stop;
 $node_standby_2->stop;
 $node_standby_3->stop;
 
-$node_standby_2->start;
-$node_standby_3->start;
+# Make sure that each standby reports back to the primary in
+# the wanted order.
+start_standby_and_wait($node_master, $node_standby_2);
+start_standby_and_wait($node_master, $node_standby_3);
 
 # Specify 2 as the number of sync standbys.
 # Check that two standbys are in 'sync' state.
@@ -94,7 +117,7 @@ standby3|3|sync),
 	'2(standby1,standby2,standby3)');
 
 # Start standby1
-$node_standby_1->start;
+start_standby_and_wait($node_master, $node_standby_1);
 
 # Create standby4 linking to master
 my $node_standby_4 = get_new_node('standby4');


signature.asc
Description: PGP signature

Re: [PATCH] vacuumlo: print the number of large objects going to be removed

2019-06-12 Thread Michael Paquier

Hi,

On Thu, Jun 13, 2019 at 10:49:46AM +0600, Timur Birsh wrote:
> 12.06.2019, 14:31, "Timur Birsh" :
>> Please cc, I am not subscribed to the list.
> 
> I have subscribed to the mailing list, there is no need to cc me.

Welcome.  Nice to see that you have subscribed to the lists.

Please note that we have some guidelines regarding the way patches are
submitted:
https://wiki.postgresql.org/wiki/Submitting_a_Patch
Based on what I can see with your patch, things are in good shape on
this side.

Now, if you want to get review for your patch, you should register it
in what we call the commit fest app, which is here: 
https://commitfest.postgresql.org/23/

Commit fests happen every two months for a duration of one month, and
the next one which will begin the development cycle of v13 begins on
the 1st of July.  As a basic rule, it is expected that for one patch
submitted, you should review another patch of equal difficulty to keep
some balance in the force.

Regarding the patch, there is an argument to be made for reporting a
rate as well as the actual numbers of deleted and to-delete items.

+   if (param->verbose)
+   {
+   snprintf(buf, BUFSIZE, "SELECT count(*) FROM vacuum_l");
+   res = PQexec(conn, buf);
That part is costly.

Thanks!
--
Michael

signature.asc
Description: PGP signature

Re: PG 12 draft release notes

2019-06-12 Thread Michael Paquier

On Wed, Jun 12, 2019 at 05:25:37PM -0400, Bruce Momjian wrote:
> Since we did not backpatch this fix, I am hesitant to spell out exactly
> how to exploit this DOS attack.  Yes, people can read it in the email
> archives, and commit messages, but I don't see the value in spelling it
> out the release notes too.

We could go for a more general version of that, for the reason that it
involves all relations, like:
"A caller of TRUNCATE or VACUUM could previously queue for an access
exclusive lock on a relation it may not have permission to truncate or
vacuum, leading to relations to be blocked from being accessed."
--
Michael


signature.asc
Description: PGP signature

Re: Quitting the thes

2019-06-12 Thread Michael Paquier

On Wed, Jun 12, 2019 at 02:45:27PM -0400, Alvaro Herrera wrote:
> Some grepping found a bit more; patch attached.

Indeed.  There were much more.  I just got to look with stuff like
that:
find . -name "*.c" | xargs egrep "(\b[a-zA-Z]+) \1\b"

But I did not find any more spots.  Indentation is incorrect in
test_integerset.c.
--
Michael


signature.asc
Description: PGP signature

Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-12 Thread Michael Paquier

On Wed, Jun 12, 2019 at 01:16:54PM +0200, Peter Eisentraut wrote:
> Right.  Here is a patch that addresses this by copying the relevant code
> from pg_lsn_in() and timestamptz_in() directly into the check hooks.
> It's obviously a bit unfortunate not to be able to share that code,
> but it's not actually that much.

+len1 = strspn(str, "0123456789abcdefABCDEF");
+if (len1 < 1 || len1 > MAXPG_LSNCOMPONENT || str[len1] != '/')
+return false;
+
+len2 = strspn(str + len1 + 1, "0123456789abcdefABCDEF");
+if (len2 < 1 || len2 > MAXPG_LSNCOMPONENT || str[len1 + 1 + len2] != '\0')
+return false;
Speaking about pg_lsn.  We have introduced it to reduce the amount of
duplication when mapping an LSN to text, so I am not much a fan of
this patch which adds again a duplication.  You also lose some error
context as you get the same type of error when parsing the first or
the second part of the LSN.  Couldn't you refactor the whole so as an
error string is present as in GUC_check_errdetail()?  I would just put
a wrapper in pg_lsn.c, like pg_lsn_parse() which returns uint64.

The same remark applies to the timestamp_in portion..
--
Michael


signature.asc
Description: PGP signature

85 matches

Mail list logo