Re: pg_dump partitions can lead to inconsistent state after restore

2019-06-20 Thread Amit Langote
On Thu, Apr 25, 2019 at 4:01 AM Alvaro Herrera  wrote:
> So, while testing this I noticed that pg_restore fails with deadlocks if
> you do a parallel restore if the --load-via-partition-root switch was
> given to pg_dump.  Is that a known bug?

Was investigating --load-via-partition-root with a coworker and came
across the following note in the documentation:

https://www.postgresql.org/docs/11/app-pgdump.html

"It is best not to use parallelism when restoring from an archive made
with this option, because pg_restore will not know exactly which
partition(s) a given archive data item will load data into. This could
result in inefficiency due to lock conflicts between parallel jobs, or
perhaps even reload failures due to foreign key constraints being set
up before all the relevant data is loaded."

Apparently, this note was added as a result of the following discussion:

https://www.postgresql.org/message-id/flat/13624.1535486019%40sss.pgh.pa.us

So, while the documentation doesn't explicitly list deadlocks as
possible risk, Tom hinted in the first email that it's possible.

I set out to reproduce one and was able to, although I'm not sure if
it's the same deadlock as seen by Alvaro.  Steps I used to reproduce:

# in the source database

create table foo (a int primary key);
insert into foo select generate_series(1, 100);
create table ht (a int) partition by hash (a);
select 'create table ht' || i || ' partition of ht for values with
(modulus 100, remainder ' || i -1 || ');' from generate_series(1, 100)
i;
\gexec
insert into ht select generate_series(1, 100);
alter table ht add foreign key (a) references foo (a);

# in shell
pg_dump --load-via-partition-root -Fd -f /tmp/dump
createdb targetdb
pg_restore -d targetdb -j 2 /tmp/dump

The last step reports deadlocks; in the server log:

ERROR:  deadlock detected
DETAIL:  Process 14213 waits for RowExclusiveLock on relation 17447 of
database 17443; blocked by process 14212.
Process 14212 waits for ShareRowExclusiveLock on relation
17507 of database 17443; blocked by process 14213.
Process 14213: COPY public.ht (a) FROM stdin;

Process 14212: ALTER TABLE public.ht
ADD CONSTRAINT ht_a_fkey FOREIGN KEY (a) REFERENCES public.foo(a);

Here, the process adding the foreign key has got the lock on the
parent and trying to lock a partition to add the foreign key to it.
The process doing COPY (via root) has apperently locked the partition
and waiting for the lock on the parent to do actual copying.  Looking
into why the latter had got a lock on the partition at all if it
hasn't started the copying yet, I noticed that it was locked when
TRUNCATE was executed on it earlier in the same transaction as part of
some WAL-related optimization, which is something that only happens in
the parallel restore mode.  I was under the impression that the TABLE
DATA archive item (its TocEntry) would have no trace of the partition
if it was dumped with --load-via-partition-root, but that's not the
case.  --load-via-partition-root only dictates that the command that
will be dumped for the item will use the root parent as COPY target,
even though the TocEntry itself is owned by the partition.

Maybe, a way to prevent the deadlock would be for the process that
will do copy-into-given-partition-via-root-parent to do a `LOCK TABLE
root_parent` before `TRUNCATE the_partition`, but we'll need to get
hold of the root parent from the TocEntry somehow.  Turns out it's
only present in the TocEntry.copyStmt, from where it will have to
parsed out.  Maybe that's the only thing we could do without breaking
the archive format though.

Thoughts?

By the way, I couldn't think of ways to reproduce any of the hazards
mentioned in the documentations of using parallel mode to restore an
archive written with pg_dump --load-via-root-parent, but maybe I just
haven't tried hard enough.

Thanks,
Amit




Re: clean up docs for v12

2019-06-20 Thread Michael Paquier
On Thu, Jun 20, 2019 at 07:34:10PM -0400, Alvaro Herrera wrote:
> This patch was applied as f73293aba4d4.  Thanks, Paul and Michael.

Thanks for the thread update, Alvaro.  I completely forgot to mention
the commit on this thread.
--
Michael


signature.asc
Description: PGP signature


Re: unlogged sequences

2019-06-20 Thread Michael Paquier
On Thu, Jun 20, 2019 at 09:30:34AM +0200, Peter Eisentraut wrote:
> The discussion in bug #15631 revealed that serial/identity sequences of
> temporary tables should really also be temporary (easy), and that
> serial/identity sequences of unlogged tables should also be unlogged.
> But there is no support for unlogged sequences, so I looked into that.

Thanks for doing so.

> If you copy the initial sequence relation file to the init fork, then
> this all seems to work out just fine.  Attached is a patch.  The
> low-level copying seems to be handled quite inconsistently across the
> code, so I'm not sure what the most appropriate way to do this would be.
>  I'm looking for feedback from those who have worked on tableam and
> storage manager to see what the right interfaces are or whether some new
> interfaces might perhaps be appropriate.

But the actual deal is that the sequence meta-data is now in
pg_sequences and not the init forks, no?  I have just done a small
test:
1) Some SQL queries:
create unlogged sequence popo;
alter sequence popo increment 2;
select nextval('popo');
select nextval('popo');
2) Then a hard crash:
pg_ctl stop -m immediate
pg_ctl start
3) Again, with a crash:
select nextval('popo'); 


#2  0x55ce60f3208d in ExceptionalCondition
(conditionName=0x55ce610f0570 "!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof (PageHeaderData, pd_linp)))",
errorType=0x55ce610f0507 "FailedAssertion",
fileName=0x55ce610f04e0 "../../../src/include/storage/bufpage.h",
lineNumber=317) at assert.c:54
#3  0x55ce60b43200 in PageValidateSpecialPointer
(page=0x7ff7692b3d80 "") at
../../../src/include/storage/bufpage.h:317
#4  0x55ce60b459d4 in read_seq_tuple (rel=0x7ff768ad27e0,
buf=0x7ffc5707f0bc, seqdatatuple=0x7ffc5707f0a0) at
sequence.c:1213
#5  0x55ce60b447ee in nextval_internal (relid=16385,
check_permissions=true) at sequence.c:678
#6  0x55ce60b44533 in nextval_oid (fcinfo=0x55ce62537570) at sequence.c:607
--
Michael


signature.asc
Description: PGP signature


Re: clean up docs for v12

2019-06-20 Thread Alvaro Herrera
This patch was applied as f73293aba4d4.  Thanks, Paul and Michael.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: clean up docs for v12

2019-06-20 Thread Alvaro Herrera
On 2019-May-20, Paul A Jungwirth wrote:

> On Mon, May 20, 2019 at 9:44 PM Amit Langote
>  wrote:
> > This sounds more like an open item to me [1], not something that have to
> > be postponed until the next CF.
> >
> > [1] https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items
> 
> Oh sorry, I already created the CF entry. Should I withdraw it? I'll
> ask on -infra about getting editor permission for the wiki and add a
> note there instead.

You didn't actually ask, but I did it anyway.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Choosing values for multivariate MCV lists

2019-06-20 Thread Tomas Vondra

On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:

On Tue, 18 Jun 2019 at 21:59, Tomas Vondra  wrote:


The current implementation of multi-column MCV lists (added in this
cycle) uses a fairly simple algorithm to pick combinations to include in
the MCV list. We just compute a minimum number of occurences, and then
include all entries sampled more often. See get_mincount_for_mcv_list().

[snip]

This however means that by choosing the MCV entries solely based on the
number of occurrences in the sample, we end up with MCV lists where vast
majority of entries has NULL street name.

That's why we got such poor estimate in the example query, despite the
fact that the city/street combination is the most frequent in the data
set (with non-NULL street name).



I think the fact that they're NULL is a bit of a red herring because
we're treating NULL just like any other value. The same thing would
happen if there were some other very common non-NULL value that
dominated the dataset.



I wasn't really suggesting the NULL is an issue - sorry if that wasn't
clear. It might be any other value, as long as it's very common (and
roughly uniform) in all cities. So yes, I agree with you here.


The other weird thing is that frequency of NULL street names is fairly
uniform in the whole data set. In total about 50% addresses match that,
and for individual cities it's generally between 25% and 100%, so the
estimate is less than 2x off in those cases.

But for addresses with non-NULL street names, the situation is very
different. Some street names are unique to a single city, etc.

Overall, this means we end up with MCV list with entries representing
the mostly-uniform part of the data set, instead of prefering the
entries that are truly skewed.

So I'm wondering if we could/should rethink the algorithm, so look at
the frequency and base_frequency, and maybe pick entries based on their
ratio, or something like that.



Hmm, interesting. I think I would like to see a more rigorous
justification for changing the algorithm deciding which values to
keep.



Sure, I'm not going to pretend my proposals were particularly rigorous, it
was more a collection of random ideas.


If I've understood correctly, I think the problem is this: The
mincount calculation is a good way of identifying MCV candidates to
keep, because it ensures that we don't keep values that don't appear
sufficiently often to produce accurate estimates, and ideally we'd
keep everything with count >= mincount. However, in the case were
there are more than stats_target items with count >= mincount, simply
ordering by count and keeping the most commonly seen values isn't
necessarily the best strategy in the case of multivariate statistics.



Yes.


To identify what the best strategy might be, I think you need to
examine the errors that would occur as a result of *not* keeping a
value in the multivariate MCV list. Given a value that appears with
count >= mincount, N*freq ought to be a reasonable estimate for the
actual number of occurrences of that value in the table, and
N*base_freq ought to be a reasonable estimate for the
univariate-stats-based estimate that it would be given if it weren't
kept in the multivariate MCV list. So the absolute error resulting
from not keeping that value would be

 N * Abs(freq - base_freq)



Yeah. Considering N is the same for all groups in the sample, this
would mean the same thing as Abs(freq - base_freq).


But then I think we ought to take into account how often we're likely
to get that error. If we're simply picking values at random, the
likelihood of getting that value is just its frequency, so the average
average absolute error would be

 Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) )

which suggests that, if we wanted to reduce the average absolute error
of the estimates, we should order by freq*Abs(freq-base_freq) and keep
the top n in the MCV list.



Interesting idea. But I'm not sure it makes much sense to assume the rows
are picked randomly - OTOH we don't really know anything about how the
data will be queries, so we may just as well do that.


On the other hand, if we wanted to reduce the average *relative* error
of the estimates, we might instead order by Abs(freq-base_freq).



Hmmm, yeah. I don't know what's the right choice here, TBH.


For example, we might sort the entries by

abs(freq - base_freq) / freq



I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
though, because that would seem likely to put too much weight on the
least commonly occurring values.



But would that be an issue, or a good thing? I mean, as long as the item
is above mincount, we take the counts as reliable. As I explained, my
motivation for proposing that was that both

  ... (cost=... rows=1 ...) (actual=... rows=101 ...)

and

  ... (cost=... rows=100 ...) (actual=... rows=200 ...)

have exactly the same Abs(freq - base_freq), but I think we both agree
that the first misestimate is much mor

Re: Multivariate MCV list vs. statistics target

2019-06-20 Thread Tomas Vondra

On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote:

On Tue, 18 Jun 2019 at 22:34, Tomas Vondra  wrote:


One slightly inconvenient thing I realized while playing with the
address data set is that it's somewhat difficult to set the desired size
of the multi-column MCV list.

At the moment, we simply use the maximum statistic target for attributes
the MCV list is built on. But that does not allow keeping default size
for per-column stats, and only increase size of multi-column MCV lists.

So I'm thinking we should allow tweaking the statistics for extended
stats, and serialize it in the pg_statistic_ext catalog. Any opinions
why that would be a bad idea?



Seems reasonable to me. This might not be the only option we'll ever
want to add though, so perhaps a "stxoptions text[]" column along the
lines of a relation's reloptions would be the way to go.



I don't know - I kinda dislike the idea of stashing stuff like this into
text[] arrays unless there's a clear need for such flexibility (i.e.
vision to have more such options). Which I'm not sure is the case here.
And we kinda have a precedent in pg_attribute.attstattarget, so I'd use
the same approach here.


I suppose it should be part of the CREATE STATISTICS command, but I'm
not sure what'd be the best syntax. We might also have something more
similar to ALTER COLUMNT, but perhaps

 ALTER STATISTICS s SET STATISTICS 1000;

looks a bit too weird.



Yes it does look a bit weird, but that's the natural generalisation of
what we have for per-column statistics, so it's probably preferable to
do that rather than invent some other syntax that wouldn't be so
consistent.



Yeah, I agree.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Tom Lane
Andrew Gierth  writes:
> Tom's "fix" of backpatching 23bd3cec6 (which happened on Friday 14th)
> addressed only a subset of cases, as far as I know working only on Linux
> (the historical convention has always been for /etc/localtime to be a
> copy of a zonefile, not a symlink to one). I only decided to write (and
> if need be commit) my own followup fix after confirming that the bug was
> unfixed in a default FreeBSD install when set to UTC, and there was a
> good chance that a number of other less-popular platforms were affected
> too.

I think your info is out of date on that.

NetBSD uses a symlink, and has done for at least 5 years: see
set_timezone in
http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.sbin/sysinst/util.c?only_with_tag=MAIN

macOS seems to have done it like that for at least 10 years, too.
I didn't bother digging into their source repo, as it's likely that
System Preferences isn't open-source; but *all* of my macOS machines
have symlinks there, and some of those link files are > 10 years old.

I could not easily find OpenBSD's logic to set the zone during install,
if they have any; but at least their admin-facing documentation says to
create the file as a symlink:
https://www.openbsd.org/faq/faq8.html#TimeZone
and there are plenty of similar recommendations found by Mr. Google.

In short, I think FreeBSD are holdouts not the norm.  I note that
even their code will preserve /etc/localtime's symlink status if
it was a symlink to start with: see install_zoneinfo_file in
https://github.com/freebsd/freebsd/blob/master/usr.sbin/tzsetup/tzsetup.c

regards, tom lane




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Tom Lane
Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> Keep in mind that dealing with whatever tzdb chooses to ship is
>  Tom> not optional from our standpoint. Even if we'd refused to import
>  Tom> 2019a, every installation using --with-system-tzdata (which, I
>  Tom> sincerely hope, includes most production installs) is going to
>  Tom> have to deal with it as soon as the respective platform vendor
>  Tom> gets around to shipping the tzdata update. So reverting that
>  Tom> commit was never on the table.

> Exactly. But that means that if the combination of our arbitrary rules
> and the data in the tzdb results in an undesirable result, then we have
> no real option but to fix our rules (we can't reasonably expect the tzdb
> upstream to choose zone names to make our alphabetical-order preference
> come out right).

My position is basically that having TimeZone come out as 'UCT' rather
than 'UTC' (affecting no visible behavior of the timestamp types, AFAIK)
was not such a grave problem as to require violating community norms
to get it fixed in this week's releases rather than the next batch.

I hadn't had time to consider your patch last week because I was (a)
busy with release prep and (b) sick as a dog.  I figured we could let
it slide and discuss it after the release work died down.  I imagine
the reason you got zero other responses was that nobody else thought
it was of life-and-death urgency either.

Anyway, as I said already, my beef is not with the substance of the
patch but with failing to follow community process.  One "yes" vote
and one "no" vote do not constitute consensus.  You had no business
assuming that I would reverse the "no" vote.

regards, tom lane




Re: Tweaking DSM and DSA limits

2019-06-20 Thread David Fetter
On Thu, Jun 20, 2019 at 02:20:27PM -0400, Robert Haas wrote:
> On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro  wrote:
> > It's currently set to 4, but I now think that was too cautious.  It
> > tries to avoid fragmentation by ramping up slowly (that is, memory
> > allocated and in some cases committed by the operating system that we
> > don't turn out to need), but it's pretty wasteful of slots.  Perhaps
> > it should be set to 2?
> 
> +1.  I think I said at the time that I thought that was too cautious...
> 
> > Perhaps also the number of slots per backend should be dynamic, so
> > that you have the option to increase it from the current hard-coded
> > value of 2 if you don't want to increase max_connections but find
> > yourself running out of slots (this GUC was a request from Andres but
> > the name was made up by me -- if someone has a better suggestion I'm
> > all ears).
> 
> I am not convinced that we really need to GUC-ify this.  How about
> just bumping the value up from 2 to say 5?  Between the preceding
> change and this one we ought to buy ourselves more than 4x, and if
> that is not enough then we can ask whether raising max_connections is
> a reasonable workaround,

Is there perhaps a way to make raising max_connections not require a
restart? There are plenty of situations out there where restarts
aren't something that can be done on a whim.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 >> I was referring to the fact that the regression was introduced by a,
 >> presumably important, tzdb update (2019a, as mentioned above). At
 >> least, I made the assumption that the commit of the import of 2019a
 >> had more than just the change that introduced the regression, but
 >> I'm happy to admit I'm no where near as close to the code here as
 >> you/Tom here.

 Tom> Keep in mind that dealing with whatever tzdb chooses to ship is
 Tom> not optional from our standpoint. Even if we'd refused to import
 Tom> 2019a, every installation using --with-system-tzdata (which, I
 Tom> sincerely hope, includes most production installs) is going to
 Tom> have to deal with it as soon as the respective platform vendor
 Tom> gets around to shipping the tzdata update. So reverting that
 Tom> commit was never on the table.

Exactly. But that means that if the combination of our arbitrary rules
and the data in the tzdb results in an undesirable result, then we have
no real option but to fix our rules (we can't reasonably expect the tzdb
upstream to choose zone names to make our alphabetical-order preference
come out right).

My commit was intended to be the minimum fix that would restore the
pre-2019a behavior on all systems.

-- 
Andrew (irc:RhodiumToad)




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Tom Lane
Stephen Frost  writes:
> * Andrew Gierth (and...@tao11.riddles.org.uk) wrote:
> "Stephen" == Stephen Frost  writes:
>> Stephen> Piling on to that, the regression was entwined with other
>> Stephen> important changes that we wanted to include in the release.
>> 
>> I'm not sure what you're referring to here?

I was confused by that too.

> I was referring to the fact that the regression was introduced by a,
> presumably important, tzdb update (2019a, as mentioned above).  At
> least, I made the assumption that the commit of the import of 2019a had
> more than just the change that introduced the regression, but I'm happy
> to admit I'm no where near as close to the code here as you/Tom here.

Keep in mind that dealing with whatever tzdb chooses to ship is not
optional from our standpoint.  Even if we'd refused to import 2019a,
every installation using --with-system-tzdata (which, I sincerely hope,
includes most production installs) is going to have to deal with it
as soon as the respective platform vendor gets around to shipping the
tzdata update.  So reverting that commit was never on the table.

regards, tom lane




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Stephen Frost
Greetings,

* Andrew Gierth (and...@tao11.riddles.org.uk) wrote:
> > "Stephen" == Stephen Frost  writes:
> 
>  Stephen> In the situation that started this discussion, a change had
>  Stephen> already been made and it was only later realized that it
>  Stephen> caused a regression.
> 
> Just to keep the facts straight:
> 
> The regression was introduced by importing tzdb 2019a (in late April)

Ah, thanks, I had misunderstood when that was committed then.

> into the previous round of point releases; the change in UTC behaviour
> was not mentioned in the commit and presumably didn't show up on
> anyone's radar until there were field complaints (which didn't reach our
> mailing lists until Jun 4 as far as I know).

Ok.

> Tom's "fix" of backpatching 23bd3cec6 (which happened on Friday 14th)
> addressed only a subset of cases, as far as I know working only on Linux
> (the historical convention has always been for /etc/localtime to be a
> copy of a zonefile, not a symlink to one). I only decided to write (and
> if need be commit) my own followup fix after confirming that the bug was
> unfixed in a default FreeBSD install when set to UTC, and there was a
> good chance that a number of other less-popular platforms were affected
> too.
> 
>  Stephen> Piling on to that, the regression was entwined with other
>  Stephen> important changes that we wanted to include in the release.
> 
> I'm not sure what you're referring to here?

I was referring to the fact that the regression was introduced by a,
presumably important, tzdb update (2019a, as mentioned above).  At
least, I made the assumption that the commit of the import of 2019a had
more than just the change that introduced the regression, but I'm happy
to admit I'm no where near as close to the code here as you/Tom here.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Andrew Gierth
> "Stephen" == Stephen Frost  writes:

 Stephen> In the situation that started this discussion, a change had
 Stephen> already been made and it was only later realized that it
 Stephen> caused a regression.

Just to keep the facts straight:

The regression was introduced by importing tzdb 2019a (in late April)
into the previous round of point releases; the change in UTC behaviour
was not mentioned in the commit and presumably didn't show up on
anyone's radar until there were field complaints (which didn't reach our
mailing lists until Jun 4 as far as I know).

Tom's "fix" of backpatching 23bd3cec6 (which happened on Friday 14th)
addressed only a subset of cases, as far as I know working only on Linux
(the historical convention has always been for /etc/localtime to be a
copy of a zonefile, not a symlink to one). I only decided to write (and
if need be commit) my own followup fix after confirming that the bug was
unfixed in a default FreeBSD install when set to UTC, and there was a
good chance that a number of other less-popular platforms were affected
too.

 Stephen> Piling on to that, the regression was entwined with other
 Stephen> important changes that we wanted to include in the release.

I'm not sure what you're referring to here?

-- 
Andrew (irc:RhodiumToad)




Re: commitfest application - create patch doesn't work

2019-06-20 Thread Pavel Stehule
čt 20. 6. 2019 v 20:27 odesílatel Stephen Frost  napsal:

> Greetings,
>
> * Pavel Stehule (pavel.steh...@gmail.com) wrote:
> > Searching subject for "Specify thread msgid" field doesn't work. It
> returns
> > empty result set every time.
>
> Is this still not working?  I was chatting with Magnus and it seems
> possible this was broken and then fixed already.
>

It is working now.

Thank you

Pavel


> Thanks,
>
> Stephen
>


Re: Tweaking DSM and DSA limits

2019-06-20 Thread Andres Freund
Hi,

On 2019-06-20 14:20:27 -0400, Robert Haas wrote:
> On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro  wrote:
> > Perhaps also the number of slots per backend should be dynamic, so
> > that you have the option to increase it from the current hard-coded
> > value of 2 if you don't want to increase max_connections but find
> > yourself running out of slots (this GUC was a request from Andres but
> > the name was made up by me -- if someone has a better suggestion I'm
> > all ears).
> 
> I am not convinced that we really need to GUC-ify this.  How about
> just bumping the value up from 2 to say 5?

I'm not sure either. Although it's not great if the only way out for a
user hitting this is to increase max_connections... But we should really
increase the default.


> As Andres observed off-list, it would also be a good idea to allow
> things that are going to gobble memory like hash joins to have some
> input into how much memory gets allocated.  Maybe preallocating the
> expected size of the hash is too aggressive -- estimates can be wrong,
> and it could be much smaller.

At least for the case of the hashtable itself, we allocate that at the
predicted size immediately. So a mis-estimation wouldn't change
anything. For the entires, yea, something like you suggest would make
sense.

Greetings,

Andres Freund




Re: psql UPDATE field [tab] expands to DEFAULT?

2019-06-20 Thread Tom Lane
I wrote:
> Nosing around in tab-complete.c, I notice a fair number of other
> places where we're doing COMPLETE_WITH() with just a single possible
> completion.  Knowing what we know now, in each one of those places
> libreadline will suppose that that completion is the only syntactically
> legal continuation, and throw away anything else the user might've typed.
> We should probably inspect each of those places to see if that's really
> desirable behavior ... but I didn't muster the energy to do that this
> morning.

I took a closer look and realized that this isn't some magic behavior of
arcane parts of libreadline; it's more like self-inflicted damage.  It
happens because tab-complete.c's complete_from_const() is doing exactly
what its comment says it does:

/*
 * This function returns one fixed string the first time even if it doesn't
 * match what's there, and nothing the second time. This should be used if
 * there is only one possibility that can appear at a certain spot, so
 * misspellings will be overwritten.  The string to be passed must be in
 * completion_charp.
 */

This is unlike complete_from_list(), which will only return completions
that match the text-string-so-far.

I have to wonder whether complete_from_const()'s behavior is really
a good idea; I think there might be an argument for getting rid of it
and using complete_from_list() even for one-element lists.

We certainly didn't do anybody any favors in the refactoring we did in
4f3b38fe2, which removed the source-code difference between calling
complete_from_const() and calling complete_from_list() with just one list
item.  But even before that, I really doubt that many people hacking on
tab-complete.c had internalized the idea that COMPLETE_WITH_CONST()
implied a higher degree of certainty than COMPLETE_WITH_LIST() with one
list item.  I'm pretty sure I'd never understood that.

Both of those functions go back to the beginnings of tab-complete.c,
so there's not much available in the history to explain the difference
in behavior (and the discussion of the original patch, if any, is lost
to the mists of time --- our archives for pgsql-patches only go back to
2000).  But my own feeling about this is that there's no situation in
which I'd expect tab completion to wipe out text I'd typed.

Thoughts?

regards, tom lane




Re: commitfest application - create patch doesn't work

2019-06-20 Thread Stephen Frost
Greetings,

* Pavel Stehule (pavel.steh...@gmail.com) wrote:
> Searching subject for "Specify thread msgid" field doesn't work. It returns
> empty result set every time.

Is this still not working?  I was chatting with Magnus and it seems
possible this was broken and then fixed already.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Jun 20, 2019 at 1:28 PM Alvaro Herrera  
> wrote:
> > I suppose we could have a moratorium on commits starting from (say) EOB
> > Wednesday of the week prior to the release; patches can only be
> > committed after that if they have ample support (where "ample support"
> > might be defined as having +1 from, say, two other committers).  That
> > way there's time to discuss/revert/fix anything that is deemed
> > controversial.
> 
> Or we could have a moratorium on any change at any time that has a -1
> from a committer and a +1 from nobody.

What about a change that's already been committed but another committer
feels caused a regression?  If that gets a -1, does it get reverted
until things are sorted out, or...?

In the situation that started this discussion, a change had already been
made and it was only later realized that it caused a regression.  Piling
on to that, the regression was entwined with other important changes
that we wanted to include in the release.

Having a system where when the commit was made is a driving factor seems
like it would potentially reward people who pushed a change early by
giving them the upper hand in such a discussion as this.

Ultimately though, I still agree with Andres that this is something we
should act to avoid these situation and we shouldn't try to make a
policy to fit what's been a very rare occurance.  If nothing else, I
feel like we'd probably re-litigate the policy every time since it would
likely have been a long time since the last discussion of it and the
specific circumstances will always be at least somewhat different.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Tweaking DSM and DSA limits

2019-06-20 Thread Robert Haas
On Tue, Jun 18, 2019 at 9:08 PM Thomas Munro  wrote:
> It's currently set to 4, but I now think that was too cautious.  It
> tries to avoid fragmentation by ramping up slowly (that is, memory
> allocated and in some cases committed by the operating system that we
> don't turn out to need), but it's pretty wasteful of slots.  Perhaps
> it should be set to 2?

+1.  I think I said at the time that I thought that was too cautious...

> Perhaps also the number of slots per backend should be dynamic, so
> that you have the option to increase it from the current hard-coded
> value of 2 if you don't want to increase max_connections but find
> yourself running out of slots (this GUC was a request from Andres but
> the name was made up by me -- if someone has a better suggestion I'm
> all ears).

I am not convinced that we really need to GUC-ify this.  How about
just bumping the value up from 2 to say 5?  Between the preceding
change and this one we ought to buy ourselves more than 4x, and if
that is not enough then we can ask whether raising max_connections is
a reasonable workaround, and if that's still not enough then we can
revisit this idea, or maybe come up with something better.  The
problem I have with a GUC here is that nobody without a PhD in
PostgreSQL-ology will have any clue how to set it, and while that's
good for your employment prospects and mine, it's not so great for
PostgreSQL users generally.

As Andres observed off-list, it would also be a good idea to allow
things that are going to gobble memory like hash joins to have some
input into how much memory gets allocated.  Maybe preallocating the
expected size of the hash is too aggressive -- estimates can be wrong,
and it could be much smaller.  But maybe we should allocate at least,
say, 1/64th of that amount, and act as if
DSA_NUM_SEGMENTS_AT_EACH_SIZE == 1 until the cumulative memory
allocation is more than 25% of that amount.  So if we think it's gonna
be 1GB, start by allocating 16MB and double the size of each
allocation thereafter until we get to at least 256MB allocated.  So
then we'd have 16MB + 32MB + 64MB + 128MB + 256MB + 256MB + 512MB = 7
segments instead of the 32 required currently or the 18 required with
DSA_NUM_SEGMENTS_AT_EACH_SIZE == 2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Robert Haas
On Thu, Jun 20, 2019 at 1:28 PM Alvaro Herrera  wrote:
> I suppose we could have a moratorium on commits starting from (say) EOB
> Wednesday of the week prior to the release; patches can only be
> committed after that if they have ample support (where "ample support"
> might be defined as having +1 from, say, two other committers).  That
> way there's time to discuss/revert/fix anything that is deemed
> controversial.

Or we could have a moratorium on any change at any time that has a -1
from a committer and a +1 from nobody.

I mean, your idea is not bad either.  I'm just saying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Alvaro Herrera
On 2019-Jun-20, Andres Freund wrote:

> On 2019-06-20 12:02:30 -0400, Robert Haas wrote:

> > I agree that it's a difficult situation.  I do kind of wonder whether
> > we were altogether overreacting.  If we had shipped it as it was,
> > what's the worst thing that would have happened?
> 
> I think it's not good, but also nothing particularly bad came out of
> it. I don't think we should try to set up procedures for future
> occurances, and rather work/plan on that not happening very often.

I suppose we could have a moratorium on commits starting from (say) EOB
Wednesday of the week prior to the release; patches can only be
committed after that if they have ample support (where "ample support"
might be defined as having +1 from, say, two other committers).  That
way there's time to discuss/revert/fix anything that is deemed
controversial.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-06-20 12:02:30 -0400, Robert Haas wrote:
> > On Thu, Jun 20, 2019 at 8:52 AM Stephen Frost  wrote:
> > > I don't want to come across as implying that I'm saying what was done
> > > was 'fine', or that we shouldn't be having this conversation, I'm just
> > > trying to figure out how we can frame it in a way that we learn from it
> > > and work to improve on it for the future, should something like this
> > > happen again.
> > 
> > I agree that it's a difficult situation.  I do kind of wonder whether
> > we were altogether overreacting.  If we had shipped it as it was,
> > what's the worst thing that would have happened?
> 
> I think it's not good, but also nothing particularly bad came out of
> it. I don't think we should try to set up procedures for future
> occurances, and rather work/plan on that not happening very often.

Agreed.

Thanks,

Stephen


signature.asc
Description: PGP signature


commitfest application - create patch doesn't work

2019-06-20 Thread Pavel Stehule
Hi

Searching subject for "Specify thread msgid" field doesn't work. It returns
empty result set every time.

Regards

Pavel


Do we need to do better for pg_ctl timeouts?

2019-06-20 Thread Andres Freund
Hi,

Right now using -w for shutting down clusters with a bit bigger shared
buffers will very frequently fail, because the shutdown checkpoint takes
much longer than 60s.  Obviously that can be addressed by manually
setting PGCTLTIMEOUT to something higher, but forcing many users to do
that doesn't seem right. And while many users probably don't want to
aggressively time-out on the shutdown checkpoint, I'd assume most do
want to time out aggressively if the server doesn't actually start the
checkpoint.

I wonder if we need to split the timeout into two: One value for
postmaster to acknowledge the action, one for that action to
complete. It seems to me that that'd be useful for all of starting,
restarting and stopping.

I think we have all the necessary information in the pid file, we would
just need to check for PM_STATUS_STARTING for start, PM_STATUS_STOPPING
for restart/stop.

Comments?

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Robert Haas
On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila  wrote:
> This delay is for *not* choking the system by constantly performing
> undo requests that consume a lot of CPU and I/O as discussed in above
> point.  For holding off the same error request to be re-tried, we need
> next_retry_time type of method as discussed below.

Oh.  That's not what I thought we were talking about.  It's not
unreasonable to think about trying to rate limit undo application just
like we do for vacuum, but a fixed delay between requests would be a
completely inadequate way of attacking that problem.  If the
individual requests are short, it will create too much delay, and if
they are long, it will not create enough.  We would need delays within
a transaction, not just between transactions, similar to how the
vacuum cost delay stuff works.

I suggest that we leave that to one side for now.  It seems like
something that could be added later, maybe in a more general way, and
not something that needs to be or should be closely connected to the
logic for deciding the order in which we're going to process different
transactions in undo.

> > Meh.  Don't get stuck on one particular method of calculating the next
> > retry time.  We want to be able to change that easily if whatever we
> > try first doesn't work out well.  I am not convinced that we need
> > anything more complex than a fixed retry time, probably controlled by
> > a GUC (undo_failure_retry_time = 10s?).
>
> IIRC, then you only seem to have suggested that we need a kind of
> back-off algorithm that gradually increases the retry time up to some
> maximum [1].  I think that is a good way to de-prioritize requests
> that are repeatedly failing.  Say, there is a request that has already
> failed for 5 times and the worker queues it to get executed after 10s.
> Immediately after that, another new request has failed for the first
> time for the same database and it also got queued to get executed
> after 10s.  In this scheme the request that has already failed for 5
> times will get a chance before the request that has failed for the
> first time.

Sure, that's an advantage of increasing back-off times -- you can keep
the stuff that looks hopeless from interfering too much with the stuff
that is more likely to work out. However, I don't think we've actually
done enough testing to know for sure what algorithm will work out
best.  Do we want linear back-off (10s, 20s, 30s, ...)?  Exponential
back-off (1s, 2s, 4s, 8s, ...)?  No back-off (10s, 10s, 10s, 10s)?
Some algorithm that depends on the size of the failed transaction, so
that big things get retried less often? I think it's important to
design the code in such a way that the algorithm can be changed easily
later, because I don't think we can be confident that whatever we pick
for the first attempt will prove to be best.  I'm pretty sure that
storing the failure count INSTEAD OF the next retry time is going to
make it harder to experiment with different algorithms later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Andres Freund
Hi,

On 2019-06-20 12:02:30 -0400, Robert Haas wrote:
> On Thu, Jun 20, 2019 at 8:52 AM Stephen Frost  wrote:
> > I don't want to come across as implying that I'm saying what was done
> > was 'fine', or that we shouldn't be having this conversation, I'm just
> > trying to figure out how we can frame it in a way that we learn from it
> > and work to improve on it for the future, should something like this
> > happen again.
> 
> I agree that it's a difficult situation.  I do kind of wonder whether
> we were altogether overreacting.  If we had shipped it as it was,
> what's the worst thing that would have happened?

I think it's not good, but also nothing particularly bad came out of
it. I don't think we should try to set up procedures for future
occurances, and rather work/plan on that not happening very often.

Greetings,

Andres Freund




Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-20 Thread Andres Freund
Hi,

On 2019-06-20 15:42:14 +0200, Peter Eisentraut wrote:
> On 2019-06-12 13:16, Peter Eisentraut wrote:
> > I haven't figured out the time zone issue yet, but I guess the solution
> > might involve moving some of the code from check_recovery_target_time()
> > to assign_recovery_target_time().
> 
> I think that won't work either.  What we need to do is postpone the
> interpretation of the timestamp string until after all the GUC
> processing is done.  So check_recovery_target_time() would just do some
> basic parsing checks, but stores the string.  Then when we need the
> recovery_target_time_value we do the final parsing.  Then we can be sure
> that the time zone is all set.

That sounds right to me.

Greetings,

Andres Freund




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Robert Haas
On Thu, Jun 20, 2019 at 8:52 AM Stephen Frost  wrote:
> I don't want to come across as implying that I'm saying what was done
> was 'fine', or that we shouldn't be having this conversation, I'm just
> trying to figure out how we can frame it in a way that we learn from it
> and work to improve on it for the future, should something like this
> happen again.

I agree that it's a difficult situation.  I do kind of wonder whether
we were altogether overreacting.  If we had shipped it as it was,
what's the worst thing that would have happened?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: benchmarking Flex practices

2019-06-20 Thread Andres Freund
Hi,

On 2019-06-20 10:52:54 -0400, Tom Lane wrote:
> John Naylor  writes:
> > It would be nice to have confirmation to make sure I didn't err
> > somewhere, and to try a more real-world benchmark.
> 
> I don't see much wrong with using information_schema.sql as a parser/lexer
> benchmark case.  We should try to confirm the results on other platforms
> though.

Might be worth also testing with a more repetitive testcase to measure
both cache locality and branch prediction. I assume that with
information_schema there's enough variability that these effects play a
smaller role. And there's plenty real-world cases where there's a *lot*
of very similar statements being parsed over and over. I'd probably just
measure the statements pgbench generates or such.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Amit Kapila
On Thu, Jun 20, 2019 at 8:01 PM Robert Haas  wrote:
>
> On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila  wrote:
> > Okay, one reason that comes to mind is we don't want to choke the
> > system as applying undo can consume CPU and generate a lot of I/O.  Is
> > that you have in mind or something else?
>
> Yeah, mainly that, but also things like log spam, and even pressure on
> the lock table.  If we are trying over and over again to take useless
> locks, it can affect other things on the system. The main thing,
> however, is the CPU and I/O consumption.
>
> > I see an advantage in having some sort of throttling here, so we can
> > have some wait time (say 100ms) between processing requests.  Do we
> > see any need of guc here?
>
> I don't think that is the right approach. As I said in my previous
> reply, we need a way of holding off the retry of the same error for a
> certain amount of time, probably measured in seconds or tens of
> seconds. Introducing a delay before processing every request is an
> inferior alternative:
>

This delay is for *not* choking the system by constantly performing
undo requests that consume a lot of CPU and I/O as discussed in above
point.  For holding off the same error request to be re-tried, we need
next_retry_time type of method as discussed below.

 if there are a lot of rollbacks, it can cause
> the system to lag; and in the case where there's just one rollback
> that's failing, it will still be way too much log spam (and probably
> CPU time too).  Nobody wants 10 failure messages per second in the
> log.
>
> > > It seems to me that thinking of this in terms of what the undo worker
> > > does and what the undo launcher does is probably not the right
> > > approach. We need to think of it more as an integrated system. Instead
> > > of storing a failure_count with each request in the error queue, how
> > > about storing a next retry time?
> >
> > I think both failure_count and next_retry_time can work in a similar way.
> >
> > I think incrementing next retry time in multiples will be a bit
> > tricky.  Say first-time error occurs at X hours.  We can say that
> > next_retry_time will X+10s=Y and error_occured_at will be X.  The
> > second time it again failed, how will we know that we need set
> > next_retry_time as Y+20s, maybe we can do something like Y-X and then
> > add 10s to it and add the result to the current time.  Now whenever
> > the worker or launcher finds this request, they can check if the
> > current_time is greater than or equal to next_retry_time, if so they
> > can pick that request, otherwise, they check request in next queue.
> >
> > The failure_count can also work in a somewhat similar fashion.
> > Basically, we can use error_occurred at and failure_count to compute
> > the required time.  So, if error is occurred at say X hours and
> > failure count is 3, then we can check if current_time is greater than
> > X+(3 * 10s), then we will allow the entry to be processed, otherwise,
> > it will check other queues for work.
>
> Meh.  Don't get stuck on one particular method of calculating the next
> retry time.  We want to be able to change that easily if whatever we
> try first doesn't work out well.  I am not convinced that we need
> anything more complex than a fixed retry time, probably controlled by
> a GUC (undo_failure_retry_time = 10s?).
>

IIRC, then you only seem to have suggested that we need a kind of
back-off algorithm that gradually increases the retry time up to some
maximum [1].  I think that is a good way to de-prioritize requests
that are repeatedly failing.  Say, there is a request that has already
failed for 5 times and the worker queues it to get executed after 10s.
Immediately after that, another new request has failed for the first
time for the same database and it also got queued to get executed
after 10s.  In this scheme the request that has already failed for 5
times will get a chance before the request that has failed for the
first time.


[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoYHBkm7M8tNk6Z9G_aEOiw3Bjdux7v9%2BUzmdNTdFmFzjA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Disconnect from SPI manager on error

2019-06-20 Thread Tom Lane
RekGRpth  writes:
> A patch fixing this bug
> https://www.postgresql.org/message-id/flat/15738-21723084f3009ceb%40postgresql.org

I do not think this code change is necessary or appropriate.
It is not plpgsql's job to clean up after other backend subsystems
during a transaction abort.  Maybe if plpgsql were the only thing
that invokes spi.c, it would be sane to factorize the responsibility
this way --- but of course it is not.

The complaint in bug #15738 is 100% bogus, which is probably why
it was roundly ignored.  The quoted C code is just plain wrong
about how to handle errors inside the backend.  In particular,
SPI_rollback is not even approximately the right thing to do to
clean up after catching a thrown exception.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Robert Haas
On Thu, Jun 20, 2019 at 6:44 AM Amit Kapila  wrote:
> BTW, while looking at the code of UndoFetchRecord, I see some problem.
> There is a coding pattern like
> if()
> {
> }
> else
> {
>LWLockAcquire()
>   ..
>   ..
> }
>
> LWLockRelease().
>
> I think this is not correct.

Independently of that problem, I think it's probably bad that we're
not maintaining the same shared memory state on the master and the
standby.  Doing the same check in one way on the master and in a
different way on the standby is a recipe for surprising and probably
bad behavior differences between master and standby servers.  Those
could be simple things like lock acquire/release not matching, but
they could also be things like performance or correctness differences
that only materialize under certain scenarios.

This is not the only place in the patch set where we have this kind of
thing, and I hate them all.  I don't exactly know what the solution
is, either, but I suspect it will involve either having the recovery
process do a more thorough job updating the shared memory state when
it does undo-related stuff, or running some of the undo-specific
processes on the standby just for the purpose of getting these updates
done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: improve transparency of bitmap-only heap scans

2019-06-20 Thread Emre Hasegeli
> Looking at the discussion where the feature was added, I think changing the
> EXPLAIN just wasn't considered.

I think this is an oversight.  It is very useful to have this on
EXPLAIN.

> The attached patch adds "avoided" to "exact" and "lossy" as a category
> under "Heap Blocks".

It took me a while to figure out what those names mean.  "unfetched",
as you call it on the code, may be more descriptive than "avoided" for
the new label.  However I think the other two are more confusing.  It
may be a good idea to change them together with this.

> I think the name of the node should also be changed to "Bitmap Only Heap
> Scan", but I didn't implement that as adding another NodeTag looks like a
> lot of tedious error prone work to do before getting feedback on whether
> the change is desirable in the first place, or the correct approach.

I am not sure about this part.  In my opinion it may have been easier
to explain to users if "Index Only Scan" had not been separate but
"Index Scan" optimization.




Re: benchmarking Flex practices

2019-06-20 Thread Tom Lane
John Naylor  writes:
> I decided to do some experiments with how we use Flex. The main
> takeaway is that backtracking, which we removed in 2005, doesn't seem
> to matter anymore for the core scanner. Also, state table size is of
> marginal importance.

Huh.  That's really interesting, because removing backtracking was a
demonstrable, significant win when we did it [1].  I wonder what has
changed?  I'd be prepared to believe that today's machines are more
sensitive to the amount of cache space eaten by the tables --- but that
idea seems contradicted by your result that the table size isn't
important.  (I'm wishing I'd documented the test case I used in 2005...)

> The size difference is because the size of the elements of the
> yy_transition array is constrained by the number of elements in the
> array. Since there are now fewer than INT16_MAX state transitions, the
> struct members go from 32 bit:
> static yyconst struct yy_trans_info yy_transition[37045] = ...
> to 16 bit:
> static yyconst struct yy_trans_info yy_transition[31763] = ...

Hm.  Smaller binary is definitely nice, but 31763 is close enough to
32768 that I'd have little faith in the optimization surviving for long.
Is there any way we could buy back some more transitions?

> It would be nice to have confirmation to make sure I didn't err
> somewhere, and to try a more real-world benchmark.

I don't see much wrong with using information_schema.sql as a parser/lexer
benchmark case.  We should try to confirm the results on other platforms
though.

regards, tom lane

[1] https://www.postgresql.org/message-id/8652.1116865...@sss.pgh.pa.us




Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Robert Haas
On Thu, Jun 20, 2019 at 6:48 AM Amit Kapila  wrote:
> > for (;;)
> > {
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> > if (i like this one)
> > break;
> > urp = uur->uur_blkprev; // should be renamed, since zedstore +
> > probably others will have tuple chains not block chains
> ..
>
> +1 for renaming this variable.  How about uur_prev_ver or uur_prevver
> or uur_verprev?  Any other suggestions?

Maybe just uur_previous or uur_prevundo or something like that.  We've
already got a uur_prevurp, but that's really pretty misnamed and IMHO
it doesn't belong in this structure anyway.  (uur_next is also a bad
name and also doesn't belong in this structure.)

I don't think we want to use 'ver' because that supposes that undo is
being used to track tuple versions, which is a likely use but perhaps
not the only one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




benchmarking Flex practices

2019-06-20 Thread John Naylor
I decided to do some experiments with how we use Flex. The main
takeaway is that backtracking, which we removed in 2005, doesn't seem
to matter anymore for the core scanner. Also, state table size is of
marginal importance.

Using the information_schema Flex+Bison microbenchmark from Tom [1], I
tested removing most of the "fail" rules designed to avoid
backtracking ("decimalfail" is needed by PL/pgSQL). Below are the best
times (most runs within 1%), followed by postgres binary size. The
numbers are with Flex 2.5.35 on MacOS, no asserts or debugging
symbols.

HEAD:
1.53s
7139132 bytes

HEAD minus "fail" rules (patch attached):
1.53s
6971204 bytes

Surprisingly, it has the same performance and a much smaller binary.
The size difference is because the size of the elements of the
yy_transition array is constrained by the number of elements in the
array. Since there are now fewer than INT16_MAX state transitions, the
struct members go from 32 bit:

struct yy_trans_info
{
flex_int32_t yy_verify;
flex_int32_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[37045] = ...

to 16 bit:

struct yy_trans_info
{
flex_int16_t yy_verify;
flex_int16_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[31763] = ...

To test if array size was the deciding factor, I tried bloating it by
essentially undoing commit a5ff502fcea. Doing so produced an array
with 62583 elements and 32-bit members, so nearly quadruple in size,
and it was still not much slower than HEAD:

HEAD minus "fail" rules, minus %xusend/%xuiend:
1.56s
7343932 bytes

While at it, I repeated the benchmark with different Flex flags:

HEAD, plus -Cf:
1.60s
6995788 bytes

HEAD, minus "fail" rules, plus -Cf:
1.59s
6979396 bytes

HEAD, plus -Cfe:
1.65s
6868804 bytes

So this recommendation of the Flex manual (-CF) still holds true. It's
worth noting that using perfect hashing for keyword lookup (20%
faster) had a much bigger effect than switching from -Cfe to -CF (7%
faster).

It would be nice to have confirmation to make sure I didn't err
somewhere, and to try a more real-world benchmark. (Also for the
moment I only have Linux on a virtual machine.) The regression tests
pass, but some comments are now wrong. If it's confirmed that
backtracking doesn't matter for recent Flex/hardware, disregarding it
would make maintenance of our scanners a bit easier.

[1] https://www.postgresql.org/message-id/14616.1558560331%40sss.pgh.pa.us

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


remove-scanner-fail-rules.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Robert Haas
On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila  wrote:
> Okay, one reason that comes to mind is we don't want to choke the
> system as applying undo can consume CPU and generate a lot of I/O.  Is
> that you have in mind or something else?

Yeah, mainly that, but also things like log spam, and even pressure on
the lock table.  If we are trying over and over again to take useless
locks, it can affect other things on the system. The main thing,
however, is the CPU and I/O consumption.

> I see an advantage in having some sort of throttling here, so we can
> have some wait time (say 100ms) between processing requests.  Do we
> see any need of guc here?

I don't think that is the right approach. As I said in my previous
reply, we need a way of holding off the retry of the same error for a
certain amount of time, probably measured in seconds or tens of
seconds. Introducing a delay before processing every request is an
inferior alternative: if there are a lot of rollbacks, it can cause
the system to lag; and in the case where there's just one rollback
that's failing, it will still be way too much log spam (and probably
CPU time too).  Nobody wants 10 failure messages per second in the
log.

> > It seems to me that thinking of this in terms of what the undo worker
> > does and what the undo launcher does is probably not the right
> > approach. We need to think of it more as an integrated system. Instead
> > of storing a failure_count with each request in the error queue, how
> > about storing a next retry time?
>
> I think both failure_count and next_retry_time can work in a similar way.
>
> I think incrementing next retry time in multiples will be a bit
> tricky.  Say first-time error occurs at X hours.  We can say that
> next_retry_time will X+10s=Y and error_occured_at will be X.  The
> second time it again failed, how will we know that we need set
> next_retry_time as Y+20s, maybe we can do something like Y-X and then
> add 10s to it and add the result to the current time.  Now whenever
> the worker or launcher finds this request, they can check if the
> current_time is greater than or equal to next_retry_time, if so they
> can pick that request, otherwise, they check request in next queue.
>
> The failure_count can also work in a somewhat similar fashion.
> Basically, we can use error_occurred at and failure_count to compute
> the required time.  So, if error is occurred at say X hours and
> failure count is 3, then we can check if current_time is greater than
> X+(3 * 10s), then we will allow the entry to be processed, otherwise,
> it will check other queues for work.

Meh.  Don't get stuck on one particular method of calculating the next
retry time.  We want to be able to change that easily if whatever we
try first doesn't work out well.  I am not convinced that we need
anything more complex than a fixed retry time, probably controlled by
a GUC (undo_failure_retry_time = 10s?).  An escalating time between
retries would be important and advantageous if we expected the sizes
of these queues to grow into the millions, but the current design
seems to be contemplating something more in the tends-of-thousands
range and I am not sure we're going to need it at that level.  We
should try simple things first and then see where we need to make it
more complex.

At some basic level, the queue needs to be ordered by increasing retry
time.  You can do that with your design, but you have to recompute the
next retry time from the error_occurred_at and failure_count values
every time you examine an entry.  It's almost certainly better to
store the next_retry_time explicitly.  That way, if for example we
change the logic for computing the next_retry_time to something really
complicated, it doesn't have any effect on the code that keeps the
queue in order -- it just looks at the computed value.  If we end up
with something very simple, like error_occurred_at + constant, it may
end up seeming a little silly, but I think that's a price well worth
paying for code maintainability.  If we end up with error_occurred_at
+ Min(failure_count * 10, 100) or something of that sort, then we can
also store failure_count in each record, but it will just be part of
the payload, not the sort key, so adding it or removing it won't
affect the code that maintains the queue ordering.

> >  I think the error queue needs to be
> > ordered by database_id, then by next_retry_time, and then by order of
> > insertion.  (The last part is important because next_retry_time is
> > going to be prone to having ties, and we need to break those ties in
> > the right way.)
>
> I think it makes sense to order requests by next_retry_time,
> error_occurred_at (this will ensure the order of insertion).  However,
> I am not sure if there is a need to club the requests w.r.t database
> id.  It can starve the error requests from other databases.  Moreover,
> we already have a functionality wherein if the undo worker doesn't
> encounter the next request from the same data

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-20 Thread Joe Conway
On 6/20/19 8:34 AM, Masahiko Sawada wrote:
> I think even if we provide the per table encryption we can have
> encryption keys per tablepaces. That is, all tables on the same
> tablespace encryption use the same encryption keys but user can
> control encrypted objects per tables.
> 
>> Will we add the table-level TDE in the first version?
> 
> I hope so but It's under discussion now.

+1

>> And I have two questions.
>> 1. Will we add hooks to support replacing the encryption algorithms?
>> 2. Will we add some encryption algorithm or use some in some libraries?
> 
> Currently the WIP patch uses openssl and supports only AES-256 and I
> don't have a plan to have such extensibility for now. But it might be
> a good idea in the future. I think it would be not hard to support
> symmetric encryption altgorithms supported by openssl but would you
> like to support other encryption algorithms?

Supporting other symmetric encryption algorithms would be nice but I
don't think that is required for the first version. It would also be
nice but not initially required to support different encryption
libraries. The implementation should be written with both of these
eventualities in mind though IMHO.

I would like to see strategically placed hooks in the key management so
that an extension could, for example, layer another key in between the
master key and the per-tablespace key. This would allow extensions to
provide additional control over when decryption is allowed.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-20 Thread Peter Eisentraut
On 2019-06-12 13:16, Peter Eisentraut wrote:
> I haven't figured out the time zone issue yet, but I guess the solution
> might involve moving some of the code from check_recovery_target_time()
> to assign_recovery_target_time().

I think that won't work either.  What we need to do is postpone the
interpretation of the timestamp string until after all the GUC
processing is done.  So check_recovery_target_time() would just do some
basic parsing checks, but stores the string.  Then when we need the
recovery_target_time_value we do the final parsing.  Then we can be sure
that the time zone is all set.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




old_snapshot_threshold vs indexes

2019-06-20 Thread Thomas Munro
Hello,

I ran into someone with a system where big queries scanning 8GB+ of
all-in-cache data took consistently ~2.5x longer on a primary server
than on a replica.  Both servers had concurrent activity on them but
plenty of spare capacity and similar specs.  After some investigation
it turned out that on the primary there were (1) some select()
syscalls waiting for 1ms, which might indicate contended
SpinLockAcquire() back-offs, and (2) a huge amount of time spent in:

+ 93,31% 0,00% postgres postgres [.] index_getnext
+ 93,30% 0,00% postgres postgres [.] index_fetch_heap
+ 81,66% 0,01% postgres postgres [.] heap_page_prune_opt
+ 75,85% 0,00% postgres postgres [.] TransactionIdLimitedForOldSnapshots
+ 75,83% 0,01% postgres postgres [.] RelationHasUnloggedIndex
+ 75,79% 0,00% postgres postgres [.] RelationGetIndexList
+ 75,79% 75,78% postgres postgres [.] list_copy

The large tables in question have around 30 indexes.  I see that
heap_page_prune_opt()'s call to TransactionIdLimitedForOldSnapshots()
acquires a couple of system-wide spinlocks, and also tests
RelationAllowsEarlyPruning() which calls RelationHasUnloggedIndex()
which says:

 * Tells whether any index for the relation is unlogged.
 *
 * Note: There doesn't seem to be any way to have an unlogged index attached
 * to a permanent table, but it seems best to keep this general so that it
 * returns sensible results even when they seem obvious (like for an unlogged
 * table) and to handle possible future unlogged indexes on permanent tables.

It calls RelationGetIndexList() which conses up a new copy of the list
every time, so that we can spin through it looking for unlogged
indexes (and in this user's case there are none).  I didn't try to
poke at this in lab conditions, but from a glance a the code, I guess
heap_page_prune_opt() is running for every index tuple except those
that reference the same heap page as the one before, so I guess it
happens a lot if the heap is not physically correlated with the index
keys.  Ouch.

-- 
Thomas Munro
https://enterprisedb.com




Re: Index Skip Scan

2019-06-20 Thread Jesper Pedersen

Hi,

On 6/19/19 9:57 AM, Dmitry Dolgov wrote:

Here is what I was talking about, POC for an integration with index scan. About
using of create_skipscan_unique_path and suggested planner improvements, I hope
together with Jesper we can come up with something soon.



I made some minor changes, but I did move all the code in 
create_distinct_paths() under enable_indexskipscan to limit the overhead 
if skip scan isn't enabled.


Attached is v20, since the last patch should have been v19.

Best regards,
 Jesper
>From 4fd4bd601f510ccce858196c0e93d37aa2d0f20f Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Thu, 20 Jun 2019 07:42:24 -0400
Subject: [PATCH 1/2] Implementation of Index Skip Scan (see Loose Index Scan
 in the wiki [1]) on top of IndexScan and IndexOnlyScan. To make it suitable
 for both situations when there are small number of distinct values and
 significant amount of distinct values the following approach is taken -
 instead of searching from the root for every value we're searching for then
 first on the current page, and then if not found continue searching from the
 root.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
 contrib/bloom/blutils.c   |   1 +
 doc/src/sgml/config.sgml  |  16 ++
 doc/src/sgml/indexam.sgml |  10 +
 doc/src/sgml/indices.sgml |  24 ++
 src/backend/access/brin/brin.c|   1 +
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/index/indexam.c|  16 ++
 src/backend/access/nbtree/nbtree.c|  12 +
 src/backend/access/nbtree/nbtsearch.c | 224 +-
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/commands/explain.c|  24 ++
 src/backend/executor/nodeIndexonlyscan.c  |  22 ++
 src/backend/executor/nodeIndexscan.c  |  22 ++
 src/backend/nodes/copyfuncs.c |   2 +
 src/backend/nodes/outfuncs.c  |   3 +
 src/backend/nodes/readfuncs.c |   2 +
 src/backend/optimizer/path/costsize.c |   1 +
 src/backend/optimizer/path/pathkeys.c |  84 ++-
 src/backend/optimizer/plan/createplan.c   |  20 +-
 src/backend/optimizer/plan/planagg.c  |   1 +
 src/backend/optimizer/plan/planner.c  |  79 +-
 src/backend/optimizer/util/pathnode.c |  40 
 src/backend/optimizer/util/plancat.c  |   1 +
 src/backend/utils/misc/guc.c  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h|   5 +
 src/include/access/genam.h|   1 +
 src/include/access/nbtree.h   |   5 +
 src/include/nodes/execnodes.h |   7 +
 src/include/nodes/pathnodes.h |   8 +
 src/include/nodes/plannodes.h |   2 +
 src/include/optimizer/cost.h  |   1 +
 src/include/optimizer/pathnode.h  |   5 +
 src/include/optimizer/paths.h |   4 +
 src/test/regress/expected/create_index.out|   1 +
 src/test/regress/expected/select_distinct.out | 209 
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/create_index.sql |   2 +
 src/test/regress/sql/select_distinct.sql  |  68 ++
 41 files changed, 918 insertions(+), 22 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY num_sync ( 
+  enable_indexskipscan (boolean)
+  
+   enable_indexskipscan configuration parameter
+  
+  
+  
+   
+Enables or disables the query planner's use of index-skip-scan plan
+types (see ). This parameter requires
+that enable_indexonlyscan is on.
+The default is on.
+   
+  
+ 
+
  
   enable_material (boolean)
   
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-20 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Also on the topic of process: 48 hours before a wrap deadline is
> *particularly* not the time to play fast and loose with this sort of
> thing.  It'd have been better to wait till after this week's releases,
> so there'd at least be time to reconsider if the patch turned out to
> have unexpected side-effects.

Our typical process for changes that actually end up breaking other
things is to put things back the way they were and come up with a
better answer.

Should we have reverted the code change that caused the issue in the
first place, namely, as I understand it at least, the tz code update, to
give us time to come up with a better solution and to fix it properly?

I'll admit that I wasn't following the thread very closely initially,
but I don't recall seeing that even discussed as an option, even though
we do it routinely and even had another such case for this set of
releases.  Possibly a bad assumption on my part, but I did assume that
the lack of such a discussion meant that reverting wasn't really an
option due to the nature of the changes, leading us into an atypical
case already where our usual processes weren't able to be followed.

That doesn't mean we should throw the whole thing out the window either,
certainly, but I'm not sure that between the 3 options of 'revert',
'live with things being arguably broken', and 'push a contentious
commit' that I'd have seen a better option either.

I do agree that it would have been better if intentions had been made
clearer, such as announcing the plan to push the changes so that we
didn't end up with an issue during this patch set (either from out of
date zone information, or from having the wrong timezone alias be used),
but also with feelings on both sides- if there had been a more explicit
"hey, we really need input from someone else on which way they think
this should go" ideally with the options spelled out, it would have
helped.

I don't want to come across as implying that I'm saying what was done
was 'fine', or that we shouldn't be having this conversation, I'm just
trying to figure out how we can frame it in a way that we learn from it
and work to improve on it for the future, should something like this
happen again.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-20 Thread Masahiko Sawada
On Tue, Jun 18, 2019 at 5:07 PM shawn wang  wrote:
>
> Masahiko Sawada  于2019年6月17日周一 下午8:30写道:
>>
>> On Fri, Jun 14, 2019 at 7:41 AM Tomas Vondra
>>  wrote:
>> > I personally find the idea of encrypting tablespaces rather strange.
>> > Tablespaces are meant to define hysical location for objects, but this
>> > would also use them to "mark" objects as encrypted or not. That just
>> > seems misguided and would make the life harder for many users.
>> >
>> > For example, what if I don't have any tablespaces (except for the
>> > default one), but I want to encrypt only some objects? Suddenly I have
>> > to create a tablespace, which will however cause various difficulties
>> > down the road (during pg_basebackup, etc.).
>>
>> I guess that we can have an encrypted tabelspace by default (e.g.
>> pg_default_enc). Or we encrypt per tables while having encryption keys
>> per tablespaces.
>
>
> Hi Sawada-san,
> I do agree with it.
>>
>>
>> On Mon, Jun 17, 2019 at 6:54 AM Tomas Vondra
>>  wrote:
>> >
>> > On Sun, Jun 16, 2019 at 02:10:23PM -0400, Stephen Frost wrote:
>> > >Greetings,
>> > >
>> > >* Joe Conway (m...@joeconway.com) wrote:
>> > >> On 6/16/19 9:45 AM, Bruce Momjian wrote:
>> > >> > On Sun, Jun 16, 2019 at 07:07:20AM -0400, Joe Conway wrote:
>> > >> >> In any case it doesn't address my first point, which is limiting the
>> > >> >> volume encrypted with the same key. Another valid reason is you might
>> > >> >> have data at varying sensitivity levels and prefer different keys be
>> > >> >> used for each level.
>> > >> >
>> > >> > That seems quite complex.
>> > >>
>> > >> How? It is no more complex than encrypting at the tablespace level
>> > >> already gives you - in that case you get this property for free if you
>> > >> care to use it.
>> > >
>> > >Perhaps not surprising, but I'm definitely in agreement with Joe
>> > >regarding having multiple keys when possible and (reasonably)
>> > >straight-forward to do so.  I also don't buy off on the OpenSSL
>> > >argument; their more severe issues certainly haven't been due to key
>> > >management issues such as what we're discussing here, so I don't think
>> > >the argument applies.
>> > >
>> >
>> > I'm not sure what exactly is the "OpenSSL argument" you're disagreeing
>> > with? IMHO Bruce is quite right that the risk of vulnerabilities grows
>> > with the complexity of the system (both due to implementation bugs and
>> > general design weaknesses). I don't think it's tied to the key
>> > management specifically, except that it's one of the parts that may
>> > contribute to the complexity.
>> >
>> > (It's often claimed that key management is one of the weakest points of
>> > current crypto systems - we have safe (a)symmetric algorithms, but safe
>> > handling of keys is an issue. I don't have data / papers supporting this
>> > claim, I kinda believe it.)
>> >
>> > Now, I'm not opposed to eventually implementing something more
>> > elaborate, but I also think just encrypting the whole cluster (not
>> > necessarily with a single key, but with one master key) would be enough
>> > for vast majority of users. Plus it's less error prone and easier to
>> > operate (backups, replicas, crash recovery, ...).
>> >
>> > But there's about 0% chance we'll get that in v1, of course, so we need
>> > s "minimum viable product" to build on anyway.
>> >
>>
>> I agree that we need minimum viable product first. But I'm not sure
>> it's true that the implementing the cluster-wide TDE first could be
>> the first step of per-tablespace/table TDE.
>
>
> Yes, we could complete the per-tablespace/table TDE in version 13.
> And we could do cluster-wide TDE in the next version.
> But I remember you said there are so many keys to manage in the table-level.

I think even if we provide the per table encryption we can have
encryption keys per tablepaces. That is, all tables on the same
tablespace encryption use the same encryption keys but user can
control encrypted objects per tables.

> Will we add the table-level TDE in the first version?

I hope so but It's under discussion now.

> And I have two questions.
> 1. Will we add hooks to support replacing the encryption algorithms?
> 2. Will we add some encryption algorithm or use some in some libraries?

Currently the WIP patch uses openssl and supports only AES-256 and I
don't have a plan to have such extensibility for now. But it might be
a good idea in the future. I think it would be not hard to support
symmetric encryption altgorithms supported by openssl but would you
like to support other encryption algorithms?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Amit Kapila
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas  wrote:
>
> On Tue, Jun 18, 2019 at 2:07 PM Robert Haas  wrote:
> > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila  wrote:
> > > [ new patches ]
> >
> > I tried writing some code [ to use these patches ].
>
>
> for (;;)
> {
> UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> if (i like this one)
> break;
> urp = uur->uur_blkprev; // should be renamed, since zedstore +
> probably others will have tuple chains not block chains
..

+1 for renaming this variable.  How about uur_prev_ver or uur_prevver
or uur_verprev?  Any other suggestions?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Amit Kapila
On Thu, Jun 20, 2019 at 2:24 PM Dilip Kumar  wrote:
>
> On Wed, Jun 19, 2019 at 11:35 PM Robert Haas  wrote:
> >
> > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas  wrote:
> > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila  
> > > wrote:
> > > > [ new patches ]
> > >
> > > I tried writing some code [ to use these patches ].
> >
> > I spent some more time experimenting with this patch set today and I
> > think that the UndoFetchRecord interface is far too zheap-centric.  I
> > expected that I would be able to do this:
> >
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> > // do stuff with uur
> > UndoRecordRelease(uur);
> >
> > But I can't, because the UndoFetchRecord API requires me to pass not
> > only an undo record but also a block number, an offset number, an XID,
> > and a callback.  I think I could get the effect that I want by
> > defining a callback that always returns true.  Then I could do:
> >
> > UndoRecPtr junk;
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber,
> > InvalidOffsetNumber, &junk, always_returns_true);
> > // do stuff with uur
> > UndoRecordRelease(uur);
> >
> > That seems ridiculously baroque.  I think the most common thing that
> > an AM will want to do with an UndoRecPtr is look up that exact record;
> > that is, for example, what zedstore will want to do. However, even if
> > some AMs, like zheap, want to search backward through a chain of
> > records, there's no real reason to suppose that all of them will want
> > to search by block number + offset.  They might want to search by some
> > bit of data buried in the payload, for example.
> >
> > I think the basic question here is whether we really need anything
> > more complicated than:
> >
> > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp);
> >
> > I mean, if you had that, the caller can implement looping easily
> > enough, and insert any test they want:
> >
> > for (;;)
> > {
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> > if (i like this one)
> > break;
> > urp = uur->uur_blkprev; // should be renamed, since zedstore +
> > probably others will have tuple chains not block chains
> > UndoRecordRelease(uur);
> > }
>
> The idea behind having the loop inside the undo machinery was that
> while traversing the blkprev chain, we can read all the undo records
> on the same undo page under one buffer lock.
>

I think if we want we can hold this buffer and allow it to be released
in UndoRecordRelease.  However, this buffer needs to be stored in some
common structure which can be then handed over to UndoRecordRelease.
Another thing is that as of now the API allocates the memory just once
for UnpackedUndoRecord whereas in the new scheme it needs to be
allocated again and again.  I think this is a relatively minor thing,
but it might be better if we can avoid palloc again and again.

BTW, while looking at the code of UndoFetchRecord, I see some problem.
There is a coding pattern like
if()
{
}
else
{
   LWLockAcquire()
  ..
  ..
}

LWLockRelease().

I think this is not correct.

> >
> > BTW, an actually generic iterator interface would probably look more like 
> > this:
> >
> > typedef bool (*SatisfyUndoRecordCallback)(void *callback_data,
> > UnpackedUndoRecord *uur);
> Right, it should be this way.
>
> > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr
> > *found, SatisfyUndoRecordCallback callback, void *callback_data);
> >
> > Now we're not assuming anything about what parts of the record the
> > callback wants to examine.  It can do whatever it likes.  Typically
> > with this sort of interface a caller will define a file-private struct
> > that is known to both the callback and the caller of UndoFetchRecord,
> > but not elsewhere.
> >
> > If we decide we need an iterator within the undo machinery itself,
> > then I think it should look like the above, and I think it should
> > accept NULL for found, callback, and callback_data, so that somebody
> > who wants to just look up a record, full stop, can do just:
> >
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL);
> >
> > which seems tolerable.
> >
> I agree with this.
>

+1.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Minimal logical decoding on standbys

2019-06-20 Thread Amit Khandekar
I am yet to work on Andres's latest detailed review comments, but I
thought before that, I should submit a patch for the below reported
issue because I was almost ready with the fix. Now I will start to
work on Andres's comments, for which I will reply separately.

On Fri, 1 Mar 2019 at 13:33, tushar  wrote:
>
> Hi,
>
> While testing  this feature  found that - if lots of insert happened on
> the master cluster then pg_recvlogical is not showing the DATA
> information  on logical replication slot which created on SLAVE.
>
> Please refer this scenario -
>
> 1)
> Create a Master cluster with wal_level=logcal and create logical
> replication slot -
>   SELECT * FROM pg_create_logical_replication_slot('master_slot',
> 'test_decoding');
>
> 2)
> Create a Standby  cluster using pg_basebackup ( ./pg_basebackup -D
> slave/ -v -R)  and create logical replication slot -
> SELECT * FROM pg_create_logical_replication_slot('standby_slot',
> 'test_decoding');
>
> 3)
> X terminal - start  pg_recvlogical  , provide port= ( slave
> cluster)  and specify slot=standby_slot
> ./pg_recvlogical -d postgres  -p  -s 1 -F 1  -v --slot=standby_slot
> --start -f -
>
> Y terminal - start  pg_recvlogical  , provide port=5432 ( master
> cluster)  and specify slot=master_slot
> ./pg_recvlogical -d postgres  -p 5432 -s 1 -F 1  -v --slot=master_slot
> --start -f -
>
> Z terminal - run pg_bench  against Master cluster ( ./pg_bench -i -s 10
> postgres)
>
> Able to see DATA information on Y terminal  but not on X.
>
> but same able to see by firing this below query on SLAVE cluster -
>
> SELECT * FROM pg_logical_slot_get_changes('standby_slot', NULL, NULL);
>
> Is it expected ?

Actually it shows up records after quite a long time. In general,
walsender on standby is sending each record after significant time (1
sec), and pg_recvlogical shows all the inserted records only after the
commit, so for huge inserts, it looks like it is hanging forever.

In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
point. On standby, GetFlushRecPtr() does not give a valid value, so it
was wrongly determined that the sent record is beyond flush point, as
a result of which, WalSndCaughtUp was set to true, causing
WalSndLoop() to sleep for some duration after every record. This is
why pg_recvlogical appears to be hanging forever in case of huge
number of rows inserted.

Fix : Use GetStandbyFlushRecPtr() if am_cascading_walsender.
Attached patch v8.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


logical-decoding-on-standby_v8.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-06-20 Thread Dilip Kumar
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas  wrote:
>
> On Tue, Jun 18, 2019 at 2:07 PM Robert Haas  wrote:
> > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila  wrote:
> > > [ new patches ]
> >
> > I tried writing some code [ to use these patches ].
>
> I spent some more time experimenting with this patch set today and I
> think that the UndoFetchRecord interface is far too zheap-centric.  I
> expected that I would be able to do this:
>
> UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> // do stuff with uur
> UndoRecordRelease(uur);
>
> But I can't, because the UndoFetchRecord API requires me to pass not
> only an undo record but also a block number, an offset number, an XID,
> and a callback.  I think I could get the effect that I want by
> defining a callback that always returns true.  Then I could do:
>
> UndoRecPtr junk;
> UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber,
> InvalidOffsetNumber, &junk, always_returns_true);
> // do stuff with uur
> UndoRecordRelease(uur);
>
> That seems ridiculously baroque.  I think the most common thing that
> an AM will want to do with an UndoRecPtr is look up that exact record;
> that is, for example, what zedstore will want to do. However, even if
> some AMs, like zheap, want to search backward through a chain of
> records, there's no real reason to suppose that all of them will want
> to search by block number + offset.  They might want to search by some
> bit of data buried in the payload, for example.
>
> I think the basic question here is whether we really need anything
> more complicated than:
>
> extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp);
>
> I mean, if you had that, the caller can implement looping easily
> enough, and insert any test they want:
>
> for (;;)
> {
> UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> if (i like this one)
> break;
> urp = uur->uur_blkprev; // should be renamed, since zedstore +
> probably others will have tuple chains not block chains
> UndoRecordRelease(uur);
> }

The idea behind having the loop inside the undo machinery was that
while traversing the blkprev chain, we can read all the undo records
on the same undo page under one buffer lock.

>
> The question in my mind is whether there's some performance advantage
> of having the undo layer manage the looping rather than the caller do
> it.  If there is, then there's a lot of zheap code that ought to be
> changed to use it, because it's just using the same satisfies-callback
> everywhere.  If there's not, we should just simplify the undo record
> lookup along the lines mentioned above and put all the looping into
> the callers.  zheap could provide a wrapper around UndoFetchRecord
> that does a search by block and offset, so that we don't have to
> repeat that logic in multiple places.
>
> BTW, an actually generic iterator interface would probably look more like 
> this:
>
> typedef bool (*SatisfyUndoRecordCallback)(void *callback_data,
> UnpackedUndoRecord *uur);
Right, it should be this way.

> extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr
> *found, SatisfyUndoRecordCallback callback, void *callback_data);
>
> Now we're not assuming anything about what parts of the record the
> callback wants to examine.  It can do whatever it likes.  Typically
> with this sort of interface a caller will define a file-private struct
> that is known to both the callback and the caller of UndoFetchRecord,
> but not elsewhere.
>
> If we decide we need an iterator within the undo machinery itself,
> then I think it should look like the above, and I think it should
> accept NULL for found, callback, and callback_data, so that somebody
> who wants to just look up a record, full stop, can do just:
>
> UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL);
>
> which seems tolerable.
>
I agree with this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Implementing Incremental View Maintenance

2019-06-20 Thread Yugo Nagata
Hi hackers,

Thank you for your many questions and feedbacks at PGCon 2019.
Attached is the patch rebased for the current master branch.

Regards,
Yugo Nagata

On Tue, 14 May 2019 15:46:48 +0900
Yugo Nagata  wrote:

> On Mon, 1 Apr 2019 12:11:22 +0900
> Yugo Nagata  wrote:
> 
> > On Thu, 27 Dec 2018 21:57:26 +0900
> > Yugo Nagata  wrote:
> > 
> > > Hi,
> > > 
> > > I would like to implement Incremental View Maintenance (IVM) on 
> > > PostgreSQL.  
> > 
> > I am now working on an initial patch for implementing IVM on PostgreSQL.
> > This enables materialized views to be updated incrementally after one
> > of their base tables is modified.
> 
> Attached is a WIP patch of Incremental View Maintenance (IVM).
> Major part is written by me, and changes in syntax and pg_class
> are Hoshiai-san's work.
> 
> Although this is sill a draft patch in work-in-progress, any
> suggestions or thoughts would be appreciated.
>  
> * What it is
> 
> This allows a kind of Immediate Maintenance of materialized views. if a 
> materialized view is created by CRATE INCREMENTAL MATERIALIZED VIEW command,
> the contents of the mateview is updated automatically and incrementally
> after base tables are updated. Noted this syntax is just tentative, so it
> may be changed.
> 
> == Example 1 ==
> postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m AS SELECT * FROM t0;
> SELECT 3
> postgres=# SELECT * FROM m;
>  i 
> ---
>  3
>  2
>  1
> (3 rows)
> 
> postgres=# INSERT INTO t0 VALUES (4);
> INSERT 0 1
> postgres=# SELECt * FROM m; -- automatically updated
>  i 
> ---
>  3
>  2
>  1
>  4
> (4 rows)
> =
> 
> This implementation also supports matviews including duplicate tuples or
> DISTINCT clause in its view definition query.  For example, even if a matview
> is defined with DISTINCT to remove duplication of tuples in a base table, this
> can perform incremental update of the matview properly. That is, the contents
> of the matview doesn't change when exiting tuples are inserted into the base
> tables, and a tuple in the matview is deleted only when duplicity of the
> corresponding tuple in the base table becomes zero.
> 
> This is due to "colunting alogorithm" in which the number of each tuple is
> stored in matviews as a special column value.
> 
> == Example 2 ==
> postgres=# SELECT * FROM t1;
>  id | t 
> +---
>   1 | A
>   2 | B
>   3 | C
>   4 | A
> (4 rows)
> 
> postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m1 AS SELECT t FROM t1;
> SELECT 3
> postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m2 AS SELECT DISTINCT t FROM 
> t1;
> SELECT 3
> postgres=# SELECT * FROM m1; -- with duplicity
>  t 
> ---
>  A
>  A
>  C
>  B
> (4 rows)
> 
> postgres=# SELECT * FROM m2;
>  t 
> ---
>  A
>  B
>  C
> (3 rows)
> 
> postgres=# INSERT INTO t1 VALUES (5, 'B');
> INSERT 0 1
> postgres=# DELETE FROM t1 WHERE id IN (1,3); -- delete (1,A),(3,C)
> DELETE 2
> postgres=# SELECT * FROM m1; -- one A left and one more B
>  t 
> ---
>  B
>  B
>  A
> (3 rows)
> 
> postgres=# SELECT * FROM m2; -- only C is removed
>  t 
> ---
>  B
>  A
> (2 rows)
> =
> 
> * How it works
> 
> 1. Creating matview
> 
> When a matview is created, AFTER triggers are internally created
> on its base tables. When the base tables is modified (INSERT, DELETE,
> UPDATE), the matview is updated incrementally in the trigger function.
> 
> When populating the matview, GROUP BY and count(*) are added to the
> view definition query before this is executed for counting duplicity
> of tuples in the matview. The result of count is stored in the matview
> as a special column named "__ivm_count__". 
> 
> 2. Maintenance of matview
> 
> When base tables are modified, the change set of the table can be
> referred as Ephemeral Named Relations (ENRs) thanks to Transition Table
> (a feature of trigger implemented since PG10). We can calculate the diff
> set of the matview by replacing the base table in the view definition
> query with the ENR (at least if it is Selection-Projection -Join view). 
> As well as view definition time, GROUP BY and count(*) is added in order
> to count the duplicity of tuples in the diff set. As a result, two diff
> sets (to be deleted from and to be inserted into the matview) are
> calculated, and the results are stored into temporary tables respectively.
> 
> The matiview is updated by merging these change sets. Instead of executing
> DELETE or INSERT simply, the values of __ivm_count__ column in the matview
> is decreased or increased. When the values becomes zero, the corresponding
> tuple is deleted from the matview.
> 
> 3. Access to matview
> 
> When SELECT is issued for IVM matviews defined with DISTINCT, all columns
> except to __ivm_count__ of each tuple in the matview are returned.  This is 
> correct because duplicity of tuples are eliminated by GROUP BY.
> 
> When DISTINCT is not used, SELECT for the IVM matviews returns each tuple
> __ivm_count__ times. Currently, this is implemented by

unlogged sequences

2019-06-20 Thread Peter Eisentraut
The discussion in bug #15631 revealed that serial/identity sequences of
temporary tables should really also be temporary (easy), and that
serial/identity sequences of unlogged tables should also be unlogged.
But there is no support for unlogged sequences, so I looked into that.

If you copy the initial sequence relation file to the init fork, then
this all seems to work out just fine.  Attached is a patch.  The
low-level copying seems to be handled quite inconsistently across the
code, so I'm not sure what the most appropriate way to do this would be.
 I'm looking for feedback from those who have worked on tableam and
storage manager to see what the right interfaces are or whether some new
interfaces might perhaps be appropriate.

(What's still missing in this patch is ALTER SEQUENCE SET
{LOGGED|UNLOGGED} as well as propagating the analogous ALTER TABLE
command to owned sequences.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c15d00d102a7dc2e13d7cc239b27cdcd189bac19 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 4 May 2019 17:04:45 +0200
Subject: [PATCH v1 1/4] Make command order in test more sensible

---
 src/test/regress/expected/sequence.out | 2 +-
 src/test/regress/sql/sequence.sql  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/test/regress/expected/sequence.out 
b/src/test/regress/expected/sequence.out
index a0d2b22d3c..75eb5015cf 100644
--- a/src/test/regress/expected/sequence.out
+++ b/src/test/regress/expected/sequence.out
@@ -599,7 +599,6 @@ DROP SEQUENCE seq2;
 -- should fail
 SELECT lastval();
 ERROR:  lastval is not yet defined in this session
-CREATE USER regress_seq_user;
 -- Test sequences in read-only transactions
 CREATE TEMPORARY SEQUENCE sequence_test_temp1;
 START TRANSACTION READ ONLY;
@@ -623,6 +622,7 @@ SELECT setval('sequence_test2', 1);  -- error
 ERROR:  cannot execute setval() in a read-only transaction
 ROLLBACK;
 -- privileges tests
+CREATE USER regress_seq_user;
 -- nextval
 BEGIN;
 SET LOCAL SESSION AUTHORIZATION regress_seq_user;
diff --git a/src/test/regress/sql/sequence.sql 
b/src/test/regress/sql/sequence.sql
index a7b9e63372..7928ee23ee 100644
--- a/src/test/regress/sql/sequence.sql
+++ b/src/test/regress/sql/sequence.sql
@@ -272,8 +272,6 @@ CREATE SEQUENCE seq2;
 -- should fail
 SELECT lastval();
 
-CREATE USER regress_seq_user;
-
 -- Test sequences in read-only transactions
 CREATE TEMPORARY SEQUENCE sequence_test_temp1;
 START TRANSACTION READ ONLY;
@@ -287,6 +285,8 @@ CREATE TEMPORARY SEQUENCE sequence_test_temp1;
 
 -- privileges tests
 
+CREATE USER regress_seq_user;
+
 -- nextval
 BEGIN;
 SET LOCAL SESSION AUTHORIZATION regress_seq_user;

base-commit: 414cca40d506dd3f17b49ae3139853139192c2ba
-- 
2.22.0

From a66fbef1fc05bd4d5907851bf93c4e268cd477ab Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 11 Jun 2019 16:40:41 +0200
Subject: [PATCH v1 2/4] Fix comment

The last argument of smgrextend() was renamed from isTemp to skipFsync
in debcec7dc31a992703911a9953e299c8d730c778, but the comments at two
call sites were not updated.
---
 src/backend/access/heap/rewriteheap.c | 7 +++
 src/backend/catalog/storage.c | 6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c 
b/src/backend/access/heap/rewriteheap.c
index 369694fa2e..916231e2c4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -703,10 +703,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
true);
 
/*
-* Now write the page. We say isTemp = true even if 
it's not a
-* temp table, because there's no need for smgr to 
schedule an
-* fsync for this write; we'll do it ourselves in
-* end_heap_rewrite.
+* Now write the page. We say skipFsync = true because 
there's no
+* need for smgr to schedule an fsync for this write; 
we'll do it
+* ourselves in end_heap_rewrite.
 */
RelationOpenSmgr(state->rs_new_rel);
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..d6112fa535 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -358,9 +358,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
PageSetChecksumInplace(page, blkno);
 
/*
-* Now write the page.  We say isTemp = true even if it's not a 
temp
-* rel, because there's no need for smgr to schedule an fsync 
for this
-* write; we'll do it ourselves below.
+* Now write the page.  We say skipFsync = true because there's 
no
+   

Disconnect from SPI manager on error

2019-06-20 Thread RekGRpth
A patch fixing this bug
https://www.postgresql.org/message-id/flat/15738-21723084f3009ceb%40postgresql.org
From 0144733c9f128108670f3654605f274928d83096 Mon Sep 17 00:00:00 2001
From: RekGRpth 
Date: Fri, 26 Apr 2019 15:35:30 +0500
Subject: Disconnect from SPI manager on error


diff --git a/src/pl/plpgsql/src/pl_handler.c b/src/pl/plpgsql/src/pl_handler.c
index 61452d9f7f..387b283b03 100644
--- a/src/pl/plpgsql/src/pl_handler.c
+++ b/src/pl/plpgsql/src/pl_handler.c
@@ -267,6 +267,13 @@ plpgsql_call_handler(PG_FUNCTION_ARGS)
 		/* Decrement use-count, restore cur_estate, and propagate error */
 		func->use_count--;
 		func->cur_estate = save_cur_estate;
+
+		/*
+		 * Disconnect from SPI manager
+		 */
+		if ((rc = SPI_finish()) != SPI_OK_FINISH)
+			elog(ERROR, "SPI_finish failed: %s", SPI_result_code_string(rc));
+
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -364,6 +371,12 @@ plpgsql_inline_handler(PG_FUNCTION_ARGS)
 		/* ... so we can free subsidiary storage */
 		plpgsql_free_function_memory(func);
 
+		/*
+		 * Disconnect from SPI manager
+		 */
+		if ((rc = SPI_finish()) != SPI_OK_FINISH)
+			elog(ERROR, "SPI_finish failed: %s", SPI_result_code_string(rc));
+
 		/* And propagate the error */
 		PG_RE_THROW();
 	}


Re: Multivariate MCV list vs. statistics target

2019-06-20 Thread Dean Rasheed
On Tue, 18 Jun 2019 at 22:34, Tomas Vondra  wrote:
>
> One slightly inconvenient thing I realized while playing with the
> address data set is that it's somewhat difficult to set the desired size
> of the multi-column MCV list.
>
> At the moment, we simply use the maximum statistic target for attributes
> the MCV list is built on. But that does not allow keeping default size
> for per-column stats, and only increase size of multi-column MCV lists.
>
> So I'm thinking we should allow tweaking the statistics for extended
> stats, and serialize it in the pg_statistic_ext catalog. Any opinions
> why that would be a bad idea?
>

Seems reasonable to me. This might not be the only option we'll ever
want to add though, so perhaps a "stxoptions text[]" column along the
lines of a relation's reloptions would be the way to go.

> I suppose it should be part of the CREATE STATISTICS command, but I'm
> not sure what'd be the best syntax. We might also have something more
> similar to ALTER COLUMNT, but perhaps
>
>  ALTER STATISTICS s SET STATISTICS 1000;
>
> looks a bit too weird.
>

Yes it does look a bit weird, but that's the natural generalisation of
what we have for per-column statistics, so it's probably preferable to
do that rather than invent some other syntax that wouldn't be so
consistent.

Regards,
Dean