Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-03-08 Thread Noah Misch
On Wed, Mar 06, 2019 at 07:24:22PM -0800, Noah Misch wrote:
> On Mon, Mar 04, 2019 at 06:04:20PM +0900, Kyotaro HORIGUCHI wrote:
> > PGSharedMemoryCreate changed to choose a usable shmem id using
> > the IpcMemoryAnalyze().  But some of the statuses from
> > IpcMemoryAnalyze() is concealed by failure of
> > PGSharedMemoryAttach() and ignored silently opposed to what the
> > code is intending to do.
> 
> SHMSTATE_FOREIGN is ignored silently.  The patch modifies the
> PGSharedMemoryAttach() header comment to direct callers to treat
> PGSharedMemoryAttach() failure like SHMSTATE_FOREIGN.  I don't think the code
> had an unintentional outcome due to the situation you describe.  Perhaps I
> could have made the situation clearer.

I think v3, attached, avoids this appearance of unintended behavior.

> > (By the way SHMSTATE_EEXISTS seems
> > suggesting oppsite thing from EEXIST, which would be confusing.)
> 
> Good catch.  Is SHMSTATE_ENOENT better?

I did s/SHMSTATE_EEXISTS/SHMSTATE_ENOENT/.

> > PGSharedMemoryCreate() repeats shmat/shmdt twice in every
> > iteration. It won't harm so much but it would be better if we
> > could get rid of that.
> 
> I'll try to achieve that and see if the code turns out cleaner.

I renamed IpcMemoryAnalyze() to PGSharedMemoryAttach() and deleted the old
function of that name.  Now, this function never calls shmdt(); the caller is
responsible for that.  I do like things better this way.  What do you think?
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1070,14 +1070,10 @@ CreateLockFile(const char *filename, bool amPostmaster,
if (PGSharedMemoryIsInUse(id1, id2))
ereport(FATAL,

(errcode(ERRCODE_LOCK_FILE_EXISTS),
-errmsg("pre-existing 
shared memory block "
-   "(key 
%lu, ID %lu) is still in use",
+errmsg("pre-existing 
shared memory block (key %lu, ID %lu) is still in use",
id1, 
id2),
-errhint("If you're 
sure there are no old "
-
"server processes still running, remove "
-"the 
shared memory block "
-"or 
just delete the file \"%s\".",
-
filename)));
+errhint("Terminate any 
old server processes associated with data directory \"%s\".",
+
refName)));
}
}
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index c118f64..e8f6050 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -430,13 +430,13 @@ ifeq ($(enable_tap_tests),yes)
 define prove_installcheck
 rm -rf '$(CURDIR)'/tmp_check
 $(MKDIR_P) '$(CURDIR)'/tmp_check
-cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" 
PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' 
PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) 
$(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
+cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" 
PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' 
PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' 
REGRESS_SHLIB='$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)' 
$(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if 
$(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
 endef
 
 define prove_check
 rm -rf '$(CURDIR)'/tmp_check
 $(MKDIR_P) '$(CURDIR)'/tmp_check
-cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) 
PGPORT='6$(DEF_PGPORT)' 
PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) 
$(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
+cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) 
PGPORT='6$(DEF_PGPORT)' 
PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' 
REGRESS_SHLIB='$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)' 
$(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if 
$(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
 endef
 
 else
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 1511a61..c9f6e43 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -72,6 +72,26 @@
 typedef key_t IpcMemoryKey;/* shared memory key passed to 
shmget(2) */
 typedef int IpcMemoryId;   /* shared memory ID returned by 

Re: Is it too soon for a PG12 open items wiki page?

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 13:03, Michael Paquier  wrote:
> It would be a good timing as some issues are already showing up.  And
> abracadabra:
> https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items

Thanks.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Ordered Partitioned Table Scans

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 09:14, Tom Lane  wrote:
> I took a quick look through this and I'm not very happy with it.
> It seems to me that the premise ought to be "just use an Append
> if we can prove the output would be ordered anyway", but that's not
> what we actually have here: instead you're adding more infrastructure
> onto Append, which notably involves invasive changes to the API of
> create_append_path, which is the main reason why the patch keeps breaking.

Can you suggest how else we could teach higher paths that an Append is
ordered by some path keys without giving the append some pathkeys?
That's what pathkeys are for, so I struggle to imagine how else this
could work.  If we don't do this, then how is a MergeJoin going to
know it does not need to sort before joining?

As for the "the patch keeps breaking"...  those are just conflicts
with other changes that have been made in master. That seems fairly
normal to me.

> (It's broken again as of HEAD, though the cfbot doesn't seem to have
> noticed yet.)

I think it's not been updating itself for a few days.

>  Likewise there's a bunch of added complication in
> cost_append, create_append_plan, etc.  I think you should remove all that
> and restrict this optimization to the case where all the subpaths are
> natively ordered --- if we have to insert Sorts, it's hardly going to move
> the needle to worry about simplifying the parent MergeAppend to Append.

I think the patch would be less than half as useful if we do that.
Can you explain why you think that fast startup plans are less
important for partitioned tables?

I could perhaps understand an argument against this if the patch added
masses of complex code to achieve the goal, but in my opinion, the
code is fairly easy to understand and there's not very much extra code
added.

> There also seem to be bits that duplicate functionality of the
> drop-single-child-[Merge]Append patch (specifically I'm looking
> at get_singleton_append_subpath).  Why do we need that?

hmm, that patch is separate functionality. The patch you're talking
about, as you know, just removes Append/MergeAppends that have a
single subpath.  Over here we add smarts to allow conversion of
MergeAppends into Appends when the order of the partitions is defined
the same as the required order of the, would be, MergeAppend path.

get_singleton_append_subpath() allows sub-partitioned table's
MergeAppend or Append subpaths to be pulled into the top-level Appends
when they just contain a single subpath.

An example from the tests:

 Append
   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
   ->  Merge Append
 Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
 ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
 ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def

If the nested MergeAppend path just had a single node then
get_singleton_append_subpath() would have pulled the subpath into the
top-level Append.  However, in this example, since there are multiple
MergeAppend subpath, the pull-up would be invalid since the top-level
Append can't guarantee the sort order of those MergeAppend subpaths.
In fact, the test directly after that one drops the mcrparted5_def
table which turns the plan into:

 Append
   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a


> The logic in build_partition_pathkeys is noticeably stupider than
> build_index_pathkeys, in particular it's not bright about boolean columns.
> Maybe that's fine, but if so it deserves a comment explaining why we're
> not bothering.

Good point.  That's required to allow cases like:

SELECT * FROM parttable WHERE boolcol = true ORDER BY boolcol, ordercol;

I've fixed that in the attached.

>  Also, the comment for build_index_pathkeys specifies that
> callers should do truncate_useless_pathkeys, which they do; why is that
> not relevant here?

I've neglected to explain that in the comments.  The problem with that
is that doing so would break cases where we use an Append when the
partition keys are a prefix of the query's pathkeys.  Say we had a
range partition table on (a,b) and an index on (a, b, c):

SELECT * FROM range_ab ORDER BY a, b, c;

With the current patch, we can use an Append for that as no earlier
value of (a,b) can come in a later partition.  If we
truncate_useless_pathkeys() then it'll strip out all pathkeys as the
partition pathkeys are not contained 

proposal: type info support functions for functions that use "any" type

2019-03-08 Thread Pavel Stehule
Hi,

Tom introduced supported functions for calculation function's selectivity.
Still I have similar idea to use supported function for calculation
function's parameter's types and function return type.

Motivation:

Reduce a necessity of overloading of functions. My motivation is related
primary to Orafce, but this feature should be helpful for anybody with
similar goals. The function's overloading is great functionality but it is
hard for maintenance.

My idea to enhance a CREATE FUNCTION command to be able do

CREATE FUCNTION foo("any")
RETURNS "any" AS ...
TYPEINFO foo_typeinfo

CREATE FUNCTION decode(VARIADIC "any")
RETURNS "any" AS ...
TYPEINFO decode_typeinfo.

The typeinfo functions returns a pointer tu structure with param types and
result type. Only function with "any" parameters or "any" result can use
TYPEINFO supported function. This functionality should not be allowed for
common functions.

This functionality is limited just for C coders. But I expect so typical
application coder doesn't need it. It doesn't replace my proposal of
introduction other polymorphic type - now named "commontype" (can be named
differently). The commontype is good enough solution for application
coders, developers.

Comments, notes?

Regards

Pavel


Re: \describe*

2019-03-08 Thread Pavel Stehule
Hi


> Since this is now waiting for v13, there's a bit more time to entertain
> the question of whether we'd rather have these in psql or in a new server
> command DESCRIBE [verbose] [system], and if so, whether the output of that
> would itself be query-able or not.
>

Including this feature in core can be nice. If they are on server side,
then should to produce result via API - like EXPLAIN. That's all.

Regards

Pavel


Re: Hash index initial size is too large given NULLs or partial indexes

2019-03-08 Thread Amit Kapila
On Fri, Mar 8, 2019 at 11:57 PM Tomas Vondra
 wrote:
> On 3/8/19 7:14 PM, Jeff Janes wrote:
>
> > This goes back to when the pre-sizing was implemented in 2008
> > (c9a1cc694abef737548a2a).  It seems to be an oversight, rather than
> > something that was considered.
> >
> > Is this a bug that should be fixed?  Or if getting a more accurate
> > estimate is not possible or not worthwhile, add a code comment about that?
> >
>
> I'd agree this smells like a bug (or perhaps two). The sizing probably
> should consider both null_frac and selectivity of the index predicate.
>

Like you guys, I also think this area needs improvement.  I am not
sure how easy it is to get the selectivity of the predicate in this
code path. If we see how we do it in set_plain_rel_size() during path
generation in the planner, we can get some idea.

Another idea could be that we don't create the buckets till we know
the exact tuples returned by IndexBuildHeapScan.  Basically, I think
we need to spool the tuples, create the appropriate buckets and then
insert the tuples.  We might want to do this only when some index
predicate is present.

If somebody is interested in doing the leg work, I can help in
reviewing the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Pavel Stehule
pá 8. 3. 2019 v 19:48 odesílatel Pavel Stehule 
napsal:

>
>
> pá 8. 3. 2019 v 19:44 odesílatel Alvaro Herrera 
> napsal:
>
>> On 2019-Mar-08, Alvaro Herrera wrote:
>>
>> > > Maybe we can call explicitly xmlFreeDoc instead xmlFreeNode
>> > >
>> > > some like
>> > >
>> > > if (cur_copy->type == XML_DOCUMENT_NODE)
>> > >   xmlFreeDoc((xmlDocPtr) cur_copy);
>> > > else
>> > >   xmlFreeNode(cur_copy);
>> > >
>> > > This looks most correct fix for me. What do you think?
>> >
>> > Seems like that should work, yeah ...
>>
>> Something like this perhaps?  Less repetitive ...
>>
>
Thank you for commit.

the commit message is not correct. xmlCopyNodes does owns works well, but
the node is broken already, and because we should to call xmlFreeNode, we
have a problem.

regards

Pavel


> +1
>
> Pavel
>
>
>> --
>> Álvaro Herrerahttps://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>


Re: Pluggable Storage - Andres's take

2019-03-08 Thread Andres Freund
Hi,

On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote:
> Here I attached the rebased patches that I shared earlier. I am adding the
> comments to explain the API's in the code, will share those patches later.

I've started to add those for the callbacks in the first commit. I'd
appreciate a look!

I think I'll include the docs patches, sans the callback documentation,
in the next version. I'll probably merge them into one commit, if that's
OK with you?


> I observed a crash with the latest patch series in the COPY command.

Hm, which version was this? I'd at some point accidentally posted a
'tmp' commit that was just a performance hack.


Btw, your patches always are attached out of order:
https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com
10, 1, 3, 4, 2 ...

Greetings,

Andres Freund



Re: Should we increase the default vacuum_cost_limit?

2019-03-08 Thread Tom Lane
I wrote:
> [ worries about overflow with VacuumCostLimit approaching INT_MAX ]

Actually, now that I think a bit harder, that disquisition was silly.
In fact, I'm inclined to argue that the already-committed patch
is taking the wrong approach, and we should revert it in favor of a
different idea.

The reason is this: what we want to do is throttle VACUUM's I/O demand,
and by "throttle" I mean "gradually reduce".  There is nothing gradual
about issuing a few million I/Os and then sleeping for many milliseconds;
that'll just produce spikes and valleys in the I/O demand.  Ideally,
what we'd have it do is sleep for a very short interval after each I/O.
But that's not too practical, both for code-structure reasons and because
most platforms don't give us a way to so finely control the length of a
sleep.  Hence the design of sleeping for awhile after every so many I/Os.

However, the current settings are predicated on the assumption that
you can't get the kernel to give you a sleep of less than circa 10ms.
That assumption is way outdated, I believe; poking around on systems
I have here, the minimum delay time using pg_usleep(1) seems to be
generally less than 100us, and frequently less than 10us, on anything
released in the last decade.

I propose therefore that instead of increasing vacuum_cost_limit,
what we ought to be doing is reducing vacuum_cost_delay by a similar
factor.  And, to provide some daylight for people to reduce it even
more, we ought to arrange for it to be specifiable in microseconds
not milliseconds.  There's no GUC_UNIT_US right now, but it's time.
(Perhaps we should also look into using other delay APIs, such as
nanosleep(2), where available.)

I don't have any particular objection to kicking up the maximum
value of vacuum_cost_limit by 10X or so, if anyone's hot to do that.
But that's not where we ought to be focusing our concern.  And there
really is a good reason, not just nannyism, not to make that
setting huge --- it's just the wrong thing to do, as compared to
reducing vacuum_cost_delay.

regards, tom lane



Re: Re: proposal: make NOTIFY list de-duplication optional

2019-03-08 Thread Thomas Munro
On Fri, Mar 8, 2019 at 1:37 PM Filip Rembiałkowski
 wrote:
> See attached patch... I'm ready to work on so it can get merged in the next 
> CF.

Hi Filip,

Seen on Travis:

 async... FAILED  126 ms

Looks like the new error isn't being raised for invalid send mode?
(What kind of error message is "?" anyway? :-))

 ERROR:  channel name too long
 -- Should fail. Invalid 3rd parameter
 NOTIFY notify_async2, 'test', 'invalid';
-ERROR:  ?
 NOTIFY notify_async2, 'test', true;
-ERROR:  ?
 --Should work. Valid NOTIFY/LISTEN/UNLISTEN commands
 NOTIFY notify_async2;
 NOTIFY notify_async2, '';

-- 
Thomas Munro
https://enterprisedb.com



Re: Should we increase the default vacuum_cost_limit?

2019-03-08 Thread Tom Lane
Andrew Dunstan  writes:
> Increase it to what?

Jeff's opinion that it could be INT_MAX without causing trouble is
a bit over-optimistic, see vacuum_delay_point():

if (VacuumCostActive && !InterruptPending &&
VacuumCostBalance >= VacuumCostLimit)
{
intmsec;

msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;

In the first place, if VacuumCostLimit is too close to INT_MAX,
it'd be possible for VacuumCostBalance (also an int) to overflow
between visits to vacuum_delay_point, wrapping around to negative
and thus missing the need to nap altogether.

In the second place, since large VacuumCostLimit implies large
VacuumCostBalance when we do get through this if-check, there's
a hazard of integer overflow in the VacuumCostDelay * VacuumCostBalance
multiplication.  The final value of the msec calculation should be
easily within integer range, since VacuumCostDelay is constrained to
not be very large, but that's no help if we have intermediate overflow.

Possibly we could forestall both of those problems by changing
VacuumCostBalance to double, but that would make the cost
bookkeeping noticeably more expensive than it used to be.
I think it'd be better to keep VacuumCostBalance as int,
which would then mean we'd better limit VacuumCostLimit to no
more than say INT_MAX/2 --- call it 1 billion for the sake of
a round number.

That'd still leave us at risk of an integer overflow in the
msec-to-sleep calculation, but that calculation could be changed
to double at little price, since once we get here we're going
to sleep awhile anyway.

BTW, don't forget autovacuum_cost_limit should have the same maximum.

regards, tom lane



Re: dropdb --force

2019-03-08 Thread Thomas Munro
On Wed, Mar 6, 2019 at 1:39 PM Filip Rembiałkowski
 wrote:
> Here is Pavel's patch rebased to master branch, added the dropdb
> --force option, a test case & documentation.

Hello,

cfbot.cputube.org says this fails on Windows, due to a missing semicolon here:

#ifdef HAVE_SETSID
kill(-(proc->pid), SIGTERM);
#else
kill(proc->pid, SIGTERM)
#endif

The test case failed on Linux, I didn't check why exactly:

Test Summary Report
---
t/050_dropdb.pl (Wstat: 65280 Tests: 13 Failed: 2)
Failed tests: 12-13
Non-zero exit status: 255
Parse errors: Bad plan. You planned 11 tests but ran 13.

+/* Time to sleep after isuing SIGTERM to backends */
+#define TERMINATE_SLEEP_TIME 1

s/isuing/issuing/

But, hmm, this macro doesn't actually seem to be used in the patch.
Wait, is that because the retry loop forgot to actually include the
sleep?

+/* without "force" flag raise exception immediately, or after
5 minutes */

Normally we call it an "error", not an "exception".

-- 
Thomas Munro
https://enterprisedb.com



Re: Should we increase the default vacuum_cost_limit?

2019-03-08 Thread Andrew Dunstan


On 3/8/19 6:47 PM, David Rowley wrote:
> On Sat, 9 Mar 2019 at 07:10, Tom Lane  wrote:
>> Jeff Janes  writes:
>>> Now that this is done, the default value is only 5x below the hard-coded
>>> maximum of 10,000.
>>> This seems a bit odd, and not very future-proof.  Especially since the
>>> hard-coded maximum appears to have no logic to it anyway, at least none
>>> that is documented.  Is it just mindless nannyism?
>> Hm.  I think the idea was that rather than setting it to "something very
>> large", you'd want to just disable the feature via vacuum_cost_delay.
>> But I agree that the threshold for what is ridiculously large probably
>> ought to be well more than 5x the default, and maybe it is just mindless
>> nannyism to have a limit less than what the implementation can handle.
> Yeah, +1 to increasing it.  I imagine that the 10,000 limit would not
> allow people to explore the upper limits of a modern PCI-E SSD with
> the standard delay time and dirty/miss scores.  Also, it doesn't seem
> entirely unreasonable that someone somewhere might also want to
> fine-tune the hit/miss/dirty scores so that they're some larger factor
> apart from each other the standard scores are. The 10,000 limit does
> not allow much wiggle room for that.
>


Increase it to what?


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Reaping Temp tables to avoid XID wraparound

2019-03-08 Thread Michael Paquier
On Fri, Mar 08, 2019 at 11:14:46AM -0800, Magnus Hagander wrote:
> On Mon, Feb 25, 2019 at 10:45 PM Michael Paquier 
> wrote:
>> One problem that I can see with your patch is that you would set the
>> XID once any temporary object created, including when objects other
>> than tables are created in pg_temp, including functions, etc.  And it
>> does not really matter for wraparound monitoring.  Still, the patch is
>> simple..
> 
> I'm not entirely sure what you mean here. Sure, it will log it even when a
> temp function is created, but the namespace is still created then is it
> not?

What I mean here is: imagine the case of a session which creates a
temporary function, creating as well the temporary schema, but creates
no other temporary objects.  In this case we don't really care about
the wraparound issue because, even if we have a temporary schema, we
do not have temporary relations.  And this could confuse the user?
Perhaps that's not worth bothering, still not all temporary objects
are tables.
--
Michael


signature.asc
Description: PGP signature


Re: pg_basebackup ignores the existing data directory permissions

2019-03-08 Thread Haribabu Kommi
On Fri, Mar 8, 2019 at 11:59 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> pg_basebackup copies the data directory permission mode from the
> upstream server.  But it doesn't copy the ownership.  So if say the
> upstream server allows group access and things are owned by
> postgres:postgres, and I want to make a copy for local development and
> make a backup into a directory owned by peter:staff without group
> access, then it would be inappropriate for pg_basebackup to change the
> permissions on that directory.
>

Yes, I agree that it may be a problem if the existing data directory
permissions
are 0700 to changing it to 0750. But it may not be a problem for the
scenarios,
where the existing data permissions  >=0750, to the upstream permissions.
Because user must need to change anyway to start the server, otherwise
server
start fails, and also the files inside the data folder follows the
permissions of the
upstream data directory.

usually production systems follows same permissions are of upstream, I don't
see a problem in following the same for development environment also?

comments?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2019-03-08 Thread Michael Paquier
On Fri, Mar 08, 2019 at 05:05:53PM -0500, Andrew Dunstan wrote:
> I notice that this seems never to have been acted on. I think we should
> apply this and remove the (confusing) message setting for the case we'll
> now be avoiding. If not we should at least comment there that this is a
> case we only expect to see in pathological cases.

Sorry for dropping the ball, I would have assumed that Robert would
handle it as he is at the origin of the introduction of the aggressive
option via fd31cd26.

+elog(DEBUG1, "relation %d has been vacuumd ocncurrently, skip",
The proposed patch has two typos in two words.

I am adding an open item about that.  I think I could commit the
patch, but I need to study it a bit more first.
--
Michael


signature.asc
Description: PGP signature


Re: Ordered Partitioned Table Scans

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 10:52, Tom Lane  wrote:
>
> Julien Rouhaud  writes:
> > On Fri, Mar 8, 2019 at 9:15 PM Tom Lane  wrote:
> >> I think you should remove all that
> >> and restrict this optimization to the case where all the subpaths are
> >> natively ordered --- if we have to insert Sorts, it's hardly going to move
> >> the needle to worry about simplifying the parent MergeAppend to Append.
>
> > This can be a huge win for queries of the form "ORDER BY partkey LIMIT
> > x".  Even if the first subpath(s) aren't natively ordered, not all of
> > the sorts should actually be performed.
>
> [ shrug... ] We've got no realistic chance of estimating such situations
> properly, so I'd have no confidence in a plan choice based on such a
> thing.

With all due respect, I'd say that's not even close to being true.

A MergeAppend's startup cost end up set to the sum of all of its
subplan's startup costs, plus any Sort that will be required if the
subpath is not sufficiently ordered already.  An Append's startup cost
will just be the startup cost of the first subpath. This can happen
since, unlike MergeAppend, we don't need to pull the first tuple out
of such subnode to find the lowest one.  In Julien's case, such an
Append plan has a potential of weighing in massively cheaper than a
MergeAppend plan.  Just imagine some large sorts in some later
subpath.

Can you explain why you think that's not properly being estimated in the patch?

> Nor do I believe that this case is all that important.

Can you explain why you believe that?

I see you were the author of b1577a7c78d which was committed over 19
years ago, so I'm surprised to hear you say cheap startup plans are
not important. Or is it, you just don't think they're important for
partitioned tables?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: \describe*

2019-03-08 Thread Corey Huinker
On Mon, Mar 4, 2019 at 1:45 PM Corey Huinker 
wrote:

>
>>> - Tab completion for \descibe-verbose.
>>> I know that \d+ tab completion is also not there, but I think we must
>>> have tab completion for \descibe-verbose.
>>>
>>> postgres=# \describe-
>>> \describe-extension
>>>  \describe-replication-publication \describe-user-mapping
>>> \describe-foreign-data-wrapper
>>> \describe-replication-subscription\describe-view
>>> \describe-foreign-server  \describe-role
>>> \describe-window-function
>>> \describe-foreign-table   \describe-rule
>>>  ...
>>>
>>
> I just confirmed that there isn't tab completion for the existing S/+
> options, so it's hard to justify them for the equivalent verbose suffixes.
>

We can add completions for describe[-thing-]-verbose, but the
auto-completions start to run into combinatoric complexity, and the
original short-codes don't do that completion, probably for the same reason.

+   success =
>>> listTables("tvmsE", NULL, show_verbose, show_system);
>>> +   }
>>> +   status =
>>> PSQL_CMD_UNKNOWN;
>>>
>>>
> I'll look into this, thanks!
>

This was fixed, good find.



> - Confusion about \desc and \desC
>>> There is confusion while running the \desc command. I know the problem,
>>> but the user may confuse by this.
>>> postgres=# \desC
>>>List of foreign servers
>>>  Name | Owner | Foreign-data wrapper
>>> --+---+--
>>> (0 rows)
>>>
>>> postgres=# \desc
>>> Invalid command \desc. Try \? for help.
>>>
>>
I've changed the code to first strip out 0-1 instances of "-verbose" and
"-system" and the remaining string must be an exact match of a describe
command or it's an error. This same system could be applied to the short
commands to strip out 'S' and '+' and it might clean up the original code a
bit.

This command shows a list of relation "\d"
>>> postgres=# \describe-aggregatE-function
>>> List of relations
>>>  Schema | Name | Type  |  Owner
>>> +--+---+-
>>>  public | foo  | table | vagrant
>>> (1 row)
>>>
>>
Same issue, same fix.


>>> I have done a brief code review except for the documentation code. I
>>> don't like this code
>>>
>>> if (cmd_match(cmd,"describe-aggregate-function"))
>>>
>>>  success = describeAggregates(pattern, show_verbose, show_system);
>>>  else if (cmd_match(cmd,
>>> "describe-access-method"))
>>>  success =
>>> describeAccessMethods(pattern, show_verbose);
>>>  else if (cmd_match(cmd,
>>> "describe-tablespace"))
>>>  success = describeTablespaces(pattern,
>>> show_verbose);
>>>  else if (cmd_match(cmd,
>>> "describe-conversion"))
>>>  success = listConversions(pattern,
>>> show_verbose, show_system);
>>>  else if (cmd_match(cmd, "describe-cast"))
>>>  success = listCasts(pattern,
>>> show_verbose
>>>
>>>
>>> This can be achieved with the list/array/hash table, so I have changed
>>> that code in the attached patch just for a sample if you want I can do that
>>> for whole code.
>>>
>>
> There's some problems with a hash table. The function signatures vary
> quite a lot, and some require additional psql_scan_slash_options to be
> called. The hash option, if implemented, probably should be expanded to all
> slash commands, at which point maybe it belongs in psqlscanslash.l...
>

As I suspected, there's a lot of variance in the function signatures of the
various listSomething()/describeSomething() commands,
and listDbRoleSettings requires a second pattern to be scanned, and as far
as I know PsqlScanState isn't known inside describe.h, so building and
using a hash table would be a lot of work for uncertain gain. The original
code just plows through strings in alphabetical order, breaking things up
by comparing leading characters, so I largely did the same at the
des/decribe levels.

Instead of a hash table, It might be fun to write something that takes a
list of alphabetized strings, and builds a binary search tree at compile
time, but that would only work for the long form commands, the short forms
that allow filters like df[anptw]+ and d[tvmisE]+ effectively defeat any
attempt at hashing or btree-ing that I can presently imagine.

Having said that, here's v3 of the patch.

Since this is now waiting for v13, there's a bit more time to entertain the
question of whether we'd rather have these in psql or in a new server
command DESCRIBE [verbose] [system], and if so, whether the output of that
would itself be query-able or not.
From 6ff5ebd39a5bda3c2b398ac0b0062bd629f3c877 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Fri, 8 Mar 2019 23:38:10 +

Re: Performance issue in foreign-key-aware join estimation

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 08:05, Tom Lane  wrote:
> Setting the CF entry to WOA for now.  I wonder though if we should
> just push it out to v13 immediately --- are you intending to do more
> with it in the near future?

Thanks a lot for going over this and providing feedback.  I put the
patch out there mostly to see if such a thing is something we might
want, and it's good to see no objections to the idea. I didn't want to
waste too much time if there was going to be some serious objections
to the idea.

This is something I'd like to look into for v13.  I think it'll be
much easier to do if we can get your List reimplementation patch in
first. That's going to allow list_nth on PlannerInfo->eq_classes to
work quickly and will remove the need for having to build an array to
store the classes, and as you mention, RelOptInfo could store a
Bitmapset to store which ECs have members for this rel.   I've
modified the CF entry to target v13.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Fwd: Add tablespace tap test to pg_rewind

2019-03-08 Thread Michael Paquier
On Fri, Mar 08, 2019 at 09:42:29PM +0800, Shaoqi Bai wrote:
> There is already a databases tap tests in pg_rewind, wonder if there is a
> need for tablespace tap tests in pg_rewind.  Attached is a initial
> patch from me.

When working on the first version of pg_rewind for VMware with Heikki,
tablespace support has been added only after, so what you propose is
sensible I think.

+# Run the test in both modes.
+run_test('local');
Small nit: we should test for the remote mode here as well.
--
Michael


signature.asc
Description: PGP signature


Re: Is it too soon for a PG12 open items wiki page?

2019-03-08 Thread Michael Paquier
On Fri, Mar 08, 2019 at 10:06:43PM +0900, Amit Langote wrote:
> On Fri, Mar 8, 2019 at 7:52 PM Julien Rouhaud  wrote:
>> It seems like a good timing to me.

It would be a good timing as some issues are already showing up.  And
abracadabra:
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items
--
Michael


signature.asc
Description: PGP signature


Re: Pluggable Storage - Andres's take

2019-03-08 Thread Haribabu Kommi
On Thu, Mar 7, 2019 at 6:33 AM Andres Freund  wrote:

> Hi,
>
> On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> > My next steps are:
> > - final polish & push the basic DDL and pg_dump patches
>
> Done and pushed. Some collation dependent fallout, I'm hoping I've just
> fixed that.
>

Thanks for the corrections that I missed, also for the extra changes.

Here I attached the rebased patches that I shared earlier. I am adding the
comments to explain the API's in the code, will share those patches later.

I observed a crash with the latest patch series in the COPY command.
I am not sure whether the problem is with the reduce of tableOid patch
problem,
Will check it and correct the problem.

Regards,
Haribabu Kommi
Fujitsu Australia


0010-Table-access-method-API-explanation.patch
Description: Binary data


0001-Reduce-the-use-of-HeapTuple-t_tableOid.patch
Description: Binary data


0003-Docs-of-default_table_access_method-GUC.patch
Description: Binary data


0004-Rename-indexam.sgml-to-am.sgml.patch
Description: Binary data


0002-Removal-of-scan_update_snapshot-callback.patch
Description: Binary data


0005-Reorganize-am-as-both-table-and-index.patch
Description: Binary data


0006-Doc-update-of-Create-access-method-type-table.patch
Description: Binary data


0009-Doc-of-CREATE-TABLE-AS-.-USING-syntax.patch
Description: Binary data


0007-Doc-update-of-create-materialized-view-.-USING-synta.patch
Description: Binary data


0008-Doc-update-of-CREATE-TABLE-.-USING-syntax.patch
Description: Binary data


Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-08 Thread Michael Paquier
On Fri, Mar 08, 2019 at 10:23:24AM +0900, Michael Paquier wrote:
> Argh...  Thanks for the lookup, Alvaro.

And committed, after an extra pass to beautify the whole experience.
--
Michael


signature.asc
Description: PGP signature


Re: Should we increase the default vacuum_cost_limit?

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 07:10, Tom Lane  wrote:
>
> Jeff Janes  writes:
> > Now that this is done, the default value is only 5x below the hard-coded
> > maximum of 10,000.
> > This seems a bit odd, and not very future-proof.  Especially since the
> > hard-coded maximum appears to have no logic to it anyway, at least none
> > that is documented.  Is it just mindless nannyism?
>
> Hm.  I think the idea was that rather than setting it to "something very
> large", you'd want to just disable the feature via vacuum_cost_delay.
> But I agree that the threshold for what is ridiculously large probably
> ought to be well more than 5x the default, and maybe it is just mindless
> nannyism to have a limit less than what the implementation can handle.

Yeah, +1 to increasing it.  I imagine that the 10,000 limit would not
allow people to explore the upper limits of a modern PCI-E SSD with
the standard delay time and dirty/miss scores.  Also, it doesn't seem
entirely unreasonable that someone somewhere might also want to
fine-tune the hit/miss/dirty scores so that they're some larger factor
apart from each other the standard scores are. The 10,000 limit does
not allow much wiggle room for that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Checksum errors in pg_stat_database

2019-03-08 Thread Magnus Hagander
On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud  wrote:

> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander 
> wrote:
> >
> > It tracks things that happen in the general backends. Possibly we should
> also consider counting the errors actually found when running base backups?
> OTOH, that part of the code doesn't really track things like databases (as
> it operates just on the raw data directory underneath), so that
> implementation would definitely not be as clean...
>
> Sorry I just realized that I totally forgot this part of the thread.
>
> While it's true that we operate on raw directory, I see that sendDir()
> already setup a isDbDir var, and if this is true lastDir should
> contain the oid of the underlying database.  Wouldn't it be enough to
> call sendFile() using this, something like (untested):
>
> if (!sizeonly)
> - sent = sendFile(pathbuf, pathbuf + basepathlen + 1, , true);
> + sent = sendFile(pathbuf, pathbuf + basepathlen + 1, , true,
> isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);
>
> and accordingly report any checksum error from sendFile()?
>

That seems it was easy enough. PFA an updated patch that does this, and
also rebased so it doesn't conflict on oid.

(yes, while moving this from draft to publish after lunch, I realized that
you put a patch togerher for about the same. So let's merge it)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0e73cdcdda..e2630fd368 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2509,6 +2509,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  Number of deadlocks detected in this database
 
 
+ checksum_failures
+ bigint
+ Number of data page checksum failures detected in this database
+
+
  blk_read_time
  double precision
  Time spent reading data file blocks by backends in this database,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..7723f01327 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -823,6 +823,7 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_temp_files(D.oid) AS temp_files,
 pg_stat_get_db_temp_bytes(D.oid) AS temp_bytes,
 pg_stat_get_db_deadlocks(D.oid) AS deadlocks,
+pg_stat_get_db_checksum_failures(D.oid) AS checksum_failures,
 pg_stat_get_db_blk_read_time(D.oid) AS blk_read_time,
 pg_stat_get_db_blk_write_time(D.oid) AS blk_write_time,
 pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..43ec33834b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -334,6 +334,7 @@ static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
+static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 
 /* 
@@ -1518,6 +1519,40 @@ pgstat_report_deadlock(void)
 	pgstat_send(, sizeof(msg));
 }
 
+
+
+/* 
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
+ *
+ *	Tell the collector about one or more checksum failures.
+ * 
+ */
+void
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+{
+	PgStat_MsgChecksumFailure msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
+	msg.m_databaseid = dboid;
+	msg.m_failurecount = failurecount;
+	pgstat_send(, sizeof(msg));
+}
+
+/* 
+ * pgstat_report_checksum_failure() -
+ *
+ *	Tell the collector about a checksum failure.
+ * 
+ */
+void
+pgstat_report_checksum_failure(void)
+{
+	pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+}
+
 /* 
  * pgstat_report_tempfile() -
  *
@@ -4455,6 +4490,10 @@ PgstatCollectorMain(int argc, char *argv[])
 	pgstat_recv_tempfile((PgStat_MsgTempFile *) , len);
 	break;
 
+case PGSTAT_MTYPE_CHECKSUMFAILURE:
+	pgstat_recv_checksum_failure((PgStat_MsgChecksumFailure *) , len);
+	break;
+
 default:
 	break;
 			}
@@ -4554,6 +4593,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_temp_files = 0;
 	dbentry->n_temp_bytes = 0;
 	dbentry->n_deadlocks = 0;
+	dbentry->n_checksum_failures = 0;
 	dbentry->n_block_read_time = 0;
 	dbentry->n_block_write_time = 0;
 
@@ -6197,6 +6237,22 @@ 

Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-08 Thread Andres Freund
Hi,

On 2019-03-08 20:18:27 -0300, Alvaro Herrera wrote:
> On 2019-Mar-07, Robert Haas wrote:
> 
> > On Wed, Mar 6, 2019 at 3:41 PM Andres Freund  wrote:
> > > I think we probably should have pg_dump suppress emitting information
> > > about the toast table of partitioned tables?
> > 
> > +1.  That seems like the right fix.
> 
> This patch fixes the upgrade problem for me.

Thanks!


> -- 
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

> diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
> index e962ae7e913..1de8da59361 100644
> --- a/src/bin/pg_dump/pg_dump.c
> +++ b/src/bin/pg_dump/pg_dump.c
> @@ -4359,9 +4359,9 @@ binary_upgrade_set_type_oids_by_rel_oid(Archive *fout,
> "SELECT c.reltype AS crel, t.reltype 
> AS trel "
> "FROM pg_catalog.pg_class c "
> "LEFT JOIN pg_catalog.pg_class t ON "
> -   "  (c.reltoastrelid = t.oid) "
> +   "  (c.reltoastrelid = t.oid AND 
> c.relkind <> '%c') "
> "WHERE c.oid = '%u'::pg_catalog.oid;",
> -   pg_rel_oid);
> +   RELKIND_PARTITIONED_TABLE, 
> pg_rel_oid);

Hm, I know this code isn't generally well documented, but perhaps we
could add a comment as to why we're excluding partitioned tables?

Greetings,

Andres Freund



Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-08 Thread Alvaro Herrera
On 2019-Mar-07, Robert Haas wrote:

> On Wed, Mar 6, 2019 at 3:41 PM Andres Freund  wrote:
> > I think we probably should have pg_dump suppress emitting information
> > about the toast table of partitioned tables?
> 
> +1.  That seems like the right fix.

This patch fixes the upgrade problem for me.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e962ae7e913..1de8da59361 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4359,9 +4359,9 @@ binary_upgrade_set_type_oids_by_rel_oid(Archive *fout,
 	  "SELECT c.reltype AS crel, t.reltype AS trel "
 	  "FROM pg_catalog.pg_class c "
 	  "LEFT JOIN pg_catalog.pg_class t ON "
-	  "  (c.reltoastrelid = t.oid) "
+	  "  (c.reltoastrelid = t.oid AND c.relkind <> '%c') "
 	  "WHERE c.oid = '%u'::pg_catalog.oid;",
-	  pg_rel_oid);
+	  RELKIND_PARTITIONED_TABLE, pg_rel_oid);
 
 	upgrade_res = ExecuteSqlQueryForSingleRow(fout, upgrade_query->data);
 
@@ -6049,7 +6049,7 @@ getTables(Archive *fout, int *numTables)
 		  "d.classid = c.tableoid AND d.objid = c.oid AND "
 		  "d.objsubid = 0 AND "
 		  "d.refclassid = c.tableoid AND d.deptype IN ('a', 'i')) "
-		  "LEFT JOIN pg_class tc ON (c.reltoastrelid = tc.oid) "
+		  "LEFT JOIN pg_class tc ON (c.reltoastrelid = tc.oid AND c.relkind <> '%c') "
 		  "LEFT JOIN pg_am am ON (c.relam = am.oid) "
 		  "LEFT JOIN pg_init_privs pip ON "
 		  "(c.oid = pip.objoid "
@@ -6072,6 +6072,7 @@ getTables(Archive *fout, int *numTables)
 		  ispartition,
 		  partbound,
 		  RELKIND_SEQUENCE,
+		  RELKIND_PARTITIONED_TABLE,
 		  RELKIND_RELATION, RELKIND_SEQUENCE,
 		  RELKIND_VIEW, RELKIND_COMPOSITE_TYPE,
 		  RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,


Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2019-03-08 Thread Andrew Dunstan


On 10/9/18 5:15 AM, Kyotaro HORIGUCHI wrote:
> At Fri, 5 Oct 2018 15:35:04 +0900, Michael Paquier  
> wrote in <20181005063504.gb14...@paquier.xyz>
>> On Fri, Oct 05, 2018 at 12:16:03PM +0900, Michael Paquier wrote:
>>> So, I have come back to this stuff, and finished with the attached
>>> instead, so as the assertion is in a single place.  I find that
>>> clearer.  The comments have also been improved.  Thoughts?
>> And so...  I have been looking at committing this thing, and while
>> testing in-depth I have been able to trigger a case where an autovacuum
>> has been able to be not aggressive but anti-wraparound, which is exactly
>> what should not be possible, no?  I have simply created an instance with
>> autovacuum_freeze_max_age = 20, then ran pgbench with
>> autovacuum_freeze_table_age=20 set for each table, and also ran
>> installcheck-world in parallel.  This has been able to trigger the
>> assertion pretty quickly.
> I investigated it and in short, it can happen.
>
> It is a kind of race consdition between two autovacuum
> processes. do_autovacuum() looks into pg_class (using a snapshot)
> and vacuum_set_xid_limits() looks into relcache. If concurrent
> vacuum happens and one has finished the relation, another gets
> relcache invalidation and relfrozenxid is updated. If this
> happens between do_autovacuum() and vacuum_set_xid_limits(), the
> latter sees newer relfrozenxid than the former. The problem
> happens when it moves by more than 5% of
> autovacuum_freeze_max_age.
>
> If lazy_vacuum_rel() sees the situation, the relation is already
> aggressively vacuumed by a cocurrent worker. We can just ingore
> the state safely but also we know that the vacuum is useless.
>
> 1. Just allow the case there (and add comment).
>Causes redundant anti-wraparound vacuum.
>
> 2. Skip the relation by the condition.
>
>I think that we can safely skip the relation in the
>case. (attached)
>
> 3. Ensure that do_autovacuum always sees the same relfrozenxid
>with vacuum_set_xid_limits().
>
> 4. Prevent concurrent acuuming of the same relation rigorously,
>   somehow.
>
> Thoughts?
>


I notice that this seems never to have been acted on. I think we should
apply this and remove the (confusing) message setting for the case we'll
now be avoiding. If not we should at least comment there that this is a
case we only expect to see in pathological cases.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Ordered Partitioned Table Scans

2019-03-08 Thread Tom Lane
Julien Rouhaud  writes:
> On Fri, Mar 8, 2019 at 9:15 PM Tom Lane  wrote:
>> I think you should remove all that
>> and restrict this optimization to the case where all the subpaths are
>> natively ordered --- if we have to insert Sorts, it's hardly going to move
>> the needle to worry about simplifying the parent MergeAppend to Append.

> This can be a huge win for queries of the form "ORDER BY partkey LIMIT
> x".  Even if the first subpath(s) aren't natively ordered, not all of
> the sorts should actually be performed.

[ shrug... ] We've got no realistic chance of estimating such situations
properly, so I'd have no confidence in a plan choice based on such a
thing.  Nor do I believe that this case is all that important.

regards, tom lane



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:
> All of that said, maybe it would be clearer if page deletion was not
> the special case that has to opt in to minusinfkey semantics. Perhaps
> it would make more sense for everyone else to opt out of minusinfkey
> semantics, and to get the !minusinfkey optimization as a result of
> that. I only did it the other way around because that meant that only
> nbtpage.c had to acknowledge the special case.

This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

Thanks
-- 
Peter Geoghegan



Re: Ordered Partitioned Table Scans

2019-03-08 Thread Julien Rouhaud
On Fri, Mar 8, 2019 at 9:15 PM Tom Lane  wrote:
>
> David Rowley  writes:
> > [ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ]
>
> I think you should remove all that
> and restrict this optimization to the case where all the subpaths are
> natively ordered --- if we have to insert Sorts, it's hardly going to move
> the needle to worry about simplifying the parent MergeAppend to Append.

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x".  Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.



Re: Checksum errors in pg_stat_database

2019-03-08 Thread Julien Rouhaud
On Mon, Mar 4, 2019 at 8:31 PM Julien Rouhaud  wrote:
>
> On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander  wrote:
> >
> > It tracks things that happen in the general backends. Possibly we should 
> > also consider counting the errors actually found when running base backups? 
> > OTOH, that part of the code doesn't really track things like databases (as 
> > it operates just on the raw data directory underneath), so that 
> > implementation would definitely not be as clean...
>
> Sorry I just realized that I totally forgot this part of the thread.
>
> While it's true that we operate on raw directory, I see that sendDir()
> already setup a isDbDir var, and if this is true lastDir should
> contain the oid of the underlying database.  Wouldn't it be enough to
> call sendFile() using this, something like (untested):
>
> if (!sizeonly)
> - sent = sendFile(pathbuf, pathbuf + basepathlen + 1, , true);
> + sent = sendFile(pathbuf, pathbuf + basepathlen + 1, , true,
> isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);
>
> and accordingly report any checksum error from sendFile()?

So this seem to work just fine without adding much code.  PFA v3 of
Magnus' patch including error reporting for BASE_BACKUP.


stat_checksums_v3.diff
Description: Binary data


Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-03-08 Thread Tom Lane
Etsuro Fujita  writes:
> (2019/02/28 0:52), Robert Haas wrote:
>> On Tue, Feb 26, 2019 at 7:26 AM Etsuro Fujita
>>   wrote:
>>> The parallel safety of the final scan/join target is determined from the
>>> grouping target, not that target, which [ is wrong ]

>> Your patch looks right to me.

> I think it would be better for you to take this one.  Could you?

I concur with Robert that this is a brown-paper-bag-grade bug in
960df2a97.  Please feel free to push (and don't forget to back-patch).

The only really interesting question is whether it's worth adding
a regression test for.  I doubt it; the issue seems much too narrow.
Usually the point of a regression test is to prevent re-introduction
of the same/similar bug, but what class of bugs would you argue
we'd be protecting against?

regards, tom lane



Re: Ordered Partitioned Table Scans

2019-03-08 Thread Tom Lane
David Rowley  writes:
> [ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ]

I took a quick look through this and I'm not very happy with it.
It seems to me that the premise ought to be "just use an Append
if we can prove the output would be ordered anyway", but that's not
what we actually have here: instead you're adding more infrastructure
onto Append, which notably involves invasive changes to the API of
create_append_path, which is the main reason why the patch keeps breaking.
(It's broken again as of HEAD, though the cfbot doesn't seem to have
noticed yet.)  Likewise there's a bunch of added complication in
cost_append, create_append_plan, etc.  I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

There also seem to be bits that duplicate functionality of the
drop-single-child-[Merge]Append patch (specifically I'm looking
at get_singleton_append_subpath).  Why do we need that?

The logic in build_partition_pathkeys is noticeably stupider than
build_index_pathkeys, in particular it's not bright about boolean columns.
Maybe that's fine, but if so it deserves a comment explaining why we're
not bothering.  Also, the comment for build_index_pathkeys specifies that
callers should do truncate_useless_pathkeys, which they do; why is that
not relevant here?

regards, tom lane



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:
> > Question: Wouldn't it be more straightforward to use "1 +inf" as page
> > 1's high key? I.e treat any missing columns as positive infinity.
>
> That might also work, but it wouldn't be more straightforward on
> balance. This is because:

I thought of another reason:

* The 'Add high key "continuescan" optimization' is effective because
the high key of a leaf page tends to look relatively dissimilar to
other items on the page. The optimization would almost never help if
the high key was derived from the lastleft item at the time of a split
-- that's no more informative than the lastleft item itself.

As things stand with the patch, a high key usually has a value for its
last untruncated attribute that can only appear on the page to the
right, and never the current page. We'd quite like to be able to
conclude that the page to the right can't be interesting there and
then, without needing to visit it. Making new leaf high keys "as close
as possible to items on the right, without actually touching them"
makes the optimization quite likely to work out with the TPC-C
indexes, when we search for orderline items for an order that is
rightmost of a leaf page in the orderlines primary key.

And another reason:

* This makes it likely that any new items that would go between the
original lastleft and firstright items end up on the right page when
they're inserted after the lastleft/firstright split. This is
generally a good thing, because we've optimized WAL-logging for new
pages that go on the right. (You pointed this out to me in Lisbon, in
fact.)

-- 
Peter Geoghegan



Re: Reaping Temp tables to avoid XID wraparound

2019-03-08 Thread Magnus Hagander
On Mon, Feb 25, 2019 at 10:45 PM Michael Paquier 
wrote:

> On Fri, Feb 22, 2019 at 04:01:02PM +0100, Magnus Hagander wrote:
> > I did the "insert column in the middle of pg_stat_get_activity", I'm not
> > sure that is right -- how do we treate that one? Do we just append at the
> > end because people are expected to use the pg_stat_activity view? It's a
> > nontrivial part of the patch.
>
> I think that it would be more confusing to add the new column at the
> tail, after all the SSL fields.
>
> > That one aside, does the general way to track it appear reasonable? (docs
> > excluded until we have agreement on that)
>
> It does.  A temp table is associated to a session so it's not like
> autovacuum can work on it.  With this information it is at least
> possible to take actions.  We could even get autovacuum to kill such
> sessions. /me hides
>
> > And should we also expose the oid in pg_stat_activity in this case, since
> > we have it?
>
> For the case reported here, just knowing the XID where the temporary
> namespace has been created is enough so as the goal is to kill the
> session associated with it.  Still, it seems to me that knowing the
> temporary schema name used by a given session is useful, and that's
> cheap to get as the information is already there.
>

it should be since it's in pgproc.

One problem that I can see with your patch is that you would set the
> XID once any temporary object created, including when objects other
> than tables are created in pg_temp, including functions, etc.  And it
> does not really matter for wraparound monitoring.  Still, the patch is
> simple..
>

I'm not entirely sure what you mean here. Sure, it will log it even when a
temp function is created, but the namespace is still created then is it
not?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Performance issue in foreign-key-aware join estimation

2019-03-08 Thread Tom Lane
David Rowley  writes:
> On Thu, 21 Feb 2019 at 15:00, Tom Lane  wrote:
>> Anyway, I rebased the POC patch up to HEAD, just in case anyone
>> still wants to play with it.

> Cool. Thanks.

I haven't done any of the performance testing that this patch
clearly needs, but just in a quick look through the code:

* I seriously dislike the usage of root->eq_classes, primarily the
underdocumented way that it means one thing in some phases of the
planner and something else in other phases.  You seem to be doing that
in hopes of saving some memory, but is the index data structure really
large enough to matter?  This scheme is certainly unlikely to continue
to work if we add any additional uses of EquivalenceClassIndexes.
So I think it might be better to pass them around as explicit
arguments and/or attach them someplace else than PlannerInfo.

* I'm also not very excited about having both a fast and slow path
in places like has_relevant_eclass_joinclause() depending on whether
the index exists or not.  IMO, if we're going to do this at all,
we should just go over to requiring the index to exist when needed,
and get rid of the slow paths.  (A possible variant to that is "build
the index structure on-demand", though you'd have to be careful about
GEQO memory management.)  Otherwise we'll forever be fighting hidden
planner-performance bugs of the form "if you call function xyz from
here, it'll be way slower than you expected".

* There's not much point in making EquivalenceClassIndex a Node type
if you're not going to wire it up to any of the Node infrastructure
(particularly outfuncs.c, which might be useful for debug purposes).
But I'm not really sure that it ought to be a distinct data structure
at all --- maybe this requirement should be more tightly integrated
with the ECs themselves?  One idea of what that might look like is to
let each base RelOptInfo have a list of associated EquivalenceClasses,
so that you'd only have to search through directly-relevant ECs when
trying to prove something.  But I'm just handwaving here.

* Spell check please, particularly EQUIVALANCE.

* Documentation of the data structure is pretty weak, eg what is
"this relation" in the comment about ei_index?  And how do you
know how long the arrays are, or what they're indexed by?  And
there's no explicit statement that ei_flags is a bitmask of those
other symbols, much less any real statement of what each flag means.

Setting the CF entry to WOA for now.  I wonder though if we should
just push it out to v13 immediately --- are you intending to do more
with it in the near future?

regards, tom lane



Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Pavel Stehule
pá 8. 3. 2019 v 19:44 odesílatel Alvaro Herrera 
napsal:

> On 2019-Mar-08, Alvaro Herrera wrote:
>
> > > Maybe we can call explicitly xmlFreeDoc instead xmlFreeNode
> > >
> > > some like
> > >
> > > if (cur_copy->type == XML_DOCUMENT_NODE)
> > >   xmlFreeDoc((xmlDocPtr) cur_copy);
> > > else
> > >   xmlFreeNode(cur_copy);
> > >
> > > This looks most correct fix for me. What do you think?
> >
> > Seems like that should work, yeah ...
>
> Something like this perhaps?  Less repetitive ...
>

+1

Pavel


> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas  wrote:
> Now, what do we have as the high key of page 1? Answer: "2 -inf". The
> "-inf" is not stored in the key itself, the second key column is just
> omitted, and the search code knows to treat it implicitly as a value
> that's lower than any real value. Hence, "minus infinity".

Right.

> However, during page deletion, we need to perform a search to find the
> downlink pointing to a leaf page. We do that by using the leaf page's
> high key as the search key. But the search needs to treat it slightly
> differently in that case. Normally, searching with a single key value,
> "2", we would land on page 2, because any real value beginning with "2"
> would be on that page, but in the page deletion case, we want to find
> page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
> to do that.

Right.

> Question: Wouldn't it be more straightforward to use "1 +inf" as page
> 1's high key? I.e treat any missing columns as positive infinity.

That might also work, but it wouldn't be more straightforward on
balance. This is because:

* We have always taken the new high key from the firstright item, and
we also continue to do that on internal pages -- same as before. It
would certainly complicate the nbtsplitloc.c code to have to deal with
this new special case now (leaf and internal pages would have to have
far different handling, not just slightly different handling).

* We have always had "-inf" values as the first item on an internal
page, which is explicitly truncated to zero attributes as of Postgres
v11. It seems ugly to me to make truncated attributes mean negative
infinity in that context, but positive infinity in every other
context.

* Another reason that I prefer "-inf" to "+inf" is that you can
imagine an implementation that makes pivot tuples into normalized
binary keys, that are truncated using generic/opclass-agnostic logic,
and compared using strcmp(). If the scankey binary string is longer
than the pivot tuple, then it's greater according to strcmp() -- that
just works. And, you can truncate the original binary strings built
using opclass infrastructure without having to understand where
attributes begin and end (though this relies on encoding things like
NULL-ness a certain way). If we define truncation to be "+inf" now,
then none of this works.

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

Even calling it minusinfkey is misleading in one way, because we're
not so much searching for "-inf" values as we are searching for the
first page that could have tuples for the untruncated attributes. But
isn't that how this has always worked, given that we've had to deal
with duplicate pivot tuples on the same level before now? As I said,
we're not doing an extra thing when minusinfykey is true (during page
deletion) -- it's the other way around. Saying that we're searching
for minus infinity values for the truncated attributes is kind of a
lie, although the search does behave that way.

>That way, the search for page deletion wouldn't need to be treated
> differently. That's also how this used to work: all tuples on a page
> used to be <= high key, not strictly < high key.

That isn't accurate -- it still works that way on the leaf level. The
alternative that you've described is possible, I think, but the key
space works just the same with either of our approaches. You've merely
thought of an alternative way of generating new high keys that satisfy
the same invariants as my own scheme. Provided the new separator for
high key is >= last item on the left and < first item on the right,
everything works.

As you point out, the original Lehman and Yao rule for leaf pages
(which Postgres kinda followed before) is that the high key is <=
items on the leaf level. But this patch makes nbtree follow that rule
fully and properly.

Maybe you noticed that amcheck tests < on internal pages, and only
checks <= on leaf pages. Perhaps it led you to believe that I did
things differently. Actually, this is classic Lehman and Yao. The keys
in internal pages are all "separators" as far as Lehman and Yao are
concerned, so the high key is less of a special case on internal
pages. We check < on internal pages because all separators are
supposed to be unique on a level. But, as I said, we do check <= on
the leaf level.

Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if
this is unclear. That shows that internal pages have unique keys -- we
can therefore expect the high key to be < items in internal pages. It
also shows that leaf pages copy the high key from the last item on the
left page -- we can expect 

Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Alvaro Herrera
On 2019-Mar-08, Alvaro Herrera wrote:

> > Maybe we can call explicitly xmlFreeDoc instead xmlFreeNode
> > 
> > some like
> > 
> > if (cur_copy->type == XML_DOCUMENT_NODE)
> >   xmlFreeDoc((xmlDocPtr) cur_copy);
> > else
> >   xmlFreeNode(cur_copy);
> > 
> > This looks most correct fix for me. What do you think?
> 
> Seems like that should work, yeah ...

Something like this perhaps?  Less repetitive ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 28b3eaaa201..9204d6e9cf6 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3720,35 +3720,57 @@ xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
 
 	if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)
 	{
-		xmlBufferPtr buf;
-		xmlNodePtr	cur_copy;
-
-		buf = xmlBufferCreate();
-
-		/*
-		 * The result of xmlNodeDump() won't contain namespace definitions
-		 * from parent nodes, but xmlCopyNode() duplicates a node along with
-		 * its required namespace definitions.
-		 */
-		cur_copy = xmlCopyNode(cur, 1);
-
-		if (cur_copy == NULL)
-			xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
-		"could not copy node");
+		volatile void (*nodefree) (xmlNodePtr) = NULL;
+		volatile xmlBufferPtr buf = NULL;
+		volatile xmlNodePtr cur_copy = NULL;
 
 		PG_TRY();
 		{
-			xmlNodeDump(buf, NULL, cur_copy, 0, 1);
+			int			bytes;
+
+			buf = xmlBufferCreate();
+			if (buf == NULL || xmlerrcxt->err_occurred)
+xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
+			"could not allocate xmlBuffer");
+
+			/*
+			 * Produce a dump of the node that we can serialize.  xmlNodeDump
+			 * does that, but the result of that function won't contain
+			 * namespace definitions from ancestor nodes, so we first do a
+			 * xmlCopyNode() which duplicates the node along with its required
+			 * namespace definitions.
+			 *
+			 * Some old libxml2 versions such as 2.7.6 produce partially
+			 * broken XML_DOCUMENT_NODE nodes (unset content field) when
+			 * copying them.  xmlNodeDump of such a node works fine, but
+			 * xmlFreeNode crashes; set us up to call xmlFreeDoc instead.
+			 */
+			cur_copy = xmlCopyNode(cur, 1);
+			if (cur_copy == NULL || xmlerrcxt->err_occurred)
+xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
+			"could not copy node");
+			nodefree = cur_copy->type == XML_DOCUMENT_NODE ?
+(void (*) (xmlNodePtr)) xmlFreeDoc : xmlFreeNode;
+
+			bytes = xmlNodeDump(buf, NULL, cur_copy, 0, 1);
+			if (bytes == -1 || xmlerrcxt->err_occurred)
+xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
+			"could not dump node");
+
 			result = xmlBuffer_to_xmltype(buf);
 		}
 		PG_CATCH();
 		{
-			xmlFreeNode(cur_copy);
-			xmlBufferFree(buf);
+			if (nodefree)
+nodefree(cur_copy);
+			if (buf)
+xmlBufferFree(buf);
 			PG_RE_THROW();
 		}
 		PG_END_TRY();
-		xmlFreeNode(cur_copy);
+
+		if (nodefree)
+			nodefree(cur_copy);
 		xmlBufferFree(buf);
 	}
 	else


Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Alvaro Herrera
(I spent some time trying to reproduce the original bug, but was
interrupted for lunch before getting a useful installation.  I find it a
bit strange that it doesn't crash in x86_64, mind ...)

On 2019-Mar-08, Pavel Stehule wrote:

> It fixes current issue, but I afraid so these two routines are not
> replaceable. xmlFreeNodeList doesn't release xmlFreeDtd, XML_ATTRIBUTE_NODE
> is not checked.

:-(

> Maybe we can call explicitly xmlFreeDoc instead xmlFreeNode
> 
> some like
> 
> if (cur_copy->type == XML_DOCUMENT_NODE)
>   xmlFreeDoc((xmlDocPtr) cur_copy);
> else
>   xmlFreeNode(cur_copy);
> 
> This looks most correct fix for me. What do you think?

Seems like that should work, yeah ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Hash index initial size is too large given NULLs or partial indexes

2019-03-08 Thread Tomas Vondra



On 3/8/19 7:14 PM, Jeff Janes wrote:
> Referring to this thread:
> 
> https://dba.stackexchange.com/questions/231647/why-are-partial-postgresql-hash-indices-not-smaller-than-full-indices
> 
> When a hash index is created on a populated table, it estimates the
> number of buckets to start out with based on the number of tuples
> returned by estimate_rel_size.  But this number ignores both the fact
> that NULLs are not stored in hash indexes, and that partial indexes
> exist.  This can lead to much too large hash indexes.  Doing a re-index
> just repeats the logic, so doesn't fix anything.  Fill factor also can't
> fix it, as you are not allowed to increase that beyond 100.
> 

Hmmm :-(

> This goes back to when the pre-sizing was implemented in 2008
> (c9a1cc694abef737548a2a).  It seems to be an oversight, rather than
> something that was considered.
> 
> Is this a bug that should be fixed?  Or if getting a more accurate
> estimate is not possible or not worthwhile, add a code comment about that?
> 

I'd agree this smells like a bug (or perhaps two). The sizing probably
should consider both null_frac and selectivity of the index predicate.
When those two are redundant (i.e. when there's IS NOT NULL condition on
indexed column), this will result in under-estimate. That means the
index build will do a an extra split, but that's probably better than
having permanently bloated index.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hash index initial size is too large given NULLs or partial indexes

2019-03-08 Thread Jeff Janes
Referring to this thread:

https://dba.stackexchange.com/questions/231647/why-are-partial-postgresql-hash-indices-not-smaller-than-full-indices

When a hash index is created on a populated table, it estimates the number
of buckets to start out with based on the number of tuples returned
by estimate_rel_size.  But this number ignores both the fact that NULLs are
not stored in hash indexes, and that partial indexes exist.  This can lead
to much too large hash indexes.  Doing a re-index just repeats the logic,
so doesn't fix anything.  Fill factor also can't fix it, as you are not
allowed to increase that beyond 100.

This goes back to when the pre-sizing was implemented in 2008
(c9a1cc694abef737548a2a).  It seems to be an oversight, rather than
something that was considered.

Is this a bug that should be fixed?  Or if getting a more accurate estimate
is not possible or not worthwhile, add a code comment about that?

Cheers,

Jeff


Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Pavel Stehule
pá 8. 3. 2019 v 15:31 odesílatel Alvaro Herrera 
napsal:

> On 2019-Mar-08, Pavel Stehule wrote:
>
> > looks like error in xmlXPathCompiledEval function, that produce little
> bit
> > broken result for XML_DOCUMENT_NODE type. I hadn't this problem with
> > libxml2 2.7.6 64bit, but I seen this issue on same version on 32bit.
> >
> > Currently I had not fresh 32 bit system to check it.
> >
> > I found a workaround - in this case copy (and release xmlNode) is not
> > necessary.
>
> Hmm ... going over the libxml2 2.7.6 source, I noticed that
> xmlFreeNodeList seem to get this right -- it uses xmlFreeDoc for
> XML_DOCUMENT_NODE.  Maybe a sufficient answer is to change the
> xmlFreeNode there to xmlFreeNodeList.
>

It fixes current issue, but I afraid so these two routines are not
replaceable. xmlFreeNodeList doesn't release xmlFreeDtd, XML_ATTRIBUTE_NODE
is not checked.

You can see, from xmlNodeGetContent, XML_DOCUMENT_NODE type should to
ignore content value, and newer returns this value.  Other interesting is
xmlXPathOrderDocElems where content is used for counting, and probably from
there is -1.

Maybe we can call explicitly xmlFreeDoc instead xmlFreeNode

some like

if (cur_copy->type == XML_DOCUMENT_NODE)
  xmlFreeDoc((xmlDocPtr) cur_copy);
else
  xmlFreeNode(cur_copy);

This looks most correct fix for me. What do you think?

Pavel



> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Should we increase the default vacuum_cost_limit?

2019-03-08 Thread Tom Lane
Jeff Janes  writes:
> Now that this is done, the default value is only 5x below the hard-coded
> maximum of 10,000.
> This seems a bit odd, and not very future-proof.  Especially since the
> hard-coded maximum appears to have no logic to it anyway, at least none
> that is documented.  Is it just mindless nannyism?

Hm.  I think the idea was that rather than setting it to "something very
large", you'd want to just disable the feature via vacuum_cost_delay.
But I agree that the threshold for what is ridiculously large probably
ought to be well more than 5x the default, and maybe it is just mindless
nannyism to have a limit less than what the implementation can handle.

regards, tom lane



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-08 Thread Tomas Vondra
On 3/8/19 5:38 PM, Antonin Houska wrote:
> Antonin Houska  wrote:
> 
>> Masahiko Sawada  wrote:
>>
>>> Agreed.
>>>
>>> For the WAL encryption, I wonder if we can have a encryption key
>>> dedicated for WAL. Regardless of keys of tables and indexes all WAL
>>> are encrypted with the WAL key. During the recovery the startup
>>> process decrypts WAL and applies it, and then the table data will be
>>> encrypted with its table key when flushing. So we just control the
>>> scope of encryption object: WAL of tables and indexes etc or
>>> everything.
>>
>> My point of view is that different key usually means different user. The user
>> who can decrypt WAL can effectively see all the data, even though another 
>> user
>> put them (encrypted with another key) into tables. So in this case, different
>> keys don't really separate users in terms of data access.
> 
> Please ignore what I said here. You probably meant that the WAL is both
> encrypted and decrypted using the same (dedicated) key.
> 

I think this very much depends on the threat model. If the encryption is
supposed to serve as a second access control layer (orthogonal to the
ACL stuff we already have), then a single WAL key may not be sufficient.

I may be misunderstanding the whole scheme, but it seems to me features
like logical decoding do require knowledge of the WAL key. So sessions
performing logical decoding (which are regular user sessions) would know
the WAL key, which gives them the ability to decode everything.

So if the threat model includes insider thread (someone with access to a
subset of data, gaining unauthorized access to everything), then this
would be an issue. Such bad actor might obtain access to WAL archive, or
possibly just copy the WAL segments on his own ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-08 Thread Julien Rouhaud
On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra
 wrote:
>
> On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund  wrote:
> >>
> >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >>>
> >>> But then again, we could just
> >>> hack a special version of ReadBuffer_common() which would just
> >>
> >>> (a) check if a page is in shared buffers, and if it is then consider the
> >>> checksum correct (because in memory it may be stale, and it was read
> >>> successfully so it was OK at that moment)
> >>>
> >>> (b) if it's not in shared buffers already, try reading it and verify the
> >>> checksum, and then just evict it right away (not to spoil sb)
> >>
> >> This'd also make sense and make the whole process more efficient. OTOH,
> >> it might actually be worthwhile to check the on-disk page even if
> >> there's in-memory state. Unless IO is in progress the on-disk page
> >> always should be valid.
> >
> > Definitely.  I already saw servers with all-frozen-read-only blocks
> > popular enough to never get evicted in months, and then a minor
> > upgrade / restart having catastrophic consequences.
> >
>
> Do I understand correctly the "catastrophic consequences" here are due
> to data corruption / broken checksums on those on-disk pages?

Ah, yes sorry I should have been clearer.  Indeed, there was silent
data corruptions (no ckecksum though) that was revealed by the
restart.  So a routine minor update resulted in a massive outage.
Such a scenario can't be avoided if we always bypass checksum check
for alreay in shared_buffers pages.



Re: Online verification of checksums

2019-03-08 Thread Tomas Vondra
On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> On Thu, Mar 7, 2019 at 7:00 PM Andres Freund  wrote:
>>
>> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
>>>
>>> But then again, we could just
>>> hack a special version of ReadBuffer_common() which would just
>>
>>> (a) check if a page is in shared buffers, and if it is then consider the
>>> checksum correct (because in memory it may be stale, and it was read
>>> successfully so it was OK at that moment)
>>>
>>> (b) if it's not in shared buffers already, try reading it and verify the
>>> checksum, and then just evict it right away (not to spoil sb)
>>
>> This'd also make sense and make the whole process more efficient. OTOH,
>> it might actually be worthwhile to check the on-disk page even if
>> there's in-memory state. Unless IO is in progress the on-disk page
>> always should be valid.
> 
> Definitely.  I already saw servers with all-frozen-read-only blocks
> popular enough to never get evicted in months, and then a minor
> upgrade / restart having catastrophic consequences.
> 

Do I understand correctly the "catastrophic consequences" here are due
to data corruption / broken checksums on those on-disk pages?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-08 Thread Tom Lane
John Naylor  writes:
>> On 2/8/19, Tom Lane  wrote:
>>> A script such as you suggest might be a good way to reduce the temptation
>>> to get lazy at the last minute.  Now that the catalog data is pretty
>>> machine-readable, I suspect it wouldn't be very hard --- though I'm
>>> not volunteering either.  I'm envisioning something simple like "renumber
>>> all OIDs in range - into range -", perhaps with the
>>> ability to skip any already-used OIDs in the target range.

>> This might be something that can be done inside reformat_dat_files.pl.
>> It's a little outside it's scope, but better than the alternatives.

> Along those lines, here's a draft patch to do just that. It handles
> array type oids as well. Run it like this:

> perl  reformat_dat_file.pl  --map-from 9000  --map-to 2000  *.dat

I took a quick look at this.  I went ahead and pushed the parts that
were just code cleanup in reformat_dat_file.pl, since that seemed
pretty uncontroversial.  As far as the rest of it goes:

* I'm really not terribly happy with sticking this functionality into
reformat_dat_file.pl.  First, there's an issue of discoverability:
it's not obvious that a script named that way would have such an
ability.  Second, it clutters the script in a way that seems to me
to hinder its usefulness as a basis for one-off hacks.  So I'd really
rather have a separate script named something like "renumber_oids.pl",
even if there's a good deal of code duplication between it and
reformat_dat_file.pl.

* In my vision of what this might be good for, I think it's important
that it be possible to specify a range of input OIDs to renumber, not
just "everything above N".  I agree the output range only needs a
starting OID.

BTW, I changed the CF entry's target back to v12; I don't see a
reason not to get this done this month, and indeed kind of wish
it was available right now ;-)

regards, tom lane



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-08 Thread Antonin Houska
Antonin Houska  wrote:

> Masahiko Sawada  wrote:
> 
> > Agreed.
> > 
> > For the WAL encryption, I wonder if we can have a encryption key
> > dedicated for WAL. Regardless of keys of tables and indexes all WAL
> > are encrypted with the WAL key. During the recovery the startup
> > process decrypts WAL and applies it, and then the table data will be
> > encrypted with its table key when flushing. So we just control the
> > scope of encryption object: WAL of tables and indexes etc or
> > everything.
> 
> My point of view is that different key usually means different user. The user
> who can decrypt WAL can effectively see all the data, even though another user
> put them (encrypted with another key) into tables. So in this case, different
> keys don't really separate users in terms of data access.

Please ignore what I said here. You probably meant that the WAL is both
encrypted and decrypted using the same (dedicated) key.

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: Problems with plan estimates in postgres_fdw

2019-03-08 Thread Antonin Houska
Etsuro Fujita  wrote:

> (2019/03/01 20:16), Antonin Houska wrote:
> > Etsuro Fujita  wrote:
> 
> >> Conversely, it appears that add_foreign_ordered_paths() added by the 
> >> patchset
> >> would generate such pre-sorted paths *redundantly* when the input_rel is 
> >> the
> >> final scan/join relation.  Will look into that.
> >
> > Currently I have no idea how to check the plan created by FDW at the
> > UPPERREL_ORDERED stage, except for removing the sort from the
> > UPPERREL_GROUP_AGG stage as I proposed here:
> >
> > https://www.postgresql.org/message-id/11807.1549564431%40localhost
> 
> I don't understand your words "how to check the plan created by FDW". Could
> you elaborate on that a bit more?

I meant that I don't know how to verify that the plan that sends the ORDER BY
clause to the remote server was created at the UPPERREL_ORDERED and not at
UPPERREL_GROUP_AGG. However if the ORDER BY clause really should not be added
at the UPPERREL_GROUP_AGG stage and if we ensure that it no longer happens,
then mere presence of ORDER BY in the (remote) plan means that the
UPPERREL_ORDERED stage works fine.

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread Amit Langote
On Sat, Mar 9, 2019 at 12:03 AM Alvaro Herrera  wrote:
>
> On 2019-Mar-08, Amit Langote wrote:
>
> > On Fri, Mar 8, 2019 at 11:09 PM Alvaro Herrera  
> > wrote:
>
> > > I'm not sure about copying the same to ddl.sgml.  Why is that needed?
> > > Update is not DDL.
> >
> > Hmm, maybe because there's already a huge block of text describing
> > certain limitations of UPDATE row movement under concurrency?
>
> Uh, you're right, there is.  That seems misplaced :-(  I'm not sure it
> even counts as a "limitation"; it seems to belong to the NOTES section
> of UPDATE rather than where it is now.
>
> > Actually, I remember commenting *against* having that text in
> > ddl.sgml, but it got in there anyway.
>
> We can move it now ...
>
> > > ddl.sgml does say this: "Partitions can also be
> > > foreign tables, although they have some limitations that normal tables
> > > do not; see CREATE FOREIGN TABLE for more information." which suggests
> > > that the limitation might need to be added to create_foreign_table.sgml.
> >
> > Actually, that "more information" never got added to
> > create_foreign_table.sgml.  There should've been some text about the
> > lack for tuple routing at least in PG 10's docs, but I guess that
> > never happened.
>
> Sigh.
>
> Since version 10 is going to be supported for a few years still, maybe
> we should add it there.
>
> > Should we start now by listing this UPDATE row movement limitation?
>
> I think we should, yes.

Attached find 3 patches -- for PG 10, 11, and HEAD.  I also realizes
that a description of PARTITION OF clause was also missing in the
Parameters section of CREATE FOREIGN TABLE, which is fixed too.

Thanks,
Amit


HEAD-foreign-table-limitations.patch
Description: Binary data


pg11-foreign-table-limitations.patch
Description: Binary data


pg10-foreign-table-limitations.patch
Description: Binary data


Re: Minimal logical decoding on standbys

2019-03-08 Thread Amit Khandekar
On Mon, 4 Mar 2019 at 14:09, Amit Khandekar  wrote:
>
> On Fri, 14 Dec 2018 at 06:25, Andres Freund  wrote:
> > I've a prototype attached, but let's discuss the details in a separate
> > thread. This also needs to be changed for pluggable storage, as we don't
> > know about table access methods in the startup process, so we can't call
> > can't determine which AM the heap is from during
> > btree_xlog_delete_get_latestRemovedXid() (and sibling routines).
>
> Attached is a WIP test patch
> 0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
> modified version of Craig Ringer's test cases

Hi Andres,

I am trying to come up with new testcases to test the recovery
conflict handling. Before that I have some queries :

With Craig Ringer's approach, the way to reproduce the recovery
conflict was, I believe, easy : Do a checkpoint, which will log the
global-catalog-xmin-advance WAL record, due to which the standby -
while replaying the message - may find out that it's a recovery
conflict. But with your approach, the latestRemovedXid is passed only
during specific vacuum-related WAL records, so to reproduce the
recovery conflict error, we need to make sure some specific WAL
records are logged, such as XLOG_BTREE_DELETE. So we need to create a
testcase such that while creating an index tuple, it erases dead
tuples from a page, so that it eventually calls
_bt_vacuum_one_page()=>_bt_delitems_delete(), thus logging a
XLOG_BTREE_DELETE record.

I tried to come up with this reproducible testcase without success.
This seems difficult. Do you have an easier option ? May be we can use
some other WAL records that may have easier more reliable test case
for showing up recovery conflict ?

Further, with your patch, in ResolveRecoveryConflictWithSlots(), it
just throws a WARNING error level; so the wal receiver would not make
the backends throw an error; hence the test case won't catch the
error. Is that right ?


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Should we increase the default vacuum_cost_limit?

2019-03-08 Thread Jeff Janes
On Wed, Mar 6, 2019 at 2:54 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 3/6/19 1:38 PM, Jeremy Schneider wrote:
> > On 3/5/19 14:14, Andrew Dunstan wrote:
> >> This patch is tiny, seems perfectly reasonable, and has plenty of
> >> support. I'm going to commit it shortly unless there are last minute
> >> objections.
> > +1
> >
>
> done.
>

Now that this is done, the default value is only 5x below the hard-coded
maximum of 10,000.

This seems a bit odd, and not very future-proof.  Especially since the
hard-coded maximum appears to have no logic to it anyway, at least none
that is documented.  Is it just mindless nannyism?

Any reason not to increase by at least a factor of 10, but preferably the
largest value that does not cause computational problems (which I think
would be INT_MAX)?

Cheers,

Jeff


Re: Online verification of checksums

2019-03-08 Thread Julien Rouhaud
On Thu, Mar 7, 2019 at 7:00 PM Andres Freund  wrote:
>
> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >
> > But then again, we could just
> > hack a special version of ReadBuffer_common() which would just
>
> > (a) check if a page is in shared buffers, and if it is then consider the
> > checksum correct (because in memory it may be stale, and it was read
> > successfully so it was OK at that moment)
> >
> > (b) if it's not in shared buffers already, try reading it and verify the
> > checksum, and then just evict it right away (not to spoil sb)
>
> This'd also make sense and make the whole process more efficient. OTOH,
> it might actually be worthwhile to check the on-disk page even if
> there's in-memory state. Unless IO is in progress the on-disk page
> always should be valid.

Definitely.  I already saw servers with all-frozen-read-only blocks
popular enough to never get evicted in months, and then a minor
upgrade / restart having catastrophic consequences.



Re: psql show URL with help

2019-03-08 Thread David Fetter
On Fri, Mar 08, 2019 at 01:45:03PM +0100, Peter Eisentraut wrote:
> On 2019-03-07 23:02, David Fetter wrote:
> >> if (psql_version_is_numeric)
> >>  return /docs/psql_version/
> >> else if (psql_version ends with 'devel')
> >>   return /docs/devel/
> >> else
> >>   return /docs/{psql_version but with text stripped}/
> >>
> >> So that e.g. 12beta would return "12", as would 12rc or 12alpha. But
> >> 12devel would return "devel".
> > 
> > That's exactly what I had in mind :)
> 
> The outcome of that is exactly what my patch does, but the inputs are
> different.  We have PG_MAJORVERSION, which is always a single integer,
> and PG_VERSION, which could be 10.9.8 or 11beta5 or 12devel.  The patch does
> 
> if (PG_VERSION ends with 'devel')
>   return /docs/devel/
> else
>  return /docs/$PG_MAJORVERSION/
> 
> There is no third case.  Your third case of not-numeric-and-not-devel is
> correctly covered by the else branch.

Thanks for helping me understand.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread Alvaro Herrera
On 2019-Mar-08, Amit Langote wrote:

> On Fri, Mar 8, 2019 at 11:09 PM Alvaro Herrera  
> wrote:

> > I'm not sure about copying the same to ddl.sgml.  Why is that needed?
> > Update is not DDL.
> 
> Hmm, maybe because there's already a huge block of text describing
> certain limitations of UPDATE row movement under concurrency?

Uh, you're right, there is.  That seems misplaced :-(  I'm not sure it
even counts as a "limitation"; it seems to belong to the NOTES section
of UPDATE rather than where it is now.

> Actually, I remember commenting *against* having that text in
> ddl.sgml, but it got in there anyway.

We can move it now ...

> > ddl.sgml does say this: "Partitions can also be
> > foreign tables, although they have some limitations that normal tables
> > do not; see CREATE FOREIGN TABLE for more information." which suggests
> > that the limitation might need to be added to create_foreign_table.sgml.
> 
> Actually, that "more information" never got added to
> create_foreign_table.sgml.  There should've been some text about the
> lack for tuple routing at least in PG 10's docs, but I guess that
> never happened.

Sigh.

Since version 10 is going to be supported for a few years still, maybe
we should add it there.

> Should we start now by listing this UPDATE row movement limitation?

I think we should, yes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread Amit Langote
On Fri, Mar 8, 2019 at 11:09 PM Alvaro Herrera  wrote:
>
> On 2019-Mar-08, Amit Langote wrote:
>
> > diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
> > index 77430a586c..f5cf8eab85 100644
> > --- a/doc/src/sgml/ref/update.sgml
> > +++ b/doc/src/sgml/ref/update.sgml
> > @@ -291,9 +291,9 @@ UPDATE  > class="parameter">count
> > concurrent UPDATE or DELETE on the
> > same row may miss this row. For details see the section
> > .
> > -   Currently, rows cannot be moved from a partition that is a
> > -   foreign table to some other partition, but they can be moved into a 
> > foreign
> > -   table if the foreign data wrapper supports it.
> > +   While rows can be moved from local partitions to a foreign-table 
> > partition
> > +   partition (provided the foreign data wrapper supports tuple routing), 
> > they
> > +   cannot be moved from a foreign-table partition to some other partition.
> >
> >   
>
> LGTM.  Maybe I'd change "some other" to "another", but maybe on a
> different phase of the moon I'd leave it alone.

Done.

> I'm not sure about copying the same to ddl.sgml.  Why is that needed?
> Update is not DDL.

Hmm, maybe because there's already a huge block of text describing
certain limitations of UPDATE row movement under concurrency?

Actually, I remember commenting *against* having that text in
ddl.sgml, but it got in there anyway.

> ddl.sgml does say this: "Partitions can also be
> foreign tables, although they have some limitations that normal tables
> do not; see CREATE FOREIGN TABLE for more information." which suggests
> that the limitation might need to be added to create_foreign_table.sgml.

Actually, that "more information" never got added to
create_foreign_table.sgml.  There should've been some text about the
lack for tuple routing at least in PG 10's docs, but I guess that
never happened.  Should we start now by listing this UPDATE row
movement limitation?

Anyway, I've only attached the patch for update.sgml this time.

Thanks,
Amit


document-update-row-movement-limitation-v5.patch
Description: Binary data


Re: House style for DocBook documentation?

2019-03-08 Thread Chapman Flack
On 3/8/19 7:38 AM, Peter Eisentraut wrote:
> On 2019-03-08 05:04, Chapman Flack wrote:
>> -o  If you want to supply text, use , else 
>> +o  If you want to supply text, use  or , else 
> 
> The choice of  vs  is for internal links.  For external
> links you have to use .

Understood, but I was thinking of the unintended message possibly
taken by a fast reader who wants to create an external link but also
wants to supply text. "I want to supply text! Is ulink not an option?"

Perhaps:

o  For an internal link, use  if you will supply text, else 
o  For an external link, use  with or without link text

-Chap



Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Alvaro Herrera
On 2019-Mar-08, Pavel Stehule wrote:

> looks like error in xmlXPathCompiledEval function, that produce little bit
> broken result for XML_DOCUMENT_NODE type. I hadn't this problem with
> libxml2 2.7.6 64bit, but I seen this issue on same version on 32bit.
> 
> Currently I had not fresh 32 bit system to check it.
> 
> I found a workaround - in this case copy (and release xmlNode) is not
> necessary.

Hmm ... going over the libxml2 2.7.6 source, I noticed that
xmlFreeNodeList seem to get this right -- it uses xmlFreeDoc for
XML_DOCUMENT_NODE.  Maybe a sufficient answer is to change the
xmlFreeNode there to xmlFreeNodeList.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread Alvaro Herrera
On 2019-Mar-08, Amit Langote wrote:

> diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
> index 77430a586c..f5cf8eab85 100644
> --- a/doc/src/sgml/ref/update.sgml
> +++ b/doc/src/sgml/ref/update.sgml
> @@ -291,9 +291,9 @@ UPDATE count
> concurrent UPDATE or DELETE on the
> same row may miss this row. For details see the section
> .
> -   Currently, rows cannot be moved from a partition that is a
> -   foreign table to some other partition, but they can be moved into a 
> foreign
> -   table if the foreign data wrapper supports it.
> +   While rows can be moved from local partitions to a foreign-table partition
> +   partition (provided the foreign data wrapper supports tuple routing), they
> +   cannot be moved from a foreign-table partition to some other partition.
>
>   

LGTM.  Maybe I'd change "some other" to "another", but maybe on a
different phase of the moon I'd leave it alone.

I'm not sure about copying the same to ddl.sgml.  Why is that needed?
Update is not DDL.  ddl.sgml does say this: "Partitions can also be
foreign tables, although they have some limitations that normal tables
do not; see CREATE FOREIGN TABLE for more information." which suggests
that the limitation might need to be added to create_foreign_table.sgml.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread Amit Langote
On Fri, Mar 8, 2019 at 8:21 PM David Rowley
 wrote:
>
> On Sat, 9 Mar 2019 at 00:09, Etsuro Fujita  
> wrote:
> > IMO I think it's better that we also mention that the UPDATE can move
> > rows into a foreign partition if the FDW supports it.  No?
>
> In my opinion, there's not much need to talk about what the
> limitations are not when you're mentioning what the limitations are.

In this case, I think it might be OK to contrast the two cases so that
it's clear exactly which side doesn't work and which works.  We also
have the following text in limitations:

 
  
   While primary keys are supported on partitioned tables, foreign
   keys referencing partitioned tables are not supported.  (Foreign key
   references from a partitioned table to some other table are supported.)
  
 

> Maybe it would be worth it if the text was slightly unclear on what's
> affected, but I thought my version was fairly clear.
>
> If you think that it's still unclear, then I wouldn't object to adding
> "  There is no such restriction on UPDATE row
> movements out of native partitions into foreign ones.".  Obviously,
> it's got to be clear for everyone, not just the person who wrote it.

OK, how about this:

--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3877,6 +3877,15 @@ ALTER TABLE measurement ATTACH PARTITION
measurement_y2008m02
   
  

+ 
+  
+   While rows can be moved from local partitions to a foreign-table
+   partition (provided the foreign data wrapper supports tuple routing),
+   they cannot be moved from a foreign-table partition to some
+   other partition.
+  
+ 
+
  
   
BEFORE ROW triggers, if necessary, must be defined

diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 77430a586c..f5cf8eab85 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -291,9 +291,9 @@ UPDATE count
concurrent UPDATE or DELETE on the
same row may miss this row. For details see the section
.
-   Currently, rows cannot be moved from a partition that is a
-   foreign table to some other partition, but they can be moved into a foreign
-   table if the foreign data wrapper supports it.
+   While rows can be moved from local partitions to a foreign-table partition
+   partition (provided the foreign data wrapper supports tuple routing), they
+   cannot be moved from a foreign-table partition to some other partition.
   
  

At least in update.sgml, we describe that row movement is implemented
as DELETE+INSERT, so I think maybe it's OK to mention tuple routing
when mentioning why that 1-way movement works.  If someone's using an
FDW that doesn't support tuple routing to begin with, they'll be get
an error when trying to move rows from a local partition to a foreign
table partition using that FDW, which is this:

ERROR:  cannot route inserted tuples to a foreign table

Then maybe they will come back to the read limitations and see why the
tuple movement didn't work.

Thanks,
Amit


document-update-row-movement-limitation-v4.patch
Description: Binary data


Fwd: Add tablespace tap test to pg_rewind

2019-03-08 Thread Shaoqi Bai
Hi hackers,

There is already a databases tap tests in pg_rewind, wonder if there is a
need for tablespace tap tests in pg_rewind.
Attached is a initial patch from me.

Here is a patch for runing pg_rewind,  it is very similar to
src/bin/pg_rewind/t/002_databases.pl, but there is no master_psql("CREATE
TABLESPACE beforepromotion LOCATION '$tempdir/beforepromotion'"); after
create_standby() and before promote_standby(), because pg_rewind will error
out :
could not create directory
"/Users/sbai/work/postgres/src/bin/pg_rewind/tmp_check/t_006_tablespace_master_local_data/pgdata/pg_tblspc/24576/PG_12_201903063":
File exists
Failure, exiting

The patch is created on top of the
commit e1e0e8d58c5c70da92e36cb9d59c2f7ecf839e00 (origin/master, origin/HEAD)
Author: Michael Paquier 
Date:   Fri Mar 8 15:10:14 2019 +0900

Fix function signatures of pageinspect in documentation

tuple_data_split() lacked the type of the first argument, and
heap_page_item_attrs() has reversed the first and second argument,
with the bytea argument using an incorrect name.

Author: Laurenz Albe
Discussion:
https://postgr.es/m/8f9ab7b16daf623e87eeef5203a4ffc0dece8dfd.ca...@cybertec.at
Backpatch-through: 9.6
From 4280335c4b81aa04b967d93348ccc6874bc9c287 Mon Sep 17 00:00:00 2001
From: Shaoqi Bai 
Date: Fri, 8 Mar 2019 18:04:20 +0800
Subject: [PATCH] Add tablespace tap test for pg_rewind

---
 src/bin/pg_rewind/t/006_tablespace.pl | 51 +++
 1 file changed, 51 insertions(+)
 create mode 100644 src/bin/pg_rewind/t/006_tablespace.pl

diff --git a/src/bin/pg_rewind/t/006_tablespace.pl b/src/bin/pg_rewind/t/006_tablespace.pl
new file mode 100644
index 00..094041dd89
--- /dev/null
+++ b/src/bin/pg_rewind/t/006_tablespace.pl
@@ -0,0 +1,51 @@
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 2;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use RewindTest;
+
+my $tempdir = TestLib::tempdir;
+
+sub run_test
+{
+	my $test_mode = shift;
+
+	RewindTest::setup_cluster($test_mode, ['-g']);
+	RewindTest::start_master();
+
+	RewindTest::create_standby($test_mode);
+
+	RewindTest::promote_standby();
+
+	mkdir "$tempdir/master_afterpromotion";
+	mkdir "$tempdir/standby_afterpromotion";
+
+	# Create tablespaces in the old master and the new promoted standby.
+	master_psql("CREATE TABLESPACE master_afterpromotion LOCATION '$tempdir/master_afterpromotion'");
+	standby_psql("CREATE TABLESPACE standby_afterpromotion LOCATION '$tempdir/standby_afterpromotion'");
+
+	# The clusters are now diverged.
+
+	RewindTest::run_pg_rewind($test_mode);
+
+	# Check that the correct databases are present after pg_rewind.
+	check_query(
+		'SELECT spcname FROM pg_tablespace ORDER BY spcname',
+		qq(pg_default
+pg_global
+standby_afterpromotion
+),
+		'tablespace names');
+
+	RewindTest::clean_rewind_test();
+	return;
+}
+
+# Run the test in both modes.
+run_test('local');
+
+exit(0);
-- 
2.19.1



Re: Reporting script runtimes in pg_regress

2019-03-08 Thread Alvaro Herrera
On 2019-Mar-08, Christoph Berg wrote:

> Re: Peter Eisentraut 2019-03-08 
> <3eb194cf-b878-1f63-8623-6d6add0ed...@2ndquadrant.com>
> > On 2019-02-21 10:37, Christoph Berg wrote:
> > > diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
> > > index a18a6f6c45..8080626e94 100644
> > > --- a/src/test/regress/pg_regress.c
> > > +++ b/src/test/regress/pg_regress.c
> > > @@ -1794,12 +1794,14 @@ run_schedule(const char *schedule, test_function 
> > > tfunc)
> > >   else
> > >   {
> > >   status(_("FAILED"));
> > > + status("  "); /* align with 
> > > failed (ignored) */
> > >   fail_count++;
> > >   }
> > 
> > So an issue here is that in theory "FAILED" etc. are marked for
> > translation but your spacers do not take that into account.  Personally,
> > I have no ambition to translate pg_regress, so we could remove all that.
> >  But it should be done consistently in either case.
> 
> Oh, right. So the way to go would be to use _("FAILED   "), and
> ask translators to use the same length.

Note there's no translation for pg_regress.  All these _() markers are
currently dead code.  It seems hard to become motivated to translate
that kind of program.  I don't think it has much value, myself.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Is it too soon for a PG12 open items wiki page?

2019-03-08 Thread Amit Langote
On Fri, Mar 8, 2019 at 7:52 PM Julien Rouhaud  wrote:
>
> On Fri, Mar 8, 2019 at 11:49 AM David Rowley
>  wrote:
> >
> > Is it too soon for a PG12 open items wiki page?
>
> It seems like a good timing to me.
>
> > I've got a few things I need to keep track of.  Normally writing a
> > patch and putting on the next open CF is a good move, but since the
> > next one is not for PG12, it seems like not the best place.
>
> Agreed, I had the same thought a couple of days ago.

+1

Thanks,
Amit



Re: Reporting script runtimes in pg_regress

2019-03-08 Thread Christoph Berg
Re: Peter Eisentraut 2019-03-08 
<3eb194cf-b878-1f63-8623-6d6add0ed...@2ndquadrant.com>
> On 2019-02-21 10:37, Christoph Berg wrote:
> > diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
> > index a18a6f6c45..8080626e94 100644
> > --- a/src/test/regress/pg_regress.c
> > +++ b/src/test/regress/pg_regress.c
> > @@ -1794,12 +1794,14 @@ run_schedule(const char *schedule, test_function 
> > tfunc)
> > else
> > {
> > status(_("FAILED"));
> > +   status("  "); /* align with 
> > failed (ignored) */
> > fail_count++;
> > }
> 
> So an issue here is that in theory "FAILED" etc. are marked for
> translation but your spacers do not take that into account.  Personally,
> I have no ambition to translate pg_regress, so we could remove all that.
>  But it should be done consistently in either case.

Oh, right. So the way to go would be to use _("FAILED   "), and
ask translators to use the same length.

> I also think we shouldn't worry about the "failed (ignored)" case.  That
> never happens, and I don't want to mess up the spacing we have now for
> that.  I'd consider removing support for it altogether.

You mean removing that case from pg_regress, or removing the alignment
"support"?

Christoph



Re: pg_basebackup ignores the existing data directory permissions

2019-03-08 Thread Peter Eisentraut
pg_basebackup copies the data directory permission mode from the
upstream server.  But it doesn't copy the ownership.  So if say the
upstream server allows group access and things are owned by
postgres:postgres, and I want to make a copy for local development and
make a backup into a directory owned by peter:staff without group
access, then it would be inappropriate for pg_basebackup to change the
permissions on that directory.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Pluggable Storage - Andres's take

2019-03-08 Thread Dagfinn Ilmari Mannsåker
Andres Freund  writes:

> On 2019-03-07 08:52:21 -0500, Robert Haas wrote:
>> On Wed, Mar 6, 2019 at 6:11 PM Andres Freund  wrote:
>> > slot that's compatible with the "target" table. You can get compatible
>> > slot callbakcs by calling table_slot_callbacks(), or directly create one
>> > by calling table_gimmegimmeslot() (likely to be renamed :)).
>>
>> Hmm.  I assume the issue is that table_createslot() was already taken
>> for another purpose, so then when you needed another callback you went
>> with table_givemeslot(), and then when you needed a third API to do
>> something in the same area the best thing available was
>> table_gimmeslot(), which meant that the fourth API could only be
>> table_gimmegimmeslot().
>>
>> Does that sound about right?
>
> It was 3 AM, and I thought it was hilarious...

♫ Gimme! Gimme! Gimme! A slot after midnight ♫

- ilmari (SCNR) 
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen



Re: psql show URL with help

2019-03-08 Thread Peter Eisentraut
On 2019-03-07 23:02, David Fetter wrote:
>> if (psql_version_is_numeric)
>>  return /docs/psql_version/
>> else if (psql_version ends with 'devel')
>>   return /docs/devel/
>> else
>>   return /docs/{psql_version but with text stripped}/
>>
>> So that e.g. 12beta would return "12", as would 12rc or 12alpha. But
>> 12devel would return "devel".
> 
> That's exactly what I had in mind :)

The outcome of that is exactly what my patch does, but the inputs are
different.  We have PG_MAJORVERSION, which is always a single integer,
and PG_VERSION, which could be 10.9.8 or 11beta5 or 12devel.  The patch does

if (PG_VERSION ends with 'devel')
  return /docs/devel/
else
 return /docs/$PG_MAJORVERSION/

There is no third case.  Your third case of not-numeric-and-not-devel is
correctly covered by the else branch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: House style for DocBook documentation?

2019-03-08 Thread Peter Eisentraut
On 2019-03-08 05:04, Chapman Flack wrote:
> I only now noticed that probably this should also have been changed:
> 
> -o  If you want to supply text, use , else 
> +o  If you want to supply text, use  or , else 

The choice of  vs  is for internal links.  For external
links you have to use .

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Pavel Stehule
pá 8. 3. 2019 v 13:20 odesílatel Alvaro Herrera 
napsal:

> On 2019-Mar-08, Pavel Stehule wrote:
>
> > looks like error in xmlXPathCompiledEval function, that produce little
> bit
> > broken result for XML_DOCUMENT_NODE type. I hadn't this problem with
> > libxml2 2.7.6 64bit, but I seen this issue on same version on 32bit.
> >
> > Currently I had not fresh 32 bit system to check it.
> >
> > I found a workaround - in this case copy (and release xmlNode) is not
> > necessary.
> >
> > please, apply attached patch.
>
> Wow :-(  At this point I'm wondering if this should be put in back
> branches as well ... I mean, distill part of commit 251cf2e27bec that
> doesn't affect the behavior of text nodes, and put it on all branches
> together with your fix?
>

The problem is just for case result: XML_DOCUMENT_TYPE, target XML. For
this case the previously used transformation doesn't work.

Is not problem to detect this situation. The content field has -1 instead
0.

Originally there was inverted logic, so xmlCopyNode and xmlFreeNode was not
used for XML_DOCUMENT_TYPE, and then we didn't hit this bug.


> Another thought: should we refuse to work on known-broken libxml2
> versions?  Seems like this bug could affect other parts of code too -- I
> see that xmlXPathCompiledEval() is called in file places (including two
> in contrib/xml2).
>
> Third thought: an alternative might be to create a wrapper for
> xmlXPathCompiledEval that detects NULL content and fills in a pointer
> that xmlFreeNode can free.
>



>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Reporting script runtimes in pg_regress

2019-03-08 Thread Peter Eisentraut
On 2019-02-21 10:37, Christoph Berg wrote:
> diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
> index a18a6f6c45..8080626e94 100644
> --- a/src/test/regress/pg_regress.c
> +++ b/src/test/regress/pg_regress.c
> @@ -1794,12 +1794,14 @@ run_schedule(const char *schedule, test_function 
> tfunc)
>   else
>   {
>   status(_("FAILED"));
> + status("  "); /* align with 
> failed (ignored) */
>   fail_count++;
>   }

So an issue here is that in theory "FAILED" etc. are marked for
translation but your spacers do not take that into account.  Personally,
I have no ambition to translate pg_regress, so we could remove all that.
 But it should be done consistently in either case.

I also think we shouldn't worry about the "failed (ignored)" case.  That
never happens, and I don't want to mess up the spacing we have now for
that.  I'd consider removing support for it altogether.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Alvaro Herrera
On 2019-Mar-08, Pavel Stehule wrote:

> looks like error in xmlXPathCompiledEval function, that produce little bit
> broken result for XML_DOCUMENT_NODE type. I hadn't this problem with
> libxml2 2.7.6 64bit, but I seen this issue on same version on 32bit.
> 
> Currently I had not fresh 32 bit system to check it.
> 
> I found a workaround - in this case copy (and release xmlNode) is not
> necessary.
> 
> please, apply attached patch.

Wow :-(  At this point I'm wondering if this should be put in back
branches as well ... I mean, distill part of commit 251cf2e27bec that
doesn't affect the behavior of text nodes, and put it on all branches
together with your fix?

Another thought: should we refuse to work on known-broken libxml2
versions?  Seems like this bug could affect other parts of code too -- I
see that xmlXPathCompiledEval() is called in file places (including two
in contrib/xml2).

Third thought: an alternative might be to create a wrapper for
xmlXPathCompiledEval that detects NULL content and fills in a pointer
that xmlFreeNode can free.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-08 Thread Michael Banck
Hi,

Am Sonntag, den 03.03.2019, 11:51 +0100 schrieb Michael Banck:
> Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost:
> > I'm not necessairly against skipping to the next file, to be clear,
> > but I think I'd be happier if we kept reading the file until we
> > actually get EOF.
> 
> So if we read half a block twice we should seek() to the next block and
> continue till EOF, ok. I think in most cases those pages will be new
> anyway and there will be no checksum check, but it sounds like a cleaner
> approach. I've seen one or two examples where we did successfully verify
> the checksum of a page after a half-read, so it might be worth it.

I've done that now, i.e. it seeks to the next block and continues to
read there (possibly getting an EOF).

I don't issue a warning for this skipped block anymore as it is somewhat
to be expected that we see some half-reads. If the seek fails for some
reason, that is still a warning.

> I still think that an external checksum verification tool has some
> merit, given that basebackup does it and the current offline requirement
> is really not useful in practise.

I've read the rest of the thread, and it seems several people prefer a
solution that interacts with the server. I won't be able to work on that
for v12 and I guess it would be too late in the cycle anyway.

I thought about I/O throttling in online mode, but it seems to be most
easily tied in with the progress reporting (that already keeps track of
everything or most of what we'd need), so I will work on it in that
context.



Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/doc/src/sgml/ref/pg_verify_checksums.sgml b/doc/src/sgml/ref/pg_verify_checksums.sgml
index 905b8f1222..4ad6edcde6 100644
--- a/doc/src/sgml/ref/pg_verify_checksums.sgml
+++ b/doc/src/sgml/ref/pg_verify_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
   Description
   
pg_verify_checksums verifies data checksums in a
-   PostgreSQL cluster.  The server must be shut
-   down cleanly before running pg_verify_checksums.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   PostgreSQL cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   
  
 
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 511262ab5f..c61bd19bf9 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -1,7 +1,7 @@
 /*
  * pg_verify_checksums
  *
- * Verifies page level checksums in an offline cluster
+ * Verifies page level checksums in a cluster
  *
  *	Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -24,12 +24,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -86,10 +90,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -104,26 +115,130 @@ scan_file(const char *fn, BlockNumber segmentno)
 
 		if (r == 0)
 			break;
+		if (r < 0)
+		{
+			skippedfiles++;
+			fprintf(stderr, _("%s: could not read block %u in file \"%s\": %s\n"),
+	progname, blockno, fn, strerror(errno));
+			return;
+		}
 		if (r != BLCKSZ)
 		{
-			fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
-	progname, blockno, fn, r, BLCKSZ);
-			exit(1);
+			if (online)
+			{
+if (block_retry)
+{
+	/* We already tried once to reread the block, skip to the next block */
+	skippedblocks++;
+	if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
+	{
+		skippedfiles++;
+		fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"),
+progname, fn);
+		return;
+	}
+	continue;
+}
+
+/*
+ * Retry the block. It's possible that we read the block while it
+ * was extended or shrinked, so it it ends up looking torn to us.
+ */
+
+/*
+ * Seek back by the amount of bytes we read to the 

Re: WIP: Avoid creation of the free space map for small tables

2019-03-08 Thread Amit Kapila
On Fri, Mar 8, 2019 at 5:13 PM Amit Kapila  wrote:
>
>
> Few minor comments:
..
>
> 2.
> +
> + /* Transfer any VM files if we can trust their
> contents. */
>   if (vm_crashsafe_match)
>
> 3. Can we add a note about this in the Notes section of pg_upgrade
> documentation [1]?
>
> This comment line doesn't seem to be related to this patch.  If so, I
> think we can avoid having any additional change which is not related
> to the functionality of this patch.  Feel free to submit it
> separately, if you think it is an improvement.
>

oops, I have messed up the above comments.  The paragraph starting
with "This comment line doesn't ..." is for comment number-2.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WIP: Avoid creation of the free space map for small tables

2019-03-08 Thread Amit Kapila
On Wed, Mar 6, 2019 at 5:19 PM John Naylor  wrote:
>
> On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila  wrote:
> > Once we agree on the code, we need to test below scenarios:
> > (a) upgrade from all supported versions to the latest version
> > (b) upgrade standby with and without using rsync.
>
> Although the code hasn't been reviewed yet, I went ahead and tested
> (a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4
> instance with the regression database and restored it to all supported
> versions. To make it work with pg_upgrade, I first had to drop tables
> with oids, drop functions referring to C libraries, and drop the
> later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all
> versions to HEAD with the patch applied. pg_upgrade worked without
> error, and the following query returned 0 bytes on all the new
> clusters:
>
> select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size
> from pg_class where relkind in ('r', 't')
> and pg_relation_size(oid, 'main') <= 4 * 8192;
>
> The complementary query (> 4 * 8192) returned the same number of bytes
> as in the original 9.4 instance.
>

Thanks, the tests done by you are quite useful.  I have given a few
comments as a response to your previous email.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Pavel Stehule
pá 8. 3. 2019 v 7:41 odesílatel Pavel Stehule 
napsal:

> Hi
>
> pá 8. 3. 2019 v 3:44 odesílatel Alvaro Herrera 
> napsal:
>
>> On 2019-Mar-07, Alvaro Herrera wrote:
>>
>> > On 2019-Feb-11, Chapman Flack wrote:
>> >
>> > > xmltable-xpath-result-processing-bugfix-6.patch includes a
>> regress/expected
>> > > output for the no-libxml case that was left out of -5.
>> >
>> > Pushed this one, with some trivial changes: I renamed and relocated
>> > Pavel's function to strdup-and-free and fused the new test cases to use
>> > less queries for the same functionality.  Naturally I had to adjust the
>> > expected files ... I tried to do my best but there's always a little
>> > something that sneaks under one's nose.
>>
>> So we now have a double-free bug here or something ...  Too tired right
>> now to do anything about it.
>>
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison=2019-03-08%2002%3A00%3A02
>>
>> == stack trace:
>> pgsql.build/src/test/regress/tmp_check/data/core ==
>> [New LWP 20275]
>>
>> warning: Can't read pathname for load map: Input/output error.
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library
>> "/lib/arm-linux-gnueabihf/libthread_db.so.1".
>> Core was generated by `postgres: pgbuildfarm regression [local] SELECT
>>'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x76a32e04 in free () from /lib/arm-linux-gnueabihf/libc.so.6
>> #0  0x76a32e04 in free () from /lib/arm-linux-gnueabihf/libc.so.6
>> #1  0x76e28380 in xmlFreeNode () from
>> /usr/lib/arm-linux-gnueabihf/libxml2.so.2
>> #2  0x00481f94 in xml_xmlnodetoxmltype (cur=,
>> xmlerrcxt=) at xml.c:3751
>> #3  0x004823dc in XmlTableGetValue (state=0x148c370, colnum=1994818404,
>> typid=2124685192, typmod=0, isnull=0x7ea42117) at xml.c:4540
>> #4  0x0026df60 in tfuncLoadRows (econtext=0x2, tstate=0x14a8578) at
>> nodeTableFuncscan.c:489
>> #5  tfuncFetchRows (tstate=0x14a8578, econtext=0x2) at
>> nodeTableFuncscan.c:318
>> #6  0x0026e248 in TableFuncNext (node=0x14a83e0) at nodeTableFuncscan.c:65
>> #7  0x0023e640 in ExecScanFetch (recheckMtd=0x23e640 ,
>> accessMtd=0x26db1c , node=0x14a83e0) at execScan.c:93
>> #8  ExecScan (node=0x14a83e0, accessMtd=0x26db1c ,
>> recheckMtd=0x23e640 ) at execScan.c:143
>> #9  0x0023c638 in ExecProcNodeFirst (node=0x14a83e0) at execProcnode.c:445
>> #10 0x00235630 in ExecProcNode (node=0x14a83e0) at
>> ../../../src/include/executor/executor.h:241
>> #11 ExecutePlan (execute_once=, dest=0x14b8d08,
>> direction=, numberTuples=0, sendTuples=,
>> operation=CMD_SELECT, use_parallel_mode=,
>> planstate=0x14a83e0, estate=0x14a82a0) at execMain.c:1643
>> #12 standard_ExecutorRun (queryDesc=0x142c0e0, direction=,
>> count=0, execute_once=true) at execMain.c:362
>> #13 0x003955f4 in PortalRunSelect (portal=0x13e3f48, forward=> out>, count=0, dest=) at pquery.c:929
>> #14 0x00396a0c in PortalRun (portal=0x0, count=0, isTopLevel=> out>, run_once=, dest=0x14b8d08, altdest=0x14b8d08,
>> completionTag=0x7ea42414 "") at pquery.c:770
>> #15 0x003923c8 in exec_simple_query (query_string=0x7ea42414 "") at
>> postgres.c:1215
>> #16 0x00393e8c in PostgresMain (argc=, argv=> out>, dbname=0x0, username=) at postgres.c:4256
>> #17 0x000849a8 in BackendRun (port=0x13967b8) at postmaster.c:4399
>> #18 BackendStartup (port=0x13967b8) at postmaster.c:4090
>> #19 ServerLoop () at postmaster.c:1703
>> #20 0x0031607c in PostmasterMain (argc=, argv=> out>) at postmaster.c:1376
>> #21 0x000864d4 in main (argc=7301080, argv=0x8) at main.c:228
>>
>
>
looks like error in xmlXPathCompiledEval function, that produce little bit
broken result for XML_DOCUMENT_NODE type. I hadn't this problem with
libxml2 2.7.6 64bit, but I seen this issue on same version on 32bit.

Currently I had not fresh 32 bit system to check it.

I found a workaround - in this case copy (and release xmlNode) is not
necessary.

please, apply attached patch.


> Pavel
>
>
>>
>> --
>> Álvaro Herrerahttps://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 28b3eaaa20..41145e697a 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3720,35 +3720,58 @@ xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
 
 	if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)
 	{
-		xmlBufferPtr buf;
-		xmlNodePtr	cur_copy;
+		volatile xmlBufferPtr buf = NULL;
+		volatile xmlNodePtr	cur_copy = NULL;
 
-		buf = xmlBufferCreate();
+		PG_TRY();
+		{
+			int		bytes;
 
-		/*
-		 * The result of xmlNodeDump() won't contain namespace definitions
-		 * from parent nodes, but xmlCopyNode() duplicates a node along with
-		 * its required namespace definitions.
-		 */
-		cur_copy = xmlCopyNode(cur, 1);
+			buf = xmlBufferCreate();
 
-		if (cur_copy == NULL)
-			xml_ereport(xmlerrcxt, ERROR, 

Re: WIP: Avoid creation of the free space map for small tables

2019-03-08 Thread Amit Kapila
On Mon, Jan 28, 2019 at 2:33 AM John Naylor  wrote:
>
> On Sat, Jan 26, 2019 at 2:14 PM Amit Kapila  wrote:
> >
> > I think there is some value in using the information from
> > this function to skip fsm files, but the code doesn't appear to fit
> > well, how about moving this check to new function
> > new_cluster_needs_fsm()?
>
> For v21, new_cluster_needs_fsm() has all responsibility for obtaining
> the info it needs. I think this is much cleaner,
>

Right, now the code looks much better.

> but there is a small
> bit of code duplication since it now has to form the file name. One
> thing we could do is form the the base old/new file names in
> transfer_single_new_db() and pass those to transfer_relfile(), which
> will only add suffixes and segment numbers. We could then pass the
> base old file name to new_cluster_needs_fsm() and use it as is. Not
> sure if that's worthwhile, though.
>

I don't think it is worth.

Few minor comments:
1.
warning C4715: 'new_cluster_needs_fsm': not all control paths return a value

Getting this new warning in the patch.

2.
+
+ /* Transfer any VM files if we can trust their
contents. */
  if (vm_crashsafe_match)

3. Can we add a note about this in the Notes section of pg_upgrade
documentation [1]?

This comment line doesn't seem to be related to this patch.  If so, I
think we can avoid having any additional change which is not related
to the functionality of this patch.  Feel free to submit it
separately, if you think it is an improvement.

Have you done any performance testing of this patch?  I mean to say
now that we added a new stat call for each table, we should see if
that has any impact.  Ideally, that should be compensated by the fact
that we are now not transferring *fsm files for small relations.  How
about constructing a test where all relations are greater than 4 pages
and then try to upgrade them.  We can check for a cluster with a
different number of relations say 10K, 20K, 50K, 100K.

In general, the patch looks okay to me.  I would like to know if
anybody else has any opinion whether pg_upgrade should skip
transferring fsm files for small relations or not?  I think both me
and John thinks that it is good to have feature and now that patch
turns out to be simpler, I feel we can go ahead with this optimization
in pg_upgrade.

[1] - https://www.postgresql.org/docs/devel/pgupgrade.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 00:09, Etsuro Fujita  wrote:
> IMO I think it's better that we also mention that the UPDATE can move
> rows into a foreign partition if the FDW supports it.  No?

In my opinion, there's not much need to talk about what the
limitations are not when you're mentioning what the limitations are.
Maybe it would be worth it if the text was slightly unclear on what's
affected, but I thought my version was fairly clear.

If you think that it's still unclear, then I wouldn't object to adding
"  There is no such restriction on UPDATE row
movements out of native partitions into foreign ones.".  Obviously,
it's got to be clear for everyone, not just the person who wrote it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread Etsuro Fujita

(2019/03/08 19:29), David Rowley wrote:

On Fri, 8 Mar 2019 at 15:07, Amit Langote  wrote:

David, can you confirm if the rewritten text reads unambiguous or perhaps
suggest a better wording?


So this is the text:

+   Currently, rows cannot be moved from a foreign-table partition to some
+   other partition, but they can be moved into a foreign-table partition if
+   the foreign data wrapper supports tuple routing.

I read this to mean that rows cannot normally be moved out of a
foreign-table partition unless the new partition is a foreign one that
uses an FDW that supports tuple routing.

So let's test that:

create extension postgres_fdw ;
do $$ begin execute 'create server loopback foreign data wrapper
postgres_fdw options (dbname ''' || current_database() || ''');'; end;
$$;
create user mapping for current_user server loopback;

create table listp (a int) partition by list (a);
create table listp1 (a int, check (a = 1));
create table listp2 (a int, check (a = 2));

create foreign table listpf1 partition of listp for values in (1)
server loopback options (table_name 'listp1');
create foreign table listpf2 partition of listp for values in (2)
server loopback options (table_name 'listp2');

insert into listp values (1);

update listp set a = 2 where a = 1;
ERROR:  new row for relation "listp1" violates check constraint "listp1_a_check"
DETAIL:  Failing row contains (2).
CONTEXT:  remote SQL command: UPDATE public.listp1 SET a = 2 WHERE ((a = 1))

I'd be filing a bug report for that as I'm moving a row into a foreign
table with an FDW that supports tuple routing.


Fair enough.


Where I think you're going wrong is, in one part of the sentence
you're talking about UPDATE, then in the next part you seem to
magically jump to talking about INSERT.  Since the entire paragraph is
talking about UPDATE, why is it relevant to talk about INSERT?

I thought my doc_confirm_foreign_partition_limitations.patch had this
pretty clear with:

+
+
+   Currently, anUPDATE  of a partitioned table cannot
+   move rows out of a foreign partition into another partition.
+
+

Or is my understanding of this incorrect?


IMO I think it's better that we also mention that the UPDATE can move 
rows into a foreign partition if the FDW supports it.  No?



I also think the new
paragraph is a good move as it's a pretty restrictive limitation for
anyone that wants to set up a partition hierarchy with foreign
partitions.


+1

Best regards,
Etsuro Fujita




Re: Is it too soon for a PG12 open items wiki page?

2019-03-08 Thread Julien Rouhaud
On Fri, Mar 8, 2019 at 11:49 AM David Rowley
 wrote:
>
> Is it too soon for a PG12 open items wiki page?

It seems like a good timing to me.

> I've got a few things I need to keep track of.  Normally writing a
> patch and putting on the next open CF is a good move, but since the
> next one is not for PG12, it seems like not the best place.

Agreed, I had the same thought a couple of days ago.



Is it too soon for a PG12 open items wiki page?

2019-03-08 Thread David Rowley
Is it too soon for a PG12 open items wiki page?

I've got a few things I need to keep track of.  Normally writing a
patch and putting on the next open CF is a good move, but since the
next one is not for PG12, it seems like not the best place.

Any objections to me making one now?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Update does not move row across foreign partitions in v11

2019-03-08 Thread David Rowley
On Fri, 8 Mar 2019 at 15:07, Amit Langote  wrote:
> David, can you confirm if the rewritten text reads unambiguous or perhaps
> suggest a better wording?

So this is the text:

+   Currently, rows cannot be moved from a foreign-table partition to some
+   other partition, but they can be moved into a foreign-table partition if
+   the foreign data wrapper supports tuple routing.

I read this to mean that rows cannot normally be moved out of a
foreign-table partition unless the new partition is a foreign one that
uses an FDW that supports tuple routing.

So let's test that:

create extension postgres_fdw ;
do $$ begin execute 'create server loopback foreign data wrapper
postgres_fdw options (dbname ''' || current_database() || ''');'; end;
$$;
create user mapping for current_user server loopback;

create table listp (a int) partition by list (a);
create table listp1 (a int, check (a = 1));
create table listp2 (a int, check (a = 2));

create foreign table listpf1 partition of listp for values in (1)
server loopback options (table_name 'listp1');
create foreign table listpf2 partition of listp for values in (2)
server loopback options (table_name 'listp2');

insert into listp values (1);

update listp set a = 2 where a = 1;
ERROR:  new row for relation "listp1" violates check constraint "listp1_a_check"
DETAIL:  Failing row contains (2).
CONTEXT:  remote SQL command: UPDATE public.listp1 SET a = 2 WHERE ((a = 1))

I'd be filing a bug report for that as I'm moving a row into a foreign
table with an FDW that supports tuple routing.

Where I think you're going wrong is, in one part of the sentence
you're talking about UPDATE, then in the next part you seem to
magically jump to talking about INSERT.  Since the entire paragraph is
talking about UPDATE, why is it relevant to talk about INSERT?

I thought my doc_confirm_foreign_partition_limitations.patch had this
pretty clear with:

+ 
+  
+   Currently, an UPDATE of a partitioned table cannot
+   move rows out of a foreign partition into another partition.
+ 
+

Or is my understanding of this incorrect?  I also think the new
paragraph is a good move as it's a pretty restrictive limitation for
anyone that wants to set up a partition hierarchy with foreign
partitions.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Add tablespace tap test to pg_rewind

2019-03-08 Thread Shaoqi Bai
Hi hackers,

There is already a databases tap tests in pg_rewind, wonder if there is a
need for tablespace tap tests in pg_rewind.
Attached is a initial patch from me.

Here is a patch for runing pg_rewind,  it is very similar to
src/bin/pg_rewind/t/002_databases.pl, but there is no master_psql("CREATE
TABLESPACE beforepromotion LOCATION '$tempdir/beforepromotion'"); after
create_standby() and before promote_standby(), because pg_rewind will error
out :
could not create directory
"/Users/sbai/work/postgres/src/bin/pg_rewind/tmp_check/t_006_tablespace_master_local_data/pgdata/pg_tblspc/24576/PG_12_201903063":
File exists
Failure, exiting

The patch is created on top of the
commit e1e0e8d58c5c70da92e36cb9d59c2f7ecf839e00 (origin/master, origin/HEAD)
Author: Michael Paquier 
Date:   Fri Mar 8 15:10:14 2019 +0900

Fix function signatures of pageinspect in documentation

tuple_data_split() lacked the type of the first argument, and
heap_page_item_attrs() has reversed the first and second argument,
with the bytea argument using an incorrect name.

Author: Laurenz Albe
Discussion:
https://postgr.es/m/8f9ab7b16daf623e87eeef5203a4ffc0dece8dfd.ca...@cybertec.at
Backpatch-through: 9.6


0001-Add-tablespace-tap-test-for-pg_rewind.patch
Description: Binary data


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-08 Thread Heikki Linnakangas

On 08/03/2019 12:22, Peter Geoghegan wrote:

I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.


Ok, after thinking about it for a while, I think I understand the minus 
infinity stuff now. Let me try to explain it in my own words:


Imagine that you have an index with two key columns, A and B. The index 
has two leaf pages, with the following items:


++   ++
| Page 1 |   | Page 2 |
||   ||
|1 1 |   |2 1 |
|1 2 |   |2 2 |
|1 3 |   |2 3 |
|1 4 |   |2 4 |
|1 5 |   |2 5 |
++   ++

The key space is neatly split on the first key column - probably thanks 
to the new magic in the page split code.


Now, what do we have as the high key of page 1? Answer: "2 -inf". The 
"-inf" is not stored in the key itself, the second key column is just 
omitted, and the search code knows to treat it implicitly as a value 
that's lower than any real value. Hence, "minus infinity".


However, during page deletion, we need to perform a search to find the 
downlink pointing to a leaf page. We do that by using the leaf page's 
high key as the search key. But the search needs to treat it slightly 
differently in that case. Normally, searching with a single key value, 
"2", we would land on page 2, because any real value beginning with "2" 
would be on that page, but in the page deletion case, we want to find 
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code 
to do that.


Question: Wouldn't it be more straightforward to use "1 +inf" as page 
1's high key? I.e treat any missing columns as positive infinity. That 
way, the search for page deletion wouldn't need to be treated 
differently. That's also how this used to work: all tuples on a page 
used to be <= high key, not strictly < high key. And it would also make 
the rightmost page less of a special case: you could pretend the 
rightmost page has a pivot tuple with all columns truncated away, which 
means positive infinity.


You have this comment _bt_split which touches the subject:


/*
 * The "high key" for the new left page will be the first key that's 
going
 * to go into the new right page, or possibly a truncated version if 
this
 * is a leaf page split.  This might be either the existing data item at
 * position firstright, or the incoming tuple.
 *
 * The high key for the left page is formed using the first item on the
 * right page, which may seem to be contrary to Lehman & Yao's approach 
of
 * using the left page's last item as its new high key when splitting on
 * the leaf level.  It isn't, though: suffix truncation will leave the
 * left page's high key fully equal to the last item on the left page 
when
 * two tuples with equal key values (excluding heap TID) enclose the 
split
 * point.  It isn't actually necessary for a new leaf high key to be 
equal
 * to the last item on the left for the L "subtree" invariant to hold.
 * It's sufficient to make sure that the new leaf high key is strictly
 * less than the first item on the right leaf page, and greater than or
 * equal to (not necessarily equal to) the last item on the left leaf
 * page.
 *
 * In other words, when suffix truncation isn't possible, L's exact
 * approach to leaf splits is taken.  (Actually, even that is slightly
 * inaccurate.  A tuple with all the keys from firstright but the heap 
TID
 * from lastleft will be used as the new high key, since the last left
 * tuple could be physically larger despite being opclass-equal in 
respect
 * of all attributes prior to the heap TID attribute.)
 */


But it doesn't explain why it's done like that.

- Heikki



Re: insensitive collations

2019-03-08 Thread Peter Eisentraut
On 2019-03-07 20:04, Daniel Verite wrote:
> With previous versions, we'd need to call ucol_setAttribute(),
> with the attributes and values defined here:
> http://icu-project.org/apiref/icu4c/ucol_8h.html
> for instance to get colStrength=secondary:
>   ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_SECONDARY, );
> which I've just checked gives the expected result with ICU-4.2.

I see.  I'm thinking about adding some ad hoc code to
pg_newlocale_from_collation() to parse these keywords ourselves, so we
can provide the same interface for old ICU versions.  I'll send a
separate patch for that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Small doc fix for pageinspect

2019-03-08 Thread Laurenz Albe
Michael Paquier wrote:
> On Thu, Mar 07, 2019 at 09:00:24PM +0100, Laurenz Albe wrote:
> > This should be backpatched down to 9.6 where the functions have been
> > added.
> 
> Thanks, applied.  The second argument name of heap_page_item_attrs is
> actually "page", and not "t_data", so both your patch and the docs
> were wrong on this point.

Thanks, and pardon the sloppiness.

Yours,
Laurenz Albe




Re: Covering GiST indexes

2019-03-08 Thread Alexander Korotkov
On Wed, Jan 30, 2019 at 4:16 AM Andreas Karlsson  wrote:
>
> On 29/01/2019 19.58, Andrey Borodin wrote:
> > PFA patch without redundant checks.
>
> I hope it is ok that I fixed a typo and some wording in the comment,
> plus re-added the btree query which I accidentally removed in my last mail.
>
> I think the current state of the patch is fine, assuming the solution
> with truncTupdesc is deemed to be good enough by the committer, so I
> will mark this as ready for committer.
>
> Thanks for the patch and the speedy fixes!

I made some adjustments for this patch:
 * Rename tupledesc and tructTupledesc to leafTupdesc and
nonLeafTupdesc correspondingly.  I think this makes things more clear.
 * Some improvements to docs and comments.
 * Revise commit message.

I'm going to push this on no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Covering-GiST-v11.patch
Description: Binary data


Re: PostgreSQL vs SQL/XML Standards

2019-03-08 Thread Pavel Stehule
pá 8. 3. 2019 v 3:44 odesílatel Alvaro Herrera 
napsal:

> On 2019-Mar-07, Alvaro Herrera wrote:
>
> > On 2019-Feb-11, Chapman Flack wrote:
> >
> > > xmltable-xpath-result-processing-bugfix-6.patch includes a
> regress/expected
> > > output for the no-libxml case that was left out of -5.
> >
> > Pushed this one, with some trivial changes: I renamed and relocated
> > Pavel's function to strdup-and-free and fused the new test cases to use
> > less queries for the same functionality.  Naturally I had to adjust the
> > expected files ... I tried to do my best but there's always a little
> > something that sneaks under one's nose.
>
> So we now have a double-free bug here or something ...  Too tired right
> now to do anything about it.
>
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison=2019-03-08%2002%3A00%3A02
>
> == stack trace:
> pgsql.build/src/test/regress/tmp_check/data/core ==
> [New LWP 20275]
>
> warning: Can't read pathname for load map: Input/output error.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library
> "/lib/arm-linux-gnueabihf/libthread_db.so.1".
> Core was generated by `postgres: pgbuildfarm regression [local] SELECT
>'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x76a32e04 in free () from /lib/arm-linux-gnueabihf/libc.so.6
> #0  0x76a32e04 in free () from /lib/arm-linux-gnueabihf/libc.so.6
> #1  0x76e28380 in xmlFreeNode () from
> /usr/lib/arm-linux-gnueabihf/libxml2.so.2
> #2  0x00481f94 in xml_xmlnodetoxmltype (cur=,
> xmlerrcxt=) at xml.c:3751
> #3  0x004823dc in XmlTableGetValue (state=0x148c370, colnum=1994818404,
> typid=2124685192, typmod=0, isnull=0x7ea42117) at xml.c:4540
> #4  0x0026df60 in tfuncLoadRows (econtext=0x2, tstate=0x14a8578) at
> nodeTableFuncscan.c:489
> #5  tfuncFetchRows (tstate=0x14a8578, econtext=0x2) at
> nodeTableFuncscan.c:318
> #6  0x0026e248 in TableFuncNext (node=0x14a83e0) at nodeTableFuncscan.c:65
> #7  0x0023e640 in ExecScanFetch (recheckMtd=0x23e640 ,
> accessMtd=0x26db1c , node=0x14a83e0) at execScan.c:93
> #8  ExecScan (node=0x14a83e0, accessMtd=0x26db1c ,
> recheckMtd=0x23e640 ) at execScan.c:143
> #9  0x0023c638 in ExecProcNodeFirst (node=0x14a83e0) at execProcnode.c:445
> #10 0x00235630 in ExecProcNode (node=0x14a83e0) at
> ../../../src/include/executor/executor.h:241
> #11 ExecutePlan (execute_once=, dest=0x14b8d08,
> direction=, numberTuples=0, sendTuples=,
> operation=CMD_SELECT, use_parallel_mode=,
> planstate=0x14a83e0, estate=0x14a82a0) at execMain.c:1643
> #12 standard_ExecutorRun (queryDesc=0x142c0e0, direction=,
> count=0, execute_once=true) at execMain.c:362
> #13 0x003955f4 in PortalRunSelect (portal=0x13e3f48, forward= out>, count=0, dest=) at pquery.c:929
> #14 0x00396a0c in PortalRun (portal=0x0, count=0, isTopLevel= out>, run_once=, dest=0x14b8d08, altdest=0x14b8d08,
> completionTag=0x7ea42414 "") at pquery.c:770
> #15 0x003923c8 in exec_simple_query (query_string=0x7ea42414 "") at
> postgres.c:1215
> #16 0x00393e8c in PostgresMain (argc=, argv= out>, dbname=0x0, username=) at postgres.c:4256
> #17 0x000849a8 in BackendRun (port=0x13967b8) at postmaster.c:4399
> #18 BackendStartup (port=0x13967b8) at postmaster.c:4090
> #19 ServerLoop () at postmaster.c:1703
> #20 0x0031607c in PostmasterMain (argc=, argv= out>) at postmaster.c:1376
> #21 0x000864d4 in main (argc=7301080, argv=0x8) at main.c:228
>

I am able to emulate it on 32bit old scientific linux. So I hope I find fix

>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: pg_basebackup against older server versions

2019-03-08 Thread Sergei Kornilov
HiGreat, thank you! regards, Sergei



Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-03-08 Thread Rajkumar Raghuwanshi
On Thu, Mar 7, 2019 at 8:20 PM amul sul  wrote:

>
>
> On Thu, Mar 7, 2019 at 1:02 PM amul sul  wrote:
>
>> Thanks Rajkumar,
>>
>> I am looking into this.
>>
>>
> The crash happens when none of the if-else branch of
> handle_missing_partition()
> evaluates and returns merged_index unassigned.
>
> Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER.
> When
> only outer rel has null partition, merge_null_partitions() function calls
> handle_missing_partition() with missing_side_inner = false and
> missing_side_outer = false argument value which fails to set merged_index.
>
> In the attached patch, I tried to fix this case by setting merged_index
> explicitly which fixes the reported crash.
>
Thanks Amul, with v20 patches, crash is fixed.


>
> Regards,
> Amul
>
>
>
>> On Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <
>> rajkumar.raghuwan...@enterprisedb.com> wrote:
>>
>>>
>>>
>>> On Tue, Mar 5, 2019 at 3:45 PM amul sul  wrote:
>>>
 Attached is the rebased atop of the latest master head(35bc0ec7c8).

>>> thanks Amul, patches applied cleanly on PG head.
>>>
>>> While testing this I got a server crash with below test case.
>>>
>>> CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
>>> CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN
>>> ('0001','0002','0003');
>>> CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN
>>> ('0004','0005','0006');
>>> CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN
>>> (NULL,'0008','0009');
>>> CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('','0010');
>>> INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM') FROM
>>> generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
>>> INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else
>>> to_char(i % 17, 'FM') end FROM generate_series(0, 500) i WHERE i % 17
>>> IN (7,8,9);
>>> ANALYSE plt1;
>>>
>>> CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
>>> CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
>>> CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN
>>> ('0004','0005','0006');
>>> CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN
>>> ('0007','0008','0009');
>>> CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN
>>> ('',NULL,'0012');
>>> INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM') FROM
>>> generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
>>> INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else
>>> to_char(i % 17, 'FM') end FROM generate_series(0, 500) i WHERE i % 17
>>> IN (0,11,12);
>>> ANALYZE plt2;
>>>
>>> CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c,
>>> 'A'));
>>> CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002',
>>> '0003');
>>> CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004',
>>> '0005', '0006');
>>> CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008',
>>> '0009');
>>> CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('');
>>> INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM') FROM
>>> generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14,
>>> 15, 16);
>>> ANALYZE plt1_e;
>>>
>>> EXPLAIN (COSTS OFF)
>>> SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
>>> plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') =
>>> t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
>>> server closed the connection unexpectedly
>>> This probably means the server terminated abnormally
>>> before or while processing the request.
>>> The connection to the server was lost. Attempting reset: Failed.
>>>
>>> below is stack trace,  looks like some indexes got corrupted, please
>>> take a look.
>>>
>>> Core was generated by `postgres: edb postgres [local]
>>> EXPLAIN   '.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0x00821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8,
>>> partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c)
>>> at partbounds.c:4162
>>> 4162if (partmap1->from < 0 && partmap2->from < 0)
>>> Missing separate debuginfos, use: debuginfo-install
>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>>> openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
>>> (gdb) bt
>>> #0  0x00821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8,
>>> partmaps2=0x2c1c8e0, *index1=45540240*, index2=0,
>>> next_index=0x7ffeebd43d3c) at partbounds.c:4162
>>> #1  0x008226c3 in merge_null_partitions (outer_bi=0x2b6e338,
>>> inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0,
>>> jointype=JOIN_INNER,
>>> next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38,
>>> default_index=0x7ffeebd43d34) at partbounds.c:4610
>>> #2  0x00821726 in partition_list_bounds_merge
>>> 

Re: [HACKERS] CLUSTER command progress monitor

2019-03-08 Thread Tatsuro Yamada

On 2019/03/06 15:38, Tatsuro Yamada wrote:

On 2019/03/05 17:56, Tatsuro Yamada wrote:

On 2019/03/05 11:35, Robert Haas wrote:

On Mon, Mar 4, 2019 at 5:38 AM Tatsuro Yamada
 wrote:

=== Current design ===

CLUSTER command uses Index Scan or Seq Scan when scanning the heap.
Depending on which one is chosen, the command will proceed in the
following sequence of phases:

    * Scan method: Seq Scan
  0. initializing (*2)
  1. seq scanning heap    (*1)
  3. sorting tuples   (*2)
  4. writing new heap (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

    * Scan method: Index Scan
  0. initializing (*2)
  2. index scanning heap  (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

VACUUM FULL command will proceed in the following sequence of phases:

  1. seq scanning heap    (*1)
  5. swapping relation files  (*2)
  6. rebuilding index (*2)
  7. performing final cleanup (*2)

(*1): increasing the value in heap_tuples_scanned column
(*2): only shows the phase in the phase column


All of that sounds good.


The view provides the information of CLUSTER command progress details as follows
# \d pg_stat_progress_cluster
    View "pg_catalog.pg_stat_progress_cluster"
    Column   |  Type   | Collation | Nullable | Default
---+-+---+--+-
   pid   | integer |   |  |
   datid | oid |   |  |
   datname   | name    |   |  |
   relid | oid |   |  |
   command   | text    |   |  |
   phase | text    |   |  |
   cluster_index_relid   | bigint  |   |  |
   heap_tuples_scanned   | bigint  |   |  |
   heap_tuples_vacuumed  | bigint  |   |  |


Still not sure if we need heap_tuples_vacuumed.  We could try to
report heap_blks_scanned and heap_blks_total like we do for VACUUM, if
we're using a Seq Scan.


I have no strong opinion to add heap_tuples_vacuumed, so I'll remove that in
next patch.

Regarding heap_blks_scanned and heap_blks_total, I suppose that it is able to
get those from initscan(). I'll investigate it more.

cluster.c
   copy_heap_data()
 heap_beginscan()
   heap_beginscan_internal()
 initscan()




=== Discussion points ===

   - Progress counter for "3. sorting tuples" phase
  - Should we add pgstat_progress_update_param() in tuplesort.c like a
    "trace_sort"?
    Thanks to Peter Geoghegan for the useful advice!


How would we avoid an abstraction violation?


Hmm... What do you mean an abstraction violation?
If it is difficult to solve, I'd not like to add the progress counter for the 
sorting tuples.



   - Progress counter for "6. rebuilding index" phase
  - Should we add "index_vacuum_count" in the view like a vacuum progress 
monitor?
    If yes, I'll add pgstat_progress_update_param() to reindex_relation() 
of index.c.
    However, I'm not sure whether it is okay or not.


Doesn't seem unreasonable to me.


I see, I'll add it later.



Attached file is revised and WIP patch including:

   - Remove heap_tuples_vacuumed
   - Add heap_blks_scanned and heap_blks_total
   - Add index_vacuum_count

I tried to "add heap_blks_scanned and heap_blks_total" columns and I realized 
that
"heap_tuples_scanned" column is suitable as a counter when a scan method is
both index-scan and seq-scan because CLUSTER is on a tuple basis.



Attached file is rebased patch on current HEAD.
I changed a status. :)


Regards,
Tatsuro Yamada



diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1ee1ed2894..acda12bf52 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -51,6 +51,7 @@
 #include "catalog/storage.h"
 #include "commands/tablecmds.h"
 #include "commands/event_trigger.h"
+#include "commands/progress.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -58,6 +59,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
 #include "parser/parser.h"
+#include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -3851,6 +3853,7 @@ reindex_relation(Oid relid, int flags, int options)
 	List	   *indexIds;
 	bool		is_pg_class;
 	bool		result;
+	int			i;
 
 	/*
 	 * Open and lock the relation.  ShareLock is sufficient since we only need
@@ -3938,6 +3941,7 @@ reindex_relation(Oid relid, int flags, int options)
 
 		/* Reindex all the indexes. */
 		doneIndexes = NIL;
+		i = 1;
 		

RE: Timeout parameters

2019-03-08 Thread Nagaura, Ryohei
Hi, Fabien-san.

About TCP_USER_TIMEOUT:
> From: Fabien COELHO 
> I could not really test the feature, i.e. I could not trigger a timeout.
> Do you have a suggestion on how to test it?
Please see the previous mail[1] or current kirk-san's mail.

About socket_timeout:
> From: Fabien COELHO 
> are actually finished. I cannot say that I'm thrilled by that behavior. ISTM 
> that libpq
> should at least attempt to cancel the query 
Would you please tell me why you think so? 
If it was implemented so, timeout couldn't work when the server is so busy that 
can't process a cancel request.
If the client encounters such a case, how does the client stop to wait the 
server?
How does the client release its resources?
Socket_timeout is helpful for such clients.

I'll send patches after next week.

[1] 
https://www.postgresql.org/message-id/EDA4195584F5064680D8130B1CA91C453A32DB@G01JPEXMBYT04
Best regards,
-
Ryohei Nagaura