date:20170810

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-08-10 Thread Maksim Milyutin


10.08.17 23:01, Robert Haas wrote:


On Wed, Apr 19, 2017 at 5:25 AM, Maksim Milyutin
 wrote:

Ok, thanks for the feedback. Then I'll use a new relkind for local
partitioned index in further development.

Any update on this?



I'll continue to work soon. Sorry for so long delay.

--
Regards,
Maksim Milyutin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Scan Mode in Hash Index

2017-08-10 Thread Amit Kapila

On Wed, Aug 9, 2017 at 2:58 PM, Amit Kapila  wrote:
> On Mon, Aug 7, 2017 at 5:50 PM, Ashutosh Sharma  wrote:
>
> 7.
> _hash_kill_items(IndexScanDesc scan)
> {
> ..
> + /*
> + * If page LSN differs it means that the page was modified since the
> + * last read. killedItems could be not valid so LP_DEAD hints apply-
> + * ing is not safe.
> + */
> + page = BufferGetPage(buf);
> + if (PageGetLSN(page) != so->currPos.lsn)
> + {
> + _hash_relbuf(rel, buf);
> + return;
> + }
> ..
> }
>
> How does this check cover the case of unlogged tables?
>

I have thought about it and I think we can't use the technique btree
is using (not to release the pin on the page) to save unlogged or
temporary relations. It works for btree because it takes a cleanup
lock on each page before removing items from each page whereas in hash
index we take cleanup lock only on primary bucket page.  Now, one
thing we could do is to start taking a cleanup lock on each bucket
page (which includes primary bucket page and overflow pages), but I
think it can turn out to be worse than the current locking strategy.
Another idea could be to preserve the current locking strategy (take
the lock on next bucket page and then release the lock on current
bucket page) during vacuum of the unlogged hash index.  That will
ensure vacuum won't be able to remove the TIDs which we are going to
mark as dead.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] coverage analysis improvements

2017-08-10 Thread Peter Eisentraut

Here is a patch series with some significant reworking and adjusting of
how the coverage analysis tools are run.  The result should be that the
"make coverage" runs are faster and more robust, the results are more
accurate, and the code is simpler.  Please see the individual patches
for details.

I have replaced "make coverage-html" with "make coverage" (but kept the
previous name for compatibility), since the old meaning of the latter
has gone away and was never very useful.  Other than that, the usage of
everything stays the same.

I will add it to the commit fest.  Testing with different versions of
tools would be especially valuable.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 3c8a2c6afb3a8873a70c61c427128bb4d1cade2b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 10 Aug 2017 23:33:47 -0400
Subject: [PATCH 1/9] Run only top-level recursive lcov

This is the way lcov was intended to be used.  It is much faster and
more robust and makes the makefiles simpler than running it in each
subdirectory.

This also removes the direct gcov calls and lets lcov do it instead.
The direct gcov calls are useless because lcov/geninfo call gcov
internally and use that information.
---
 .gitignore|  3 +--
 GNUmakefile.in|  2 +-
 doc/src/sgml/regress.sgml |  2 +-
 src/Makefile.global.in| 33 +++--
 4 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/.gitignore b/.gitignore
index 4976fd9119..2052f719d1 100644
--- a/.gitignore
+++ b/.gitignore
@@ -19,10 +19,9 @@ objfiles.txt
 .deps/
 *.gcno
 *.gcda
-*.gcov
-*.gcov.out
 lcov.info
 coverage/
+coverage-stamp
 *.vcproj
 *.vcxproj
 win32ver.rc
diff --git a/GNUmakefile.in b/GNUmakefile.in
index dc76a5d11d..8d77b01eea 100644
--- a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ -41,7 +41,7 @@ install-world-contrib-recurse: install-world-src-recurse
 
 $(call recurse,installdirs uninstall init-po update-po,doc src config)
 
-$(call recurse,distprep coverage,doc src config contrib)
+$(call recurse,distprep,doc src config contrib)
 
 # clean, distclean, etc should apply to contrib too, even though
 # it's not built by default
diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index 7c2b1029c2..796cdc26ff 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -699,7 +699,7 @@ Test Coverage Examination
 ./configure --enable-coverage ... OTHER OPTIONS ...
 make
 make check # or other test suite
-make coverage-html
+make coverage
 
 Then point your HTML browser
 to coverage/index.html.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 0d3f8ca950..b01e578238 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -19,7 +19,7 @@
 #
 # Meta configuration
 
-standard_targets = all install installdirs uninstall distprep clean distclean 
maintainer-clean coverage check installcheck init-po update-po
+standard_targets = all install installdirs uninstall distprep clean distclean 
maintainer-clean check installcheck init-po update-po
 # these targets should recurse even into subdirectories not being built:
 standard_always_targets = distprep clean distclean maintainer-clean
 
@@ -863,34 +863,23 @@ endif # enable_nls
 #  (by gcc -ftest-coverage)
 #   foo.gcda   gcov data file, created when the program is run (for
 #  programs compiled with gcc -fprofile-arcs)
-#   foo.c.gcov gcov output file with coverage information, created by
-#  gcov from foo.gcda (by "make coverage")
-#   foo.c.gcov.out  stdout captured when foo.c.gcov is created, mildly
-#  interesting
 #   lcov.info  lcov tracefile, built from gcda files in one directory,
 #  later collected by "make coverage-html"
 
 ifeq ($(enable_coverage), yes)
 
-# There is a strange interaction between lcov and existing .gcov
-# output files.  Hence the rm command and the ordering dependency.
-
-gcda_files := $(wildcard *.gcda)
+gcda_files = $(shell find . -name '*.gcda' -print)
 
 lcov.info: $(gcda_files)
-   rm -f *.gcov
-   $(if $^,$(LCOV) -d . -c -o $@ $(LCOVFLAGS) --gcov-tool $(GCOV))
-
-%.c.gcov: %.gcda | lcov.info
-   $(GCOV) -b -f -p -o . $(GCOVFLAGS) $*.c >$*.c.gcov.out
-
-coverage: $(gcda_files:.gcda=.c.gcov) lcov.info
+   $(LCOV) -d . -c -o $@ $(LCOVFLAGS) --gcov-tool $(GCOV)
 
-.PHONY: coverage-html
-coverage-html: coverage
+coverage-stamp: lcov.info
rm -rf coverage
-   mkdir coverage
-   $(GENHTML) --show-details --legend --output-directory=coverage 
--title=PostgreSQL --num-spaces=4 --prefix=$(abs_top_srcdir) `find . -name 
lcov.info -print`
+   $(GENHTML) --show-details --legend --output-directory=coverage 
--title=PostgreSQL --num-spaces=4 --prefix=$(abs_top_srcdir) $<
+   touch $@
+
+.PHONY: coverage coverage-html
+coverage coverage-html: coverage-stamp
 
 
 # hook for clean-up
@@

Re: [HACKERS] dubious error message from partition.c

2017-08-10 Thread Amit Langote

(Sorry Robert for the duplicate message, I mistakenly didn't hit Reply All)

On Fri, Aug 11, 2017 at 2:54 Robert Haas  wrote:

> On Wed, Aug 9, 2017 at 10:14 PM, Amit Langote
>  wrote:
> >> That seems like overkill.  I'm good with "greater than or equal to".
> >
> > Attached updated patch has "greater than or equal to".
>
> Committed, with one small change which I hope will be considered an
> improvement.  Your proposed primary error message was:
>
> invalid range bound specification for partition \"%s\"
>
> I changed it to:
>
> empty range bound specified for partition \"%s\"
>
> Thanks for working on this.


Thank for committing.  The new message is certainly an improvement.

Thanks,
Amit

Re: [HACKERS] WIP: Failover Slots

2017-08-10 Thread Craig Ringer

On 11 August 2017 at 01:02, Robert Haas  wrote:

> Well,
> anybody's welcome to write code without discussion and drop it to the
> list, but if people don't like it, that's the risk you took by not
> discussing it first.
>

Agreed, patches materializing doesn't mean they should be committed, and
there wasn't prior design discussion on this.

It can be hard to elicit it without a patch, but clearly not always, we're
doing a good job of it here.

> > When a replica connects to an upstream it asks via a new walsender msg
> "send
> > me the state of all your failover slots". Any local mirror slots are
> > updated. If they are not listed by the upstream they are known deleted,
> and
> > the mirror slots are deleted on the downstream.
>
> What about slots not listed by the upstream that are currently in use?
>

Yes, it'll also need to send a list of its local owned and up-mirrored
failover slots to the upstream so the upstream can create them or update
their state.

> There's one big hole left here. When we create a slot on a cascading leaf
> or
> > inner node, it takes time for hot_standby_feedback to propagate the
> needed
> > catalog_xmin "up" the chain. Until the master has set the needed
> > catalog_xmin on the physical slot for the closest branch, the inner
> node's
> > slot's catalog_xmin can only be tentative pending confirmation. That's
> what
> > a whole bunch of gruesomeness in the decoding on standby patch was about.
> >
> > One possible solution to this is to also mirror slots "up", as you
> alluded
> > to: when you create an "owned" slot on a replica, it tells the master at
> > connect time / slot creation time "I have this slot X, please copy it up
> the
> > tree". The slot gets copied "up" to the master via cascading layers with
> a
> > different failover slot type indicating it's an up-mirror. Decoding
> clients
> > aren't allowed to replay from an up-mirror slot and it cannot be promoted
> > like a down-mirror slot can, it's only there for resource retention. A
> node
> > knows its owned slot is safe to actually use, and is fully created, when
> it
> > sees the walsender report it in the list of failover slots from the
> master
> > during a slot state update.
>
> I'm not sure that this actually prevents the problem you describe.  It
> also seems really complicated.  Maybe you can explain further; perhaps
> there is a simpler solution (or perhaps this isn't as complicated as I
> currently think it is).

It probably sounds more complex than it is. A slot is created tentatively
and marked not ready to actually use yet when created on a standby. It
flows "up" to the master where it's created as permanent/ready. The
permanent/ready state flows back down to the creator.

When we see a temp slot become permanent we copy the
restart_lsn/catalog_xmin/confirmed_flush_lsn from the upstream slot in case
the master had to advance them from our tentative values when it created
the slot. After that, slot state updates only flow "out" from the owner: up
the tree for up-mirror slots, down the tree for down-mirror slots.

Diagram may help. I focused only on the logical slot created on standby
case, since I think we're happy with the rest already and I don't want to
complicate it.

GMail will probably HTMLize this, sorry:

  Phys rep  Phys rep
  using physusing
  slot "B"  phys slot "C"
+---+ ++ +---+
 T  |  A<^+ B  <-+ C |
 I  |   | || |   |
 M  +---+ ++ +---+
 E |  |  |
 | |  |  |CREATEs
 | |  |  |logical slot X
 v |  |  |("owned")
   |  |  |as temp slot
   |  +<-+
   |  |Creates upmirror  |
   |  |slot "X" linked   |
   |  |to phys slot "C"  |
   |  |marked temp   |
   | <+  |
   |Creates upmirror  |  |
<--+   +-+
   |slot "X" linked   |  |   Attempt to
decode from "X"   | |
   |to phys slot "B"  |  |
   | CLIENT  |
   |marked permanent  |  |
 +->   | |
   +> |  |   ERROR: slot X
still being+-+
   |  |Sees upmirror

Re: [HACKERS] Thoughts on unit testing?

2017-08-10 Thread Craig Ringer

On 11 August 2017 at 07:13, Thomas Munro 
wrote:

> On Fri, Aug 11, 2017 at 10:35 AM, Andres Freund 
> wrote:
> > On 2017-08-11 09:53:23 +1200, Thomas Munro wrote:
> >> One idea that keeps coming back to me is that we could probably extend
> >> our existing regression tests to cover C tests with automatic
> >> discovery/minimal boilerplate.
> >
> > What's your definition of boilerplate here? All the "expected" state
> > tests in C unit tests is plenty boilerplate...
>
> I mean close to zero effort required to create and run new tests for
> primitive C code.  Just create a .test.c file and type "TEST(my_math,
> factorial) { EXPECT_EQ(6, factorial(3)); }" and it'll run when you
> "make check" and print out nicely tabulated results and every build
> farm member will run it.


The closest we come now is a src/test/modules/ with an extension that
exposes SQL functions that in turn run the C tests, one SQL func per C test.

Definitely not brief and simple. But that's what I've been doing.

Lots of those unit frameworks support auto-generation of harness code. I
wonder how practical it'd be to teach them to generate an extension like
this and the .sql to run it?

It won't work if you want to test anything static or file-private, but
pretty much nothing will.

(Side note: I'd really like to get away from relying so heavily on 'static'
and move toward higher level visibility (https://gcc.gnu.org/wiki/Visibility:
-fvisibility=hidden and __attribute__((dllexport)) ). It'd make it easier
not to forget needed PGDLLEXPORTs, let us hide stuff we consider really
internal but still share across a few files, etc.)
-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Getting server crash on Windows when using ICU collation

2017-08-10 Thread Peter Eisentraut

On 8/10/17 22:21, Peter Geoghegan wrote:
> On Thu, Aug 10, 2017 at 3:57 PM, Peter Geoghegan  wrote:
>> On Thu, Aug 10, 2017 at 11:02 AM, Robert Haas  wrote:
>>> On Fri, Jun 23, 2017 at 1:14 AM, Ashutosh Sharma  
>>> wrote:
 Okay, attached is the patch which first detects the platform type and
 runs the uconv commands accordingly  from the corresponding icu bin
 path. Please have a look and let me know for any issues. Thanks.
>>>
>>> Should this be on the open items list?
>>
>> Maybe. Peter E has a patch to fix the other issue (ICU 52 has a buggy
>> ucol_strcollUTF8()) by never using it on ICU versions prior to 53. He
>> posted this just yesterday.
> 
> He committed it just now, so that's the end of this Windows issue, I suppose.

Yes, I think so.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Getting server crash on Windows when using ICU collation

2017-08-10 Thread Peter Geoghegan

On Thu, Aug 10, 2017 at 3:57 PM, Peter Geoghegan  wrote:
> On Thu, Aug 10, 2017 at 11:02 AM, Robert Haas  wrote:
>> On Fri, Jun 23, 2017 at 1:14 AM, Ashutosh Sharma  
>> wrote:
>>> Okay, attached is the patch which first detects the platform type and
>>> runs the uconv commands accordingly  from the corresponding icu bin
>>> path. Please have a look and let me know for any issues. Thanks.
>>
>> Should this be on the open items list?
>
> Maybe. Peter E has a patch to fix the other issue (ICU 52 has a buggy
> ucol_strcollUTF8()) by never using it on ICU versions prior to 53. He
> posted this just yesterday.

He committed it just now, so that's the end of this Windows issue, I suppose.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-10 Thread Noah Misch

On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote:
> On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch  wrote:
> > postmaster algorithms rely on the PG_SETMASK() calls preventing that.  
> > Without
> > such protection, duplicate bgworkers are an understandable result.  I caught
> > several other assertions; the PMChildFlags failure is another case of
> > duplicate postmaster children:
> >
> >   6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
> > "pgstat.c", Line: 871)
> >   3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", 
> > File: "pmsignal.c", Line: 229)
> >  20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", 
> > Line: 2523)
> >  21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
> > "shm_mq.c", Line: 221)
> >  Also, got a few "select() failed in postmaster: Bad address"
> >
> > I suspect a Cygwin signals bug.  I'll try to distill a self-contained test
> > case for the Cygwin hackers.  The lack of failures on buildfarm member 
> > brolga
> > argues that older Cygwin is not affected.
> 
> Nice detective work.

Thanks.  http://marc.info/?t=15018329641 has my upstream report.  The
Cygwin project lead reproduced this, but a fix remained elusive.

I guess we'll ignore weird postmaster-associated lorikeet failures for the
foreseeable future.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-10 Thread Amit Kapila

On Fri, Aug 11, 2017 at 5:01 AM, AP  wrote:
> On Thu, Aug 10, 2017 at 01:12:25PM -0400, Robert Haas wrote:
>> On Thu, Aug 10, 2017 at 6:41 AM, AP  wrote:
>> > The index is 135GB rather than 900GB (from memory/give or take).
>>
>> Whoa.  Big improvement.
>
>
> As an aside, btree for the above is around 2.5x bigger than hash v4 so
> chances are much better that a hash index will fit into ram which has
> its own benefits. :)
>

Yeah, that's exactly one of the benefit hash indexes can provide over
btree indexes.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-10 Thread Amit Kapila

On Thu, Aug 10, 2017 at 4:11 PM, AP  wrote:
> On Sun, Aug 06, 2017 at 04:32:29PM +1000, AP wrote:
>> On Sat, Aug 05, 2017 at 04:41:24PM +0530, Amit Kapila wrote:
>> > > (On another note, I committed these patches.)
>> >
>> > Thanks.
>>
>> Seconded. :)
>>
>> Now uploading data with fillfactor of 90. I'll know in 2-3 days
>> if the new patches are successful (earlier if they did not help).
>>
>> I compiled (as apt.postgresql.org does not provide the latest
>> beta3 version for stretch; only sid which has a different perl
>> version) 10~beta3~20170805.2225-1~593.git0d1f98b.
>
> Have gotten success with a dataset that failed before with ff 10.
>
> End result:
>
> mdstash=# select * from pgstathashindex('link_datum_id_idx');
>  version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
> live_items | dead_items |   free_percent
> -+--++--+--+++--
>4 | 12391325 |5148912 |  158 |   191643 | 
> 4560007478 |  0 | 1894.29056075982
> (1 row)
>

The free_percent calculation seems to be wrong.  Can you please once
try after recent commit 0b7ba3d6474b8f58e74dba548886df3250805cdf?  I
feel this should be fixed by that commit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-10 Thread Tom Lane

Noah Misch  writes:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
>> I don't think I can usefully contribute to this.  Could someone else
>> take it?

> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I think you're blaming the victim.  Our current theory about the cause
of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
completion of a nonblocking connect() call.  That seems pretty broken
independently of whether libpqwalreceiver needs the capability.

In any case, we have a draft patch, so what we should be pressing for
is for somebody to test it.  Peter's not in a position to do that
(and neither am I), but anyone who can build from source on Windows
could do so.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-10 Thread Augustine, Jobin

I am in an effort to create independent build environment on Windows to
test the patch from Andres.
I shall come back with result possibly in another 24 hours.

-Jobin

On Fri, Aug 11, 2017 at 7:25 AM, Andres Freund  wrote:

> On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > > On 8/7/17 21:06, Noah Misch wrote:
> > > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had
> no in-tree
> > > >> callers outside of libpq itself.
> > > > [Action required within three days.  This is a generic notification.]
> > > >
> > > > The above-described topic is currently a PostgreSQL 10 open item.
> Peter,
> > > > since you committed the patch believed to have created it, you own
> this open
> > > > item.
> > >
> > > I don't think I can usefully contribute to this.  Could someone else
> > > take it?
> >
> > If nobody volunteers, you could always resolve this by reverting 1e8a850
> and
> > successors.
>
> I've volunteered a fix nearby, just can't test it. I don't think we can
> require committers to work on windows.
>
> - Andres
>



-- 

*Jobin Augustine*
Architect : Production Database Operations


*OpenSCG*


*phone : +91 9989932600*

[HACKERS] SCRAM protocol documentation

2017-08-10 Thread Peter Eisentraut

The SCRAM protocol documentation
(https://www.postgresql.org/docs/devel/static/sasl-authentication.html)
states

"To avoid confusion, the client should use pg_same_as_startup_message as
the username in the client-first-message."

However, the client implementation in libpq doesn't actually do that, it
sends an empty string for the user name.  I find no other reference to
"pg_same_as_startup_message" in the sources.  Should the documentation
be updated?

Relatedly, the SCRAM specification doesn't appear to allow omitting the
user name in this manner.  Why don't we just send the actual user name,
even though it's redundant with the startup message?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-10 Thread Andres Freund

On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no 
> > >> in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > > 
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this 
> > > open
> > > item.
> > 
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
> 
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-10 Thread AP

On Thu, Aug 10, 2017 at 01:12:25PM -0400, Robert Haas wrote:
> On Thu, Aug 10, 2017 at 6:41 AM, AP  wrote:
> > The index is 135GB rather than 900GB (from memory/give or take).
> 
> Whoa.  Big improvement.

Not a good direct comparison in general but it fits my workload.

The 900GB was fillfactor 10 and the 135 was ff 90 BUT ff 90 on
v3 fails early into the import. Even ff 10 on v3 didn't succeed (came just
short). So for my usage I was facing having indexes with fillfactor 10
just to be able to put a more reasonable amount of data in them. Almost.

Now I don't have to as v4 copes with the load and more and in less disk
space so for me, the above is just lovely. :) This is even more so
given that the hash index v4 upload actually finished unlike the v3 one. :)

As I said in my last email, this weekend I'll be adding more to that table
so I'll see how far that takes me but the last two patches have given
me a great deal of confidence that the end result will be good news. :)

As an aside, btree for the above is around 2.5x bigger than hash v4 so 
chances are much better that a hash index will fit into ram which has
its own benefits. :)

AP

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-10 Thread Noah Misch

On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> On 8/7/17 21:06, Noah Misch wrote:
> >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no 
> >> in-tree
> >> callers outside of libpq itself.
> > [Action required within three days.  This is a generic notification.]
> > 
> > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > since you committed the patch believed to have created it, you own this open
> > item.
> 
> I don't think I can usefully contribute to this.  Could someone else
> take it?

If nobody volunteers, you could always resolve this by reverting 1e8a850 and
successors.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Thoughts on unit testing?

2017-08-10 Thread Thomas Munro

On Fri, Aug 11, 2017 at 11:13 AM, Thomas Munro
 wrote:
> Just create a .test.c file and type "TEST(my_math,
> factorial) { EXPECT_EQ(6, factorial(3)); }" ...

Of course that would really need to #include "something/test_macros.h"
and "something/factorial.h", and EXPECT_EQ probably doesn't make much
sense in C, but you get the point: very low barrier to use it.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Thoughts on unit testing?

2017-08-10 Thread Thomas Munro

On Fri, Aug 11, 2017 at 10:35 AM, Andres Freund  wrote:
> On 2017-08-11 09:53:23 +1200, Thomas Munro wrote:
>> One idea that keeps coming back to me is that we could probably extend
>> our existing regression tests to cover C tests with automatic
>> discovery/minimal boilerplate.
>
> What's your definition of boilerplate here? All the "expected" state
> tests in C unit tests is plenty boilerplate...

I mean close to zero effort required to create and run new tests for
primitive C code.  Just create a .test.c file and type "TEST(my_math,
factorial) { EXPECT_EQ(6, factorial(3)); }" and it'll run when you
"make check" and print out nicely tabulated results and every build
farm member will run it.

>> Imagine if you just had to create a
>> file bitmapset.test.c that sits beside bitmapset.c (and perhaps add it
>> to TEST_OBJS), and in it write tests using a tiny set of macros a bit
>> like Google Test's[2].  It could get automagically sucked into a test
>> driver shlib module, perhaps one per source directory/subsystem, that
>> is somehow discovered, loaded and run inside PostgreSQL as part of the
>> regression suite, or perhaps it's just explicitly listed in the
>> regression schedule with a .sql file that loads the module and runs an
>> entry point function.
>>
>> One problem is that if this was happening inside an FMGR function it'd
>> be always in a transaction, which has implications.  There are
>> probably better ways to do it.
>
> You can already kinda avoid that in various ways, some more some less
> hacky. I think it depends a bit on which scenarios we want to test.  How
> much infrastructure do you want around? Do you want to be able to start
> transactions? Write to the WAL, etc?

I've mostly wanted to do this when working on code that doesn't need
too much of that stuff (dsa.c, freepage.c, parallel hash, ...) but am
certainly very interested in testing transactional stuff related to
skunkworks new storage projects and SSI.  I haven't thought too much
about WAL interaction, that's a good question.

> One relatively easy way would be
> to simply have a 'postgres' commandline option (like we already kinda
> have for EXEC_BACKEND style subprocesses), that loads a shared library
> and invokes an entry point in it. That'd then be invoked at some defined
> point in time. Alternatively you could just have a bgworker that does
> its thing at startup and then shuts down the cluster...

Hmm.  Interesting idea.

I guess you could also also just use _PG_init() as your entrypoint, or
register some callback from there, and LOAD?  Keeping it inside the
existing regression framework seems nice because it already knows how
to paralellise stuff and it'd be small incremental change to start
adding stuff like this.  I suppose we could run most modules that way,
but some extra ones by postgres --run-test-module foo if it turns out
to be necessary.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Michael Paquier

On Thu, Aug 10, 2017 at 9:00 PM, Robert Haas  wrote:
> On Thu, Aug 10, 2017 at 2:50 AM, Andreas Seltenreich  
> wrote:
>> Will do.  Won't miss this chance to try out discostu's extension
>> pg_rage_terminator[1] :-)
>> [1]  https://github.com/disco-stu/pg_rage_terminator
>
> Oh, that's *awesome*.

You can do that at query level using the planner hook:
https://github.com/michaelpq/pg_plugins/tree/master/pg_panic
The PANIC should be lowered to a FATAL though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Getting server crash on Windows when using ICU collation

2017-08-10 Thread Peter Geoghegan

On Thu, Aug 10, 2017 at 11:02 AM, Robert Haas  wrote:
> On Fri, Jun 23, 2017 at 1:14 AM, Ashutosh Sharma  
> wrote:
>> Okay, attached is the patch which first detects the platform type and
>> runs the uconv commands accordingly  from the corresponding icu bin
>> path. Please have a look and let me know for any issues. Thanks.
>
> Should this be on the open items list?

Maybe. Peter E has a patch to fix the other issue (ICU 52 has a buggy
ucol_strcollUTF8()) by never using it on ICU versions prior to 53. He
posted this just yesterday.

This removes the existing configure test, replacing it with something
that will work just the same on Windows, presuming Peter also reverts
the commit that had ICU never use ucol_strcollUTF8() on Windows.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Thoughts on unit testing?

2017-08-10 Thread Andres Freund

Hi,

On 2017-08-11 09:53:23 +1200, Thomas Munro wrote:
> The current regression tests, isolation tests and TAP tests are very
> good(though I admit my experience with TAP is limited), but IMHO we
> are lacking support for C-level unit testing.  Complicated, fiddly
> things with many states, interactions, edge cases etc can be hard to
> get full test coverage on from the outside.  Consider
> src/backend/utils/mmgr/freepage.c as a case in point.
> 
> I guess we could consider the various xUnit systems for C[1] and have
> an entirely new kind of test that runs independently of PostgreSQL.  I
> guess that'd be difficult politically (choosing external project,
> cognitive load) and technically (global state problems as soon as you
> get away from completely stand-alone components).

Yea, I think there's little chance for something like that. The amount
of infrastructure you need to link in and initialize is prohibitive imo.

> One idea that keeps coming back to me is that we could probably extend
> our existing regression tests to cover C tests with automatic
> discovery/minimal boilerplate.

What's your definition of boilerplate here? All the "expected" state
tests in C unit tests is plenty boilerplate...

> Imagine if you just had to create a
> file bitmapset.test.c that sits beside bitmapset.c (and perhaps add it
> to TEST_OBJS), and in it write tests using a tiny set of macros a bit
> like Google Test's[2].  It could get automagically sucked into a test
> driver shlib module, perhaps one per source directory/subsystem, that
> is somehow discovered, loaded and run inside PostgreSQL as part of the
> regression suite, or perhaps it's just explicitly listed in the
> regression schedule with a .sql file that loads the module and runs an
> entry point function.
> 
> One problem is that if this was happening inside an FMGR function it'd
> be always in a transaction, which has implications.  There are
> probably better ways to do it.

You can already kinda avoid that in various ways, some more some less
hacky. I think it depends a bit on which scenarios we want to test.  How
much infrastructure do you want around? Do you want to be able to start
transactions? Write to the WAL, etc?  One relatively easy way would be
to simply have a 'postgres' commandline option (like we already kinda
have for EXEC_BACKEND style subprocesses), that loads a shared library
and invokes an entry point in it. That'd then be invoked at some defined
point in time. Alternatively you could just have a bgworker that does
its thing at startup and then shuts down the cluster...

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-08-10 Thread Mark Dilger

> On Aug 10, 2017, at 11:20 AM, Robert Haas  wrote:
> 
> On Wed, Jul 5, 2017 at 12:14 AM, Mark Dilger  wrote:
>> I can understand this, but wonder if I could use something like
>> 
>> FOR I TOTALLY PROMISE TO USE ALL ROWS rec IN EXECUTE sql LOOP
>> ...
>> END LOOP;
> 
> Actually, what you'd need is:
> 
> FOR I TOTALLY PROMISE TO USE ALL ROWS AND IT IS OK TO BUFFER THEM ALL
> IN MEMORY INSTEAD OF FETCHING THEM ONE AT A TIME FROM THE QUERY rec IN
> EXECUTE sql LOOP
> 
> Similarly, RETURN QUERY could be made to work with parallelism if we
> had RETURN QUERY AND IT IS OK TO BUFFER ALL THE ROWS IN MEMORY TWICE
> INSTEAD OF ONCE.
> 
> I've thought a bit about trying to make parallel query support partial
> execution, but it seems wicked hard.  The problem is that you can't
> let the main process do anything parallel-unsafe (e.g., currently,
> write any data) while the there are workers in existence, or things
> may blow up badly.  You could think about fixing that problem by
> having all of the workers exit cleanly when the query is suspended,
> and then firing up new ones when the query is resumed.  However, that
> presents two further problems: (1) having the workers exit cleanly
> when the query is suspended would cause wrong answers unless any
> tuples that the worker has implicitly claimed e.g. by taking a page
> from a parallel scan and returning only some of the tuples on it were
> somehow accounted for and (2) starting and stopping workers over and
> over would be bad for performance.  The second problem could be solved
> by having a persistent pool of workers that attach and detach instead
> of firing up new ones all the time, but that has a host of problems
> all of its own.  The first one would be desirable change for a bunch
> of reasons but is not easy for reasons that are a little longer than I
> feel like explaining right now.

That misses the point I was making.  I was suggesting a syntax where
the caller promises to use all rows without stopping short, and the
database performance problems of having a bunch of parallel workers
suspended in mid query is simply the caller's problem if the caller does
not honor the contract.  Maybe the ability to execute such queries would
be limited to users who are granted a privilege for doing so, and the DBA
can decide not to go around granting that privilege to anybody.
Certainly if this is being used from within a stored procedure, the DBA
can make certain only to use it in cases where there is no execution
path exiting the loop before completion, either because everything is
wrapped up with try/catch syntax and/or because the underlying query
does not call anything that might throw exceptions.

I'm not advocating that currently, as you responded to a somewhat old
email, so really I'm just making clear what I intended at the time.

mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Thoughts on unit testing?

2017-08-10 Thread Thomas Munro

Hi hackers,

The current regression tests, isolation tests and TAP tests are very
good (though I admit my experience with TAP is limited), but IMHO we
are lacking support for C-level unit testing.  Complicated, fiddly
things with many states, interactions, edge cases etc can be hard to
get full test coverage on from the outside.  Consider
src/backend/utils/mmgr/freepage.c as a case in point.

I guess we could consider the various xUnit systems for C[1] and have
an entirely new kind of test that runs independently of PostgreSQL.  I
guess that'd be difficult politically (choosing external project,
cognitive load) and technically (global state problems as soon as you
get away from completely stand-alone components).

One idea that keeps coming back to me is that we could probably extend
our existing regression tests to cover C tests with automatic
discovery/minimal boilerplate.  Imagine if you just had to create a
file bitmapset.test.c that sits beside bitmapset.c (and perhaps add it
to TEST_OBJS), and in it write tests using a tiny set of macros a bit
like Google Test's[2].  It could get automagically sucked into a test
driver shlib module, perhaps one per source directory/subsystem, that
is somehow discovered, loaded and run inside PostgreSQL as part of the
regression suite, or perhaps it's just explicitly listed in the
regression schedule with a .sql file that loads the module and runs an
entry point function.

One problem is that if this was happening inside an FMGR function it'd
be always in a transaction, which has implications.  There are
probably better ways to do it.

Thoughts, better ideas?

[1] https://en.wikipedia.org/wiki/List_of_unit_testing_frameworks#C
[2] 
https://github.com/google/googletest/blob/master/googletest/samples/sample1_unittest.cc

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-08-10 Thread Merlin Moncure

On Thu, Aug 10, 2017 at 12:01 PM, Ants Aasma  wrote:
> On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure  wrote:
>> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma  wrote:
>>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure  wrote:
 Still getting checksum failures.   Over the last 30 days, I see the
 following.  Since enabling checksums FWICT none of the damage is
 permanent and rolls back with the transaction.   So creepy!
>>>
>>> The checksums still only differ in least significant digits which
>>> pretty much means that there is a block number mismatch. So if you
>>> rule out filesystem not doing its job correctly and transposing
>>> blocks, it could be something else that is resulting in blocks getting
>>> read from a location that happens to differ by a small multiple of
>>> page size. Maybe somebody is racily mucking with table fd's between
>>> seeking and reading. That would explain the issue disappearing after a
>>> retry.
>>>
>>> Maybe you can arrange for the RelFileNode and block number to be
>>> logged for the checksum failures and check what the actual checksums
>>> are in data files surrounding the failed page. If the requested block
>>> number contains something completely else, but the page that follows
>>> contains the expected checksum value, then it would support this
>>> theory.
>>
>> will do.   Main challenge is getting hand compiled server to swap in
>> so that libdir continues to work.  Getting access to the server is
>> difficult as is getting a maintenance window.  I'll post back ASAP.
>
> As a new datapoint, we just had a customer with an issue that I think
> might be related. The issue was reasonably repeatable by running a
> report on the standby system. Issue manifested itself by first "could
> not open relation" and/or "column is not in index" errors, followed a
> few minutes later by a PANIC from startup process due to "specified
> item offset is too large", "invalid max offset number" or "page X of
> relation base/16384/1259 is uninitialized". I took a look at the xlog
> dump and it was completely fine. For instance in the "specified item
> offset is too large" case there was a INSERT_LEAF redo record
> inserting the preceding offset just a couple hundred kilobytes back.
> Restarting the server sometimes successfully applied the offending
> WAL, sometimes it failed with other corruption errors. The offending
> relations were always pg_class or pg_class_oid_index. Replacing plsh
> functions with dummy plpgsql functions made the problem go away,
> reintroducing plsh functions made it reappear.

Fantastic.  I was never able to attempt to apply O_CLOEXEC patch (see
upthread) due to the fact that access to the system is highly limited
and compiling a replacement binary was a bit of a headache.  IIRC this
was the best theory on the table as to the underlying cause and we
ought to to try that first, right?

Reminder; I was able to completely eliminate all damage (but had to
handle occasional unexpected rollback) via enabling checksums.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-08-10 Thread Robert Haas

On Wed, Apr 19, 2017 at 5:25 AM, Maksim Milyutin
 wrote:
> Ok, thanks for the feedback. Then I'll use a new relkind for local
> partitioned index in further development.

Any update on this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-08-10 Thread Robert Haas

On Wed, Jul 5, 2017 at 12:14 AM, Mark Dilger  wrote:
> I can understand this, but wonder if I could use something like
>
> FOR I TOTALLY PROMISE TO USE ALL ROWS rec IN EXECUTE sql LOOP
> ...
> END LOOP;

Actually, what you'd need is:

FOR I TOTALLY PROMISE TO USE ALL ROWS AND IT IS OK TO BUFFER THEM ALL
IN MEMORY INSTEAD OF FETCHING THEM ONE AT A TIME FROM THE QUERY rec IN
EXECUTE sql LOOP

Similarly, RETURN QUERY could be made to work with parallelism if we
had RETURN QUERY AND IT IS OK TO BUFFER ALL THE ROWS IN MEMORY TWICE
INSTEAD OF ONCE.

I've thought a bit about trying to make parallel query support partial
execution, but it seems wicked hard.  The problem is that you can't
let the main process do anything parallel-unsafe (e.g., currently,
write any data) while the there are workers in existence, or things
may blow up badly.  You could think about fixing that problem by
having all of the workers exit cleanly when the query is suspended,
and then firing up new ones when the query is resumed.  However, that
presents two further problems: (1) having the workers exit cleanly
when the query is suspended would cause wrong answers unless any
tuples that the worker has implicitly claimed e.g. by taking a page
from a parallel scan and returning only some of the tuples on it were
somehow accounted for and (2) starting and stopping workers over and
over would be bad for performance.  The second problem could be solved
by having a persistent pool of workers that attach and detach instead
of firing up new ones all the time, but that has a host of problems
all of its own.  The first one would be desirable change for a bunch
of reasons but is not easy for reasons that are a little longer than I
feel like explaining right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Getting server crash on Windows when using ICU collation

2017-08-10 Thread Robert Haas

On Fri, Jun 23, 2017 at 1:14 AM, Ashutosh Sharma  wrote:
> Okay, attached is the patch which first detects the platform type and
> runs the uconv commands accordingly  from the corresponding icu bin
> path. Please have a look and let me know for any issues. Thanks.

Should this be on the open items list?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default Partition for Range

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 6:56 AM, Jeevan Ladhe
 wrote:
>> This looks like a problem with the default list partitioning patch.  I
>> think "true" is what we expect to see here in both cases.
>
> In case of list partition if there is only default partition, then there is
> no
> specific value set that can be excluded from default partition, so in
> such a case DEFAULT partition for a list partitioned table simply will
> not have any constraint on it, and hence while the describe output is
> printed it does not have any constraints to be printed.
>
> Whereas, in case of default partition for a range partitioned table, in
> get_qual_for_range() it creates true boolean expression if default is the
> only partition. Here is the relevent code from Beena's patch:
>
> + else /* default is the only partition */
> + result = list_make1(makeBoolConst(true, false));
>
> I think, it is unnecessary to create a constraint that is going to be always
> true. Instead, if we have no constraints we can avoid some extra cpu
> cycles which would otherwise be wasted in evaluating an expression
> that is going to be always true.

That's a reasonable argument.  I'm not sure whether having no
partition constraint at all is likely to be a source of bugs, but
certainly, not needing to check it is appealing.

> Having said this, I see a point that in such a case we should have
> some neat meaningful content to be printed for "Partition constraint: ".
> I think we can address this when we construct describe output string.
>
> Some ideas that come to my mind are:
> Partition constraint: NIL
> Partition constraint: no constraints
> No partition constraint.
> Partition constraint: true
>
> Please let me know your thoughts.

I like "No partition constraint." of those options.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Robert Haas

On Tue, Aug 8, 2017 at 10:48 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane  wrote:
>>> In the meantime, I think my vote would be to remove AtEOXact_CatCache.
>
>> In all supported branches?
>
> Whatever we do about this issue, I don't feel a need to do it further
> back than HEAD.  It's a non-problem except in an assert-enabled build,
> and we don't recommend running those for production, only development.

Sure, but people still do testing and development against older
branches - bug fixes, for example.  It doesn't make much sense to me
to leave code that we know does the wrong thing in the back branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dubious error message from partition.c

2017-08-10 Thread Robert Haas

On Wed, Aug 9, 2017 at 10:14 PM, Amit Langote
 wrote:
>> That seems like overkill.  I'm good with "greater than or equal to".
>
> Attached updated patch has "greater than or equal to".

Committed, with one small change which I hope will be considered an
improvement.  Your proposed primary error message was:

invalid range bound specification for partition \"%s\"

I changed it to:

empty range bound specified for partition \"%s\"

Thanks for working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel documentation improvements

2017-08-10 Thread Robert Haas

On Tue, Aug 1, 2017 at 2:43 PM, Robert Haas  wrote:
> In commit 054637d2e08cda6a096f48cc99696136a06f4ef5, I updated the
> parallel query documentation to reflect recently-committed parallel
> query features.  However, a few more things got committed after that.
> Most of the attached patch consists of generalizing references to
> Gather to also include Gather Merge, but also included a tweak to note
> that uncorrelated subplans are no longer parallel-restricted and made
> a few other minor improvements and clarifications.
>
> Barring objections, I'd like to commit this in the next couple of days
> so that it is included in beta3.  I probably should have gotten to do
> this sooner; apologies for any inconvenience.

Well, that didn't happen.   Oops.  But better late than never -- committed now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Comment typo in plannodes.h

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 8:25 AM, Etsuro Fujita
 wrote:
> Here is a small patch for fixing a comment typo in plannodes.h: s/all the
> partitioned table/all the partitioned tables/.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 6:41 AM, AP  wrote:
> The index is 135GB rather than 900GB (from memory/give or take).

Whoa.  Big improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Failover Slots

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 2:38 AM, Craig Ringer  wrote:
> Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
> correspondence, and using globally unique slot names to manage state.

Yes, that's what I'm imagining.  (Whether I should instead be
imagining something else is the important question.)

> I'm quite happy to find a better one. But I cannot spend a lot of time
> writing something to have it completely knocked back because the scope just
> got increased again and now it has to do more, so it needs another rewrite.

Well, I can't guarantee anything about that.  I don't tend to argue
against designs to which I myself previously agreed, but other people
may, and there's not a lot I can do about that (although sometimes I
try to persuade them that they're wrong, if I think they are).  Of
course, sometimes you implement something and it doesn't look as good
as you thought it would; that's a risk of software development
generally.  I'd like to push back a bit on the underlying assumption,
though: I don't think that there was ever an agreed-upon design on
this list for failover slots before the first patch showed up.  Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

> A "failover slot" is identified by a  field in the slot struct and exposed
> in pg_replication_slots. It can be null (not a failover slots). It can
> indicate that the slot was created locally and is "owned" by this node; all
> downstreams should mirror it. It can also indicate that it is a mirror of an
> upstream, in which case clients may not replay from it until it's promoted
> to an owned slot and ceases to be mirrored. Attempts to replay from a
> mirrored slot just ERROR and will do so even once decoding on standby is
> supported.

+1

> This promotion happens automatically if a standby is promoted to a master,
> and can also be done manually via sql function call or walsender command to
> allow for an internal promotion within a cascading replica chain.

+1.

> When a replica connects to an upstream it asks via a new walsender msg "send
> me the state of all your failover slots". Any local mirror slots are
> updated. If they are not listed by the upstream they are known deleted, and
> the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

> The upstream walsender then sends periodic slot state updates while
> connected, so replicas can advance their mirror slots, and in turn send
> hot_standby_feedback that gets applied to the physical replication slot used
> by the standby, freeing resources held for the slots on the master.

+1.

> There's one big hole left here. When we create a slot on a cascading leaf or
> inner node, it takes time for hot_standby_feedback to propagate the needed
> catalog_xmin "up" the chain. Until the master has set the needed
> catalog_xmin on the physical slot for the closest branch, the inner node's
> slot's catalog_xmin can only be tentative pending confirmation. That's what
> a whole bunch of gruesomeness in the decoding on standby patch was about.
>
> One possible solution to this is to also mirror slots "up", as you alluded
> to: when you create an "owned" slot on a replica, it tells the master at
> connect time / slot creation time "I have this slot X, please copy it up the
> tree". The slot gets copied "up" to the master via cascading layers with a
> different failover slot type indicating it's an up-mirror. Decoding clients
> aren't allowed to replay from an up-mirror slot and it cannot be promoted
> like a down-mirror slot can, it's only there for resource retention. A node
> knows its owned slot is safe to actually use, and is fully created, when it
> sees the walsender report it in the list of failover slots from the master
> during a slot state update.

I'm not sure that this actually prevents the problem you describe.  It
also seems really complicated.  Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-08-10 Thread Ants Aasma

On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure  wrote:
> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma  wrote:
>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure  wrote:
>>> Still getting checksum failures.   Over the last 30 days, I see the
>>> following.  Since enabling checksums FWICT none of the damage is
>>> permanent and rolls back with the transaction.   So creepy!
>>
>> The checksums still only differ in least significant digits which
>> pretty much means that there is a block number mismatch. So if you
>> rule out filesystem not doing its job correctly and transposing
>> blocks, it could be something else that is resulting in blocks getting
>> read from a location that happens to differ by a small multiple of
>> page size. Maybe somebody is racily mucking with table fd's between
>> seeking and reading. That would explain the issue disappearing after a
>> retry.
>>
>> Maybe you can arrange for the RelFileNode and block number to be
>> logged for the checksum failures and check what the actual checksums
>> are in data files surrounding the failed page. If the requested block
>> number contains something completely else, but the page that follows
>> contains the expected checksum value, then it would support this
>> theory.
>
> will do.   Main challenge is getting hand compiled server to swap in
> so that libdir continues to work.  Getting access to the server is
> difficult as is getting a maintenance window.  I'll post back ASAP.

As a new datapoint, we just had a customer with an issue that I think
might be related. The issue was reasonably repeatable by running a
report on the standby system. Issue manifested itself by first "could
not open relation" and/or "column is not in index" errors, followed a
few minutes later by a PANIC from startup process due to "specified
item offset is too large", "invalid max offset number" or "page X of
relation base/16384/1259 is uninitialized". I took a look at the xlog
dump and it was completely fine. For instance in the "specified item
offset is too large" case there was a INSERT_LEAF redo record
inserting the preceding offset just a couple hundred kilobytes back.
Restarting the server sometimes successfully applied the offending
WAL, sometimes it failed with other corruption errors. The offending
relations were always pg_class or pg_class_oid_index. Replacing plsh
functions with dummy plpgsql functions made the problem go away,
reintroducing plsh functions made it reappear.

The only thing I came up with that is consistent with the symptoms is
that a page got thrown out of shared_buffers between the two xlog
records referencing it (shared_buffers was default 128MB), and then
read back by a backend process, where in the time between FileSeek and
FileRead calls in mdread a subprocess mucked with the fd's offset so
that a different page than intended got read in. Or basically the same
race condition, but on the write side. Maybe somebody else has a
better imagination than me...

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-08-10 Thread Josh Berkus

On 08/09/2017 10:49 PM, Michael Paquier wrote:
> On Fri, Aug 4, 2017 at 8:19 AM, Masahiko Sawada  wrote:
>> On Fri, Jul 28, 2017 at 2:24 PM, Noah Misch  wrote:
>>> This item appears under "decisions to recheck mid-beta".  If anyone is going
>>> to push for a change here, now is the time.
>>
>> It has been 1 week since the previous mail. I though that there were
>> others argued to change the behavior of old-style setting so that a
>> quorum commit is chosen. If nobody is going to push for a change we
>> can live with the current behavior?
> 
> FWIW, I still see no harm in keeping backward-compatibility here, so I
> am in favor of a statu-quo.
> 

I am vaguely in favor of making quorum the default over "ordered".
However, given that anybody using sync commit without
understanding/customizing the setup is going to be sorry regardless,
keeping backwards compatibility is acceptable.

-- 
Josh Berkus
Containers & Databases Oh My!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-10 Thread Tom Lane

Alexander Korotkov  writes:
> ...
> You have random mix of tabs and spaces here.

It's worth running pgindent over your code before submitting.  It should
be pretty easy to set that up nowadays, see src/tools/pgindent/README.
(If you find any portability problems while trying to install pgindent,
please let me know.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/perl extension fails on Windows

2017-08-10 Thread Tom Lane

Ashutosh Sharma  writes:
> On Thu, Aug 10, 2017 at 8:06 PM, Tom Lane  wrote:
>> Yeah ... however, if that's there, then there's something wrong with
>> Ashutosh's explanation, because that means we *are* building with
>> _USE_32BIT_TIME_T in 32-bit builds.  It's just getting there in a
>> roundabout way.  (Or, alternatively, this code is somehow not doing
>> anything at all.)

> I am extremely sorry if i have communicated the things wrongly, what i
> meant was we are always considering _USE_32BIT_TIME_T flag to build
> plperl module on Windows 32-bit platform but unfortunately that is not
> being considered/defined in perl code in case we use VC++ compiler
> version greater than 8.0. and that's the reason for the binary
> mismatch error on 32 bit platform.

Got it.  So in short, it seems like the attached patch ought to fix it
for MSVC builds.  (We'd also need to teach PGAC_CHECK_PERL_EMBED_CCFLAGS
to let _USE_32BIT_TIME_T through on Windows, but let's confirm the theory
first.)

regards, tom lane

diff --git a/src/tools/msvc/MSBuildProject.pm b/src/tools/msvc/MSBuildProject.pm
index d7638b4..27329f9 100644
*** a/src/tools/msvc/MSBuildProject.pm
--- b/src/tools/msvc/MSBuildProject.pm
*** EOF
*** 63,83 

  EOF
  
- 	# We have to use this flag on 32 bit targets because the 32bit perls
- 	# are built with it and sometimes crash if we don't.
- 	my $use_32bit_time_t =
- 	  $self->{platform} eq 'Win32' ? '_USE_32BIT_TIME_T;' : '';
- 
  	$self->WriteItemDefinitionGroup(
  		$f, 'Debug',
! 		{   defs=> "_DEBUG;DEBUG=1;$use_32bit_time_t",
  			opt => 'Disabled',
  			strpool => 'false',
  			runtime => 'MultiThreadedDebugDLL' });
  	$self->WriteItemDefinitionGroup(
  		$f,
  		'Release',
! 		{   defs=> "$use_32bit_time_t",
  			opt => 'Full',
  			strpool => 'true',
  			runtime => 'MultiThreadedDLL' });
--- 63,78 

  EOF
  
  	$self->WriteItemDefinitionGroup(
  		$f, 'Debug',
! 		{   defs=> "_DEBUG;DEBUG=1",
  			opt => 'Disabled',
  			strpool => 'false',
  			runtime => 'MultiThreadedDebugDLL' });
  	$self->WriteItemDefinitionGroup(
  		$f,
  		'Release',
! 		{   defs=> "",
  			opt => 'Full',
  			strpool => 'true',
  			runtime => 'MultiThreadedDLL' });
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index a7e3a01..940bef6 100644
*** a/src/tools/msvc/Mkvcbuild.pm
--- b/src/tools/msvc/Mkvcbuild.pm
*** sub mkvcbuild
*** 522,528 
  		my @perl_embed_ccflags;
  		foreach my $f (split(" ",$Config{ccflags}))
  		{
! 			if ($f =~ /^-D[^_]/)
  			{
  $f =~ s/\-D//;
  push(@perl_embed_ccflags, $f);
--- 522,529 
  		my @perl_embed_ccflags;
  		foreach my $f (split(" ",$Config{ccflags}))
  		{
! 			if ($f =~ /^-D[^_]/ ||
! 			$f =~ /^-D_USE_32BIT_TIME_T/)
  			{
  $f =~ s/\-D//;
  push(@perl_embed_ccflags, $f);
diff --git a/src/tools/msvc/VCBuildProject.pm b/src/tools/msvc/VCBuildProject.pm
index a8d75d8..669ba17 100644
*** a/src/tools/msvc/VCBuildProject.pm
--- b/src/tools/msvc/VCBuildProject.pm
*** sub WriteHeader
*** 33,47 
   
  EOF
  
- 	# We have to use this flag on 32 bit targets because the 32bit perls
- 	# are built with it and sometimes crash if we don't.
- 	my $use_32bit_time_t =
- 	  $self->{platform} eq 'Win32' ? '_USE_32BIT_TIME_T;' : '';
- 
- 
  	$self->WriteConfiguration(
  		$f, 'Debug',
! 		{   defs => "_DEBUG;DEBUG=1;$use_32bit_time_t",
  			wholeopt => 0,
  			opt  => 0,
  			strpool  => 'false',
--- 33,41 
   
  EOF
  
  	$self->WriteConfiguration(
  		$f, 'Debug',
! 		{   defs => "_DEBUG;DEBUG=1",
  			wholeopt => 0,
  			opt  => 0,
  			strpool  => 'false',
*** EOF
*** 49,55 
  	$self->WriteConfiguration(
  		$f,
  		'Release',
! 		{   defs => "$use_32bit_time_t",
  			wholeopt => 0,
  			opt  => 3,
  			strpool  => 'true',
--- 43,49 
  	$self->WriteConfiguration(
  		$f,
  		'Release',
! 		{   defs => "",
  			wholeopt => 0,
  			opt  => 3,
  			strpool  => 'true',

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-10 Thread Craig Ringer

On 10 August 2017 at 23:25, Robert Haas  wrote:

> On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer 
> wrote:
> > I think so - specifically, that it's a leftover from a revision where the
> > xid limit was advanced before clog truncation.
> >
> > I'll be finding time in the next couple of days to look more closely and
> > ensure that's all it is.
>
> A couple of days having gone by without hearing further on this, I'm
> going to assume for the moment that my analysis is correct and just
> push a commit to remove this assertion.  That's not intended to
> discourage you from spending further time on this - it would be bad to
> be wrong about this - but doing something we all agree is wrong can't
> be better than doing something that I, at least, think is correct.
>


Thanks.

Keeps getting bumped down by other things, but it's on the stack to not get
lost


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-10 Thread Alexander Korotkov

On Thu, Aug 10, 2017 at 7:37 AM, Michael Paquier 
wrote:

> On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas  wrote:
> > The patch doesn't really conform to our coding standards, though, so
> > you need to clean it up (or, if you're not sure what you need to do,
> > you need to have someone who knows how PostgreSQL code needs to look
> > review it for you).
>
> The documentation has a couple of rules for coding conventions:
> https://www.postgresql.org/docs/9.6/static/source.html


+1

Ildus, from the first glance I see at least following violations of
PostgreSQL coding standards in your code.

+/*
+ * Converts tsvector with the old structure to current.
+ * @orig - tsvector to convert,
+ * @copy - return copy of tsvector, it has a meaning when tsvector doensn't
+ * need to be converted.
+ */
This comment will be reflowed by pgindent.  Also we don't use '@' for
parameters description in comments.
https://www.postgresql.org/docs/9.6/static/source-format.html

+TSVector
+tsvector_upgrade(Datum orig, bool copy)
+{
+ int   i,
+   dataoff = 0,
+   datalen = 0,
+   totallen;
+ TSVector   in,
+   out;

You have random mix of tabs and spaces here.

+ {
+ stroff = SHORTALIGN(stroff); \
+ entry->hasoff = 0;
+ entry->len = lexeme_len;
+ entry->npos = npos;
+ }

What this backslash is doing here?
There are other similar (and probably different) violations of coding
standard over the code.  Ildus, please check you patches carefully before
publishing.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-10 Thread Robert Haas

On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer  wrote:
> I think so - specifically, that it's a leftover from a revision where the
> xid limit was advanced before clog truncation.
>
> I'll be finding time in the next couple of days to look more closely and
> ensure that's all it is.

A couple of days having gone by without hearing further on this, I'm
going to assume for the moment that my analysis is correct and just
push a commit to remove this assertion.  That's not intended to
discourage you from spending further time on this - it would be bad to
be wrong about this - but doing something we all agree is wrong can't
be better than doing something that I, at least, think is correct.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/perl extension fails on Windows

2017-08-10 Thread Ashutosh Sharma

On Thu, Aug 10, 2017 at 8:06 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Aug 10, 2017 at 8:30 AM, Ashutosh Sharma  
>> wrote:
>>> So, the question is should we allow the preprocessor option
>>> '_USE_32BIT_TIME_T' to be defined implicitly or not in postgresql
>>> provided we are using Visual C compiler version > 8.0. I think this
>>> should make plperl compatible with perl binaries.
>
>> We've got this code in MSBuildProject.pm and VCBuildProject.pm:
>
>> # We have to use this flag on 32 bit targets because the 32bit perls
>> # are built with it and sometimes crash if we don't.
>> my $use_32bit_time_t =
>>   $self->{platform} eq 'Win32' ? '_USE_32BIT_TIME_T;' : '';
>
>> Based on the discussion upthread, I think we now know that this was
>> not really the right approach.
>
> Yeah ... however, if that's there, then there's something wrong with
> Ashutosh's explanation, because that means we *are* building with
> _USE_32BIT_TIME_T in 32-bit builds.  It's just getting there in a
> roundabout way.  (Or, alternatively, this code is somehow not doing
> anything at all.)

I am extremely sorry if i have communicated the things wrongly, what i
meant was we are always considering _USE_32BIT_TIME_T flag to build
plperl module on Windows 32-bit platform but unfortunately that is not
being considered/defined in perl code in case we use VC++ compiler
version greater than 8.0. and that's the reason for the binary
mismatch error on 32 bit platform.

Here, is what has been mentioned in 'win32/GNUMakefile' in perl source
for version 5.24.1

# In VS 2005 (VC++ 8.0) Microsoft changes time_t from 32-bit to
# 64-bit, even in 32-bit mode.  It also provides the _USE_32BIT_TIME_T
# preprocessor option to revert back to the old functionality for
# backward compatibility.  We define this symbol here for older 32-bit
# compilers only (which aren't using it at all) for the sole purpose
# of getting it into $Config{ccflags}.  That way if someone builds
# Perl itself with e.g. VC6 but later installs an XS module using VC8
# the time_t types will still be compatible.
ifeq ($(WIN64),undef)
ifeq ((PREMSVC80),define)
BUILDOPT+= -D_USE_32BIT_TIME_T
endif
endif

>
>> The trouble with that is that _USE_32BIT_TIME_T also affects how
>> PostgreSQL code compiles.
>
> Really?  We try to avoid touching "time_t" at all in most of the code.
> I bet that we could drop the above-cited code, and compile only plperl
> with _USE_32BIT_TIME_T, taken (if present) from the Perl flags, and
> it'd be fine.  At least, that's my first instinct for what to try.
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/perl extension fails on Windows

2017-08-10 Thread Tom Lane

Robert Haas  writes:
> On Thu, Aug 10, 2017 at 10:36 AM, Tom Lane  wrote:
>> Yeah ... however, if that's there, then there's something wrong with
>> Ashutosh's explanation, because that means we *are* building with
>> _USE_32BIT_TIME_T in 32-bit builds.  It's just getting there in a
>> roundabout way.  (Or, alternatively, this code is somehow not doing
>> anything at all.)

> I don't follow.

The stanzas you pointed to in the MSVC build scripts should mean that
a 32-bit PG build is using _USE_32BIT_TIME_T, no?  And Ashutosh stated
that he saw _USE_32BIT_TIME_T in "perl -V" output.  So how are they
not ending up compatible?

Now, if that statement was wrong and his 32-bit Perl actually *isn't*
built with _USE_32BIT_TIME_T, then this is clearly what's causing the
problem.

>> Really?  We try to avoid touching "time_t" at all in most of the code.
>> I bet that we could drop the above-cited code, and compile only plperl
>> with _USE_32BIT_TIME_T, taken (if present) from the Perl flags, and
>> it'd be fine.  At least, that's my first instinct for what to try.

> Oh.  Well, if that's an OK thing to do, then sure, wfm.  I guess we've
> got pg_time_t plastered all over the backend but that's not actually
> time_t under the hood, so it's fine.  I do see time_t being used in
> frontend code, but that won't matter for this.

Yeah.  I think this should work as long as plperl itself doesn't use
time_t, or at least doesn't exchange time_t with any other part of the
system, and since we don't use that type in any common APIs that seems
like an OK assumption.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

2017-08-10 Thread Amit Kapila

On Wed, Aug 9, 2017 at 6:51 PM, Haribabu Kommi  wrote:
>
> On Wed, Aug 9, 2017 at 9:26 PM, Amit Kapila  wrote:
>>
>
> By the way, I tested the patch with by DML support for parallel patch to
> check the returning of clause of insert, and all the returning clause init
> plans
> are parallel plans with this patch.
>

Good to know.

>>
>> By the way, the patch doesn't apply on HEAD, so attached rebased patch.
>
>
>
> Thanks for the updated patch. I have some comments and I am yet to finish
> the review.
>
> + /*
> + * We don't want to generate gather or gather merge node if there are
> + * initplans at some query level below the current query level as those
> + * plans could be parallel-unsafe or undirect correlated plans.  Ensure to
> + * mark all the partial and non-partial paths for a relation at this level
> + * to be parallel-unsafe.
> + */
> + if (is_initplan_below_current_query_level(root))
> + {
> + foreach(lc, rel->partial_pathlist)
> + {
> + Path   *subpath = (Path *) lfirst(lc);
> +
> + subpath->parallel_safe = false;
> + }
> +
> + foreach(lc, rel->pathlist)
> + {
> + Path   *subpath = (Path *) lfirst(lc);
> +
> + subpath->parallel_safe = false;
> + }
> + return;
> + }
> +
>
> The above part of the code is almost same in mutiple places, is it possible
> to change to function?
>

Sure, we can do that and I think that makes sense as well, so changed
accordingly in the attached patch.

>
> + node->initParam = NULL;
>
> This new parameter is set to NULL in make_gather function, the same
> parameter
> is added to GatherMerge structure also, but anyway this parameter is set to
> NULL
> makeNode macro, why only setting it to here, but not the other place.
>
> Do we need to set it to default value such as NULL or false if it is already
> the same value?
> This is not related to the above parameter, for all existing parameters
> also.
>

Strictly speaking, it is not required, but I have initialised at only
one of the place to make it consistent to near by code.  We already do
similar stuff in some other functions like make_seqscan,
make_samplescan, ..

I don't see any value in trying to initialize it for GatherMerge as
well, so left it as it is.

>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else if (IsA(plan, GatherMerge))
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
> + else
> + elog(ERROR, "unrecognized node type: %d", nodeTag(plan));
>
> The else case is not possible, because it is already validated for Gather or
> GatherMerge.
> Can we change it simple if and else?
>

As we already have an assert in this function to protect from any
other node type (nodes other than Gather and Gather Merge), it makes
sense to change the code to just if...else, so changed accordingly.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pq_pushdown_initplan_v.6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2017-08-10 Thread Alexander Korotkov

On Thu, Aug 10, 2017 at 3:30 PM, Sokolov Yura 
wrote:

> On 2017-07-31 18:56, Sokolov Yura wrote:
>
>> Good day, every one.
>>
>> In attempt to improve performance of YCSB on zipfian distribution,
>> it were found that significant time is spent in XidInMVCCSnapshot in
>> scanning snapshot->xip array. While overall CPU time is not too
>> noticable, it has measurable impact on scaleability.
>>
>> First I tried to sort snapshot->xip in GetSnapshotData, and search in a
>> sorted array. But since snapshot->xip is not touched if no transaction
>> contention occurs, sorting xip always is not best option.
>>
>> Then I sorted xip array on demand in XidInMVCCSnapshot only if
>> search in snapshot->xip occurs (ie lazy sorting). It performs much
>> better, but since it is O(NlogN), sort's execution time is noticable
>> for large number of clients.
>>
>> Third approach (present in attached patch) is making hash table lazily
>> on first search in xip array.
>>
>> Note: hash table is not built if number of "in-progress" xids is less
>> than 60. Tests shows, there is no big benefit from doing so (at least
>> on Intel Xeon).
>>
>> With regards,
>>
>
> Here is new, more correct, patch version, and results for pgbench with
> zipfian distribution (50% read 50% write) (same to Alik's tests at
> https://www.postgresql.org/message-id/BF3B6F54-68C3-417A-BFA
> b-fb4d66f2b...@postgrespro.ru )
>
>
> clients | master | hashsnap2 | hashsnap2_lwlock
> ++---+--
>  10 | 203384 |203813 |   204852
>  20 | 334344 |334268 |   363510
>  40 | 228496 |231777 |   383820
>  70 | 146892 |148173 |   221326
> 110 |  99741 |111580 |   157327
> 160 |  65257 | 81230 |   112028
> 230 |  38344 | 56790 |77514
> 310 |  22355 | 39249 |55907
> 400 |  13402 | 26899 |39742
> 500 |   8382 | 17855 |28362
> 650 |   5313 | 11450 |17497
> 800 |   3352 |  7816 |11030
>
> (Note: I'm not quite sure, why my numbers for master are lower than
> Alik's one. Probably, our test methodology differs a bit. Perhaps, it
> is because I don't recreate tables between client number change, but
> between branch switch).
>

These results look very cool!
I think CSN is eventually inevitable, but it's a long distance feature.
Thus, this optimization could make us a good serve before we would have CSN.
Do you expect there are cases when this patch causes slowdown?  What about
case when we scan each xip array only once (not sure how to simulate that
using pgbench)?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Remove 1MB size limit in tsvector

2017-08-10 Thread Alexander Korotkov

On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas  wrote:

> On Tue, Aug 1, 2017 at 4:00 PM, Ildus K 
> wrote:
> > It's a workaround. DatumGetTSVector and
> > DatumGetTSVectorCopy will upgrade tsvector on the fly if it
> > has old format.
>
> Hmm, that seems like a real fix, not just a workaround.  If you can
> transparently read the old format, there's no problem.  Not sure about
> performance, though.
>

+1
Ildus, I think we need to benchmark reading of the old format.  There would
be tradeoff between performance of old format reading and amount of extra
code needed.  Once we will have benchmarks we can consider whether this is
the solution we would like to buy.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] pl/perl extension fails on Windows

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 10:36 AM, Tom Lane  wrote:
> Yeah ... however, if that's there, then there's something wrong with
> Ashutosh's explanation, because that means we *are* building with
> _USE_32BIT_TIME_T in 32-bit builds.  It's just getting there in a
> roundabout way.  (Or, alternatively, this code is somehow not doing
> anything at all.)

I don't follow.

>> The trouble with that is that _USE_32BIT_TIME_T also affects how
>> PostgreSQL code compiles.
>
> Really?  We try to avoid touching "time_t" at all in most of the code.
> I bet that we could drop the above-cited code, and compile only plperl
> with _USE_32BIT_TIME_T, taken (if present) from the Perl flags, and
> it'd be fine.  At least, that's my first instinct for what to try.

Oh.  Well, if that's an OK thing to do, then sure, wfm.  I guess we've
got pg_time_t plastered all over the backend but that's not actually
time_t under the hood, so it's fine.  I do see time_t being used in
frontend code, but that won't matter for this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 3:47 AM, Rushabh Lathia
 wrote:
>> (1) seems like a pretty arbitrary restriction, so I don't like that
>> option.  (2) would hurt performance in some use cases.  Do we have an
>> option (3)?
>
> How about protecting option 2) with the load-via-partition-root protection.
> Means
> load the parents information even dump is not set only when
> load-via-partition-root
> & ispartition.
>
> This won't hurt performance for the normal cases.

Yes, that seems like the right approach.

+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.

I think we should phrase this a bit more clearly, something like this:
When dumping a COPY or INSERT statement for a partitioned table,
target the root of the partitioning hierarchy which contains it rather
than the partition itself.  This may be useful when reloading data on
a server where rows do not always fall into the same partitions as
they did on the original server.  This could happen, for example, if
the partitioning column is of type text and the two system have
different definitions of the collation used to partition the data.

+printf(_("  --load-via-partition-rootload partition table via
the root relation\n"));

"relation" seems odd to me here.  root table?

 /* Don't bother computing anything for non-target tables, either */
 if (!tblinfo[i].dobj.dump)
+{
+/*
+ * Load the parents information for the partition table when
+ * the load-via-partition-root option is set. As we need the
+ * parents information to get the partition root.
+ */
+if (dopt->load_via_partition_root &&
+tblinfo[i].ispartition)
+findParentsByOid([i], inhinfo, numInherits);
 continue;
+}

Duplicating the call to findParentsByOid seems less then ideal.  How
about doing something like this:

if (tblinfo[i].dobj.dump)
{
find_parents = true;
mark_parents = true;
}
else if (dopt->load_via_partition_root && tblinfo[i].ispartition)
find_parents = true;

if (find_parents)
findParentsByOid([i], inhinfo, numInherits);

etc.

The comments for this function also need some work - e.g. the function
header comment deserves some kind of update for these changes.

+static TableInfo *
+getRootTableInfo(TableInfo *tbinfo)
+{
+Assert(tbinfo->ispartition);
+Assert(tbinfo->numParents == 1);
+if (tbinfo->parents[0]->ispartition)
+return getRootTableInfo(tbinfo->parents[0]);
+
+return tbinfo->parents[0];
+}

This code should iterate, not recurse, to avoid any risk of blowing
out the stack.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/perl extension fails on Windows

2017-08-10 Thread Tom Lane

Robert Haas  writes:
> On Thu, Aug 10, 2017 at 8:30 AM, Ashutosh Sharma  
> wrote:
>> So, the question is should we allow the preprocessor option
>> '_USE_32BIT_TIME_T' to be defined implicitly or not in postgresql
>> provided we are using Visual C compiler version > 8.0. I think this
>> should make plperl compatible with perl binaries.

> We've got this code in MSBuildProject.pm and VCBuildProject.pm:

> # We have to use this flag on 32 bit targets because the 32bit perls
> # are built with it and sometimes crash if we don't.
> my $use_32bit_time_t =
>   $self->{platform} eq 'Win32' ? '_USE_32BIT_TIME_T;' : '';

> Based on the discussion upthread, I think we now know that this was
> not really the right approach.

Yeah ... however, if that's there, then there's something wrong with
Ashutosh's explanation, because that means we *are* building with
_USE_32BIT_TIME_T in 32-bit builds.  It's just getting there in a
roundabout way.  (Or, alternatively, this code is somehow not doing
anything at all.)

> The trouble with that is that _USE_32BIT_TIME_T also affects how
> PostgreSQL code compiles.

Really?  We try to avoid touching "time_t" at all in most of the code.
I bet that we could drop the above-cited code, and compile only plperl
with _USE_32BIT_TIME_T, taken (if present) from the Perl flags, and
it'd be fine.  At least, that's my first instinct for what to try.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers

On Thu, Aug 10, 2017 at 3:17 PM, Vladimir Rusinov 
wrote:

>
>
> On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <
> a.aleks...@postgrespro.ru> wrote:
>
>> I just wanted to point out that a hardware issue or third party software
>> issues (bugs in FS, software RAID, ...) could not be fully excluded from
>> the list of suspects. According to the talk by Christophe Pettus [1]
>> it's not that uncommon as most people think.
>
>
> This still might be the case of hardware corruption, but it does not look
> like one.
>

Yeah, I don't think so either.  The systems were not restarted, only the
service so I don't think this is a lie-on-write case.  We have EEC with
full checks, etc.  It really looks like something I initiated caused it but
not sure what and really not interested in trying to reproduce on a db of
this size.

>
> Likelihood of two different persons seeing similar error message just a
> year apart is low. From our practice hardware corruption usually looks like
> a random single bit flip (most common - bad cpu or memory), bunch of zeroes
> (bad storage), or bunch of complete garbage (usually indicates in-memory
> pointer corruption).
>
> Chris, if you still have original WAL segment from the master and it's
> corrupt copy from standby, can you do bit-by-bit comparison to see how they
> are different? Also, if you can please share some hardware details.
> Specifically, do you use ECC? If so, are there any ECC errors logged? Do
> you use physical disks/ssd or some form of storage virtualization?
>

Straight on bare metal, eec with no errors logged.  SSD for both data and
wal.

The bitwise comparison is interesting.  Remember the error was:

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
1146880

Starting with the good segment:

Good wall segment, I think the record starts at 003b:

0117fb0     003b   

0117fc0 7f28 e111 e39c 0001 0940  cb88 db01

0117fd0 0200  067f  4000  2249 0195

0117fe0 0001  8001  b5c3  05ff 

0117ff0  0003   008c   

0118000 d093 0005 0001  8000 e111 e39c 0001

0118010 0084    7fb8 e111 e39c 0001

0118020 0910  ccac 2eba 2000 0056 067f 

0118030 4000  2249 0195 b5c4  08ff 0001

0118040 0002 0003 0004 0005 0006 0007 0008 0009

0118050 000a 000b 000c 000d 000e 000f 0010 0011

0118060 0012 0013 0014 0015 0016 0017 0018 0019

0118070 001a 001b 001c 001d 001e 001f 0020 0021

0117fb0     003b   

0117fc0 7f28 e111 e39c 0001 0940  cb88 db01

0117fd0 0200  067f  4000  2249 0195

0117fe0 0001  8001  b5c3  05ff 

0117ff0  0003   4079 ce05 1cce ecf9

0118000 d093 0005 0001  8000 6111 e375 0001

0118010 119d    cfd4 00cc ca00 0410

0118020 1800 7c00 5923 544b dc20 914c 7a5c afec

0118030 db45 0060 b700 1910 1800 7c00 791f 2ede

0118040 c573 a110 5a88 e1e6 ab48 0034 9c00 2210

0118050 1800 7c00 4415 400d 2c7e b5e3 7c88 bcef

0118060 4666 00db 9900 0a10 1800 7c00 7d1d b355
0118070 d432 8365 de99 4dba 87c7 00ed 6200 2210

I think the divergence is interesting here.  Up through 0117ff8, they are
identical.  Then the last half if the line differs.
The first half of the next is the same (but up through 011800a this time),
but the last 6 bytes differ (those six hold what appear to be the memory
address causing the problem), and we only have a few bits different in the
rest of the line.

It looks like some data and some flags were overwritten, perhaps while the
process exited.  Very interesting.

> Also, in absolute majority of cases corruption is caught by checksums. I
> am not familiar with WAL protocol - do we have enough checksums when
> writing it out and on the wire? I suspect there are much more things
> PostgreSQL can do to be more resilient, and at least detect corruptions
> earlier.
>

Since this didn't throw a checksum error (we have data checksums disabled
but wal records ISTR have a separate CRC check), would this perhaps
indicate that the checksum operated over incorrect data?

>
> --
> Vladimir Rusinov
> PostgreSQL SRE, Google Ireland
>
> Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
> Registered in Dublin, Ireland
> Registration Number: 368047
>

-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more

Re: [HACKERS] Comment in snapbuild.c file

2017-08-10 Thread Alvaro Herrera

Masahiko Sawada wrote:
> Hi all,
> 
> In snapbuild.c file, there is a comment as follows.
> 
>* NB: Because of that xmax can be lower than xmin, because we only
>* increase xmax when a catalog modifying transaction commits. While odd
>* looking, it's correct and actually more efficient this way since we hit
>* fast paths in tqual.c.
>*/
> 
> Maybe we can get rid of the second "because" in the first sentence?

I think the whole para needs to be rethought.  I propose this:

 * NB: We only increase xmax when a catalog-modifying transaction 
commits
 * (see SnapBuildCommitTxn).  Because of this, xmax can be lower than 
xmin,
 * which looks odd but is correct and actually more efficient, since we 
hit
 * fast paths in tqual.c.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pl/perl extension fails on Windows

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 8:30 AM, Ashutosh Sharma  wrote:
> So, the question is should we allow the preprocessor option
> '_USE_32BIT_TIME_T' to be defined implicitly or not in postgresql
> provided we are using Visual C compiler version > 8.0. I think this
> should make plperl compatible with perl binaries.

We've got this code in MSBuildProject.pm and VCBuildProject.pm:

# We have to use this flag on 32 bit targets because the 32bit perls
# are built with it and sometimes crash if we don't.
my $use_32bit_time_t =
  $self->{platform} eq 'Win32' ? '_USE_32BIT_TIME_T;' : '';

Based on the discussion upthread, I think we now know that this was
not really the right approach.  IIUC, a given 32-bit Perl might've
been built with _USE_32BIT_TIME_T, or not; in fact, given the comments
you found on the Perl side, it's more than likely that most modern
32-bit Perls are NOT build with this option.  What we actually need to
do here is use _USE_32BIT_TIME_T if and only if it was used when Perl
was built (otherwise we'll end up with a different interpreter size).

The trouble with that is that _USE_32BIT_TIME_T also affects how
PostgreSQL code compiles.  Apparently, given that according to Perl
this was changed by Microsoft in 2005, we're forcing the Microsoft
compilers into a non-default backward compatibility mode by defining
this symbol, and that affects how our entire code base compiles -- and
it just so happens that the result is compatible with older Perl
builds that used _USE_32BIT_TIME_T and not compatible with newer ones
that don't.

Maybe we need to make _USE_32BIT_TIME_T a compile-time configuration
option on Windows, and then cross-check that our setting is compatible
with Perl's setting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 5:43 AM, Thomas Munro
 wrote:
>> Do you think we solving this problem is a prerequisite for
>> partition-wise join? Or should we propose that patch as a separate
>> enhancement?
>
> No, I'm not proposing anything yet.  For now I just wanted to share
> this observation about where hot CPU time goes in simple tests, and
> since it turned out to be a loop in a loop that I could see an easy to
> way to fix for singleton sets and sets with a small range, I couldn't
> help trying it...  But I'm still trying to understand the bigger
> picture.  I'll be interested to compare profiles with the ordered
> append_rel_list version you have mentioned, to see how that moves the
> hot spots.

Perhaps this is stating the obvious, but it's often better to optimize
things like this at a higher level, rather than by tinkering with
stuff like Bitmapset.  On the other hand, sometimes
micro-optimizations are the way to go, because optimizing
find_ec_member_for_tle(), for example, might involve a much broader
rethink of the planner code than we want to undertake right now.

> I guess one very practical question to ask is: can we plan queries
> with realistic numbers of partitioned tables and partitions in
> reasonable times?  Well, it certainly looks very good for hundreds of
> partitions so far...  My own experience of partitioning with other
> RDBMSs has been on that order, 'monthly partitions covering the past
> 10 years' and similar, but on the other hand it wouldn't be surprising
> to learn that people want to go to many thousands, especially for
> schemas which just keep adding partitions over time and don't want to
> drop them.

I've been thinking that it would be good if this feature - and other
new partitioning features - could scale to about 1000 partitions
without too much trouble.  Eventually, it might be nice to scale
higher, but there's not much point in making partition-wise join scale
to 100,000 partitions if we've got some other part of the system that
runs into trouble beyond 250.

> Curious: would you consider joins between partitioned tables and
> non-partitioned tables where the join is pushed down to be a kind of
> "partition-wise join", or something else?  If so, would that be a
> special case, or just the logical extreme case for
> 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
> one single "partition" on the non-partitioned side maps to all the
> partitions on the partitioned size?

I think this is actually a really important case which we've just
excluded from the initial scope because the problem is hard enough
already.  But it's quite possible that if you are joining partitioned
tables A and B with unpartitioned table X, the right join order could
be A-X-B; the A-X join might knock out a lot of rows.  It's not great
to have to pick between doing the A-B join partitionwise and doing the
A-X join first; you want to get both things.  But we can't do
everything at once.

Further down the road, there's more than one way of doing the A-X
join.  You could join each partition of A to all of X, which is likely
optimal if for example you are doing a nested loop with an inner index
scan on X.  But you could also partition X on the fly using A's
partitioning scheme and then join partitions of A against the
on-the-fly-partitioned version of X.  That's likely to be a lot better
for a merge join with an underlying sort on X.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Vladimir Rusinov

On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> I just wanted to point out that a hardware issue or third party software
> issues (bugs in FS, software RAID, ...) could not be fully excluded from
> the list of suspects. According to the talk by Christophe Pettus [1]
> it's not that uncommon as most people think.

This still might be the case of hardware corruption, but it does not look
like one.

Likelihood of two different persons seeing similar error message just a
year apart is low. From our practice hardware corruption usually looks like
a random single bit flip (most common - bad cpu or memory), bunch of zeroes
(bad storage), or bunch of complete garbage (usually indicates in-memory
pointer corruption).

Chris, if you still have original WAL segment from the master and it's
corrupt copy from standby, can you do bit-by-bit comparison to see how they
are different? Also, if you can please share some hardware details.
Specifically, do you use ECC? If so, are there any ECC errors logged? Do
you use physical disks/ssd or some form of storage virtualization?

Also, in absolute majority of cases corruption is caught by checksums. I am
not familiar with WAL protocol - do we have enough checksums when writing
it out and on the wire? I suspect there are much more things PostgreSQL can
do to be more resilient, and at least detect corruptions earlier.

-- 
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047

smime.p7s
Description: S/MIME Cryptographic Signature

[HACKERS] Foreign tables privileges not shown in information_schema.table_privileges

2017-08-10 Thread Nicolas Thauvin

Hello,

The information_schema.table_privileges view filters on regular tables
and views. Foreign tables are not shown in this view but they are in
other views of the information_schema like tables or column_privileges.

Is it intentional? A patch is attached if not.

Thanks
-- 
Nicolas Thauvin
http://dalibo.com - http://dalibo.org
diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index 98bcfa0..5398271 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -1868,7 +1868,7 @@ CREATE VIEW table_privileges AS
  ) AS grantee (oid, rolname)
 
 WHERE c.relnamespace = nc.oid
-  AND c.relkind IN ('r', 'v', 'p')
+  AND c.relkind IN ('r', 'v', 'f', 'p')
   AND c.grantee = grantee.oid
   AND c.grantor = u_grantor.oid
   AND c.prtype IN ('INSERT', 'SELECT', 'UPDATE', 'DELETE', 'TRUNCATE', 'REFERENCES', 'TRIGGER')

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-08-10 Thread Sokolov Yura


On 2017-07-18 20:20, Sokolov Yura wrote:

On 2017-06-05 16:22, Sokolov Yura wrote:

Good day, everyone.

This patch improves performance of contended LWLock.
Patch makes lock acquiring in single CAS loop:
1. LWLock->state is read, and ability for lock acquiring is detected.
  If there is possibility to take a lock, CAS tried.
  If CAS were successful, lock is aquired (same to original version).
2. but if lock is currently held by other backend, we check ability 
for

  taking WaitList lock. If wait list lock is not help by anyone, CAS
  perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at 
once.
  If CAS were successful, then LWLock were still held at the moment 
wait

  list lock were held - this proves correctness of new algorithm. And
  Proc is queued to wait list then.
3. Otherwise spin_delay is performed, and loop returns to step 1.



I'm sending rebased version with couple of one-line tweaks.
(less skip_wait_list on shared lock, and don't update spin-stat on 
aquiring)


With regards,


Here are results for zipfian distribution (50/50 r/w) in conjunction
with "lazy hash table for XidInMVCCSnapshot":
(https://www.postgresql.org/message-id/642da34694800dab801f04c62950ce8a%40postgrespro.ru)

clients | master | hashsnap2 | hashsnap2_lwlock
++---+--
 10 | 203384 |203813 |   204852
 20 | 334344 |334268 |   363510
 40 | 228496 |231777 |   383820
 70 | 146892 |148173 |   221326
110 |  99741 |111580 |   157327
160 |  65257 | 81230 |   112028
230 |  38344 | 56790 |77514
310 |  22355 | 39249 |55907
400 |  13402 | 26899 |39742
500 |   8382 | 17855 |28362
650 |   5313 | 11450 |17497
800 |   3352 |  7816 |11030

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Aleksander Alekseev

Hi Chris,

> I ran into a funny situation today regarding PostgreSQL replication and
> wal corruption and wanted to go over what I think happened and what I
> wonder about as a possible solution.

Sad story. Unfortunately I have no idea what could be a reason nor can I
suggest a good way to find it unless there is an already know sequence
of steps that reproduces an issue.

I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.

If the issue reproduces from time to time on one replica and doesn't on
the second identical replica there is a good chance that you've faced a
hardware issue. Another thing that is worth checking is a SMART status
of the hard drive.

[1]: 
http://www.pgcon.org/2017/schedule/attachments/453_corruption-pgcon-2017.pdf

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature

Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers

> Yes.  Exactly the same output until a certain point where pg_xlogdump dies
> with an error.  That is the same LSN where PostgreSQL died with an error on
> restart.
>

 One other thing that is possibly relevant after reading through the last
bug report is the error pgxlogdumo gives:

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
1146880



> --
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more

Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2017-08-10 Thread Sokolov Yura


On 2017-07-31 18:56, Sokolov Yura wrote:

Good day, every one.

In attempt to improve performance of YCSB on zipfian distribution,
it were found that significant time is spent in XidInMVCCSnapshot in
scanning snapshot->xip array. While overall CPU time is not too
noticable, it has measurable impact on scaleability.

First I tried to sort snapshot->xip in GetSnapshotData, and search in a
sorted array. But since snapshot->xip is not touched if no transaction
contention occurs, sorting xip always is not best option.

Then I sorted xip array on demand in XidInMVCCSnapshot only if
search in snapshot->xip occurs (ie lazy sorting). It performs much
better, but since it is O(NlogN), sort's execution time is noticable
for large number of clients.

Third approach (present in attached patch) is making hash table lazily
on first search in xip array.

Note: hash table is not built if number of "in-progress" xids is less
than 60. Tests shows, there is no big benefit from doing so (at least
on Intel Xeon).

With regards,


Here is new, more correct, patch version, and results for pgbench with
zipfian distribution (50% read 50% write) (same to Alik's tests at
https://www.postgresql.org/message-id/bf3b6f54-68c3-417a-bfab-fb4d66f2b...@postgrespro.ru 
)



clients | master | hashsnap2 | hashsnap2_lwlock
++---+--
 10 | 203384 |203813 |   204852
 20 | 334344 |334268 |   363510
 40 | 228496 |231777 |   383820
 70 | 146892 |148173 |   221326
110 |  99741 |111580 |   157327
160 |  65257 | 81230 |   112028
230 |  38344 | 56790 |77514
310 |  22355 | 39249 |55907
400 |  13402 | 26899 |39742
500 |   8382 | 17855 |28362
650 |   5313 | 11450 |17497
800 |   3352 |  7816 |11030

(Note: I'm not quite sure, why my numbers for master are lower than
Alik's one. Probably, our test methodology differs a bit. Perhaps, it
is because I don't recreate tables between client number change, but
between branch switch).

With regards
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres CompanyFrom 80787aee44b7f9b5389476e45f2241073b6c89ee Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Mon, 10 Jul 2017 12:34:48 +
Subject: [PATCH] Make hash table for xip in XidInMVCCSnapshot

When lot of concurrent transactions attempts to update single
row, then linear scan through running list in XidInMVCCSnapshot
became noticebale bottleneck.

With this change, hash table is built on first search of xid in
snapshot->xip and snapshot->subxip arrays.

If size of array is smaller than 60, then linear scan is still
used, cause there is no much benefits from building hash then.
(at least on Intel Xeon).
---
 src/backend/storage/ipc/procarray.c | 64 +++--
 src/backend/utils/time/snapmgr.c| 24 +++
 src/backend/utils/time/tqual.c  | 81 +++--
 src/include/utils/snapshot.h| 11 +
 4 files changed, 138 insertions(+), 42 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a7e8cf2..18cc233 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1469,6 +1469,49 @@ GetMaxSnapshotSubxidCount(void)
 }
 
 /*
+ * ExtendXipSizeForHash - calculate xip array size with space for hash table
+ */
+int
+ExtendXipSizeForHash(int xipcnt, uint8* plog)
+{
+	int size;
+	uint8 log = 0;
+
+	size = xipcnt;
+	if (xipcnt >= SnapshotMinHash)
+	{
+		log = 1;
+		while (xipcnt) {
+			log++;
+			xipcnt >>= 1;
+		}
+		size += 1<xip = (TransactionId *)
-			malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
-		if (snapshot->xip == NULL)
-			ereport(ERROR,
-	(errcode(ERRCODE_OUT_OF_MEMORY),
-	 errmsg("out of memory")));
-		Assert(snapshot->subxip == NULL);
-		snapshot->subxip = (TransactionId *)
-

Re: [HACKERS] pl/perl extension fails on Windows

2017-08-10 Thread Ashutosh Sharma

On Thu, Aug 10, 2017 at 1:55 AM, Robert Haas  wrote:
> On Tue, Aug 8, 2017 at 12:15 PM, Sandeep Thakkar
>  wrote:
>> I copied and pasted that portion of the build log into file build.log
>> (attached) for Windows 32bit and Windows 64bit.
>
> Can you also share the output of 'perl -V' on each system?
>
> Comparing the 32-bit log and the 64-bit log, I see the following differences:
>
> 32-bit has extra options /IC:\pgBuild32\uuid\include /Oy- /analyze- /D
> _USE_32BIT_TIME_T
> 64-bit has extra options /D WIN64 /D CONSERVATIVE /D USE_SITECUSTOMIZE
> Both builds have several copies of /IC:\pgBuild32\include or
> /IC:\pgBuild64\include, but the 64-bit build has an extra one
>
> (I wrote that command on Linux, might need different quoting to work
> on Windows.)
>
> I'm suspicious about _USE_32BIT_TIME_T.  The contents of a
> PerlInterpreter are defined in Perl's intrpvar.h, and one of those
> lines is:
>
> PERLVAR(I, basetime,Time_t) /* $^T */
>
> Now, Time_t is defined as time_t elsewhere in the Perl headers, so if
> the plperl build and the Perl interpreter build had different ideas
> about whether that flag was set, the interpreter sizes would be
> different.  Unfortunately for this theory, if I'm interpreting the
> screenshot you posted correctly, the sizes are different by exactly 16
> bytes, and I can't see how a difference in the size of time_t could
> account for more than 8 bytes (4 bytes of actual size change + 4 bytes
> of padding).
>

Okay. I too had a look into this issue and my findings are as follows,

The size of PerlInterpreter structure on plperl and perl module are
not the same. The size of PerlInterpreter on plperl is 2744 bytes
whereas on perl it is 2760 bytes i.e. there is the difference of 16
bytes and therefore the handshaking key generated in the two modules
are not the same resulting in a binary mismatch error. To analyse this
problem, I generated the preprocessed output of header file 'perl.h'
on plperl and perl modules and then filtered the contents of "struct
interpreter" from the preprocessed output files and  compared them
using some diff tool. The diff report is attached with this email.

As per the diff report, incase of  plperl module,  the structure
Stat_t is getting mapped to "_stat32i64" whereas incase of perl
module, the same structure is getting mapped to "_stat64" and this is
happening for couple of variables (statbuf & statcache) in intrpvar.h
file. However, if the preprocessor option '_USE_32BIT_TIME_T' is
explicitly passed to the VC compiler when compiling perl source, there
is no diff observed as in both the cases, 'Stat_t ' is gets mapped to
"stat32i64". So, now the question is, why ''_USE_32BIT_TIME_T'  flag
is not implicitly defined when compiling perl source on 32 bit
platform but for postgresql/plperl build it is being defined
implicitly on 32bit platform. I just got few hints from the Makefile,
in perl source code, where is has been mentioned that  the modern
Visual C compiler (VC++ 8.0 onwards) changes time_t from 32-bit to
64-bit, even in 32-bit mode hence, ''_USE_32BIT_TIME_T'   is net being
added to $Config{ccflags}. Here, is what i could see in
GNUMakefile/Makefile.mk in perl source code.

# In VS 2005 (VC++ 8.0) Microsoft changes time_t from 32-bit to
# 64-bit, even in 32-bit mode.  It also provides the _USE_32BIT_TIME_T
# preprocessor option to revert back to the old functionality for
# backward compatibility.  We define this symbol here for older 32-bit
# compilers only (which aren't using it at all) for the sole purpose
# of getting it into $Config{ccflags}.  That way if someone builds
# Perl itself with e.g. VC6 but later installs an XS module using VC8
# the time_t types will still be compatible.

ifeq ($(WIN64),undef)
ifeq ((PREMSVC80),define)
BUILDOPT += -D_USE_32BIT_TIME_T
endif
endif

So, the question is should we allow the preprocessor option
'_USE_32BIT_TIME_T' to be defined implicitly or not in postgresql
provided we are using Visual C compiler version > 8.0. I think this
should make plperl compatible with perl binaries.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Title: Text Compare

Text Compare
Produced: 8/10/2017 1:44:37 PM

Mode:  Differences  

Left file: C:\Users\ashu\perl-5.24.1\perl_interpreter.out  

Right file: C:\Users\ashu\git-clone-postgresql\postgresql\plper_intrpreter.out  

struct _stat64 Istatbuf;
<>
struct _stat32i64 Istatbuf;

struct _stat64 Istatcache;              

struct _stat32i64 Istatcache;           

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers

Sorry, meant to reply all.

On Thu, Aug 10, 2017 at 2:19 PM, Vladimir Borodin  wrote:

> Hi, Chris.
>
> 10 авг. 2017 г., в 15:09, Chris Travers 
> написал(а):
>
> Hi;
>
> I ran into a funny situation today regarding PostgreSQL replication and
> wal corruption and wanted to go over what I think happened and what I
> wonder about as a possible solution.
>
> Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB
> database with variable load.  Master database has two slaves and generates
> 10-20MB of WAL traffic a second.  The data_checksum option is off.
>
>
> The problem occurred when I attempted to restart the service on the slave
> using pg_ctl (I believe the service had been started with sys V init
> scripts).  On trying to restart, it gave me a nice "Invalid memory
> allocation request" error and promptly stopped.
>
> The main logs showed a lot of messages like before the restart:
> 2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match
> any process
>
> On restart, the following was logged to stderr:
> LOG:  entering standby mode
> LOG:  redo starts at 1E39C/8B77B458
> LOG:  consistent recovery state reached at 1E39C/E1117FF8
> FATAL:  invalid memory alloc request size 3456458752
> LOG:  startup process (PID 18167) exited with exit code 1
> LOG:  terminating any other active server processes
>
> LOG:  database system is shut down
>
> After some troubleshooting I found that the wal segment had become
> corrupt, I copied the correct one from the master and everything came up to
> present.
>
> So It seems like somewhere something crashed big time on the back-end and
> when we tried to restart, the wal ended in an invalid way.
>
>
> We have reported the same thing [1] nearly a year ago. Could you please
> check with pg_xlogdump that both WALs (normal from master and corrupted)
> are exactly the same until some certain LSN?
>
> [1] https://www.postgresql.org/message-id/20160614103415.
> 5796.6885%40wrigleys.postgresql.org
>

Yes.  Exactly the same output until a certain point where pg_xlogdump dies
with an error.  That is the same LSN where PostgreSQL died with an error on
restart.

>
>
> I am wondering what can be done to prevent these sorts of things from
> happening in the future if, for example, a replica dies in the middle of a
> wal fsync.
> --
> Best Wishes,
> Chris Travers
>
> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
> lock-in.
> http://www.efficito.com/learn_more
>
>
>
> --
> May the force be with you…
> https://simply.name
>
>


-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more

[HACKERS] Comment typo in plannodes.h

2017-08-10 Thread Etsuro Fujita


Hi,

Here is a small patch for fixing a comment typo in plannodes.h: s/all 
the partitioned table/all the partitioned tables/.


Best regards,
Etsuro Fujita
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..7c51e7f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -67,7 +67,7 @@ typedef struct PlannedStmt
 
/*
 * rtable indexes of non-leaf target relations for UPDATE/DELETE on all
-* the partitioned table mentioned in the query.
+* the partitioned tables mentioned in the query.
 */
List   *nonleafResultRelations;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Vladimir Borodin

Hi, Chris.

> 10 авг. 2017 г., в 15:09, Chris Travers  написал(а):
> 
> Hi;
> 
> I ran into a funny situation today regarding PostgreSQL replication and wal 
> corruption and wanted to go over what I think happened and what I wonder 
> about as a possible solution.
> 
> Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB 
> database with variable load.  Master database has two slaves and generates 
> 10-20MB of WAL traffic a second.  The data_checksum option is off.
> 
> 
> The problem occurred when I attempted to restart the service on the slave 
> using pg_ctl (I believe the service had been started with sys V init 
> scripts).  On trying to restart, it gave me a nice "Invalid memory allocation 
> request" error and promptly stopped.
> 
> The main logs showed a lot of messages like before the restart:
> 2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any 
> process
> 
> On restart, the following was logged to stderr:
> LOG:  entering standby mode
> LOG:  redo starts at 1E39C/8B77B458
> LOG:  consistent recovery state reached at 1E39C/E1117FF8
> FATAL:  invalid memory alloc request size 3456458752
> LOG:  startup process (PID 18167) exited with exit code 1
> LOG:  terminating any other active server processes
> LOG:  database system is shut down
> 
> After some troubleshooting I found that the wal segment had become corrupt, I 
> copied the correct one from the master and everything came up to present.
> 
> So It seems like somewhere something crashed big time on the back-end and 
> when we tried to restart, the wal ended in an invalid way.

We have reported the same thing [1] nearly a year ago. Could you please check 
with pg_xlogdump that both WALs (normal from master and corrupted) are exactly 
the same until some certain LSN?

[1] 
https://www.postgresql.org/message-id/20160614103415.5796.6885%40wrigleys.postgresql.org

> 
> I am wondering what can be done to prevent these sorts of things from 
> happening in the future if, for example, a replica dies in the middle of a 
> wal fsync. 
> -- 
> Best Wishes,
> Chris Travers
> 
> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor 
> lock-in.
> http://www.efficito.com/learn_more 

--
May the force be with you…
https://simply.name

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-10 Thread Ashutosh Bapat

On Thu, Aug 10, 2017 at 3:13 PM, Thomas Munro
 wrote:
> On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat
>  wrote:
>> Your patch didn't improve planning time without partition-wise join,
>> so it's something good to have along-with partition-wise join. Given
>> that Bitmapsets are used in other parts of code as well, the
>> optimization may affect those parts as well, esp. the overhead of
>> maintaining first_non_empty_wordnum.
>
> Maybe, but if you consider that this container already deals with the
> upper bound moving up by reallocating and copying the whole thing,
> adjusting an int when the lower bound moves down doesn't seem like
> anything to worry about...

Yeah. May be we should check whether that makes any difference to
planning times of TPC-H queries. If it shows any difference.

>
>> Do you think we solving this problem is a prerequisite for
>> partition-wise join? Or should we propose that patch as a separate
>> enhancement?
>
> No, I'm not proposing anything yet.  For now I just wanted to share
> this observation about where hot CPU time goes in simple tests, and
> since it turned out to be a loop in a loop that I could see an easy to
> way to fix for singleton sets and sets with a small range, I couldn't
> help trying it...  But I'm still trying to understand the bigger
> picture.  I'll be interested to compare profiles with the ordered
> append_rel_list version you have mentioned, to see how that moves the
> hot spots.

build_simple_rel() which contains that loop takes only .23% of
planning time. So, I doubt if that's going to change much.
+   0.23%  postgres  postgres[.] build_simple_rel

▒


>
> I guess one very practical question to ask is: can we plan queries
> with realistic numbers of partitioned tables and partitions in
> reasonable times?  Well, it certainly looks very good for hundreds of
> partitions so far...  My own experience of partitioning with other
> RDBMSs has been on that order, 'monthly partitions covering the past
> 10 years' and similar, but on the other hand it wouldn't be surprising
> to learn that people want to go to many thousands, especially for
> schemas which just keep adding partitions over time and don't want to
> drop them.  As for hash partitioning, that seems to be typically done
> with numbers like 16, 32 or 64 in other products from what I can
> glean.  Speculation: perhaps hash partitioning is more motivated by
> parallelism than data maintenance and thus somehow anchored to the
> ground by core counts; if so no planning performance worries there I
> guess (until core counts double quite a few more times).

Agreed.

>
> One nice thing about the planning time is that restrictions on the
> partition key cut down planning time; so where I measure ~7 seconds to
> plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions,
> if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so
> if someone has a keep-adding-more-partitions-over-time model then at
> least their prunable current day/week/whatever queries will not suffer
> quite so badly.  (Yeah my computer seems to be a lot slower than yours
> for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @
> 2.2Ghz).

That's interesting observation. Thanks for sharing it.

>
> Curious: would you consider joins between partitioned tables and
> non-partitioned tables where the join is pushed down to be a kind of
> "partition-wise join", or something else?  If so, would that be a
> special case, or just the logical extreme case for
> 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
> one single "partition" on the non-partitioned side maps to all the
> partitions on the partitioned size?
>

Parameterized nest loop joins with partition key as parameters
simulate something like that. Apart from that case, I don't see any
case where such a join would be more efficient compared to the current
method of ganging all partitions and joining them to the unpartitioned
table. But oh wait, that could be useful in sharding, when the
unpartitioned table is replicated and partitioned table is distributed
across shards. So, yes, that's a useful case. I am not sure whether
it's some kind of partition-wise join; it doesn't matter, it looks
useful. Said that, I am not planning to handle it in the near future.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers

Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal
corruption and wanted to go over what I think happened and what I wonder
about as a possible solution.

Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB
database with variable load.  Master database has two slaves and generates
10-20MB of WAL traffic a second.  The data_checksum option is off.


The problem occurred when I attempted to restart the service on the slave
using pg_ctl (I believe the service had been started with sys V init
scripts).  On trying to restart, it gave me a nice "Invalid memory
allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:

2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any
process


On restart, the following was logged to stderr:

LOG:  entering standby mode

LOG:  redo starts at 1E39C/8B77B458

LOG:  consistent recovery state reached at 1E39C/E1117FF8

FATAL:  invalid memory alloc request size 3456458752

LOG:  startup process (PID 18167) exited with exit code 1

LOG:  terminating any other active server processes

LOG:  database system is shut down

After some troubleshooting I found that the wal segment had become corrupt,
I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and
when we tried to restart, the wal ended in an invalid way.

I am wondering what can be done to prevent these sorts of things from
happening in the future if, for example, a replica dies in the middle of a
wal fsync.
-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more

Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Robert Haas

On Thu, Aug 10, 2017 at 2:50 AM, Andreas Seltenreich  wrote:
> Will do.  Won't miss this chance to try out discostu's extension
> pg_rage_terminator[1] :-)
> [1]  https://github.com/disco-stu/pg_rage_terminator

Oh, that's *awesome*.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Walsender timeouts and large transactions

2017-08-10 Thread Sokolov Yura


On 2017-08-09 16:23, Petr Jelinek wrote:

On 02/08/17 19:35, Yura Sokolov wrote:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

There is no check for (last_reply_timestamp <= 0 || wal_sender_timeout 
<= 0) as in other places

(in WalSndKeepaliveIfNecessary for example).

I don't think moving update of 'now' down to end of loop body is 
correct:
there are calls to ProcessConfigFile with SyncRepInitConfig, 
ProcessRepliesIfAny that can
last non-negligible time. It could lead to over sleeping due to larger 
computed sleeptime.

Though I could be mistaken.

I'm not sure about moving `if (!pg_is_send_pending())` in a body loop 
after WalSndKeepaliveIfNecessary.

Is it necessary? But it looks harmless at least.



We also need to do actual timeout handing so that the timeout is not
deferred to the end of the transaction (Which is why I moved `if
(!pg_is_send_pending())` under WalSndCheckTimeOut() and
WalSndKeepaliveIfNecessary() calls).



If standby really stalled, then it will not read from socket, and then
`pg_is_sendpending` eventually will return false, and timeout will be 
checked.

If standby doesn't stall, then `last_reply_timestamp` will be updated in
`ProcessRepliesIfAny`, and so timeout will not be triggered.
Do I understand correctly, or I missed something?

Could patch be reduced to check after first `if 
(!pg_is_sendpending())` ? like:


if (!pq_is_send_pending())
-   return;
+   {
+   if (last_reply_timestamp <= 0 || wal_sender_timeout <= 0)
+   {
+   CHECK_FOR_INTERRUPTS();
+   return;
+   }
+		if (now <= TimestampTzPlusMilliseconds(last_reply_timestamp, 
wal_sender_timeout / 2))

+   return;
+   }

If not, what problem prevents?


We should do CHECK_FOR_INTERRUPTS() independently of pq_is_send_pending
so that it's possible to stop walsender while it's processing large
transaction.


In this case CHECK_FOR_INTERRUPTS will be called after 
wal_sender_timeout/2.
(This diff is for first appearance of `pq_is_send_pending` in a 
function).


If it should be called more often, then patch could be simplier:

if (!pq_is_send_pending())
-   return;
+   {
+   CHECK_FOR_INTERRUPTS();
+   if (last_reply_timestamp <= 0 || wal_sender_timeout <= 0 ||
+now <= TimestampTzPlusMilliseconds(last_reply_timestamp, 
wal_sender_timeout / 2))

+   return;
+   }

(Still, I could be mistaken, it is just suggestion).

Is it hard to add test for case this patch fixes?

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-08-10 Thread AP

On Sun, Aug 06, 2017 at 04:32:29PM +1000, AP wrote:
> On Sat, Aug 05, 2017 at 04:41:24PM +0530, Amit Kapila wrote:
> > > (On another note, I committed these patches.)
> > 
> > Thanks.
> 
> Seconded. :)
> 
> Now uploading data with fillfactor of 90. I'll know in 2-3 days
> if the new patches are successful (earlier if they did not help).
> 
> I compiled (as apt.postgresql.org does not provide the latest
> beta3 version for stretch; only sid which has a different perl
> version) 10~beta3~20170805.2225-1~593.git0d1f98b.

Have gotten success with a dataset that failed before with ff 10.

End result:

mdstash=# select * from pgstathashindex('link_datum_id_idx');
 version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
live_items | dead_items |   free_percent
-+--++--+--+++--
   4 | 12391325 |5148912 |  158 |   191643 | 
4560007478 |  0 | 1894.29056075982
(1 row)

mdstash=# select 5148912.0/158/8;
   ?column?
---
 4073.506329113924
(1 row)

The index is 135GB rather than 900GB (from memory/give or take).

I'm happy so far. I'll be seeing how much more I can dump into it this weekend. 
:)

Thanks muchly to the both of you. :)

AP


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default Partition for Range

2017-08-10 Thread Jeevan Ladhe

Hi Robert, Beena,

On Wed, Aug 9, 2017 at 5:53 PM, Robert Haas  wrote:

> On Wed, Aug 9, 2017 at 8:18 AM, Rajkumar Raghuwanshi
>  wrote:
> > --difference in the description of default partition in case of list vs
> > range
> >
> > create table lp (a int) partition by list(a);
> > create table lp_d partition of lp DEFAULT;
> > postgres=# \d+ lp_d
> >Table "public.lp_d"
> >  Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target
> > | Description
> > +-+---+--+-+
> -+--+-
> >  a  | integer |   |  | | plain   |
> > |
> > Partition of: lp DEFAULT
> > Partition constraint:
> >
> > create table rp (a int) partition by range(a);
> > create table rp_d partition of rp DEFAULT;
> > postgres=# \d+ rp_d
> >Table "public.rp_d"
> >  Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target
> > | Description
> > +-+---+--+-+
> -+--+-
> >  a  | integer |   |  | | plain   |
> > |
> > Partition of: rp DEFAULT
> > Partition constraint: true
>
> This looks like a problem with the default list partitioning patch.  I
> think "true" is what we expect to see here in both cases.
>

In case of list partition if there is only default partition, then there is
no
specific value set that can be excluded from default partition, so in
such a case DEFAULT partition for a list partitioned table simply will
not have any constraint on it, and hence while the describe output is
printed it does not have any constraints to be printed.

Whereas, in case of default partition for a range partitioned table, in
get_qual_for_range() it creates true boolean expression if default is the
only partition. Here is the relevent code from Beena's patch:

+ else /* default is the only partition */
+ result = list_make1(makeBoolConst(true, false));

I think, it is unnecessary to create a constraint that is going to be always
true. Instead, if we have no constraints we can avoid some extra cpu
cycles which would otherwise be wasted in evaluating an expression
that is going to be always true.

Having said this, I see a point that in such a case we should have
some neat meaningful content to be printed for "Partition constraint: ".
I think we can address this when we construct describe output string.

Some ideas that come to my mind are:
Partition constraint: NIL
Partition constraint: no constraints
No partition constraint.
Partition constraint: true

Please let me know your thoughts.

Regards,
Jeevan Ladhe

[HACKERS] Fix number skipping in to_number

2017-08-10 Thread Oliver Ford

Prevents an issue where numbers can be skipped in the to_number()
function when the format mask contains a "G" or a "," but the input
string doesn't contain a separator. This resolves the TODO item "Fix
to_number() handling for values not matching the format string".

== Change ==

Currently, if the format mask in to_number() has a "G" or a ",", it
will assume that the input string contains a separator character in
the same place. If however a number is there instead, that number will
be silently skipped and not appear in the output. So we get:

select to_number('34,50','999,99');
 to_number
---
   340
(1 row)

This patch checks the input string when it encounters a "G" or "," in
the format mask. If the separator character is found, the code moves
over it as normal. If it's not found, then the code no longer
increments the relevant pointer so as not to skip the character. After
the patch, we get the correct result:

select to_number('34,50','999,99');
 to_number
---
  3450
(1 row)

This is in line with Oracle's result.

== Rationale ==

This patch is a small change, which leaves PostgreSQL behavior
different from Oracle behavior in related cases. Oracle's
implementation seems to read from right-to-left, and raises an
"ORA-01722: invalid number" error if there are digits in the input
string which don't have corresponding characters in the format mask. I
have chosen not to throw such errors, because there are use cases for
only returning part of a number string. For example, the following is
an error on Oracle:

select to_number('123,000', '999G') from dual;
Error report -
SQL Error: ORA-01722: invalid number

But if you wanted to only take the characters before the comma, and
discard the thousands part, you can do so on PostgreSQL with:

select to_number('123,000', '999G');
 to_number
---
   123
(1 row)

This is the current behavior. Which is why I think it makes more sense
to do what PostgreSQL currently does and read from left-to-right. The
only change, as mentioned above, is that the current behavior can skip
a digit:

select to_number('123456', '999G999');
 to_number
---
   12356
(1 row)

After the patch, this returns all the digits:

select to_number('123456', '999G999');
 to_number
---
  123456
(1 row)

== Testing ==

Tested on Windows with MinGW using the latest checkout from master.
Added regression tests to check for this new behavior. All existing
tests pass.


0001-apply-number-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

2017-08-10 Thread Beena Emerson

Hi Amit,

On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote
 wrote:
> On 2017/08/05 2:25, Robert Haas wrote:
>> Concretely, my proposal is:
>>
>> P.S. While I haven't reviewed 0002 in detail, I think the concept of
>> minimizing what needs to be built in RelationGetPartitionDispatchInfo
>> is a very good idea.
>
> I put this patch ahead in the list and so it's now 0001.
>

FYI, 0001 patch throws the warning:

execMain.c: In function ‘ExecSetupPartitionTupleRouting’:
execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
 next_ptinfo->parentid != ptinfo->parentid)

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-10 Thread Thomas Munro

On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat
 wrote:
> Your patch didn't improve planning time without partition-wise join,
> so it's something good to have along-with partition-wise join. Given
> that Bitmapsets are used in other parts of code as well, the
> optimization may affect those parts as well, esp. the overhead of
> maintaining first_non_empty_wordnum.

Maybe, but if you consider that this container already deals with the
upper bound moving up by reallocating and copying the whole thing,
adjusting an int when the lower bound moves down doesn't seem like
anything to worry about...

> The comment at the beginning of the file bitmapset.c says
>3  * bitmapset.c
>4  *PostgreSQL generic bitmap set package
>5  *
>6  * A bitmap set can represent any set of nonnegative integers, although
>7  * it is mainly intended for sets where the maximum value is not large,
>8  * say at most a few hundred.
>
> When we created thousands of children, we have certainly crossed the
> few hundred threashold. So, there may be other optimizations possible
> there. Probably we should leave that out of partition-wise join
> patches.

+1

> Do you think we solving this problem is a prerequisite for
> partition-wise join? Or should we propose that patch as a separate
> enhancement?

No, I'm not proposing anything yet.  For now I just wanted to share
this observation about where hot CPU time goes in simple tests, and
since it turned out to be a loop in a loop that I could see an easy to
way to fix for singleton sets and sets with a small range, I couldn't
help trying it...  But I'm still trying to understand the bigger
picture.  I'll be interested to compare profiles with the ordered
append_rel_list version you have mentioned, to see how that moves the
hot spots.

I guess one very practical question to ask is: can we plan queries
with realistic numbers of partitioned tables and partitions in
reasonable times?  Well, it certainly looks very good for hundreds of
partitions so far...  My own experience of partitioning with other
RDBMSs has been on that order, 'monthly partitions covering the past
10 years' and similar, but on the other hand it wouldn't be surprising
to learn that people want to go to many thousands, especially for
schemas which just keep adding partitions over time and don't want to
drop them.  As for hash partitioning, that seems to be typically done
with numbers like 16, 32 or 64 in other products from what I can
glean.  Speculation: perhaps hash partitioning is more motivated by
parallelism than data maintenance and thus somehow anchored to the
ground by core counts; if so no planning performance worries there I
guess (until core counts double quite a few more times).

One nice thing about the planning time is that restrictions on the
partition key cut down planning time; so where I measure ~7 seconds to
plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions,
if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so
if someone has a keep-adding-more-partitions-over-time model then at
least their prunable current day/week/whatever queries will not suffer
quite so badly.  (Yeah my computer seems to be a lot slower than yours
for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @
2.2Ghz).

Curious: would you consider joins between partitioned tables and
non-partitioned tables where the join is pushed down to be a kind of
"partition-wise join", or something else?  If so, would that be a
special case, or just the logical extreme case for
0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
one single "partition" on the non-partitioned side maps to all the
partitions on the partitioned size?

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Creating backup history files for backups taken from standbys

2017-08-10 Thread Masahiko Sawada

On Thu, Aug 10, 2017 at 4:50 PM, Michael Paquier
 wrote:
> On Thu, Aug 10, 2017 at 9:45 AM, Masahiko Sawada  
> wrote:
>> Thank you for the patch. Regarding to creating the backup history file
>> on stanbys, is there any difference from the patch posted on earlier
>> thread?
>
> That's a rebased version on top of what has been applied, except that
> I have fixed an incorrect comment and shuffled the code building the
> WAL segment name for the stop position.

Understood, I'll review this patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Creating backup history files for backups taken from standbys

2017-08-10 Thread Michael Paquier

On Thu, Aug 10, 2017 at 9:45 AM, Masahiko Sawada  wrote:
> Thank you for the patch. Regarding to creating the backup history file
> on stanbys, is there any difference from the patch posted on earlier
> thread?

That's a rebased version on top of what has been applied, except that
I have fixed an incorrect comment and shuffled the code building the
WAL segment name for the stop position.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-10 Thread Rushabh Lathia

On Wed, Aug 9, 2017 at 1:20 AM, Robert Haas  wrote:

> On Tue, Aug 8, 2017 at 8:48 AM, Rushabh Lathia 
> wrote:
> > It seems like with we set the numParents and parents only for the
> > dumpable objects (flagInhTables()). Current patch relies on the
> numParents
> > and parents to get the root partition TableInfo, but when --schema is
> been
> > specified - it doesn't load those information for the object which is not
> > dumpable.
> >
> > Now one options is:
> >
> > 1) restrict the --load-via-partition-root to be used with
> > the --schema or may be some other options - where we restrict the
> > objects.
> >
> > Consider this, partition root is in schema 'a' and the partition table
> is in
> > schema 'b', if someone specify the --schema b with
> > --load-via-partition-root,
> > I think we should not do "INSERT INTO a.tab" to load the data (because
> > user specified --schema b).
> >
> > 2) fix flagInhTables() to load the numParents and the parents information
> > for the partition table (can be checked using ispartition), irrespective
> of
> > whether object is dumpable is true or not.
>
> (1) seems like a pretty arbitrary restriction, so I don't like that
> option.  (2) would hurt performance in some use cases.  Do we have an
> option (3)?
>

How about protecting option 2) with the load-via-partition-root protection.
Means
load the parents information even dump is not set only when
load-via-partition-root
& ispartition.

This won't hurt performance for the normal cases.

Attaching the patch.



>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Rushabh Lathia
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bafa031..65f7e98 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -889,6 +889,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
--section=sectionname

  
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index aa944a2..54b8e7e 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -431,6 +431,16 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+Dump data via the top-most partitioned table (rather than partition
+table) when dumping data for the partition table.
+   
+  
+ 
+
+ 
   --use-set-session-authorization
   

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 47191be..65c2d7f 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -66,7 +66,7 @@ static int	numExtensions;
 static ExtensionMemberId *extmembers;
 static int	numextmembers;
 
-static void flagInhTables(TableInfo *tbinfo, int numTables,
+static void flagInhTables(Archive *fout, TableInfo *tbinfo, int numTables,
 			  InhInfo *inhinfo, int numInherits);
 static void flagInhAttrs(DumpOptions *dopt, TableInfo *tblinfo, int numTables);
 static DumpableObject **buildIndexArray(void *objArray, int numObjs,
@@ -243,7 +243,7 @@ getSchemaData(Archive *fout, int *numTablesPtr)
 	/* Link tables to parents, mark parents of target tables interesting */
 	if (g_verbose)
 		write_msg(NULL, "finding inheritance relationships\n");
-	flagInhTables(tblinfo, numTables, inhinfo, numInherits);
+	flagInhTables(fout, tblinfo, numTables, inhinfo, numInherits);
 
 	if (g_verbose)
 		write_msg(NULL, "reading column info for interesting tables\n");
@@ -304,9 +304,10 @@ getSchemaData(Archive *fout, int *numTablesPtr)
  * modifies tblinfo
  */
 static void
-flagInhTables(TableInfo *tblinfo, int numTables,
+flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
 			  InhInfo *inhinfo, int numInherits)
 {
+	DumpOptions *dopt = fout->dopt;
 	int			i,
 j;
 	int			numParents;
@@ -322,7 +323,17 @@ flagInhTables(TableInfo *tblinfo, int numTables,
 
 		/* Don't bother computing anything for non-target tables, either */
 		if (!tblinfo[i].dobj.dump)
+		{
+			/*
+			 * Load the parents information for the partition table when
+			 * the load-via-partition-root option is set. As we need the
+			 * parents information to get the partition root.
+			 */
+			if (dopt->load_via_partition_root &&
+tblinfo[i].ispartition)
+findParentsByOid([i], inhinfo, numInherits);
 			continue;
+		}
 
 		/* Find all the immediate parent tables */
 		findParentsByOid([i], inhinfo, numInherits);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 144068a..ce3100f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -157,6 +157,7 @@ typedef struct _dumpOptions
 	int			outputNoTablespaces;
 	int			use_setsessauth;
 	int

Re: [HACKERS] Creating backup history files for backups taken from standbys

2017-08-10 Thread Masahiko Sawada

On Thu, Aug 10, 2017 at 3:56 PM, Michael Paquier
 wrote:
> Hi all,
>
> In the recent thread related to a bug in pg_stop_backup when waiting
> for WAL segments to be archived, it has been mentioned that it would
> be nice to get backup history files generated as well on standbys:
> https://www.postgresql.org/message-id/CA+Tgmobh3=quCzugFP5yA3-z5uzz=YKZ7arnoApJoPN7=0h...@mail.gmail.com
>
> Backup history files are not essential to backups, but they are for
> debugging purposes, and it has bugged me hard that we should have a
> consistent behavior in this area, particularly because with
> archive_mode = always they can be archived by a standby.
>
> The previous thread has ended with 52f8a59d, which addressed the
> original problem but discarded the backup history file generation
> during recovery. Attached is a patch to fix this inconsistency, parked
> in next CF. I am not arguing for this change as a bug fix, but as an
> improvement for PG11.
>
> Thoughts?

Thank you for the patch. Regarding to creating the backup history file
on stanbys, is there any difference from the patch posted on earlier
thread?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Fix inadequacies in recently added wait events

2017-08-10 Thread Michael Paquier

On Thu, Aug 10, 2017 at 2:52 AM, Michael Paquier
 wrote:
> With a minimal maintenance effort we can be careful enough. I think
> that a comment for example in pgstat.c about the usage uniqueness
> would be an adapted answer.

By the way, let's discuss that on a new thread. I'll try to come up
with a patch that can be used for a base discussion soon.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More race conditions in logical replication

2017-08-10 Thread Michael Paquier

On Tue, Aug 8, 2017 at 8:11 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Mon, Aug 7, 2017 at 8:14 PM, Alvaro Herrera  
>> wrote:
>> > BTW, I noticed that the PG_WAIT_LOCK value that we're using for wait
>> > event here (and in the replication slot case) is bogus.  We probably
>> > need something new here.
>>
>> Yeah, if you're adding a new wait point, you should add document a new
>> constant in the appropriate section, probably something under
>> PG_WAIT_IPC in this case.
>
> Here's a patch.  It turned to be a bit larger than I initially expected.

Álvaro, 030273b7 did not get things completely right. A couple of wait
events have been added in the docs for sections Client, IPC and
Activity but it forgot to update the counters for morerows . Attached
is a patch to fix all that.
-- 
Michael


docfix-wait-event-rows.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Creating backup history files for backups taken from standbys

2017-08-10 Thread Michael Paquier

Hi all,

In the recent thread related to a bug in pg_stop_backup when waiting
for WAL segments to be archived, it has been mentioned that it would
be nice to get backup history files generated as well on standbys:
https://www.postgresql.org/message-id/CA+Tgmobh3=quCzugFP5yA3-z5uzz=YKZ7arnoApJoPN7=0h...@mail.gmail.com

Backup history files are not essential to backups, but they are for
debugging purposes, and it has bugged me hard that we should have a
consistent behavior in this area, particularly because with
archive_mode = always they can be archived by a standby.

The previous thread has ended with 52f8a59d, which addressed the
original problem but discarded the backup history file generation
during recovery. Attached is a patch to fix this inconsistency, parked
in next CF. I am not arguing for this change as a bug fix, but as an
improvement for PG11.

Thoughts?
-- 
Michael


standby-backup-history.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

2017-08-10 Thread Andreas Seltenreich

Tom Lane writes:

> I wonder if Andreas would be interested in trying the randomly-timed-
> SIGTERM thing with sqlsmith.

Will do.  Won't miss this chance to try out discostu's extension
pg_rage_terminator[1] :-)

regards,
Andreas

Footnotes: 
[1]  https://github.com/disco-stu/pg_rage_terminator


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Failover Slots

2017-08-10 Thread Craig Ringer

On 9 August 2017 at 23:42, Robert Haas  wrote:

> On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer 
> wrote:
> >> - When a standby connects to a master, it can optionally supply a list
> >> of slot names that it cares about.
> >
> > Wouldn't that immediately exclude use for PITR and snapshot recovery? I
> have
> > people right now who want the ability to promote a PITR-recovered
> snapshot
> > into place of a logical replication master and have downstream peers
> replay
> > from it. It's more complex than that, as there's a resync process
> required
> > to recover changes the failed node had sent to other peers but isn't
> > available in the WAL archive, but that's the gist.
> >
> > If you have a 5TB database do you want to run an extra replica or two
> > because PostgreSQL can't preserve slots without a running, live replica?
> > Your SAN snapshots + WAL archiving have been fine for everything else so
> > far.
>
> OK, so what you're basically saying here is that you want to encode
> the failover information in the write-ahead log rather than passing it
> at the protocol level, so that if you replay the write-ahead log on a
> time delay you get the same final state that you would have gotten if
> you had replayed it immediately.  I hadn't thought about that
> potential advantage, and I can see that it might be an advantage for
> some reason, but I don't yet understand what the reason is.  How would
> you imagine using any version of this feature in a PITR scenario?  If
> you PITR the master back to an earlier point in time, I don't see how
> you're going to manage without resyncing the replicas, at which point
> you may as well just drop the old slot and create a new one anyway.
>

I've realised that it's possible to work around it in app-space anyway. You
create a new slot on a node before you snapshot it, and you don't drop this
slot until you discard the snapshot. The existence of this slot ensures
that any WAL generated by the node (and replayed by PITR after restore)
cannot clobber needed catalog_xmin. If we xlog catalog_xmin advances or
have some other safeguard in place, which we need for logical decoding on
standby to be safe anyway, then we can fail gracefully if the user does
something dumb.

So no need to care about this.

(What I wrote previously on this was):

You definitely can't just PITR restore and pick up where you left off.

You need a higher level protocol between replicas to recover. For example,
in a multi-master configuration, this can be something like (simplified):

* Use the timeline history file to find the lsn at which we diverged from
our "future self", the failed node
* Connect to the peer and do logical decoding, with a replication origin
filter for "originating from me", for xacts from the divergence lsn up to
the peer's current end-of-wal.
* Reset peer's replication origin for us to our new end-of-wal, and resume
replication

To enable that to be possible, since we can't rewind slots once confirmed
advanced, maintain a backup slot on the peer corresponding to the
point-in-time at which a snapshot was taken.

For most other situations there is little benefit vs just re-creating the
slot before you permit write user-initiated write xacts to begin on the
restored node.

I can accept an argument that "we" as pgsql-hackers do not consider this
something worth caring about, should that be the case. It's niche enough
that you could argue it doesn't have to be supportable in stock postgres.

Maybe you're thinking of a scenario where we PITR the master and also
> use PITR to rewind the replica to a slightly earlier point?

That can work, but must be done in lock-step. You have to pause apply on
both ends for long enough to snapshot both, otherwise the replicaion
origins on one end get out of sync with the slots on another.

Interesting, but I really hope nobody's going to need to do it.

> But I
> can't quite follow what you're thinking about.  Can you explain
> further?
>

Gladly.

I've been up to my eyeballs in this for years now, and sometimes it becomes
quite hard to see the outside perspective, so thanks for your patience.

>
> > Requiring live replication connections could also be an issue for service
> > interruptions, surely?  Unless you persist needed knowledge in the
> physical
> > replication slot used by the standby to master connection, so the master
> can
> > tell the difference between "downstream went away for while but will come
> > back" and "downstream is gone forever, toss out its resources."
>
> I don't think the master needs to retain any resources on behalf of
> the failover slot.  If the slot has been updated by feedback from the
> associated standby, then the master can toss those resources
> immediately.  When the standby comes back on line, it will find out
> via a protocol message that it can fast-forward the slot to whatever
> the new LSN is, and any WAL files before that point are irrelevant on
> both the

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-10 Thread Ashutosh Bapat

On Thu, Aug 10, 2017 at 9:28 AM, Thomas Munro
 wrote:
> On Thu, Aug 10, 2017 at 1:39 AM, Thomas Munro
>  wrote:
>> On my computer it took ~1.5 seconds to plan a 1000 partition join,
>> ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
>> 4000 partition join.  I poked around in a profiler a bit and saw that
>> for the 2000 partition case I spent almost half the time in
>> create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
>> and about half of that was in bms_is_subset.  The other half the time
>> was in query_planner->make_one_rel which spent 2/3 of its time in
>> set_rel_size->add_child_rel_equivalences->bms_overlap and the other
>> 1/3 in standard_join_search.
>
> Ashutosh asked me how I did that.  Please see attached. I was
> explaining simple joins like SELECT * FROM foofoo JOIN barbar USING
> (a, b).  Here also is the experimental hack I tried when I saw
> bitmapset.c eating my CPU.
>

On my machine I observed following planning times
1000 partitions, without partition-wise join, 100ms; with
partition-wise join 500ms
2000 partitions, without partition-wise join, 320ms; with
partition-wise join 2.2s
4000 partitions, without partition-wise join, 1.3ms; with
partition-wise join 17s

So, even without partition-wise join the planning time increases at a
superlinear rate with the number of partitions.

Your patch didn't improve planning time without partition-wise join,
so it's something good to have along-with partition-wise join. Given
that Bitmapsets are used in other parts of code as well, the
optimization may affect those parts as well, esp. the overhead of
maintaining first_non_empty_wordnum.

The comment at the beginning of the file bitmapset.c says
   3  * bitmapset.c
   4  *PostgreSQL generic bitmap set package
   5  *
   6  * A bitmap set can represent any set of nonnegative integers, although
   7  * it is mainly intended for sets where the maximum value is not large,
   8  * say at most a few hundred.

When we created thousands of children, we have certainly crossed the
few hundred threashold. So, there may be other optimizations possible
there. Probably we should leave that out of partition-wise join
patches. Do you think we solving this problem is a prerequisite for
partition-wise join? Or should we propose that patch as a separate
enhancement?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

2017-08-10 Thread Michael Paquier

On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane  wrote:
>>> Michael Paquier  writes:
 I got the same thought, wondering as well if get_slot_xmins should be
 renamed check_slot_xmins with the is() tests moved inside it as well.
 Not sure if that's worth the API ugliness though.
>
>>> Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
>>> wait_slot_xmins into get_slot_xmins so you can't skip it ...
>
>> Let's do that please. Merging both was my first feeling when
>> refactoring this test upthread. Should I send a patch?
>
> Sure, have at it.

And here you go.
-- 
Michael


tap-simplify-repslot.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

84 matches

Mail list logo