Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-07 Thread Amit Kapila
On Fri, Oct 7, 2016 at 3:02 PM, Tomas Vondra
 wrote:
>
> I got access to a large machine with 72/144 cores (thanks to Oleg and
> Alexander from Postgres Professional), and I'm running the tests on that
> machine too.
>
> Results from Dilip's workload (with scale 300, unlogged tables) look like
> this:
>
> 32  64128 192224 256288
>   master104943  128579  72167  100967  66631   97088  63767
>   granular-locking  103415  141689  83780  120480  71847  115201  67240
>   group-update  105343  144322  92229  130149  81247  126629  76638
>   no-content-lock   103153  140568  80101  119185  70004  115386  66199
>
> So there's some 20-30% improvement for >= 128 clients.
>

So here we see performance improvement starting at 64 clients, this is
somewhat similar to what Dilip saw in his tests.

> But what I find much more intriguing is the zig-zag behavior. I mean, 64
> clients give ~130k tps, 128 clients only give ~70k but 192 clients jump up
> to >100k tps again, etc.
>

No clear answer.

> FWIW I don't see any such behavior on pgbench, and all those tests were done
> on the same cluster.
>
>>> With 4.5.5, results for the same benchmark look like this:
>>>
>>>64128192
>>> 
>>>  master 35693  39822  42151
>>>  granular-locking   35370  39409  41353
>>>  no-content-lock36201  39848  42407
>>>  group-update   35697  39893  42667
>>>
>>> That seems like a fairly bad regression in kernel, although I have not
>>> identified the feature/commit causing it (and it's also possible the
>>> issue
>>> lies somewhere else, of course).
>>>
>>> With regular pgbench, I see no improvement on any kernel version. For
>>> example on 3.19 the results look like this:
>>>
>>>64128192
>>> 
>>>  master 54661  61014  59484
>>>  granular-locking   55904  62481  60711
>>>  no-content-lock56182  62442  61234
>>>  group-update   55019  61587  60485
>>>
>>
>> Are the above results with synchronous_commit=off?
>>
>
> No, but I can do that.
>
>>> I haven't done much more testing (e.g. with -N to eliminate
>>> collisions on branches) yet, let's see if it changes anything.
>>>
>>
>> Yeah, let us see how it behaves with -N. Also, I think we could try
>> at higher scale factor?
>>
>
> Yes, I plan to do that. In total, I plan to test combinations of:
>
> (a) Dilip's workload and pgbench (regular and -N)
> (b) logged and unlogged tables
> (c) scale 300 and scale 3000 (both fits into RAM)
> (d) sync_commit=on/off
>

sounds sensible.

Thanks for doing the tests.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-07 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 11:05 AM, Peter Geoghegan  wrote:
> So, while there are still a few loose ends with this revision (it
> should still certainly be considered WIP), I wanted to get a revision
> out quickly because V1 has been left to bitrot for too long now, and
> my schedule is very full for the next week, ahead of my leaving to go
> on vacation (which is long overdue). Hopefully, I'll be able to get
> out a third revision next Saturday, on top of the
> by-then-presumably-committed new tape batch memory patch from Heikki,
> just before I leave. I'd rather leave with a patch available that can
> be cleanly applied, to make review as easy as possible, since it
> wouldn't be great to have this V2 with bitrot for 10 days or more.

Heikki committed his preload memory patch a little later than
originally expected, 4 days ago. I attach V3 of my own parallel CREATE
INDEX patch, which should be applied on top of a today's git master
(there is a bugfix that reviewers won't want to miss -- commit
b56fb691). I have my own test suite, and have to some extent used TDD
for this patch, so rebasing was not so bad . My tests are rather rough
and ready, so I'm not going to post them here. (Changes in the
WaitLatch() API also caused bitrot, which is now fixed.)

Changes from V2:

* Since Heikki eliminated the need for any extra memtuple "slots"
(memtuples is now only exactly big enough for the initial merge heap),
an awful lot of code could be thrown out that managed sizing memtuples
in the context of the parallel leader (based on trends seen in
parallel workers). I was able to follow Heikki's example by
eliminating code for parallel sorting memtuples sizing. Throwing this
code out let me streamline a lot of stuff within tuplesort.c, which is
cleaned up quite a bit.

* Since this revision was mostly focused on fixing up logtape.c
(rebasing on top of Heikki's work), I also took the time to clarify
some things about how an block-based offset might need to be applied
within the leader. Essentially, outlining how and where that happens,
and where it doesn't and shouldn't happen. (An offset must sometimes
be applied to compensate for difference in logical BufFile positioning
(leader/worker differences) following leader's unification of worker
tapesets into one big tapset of its own.)

* max_parallel_workers_maintenance now supersedes the use of the new
parallel_workers index storage parameter. This matches existing heap
storage parameter behavior, and allows the implementation to add
practically no cycles as compared to master branch when the use of
parallelism is disabled by setting max_parallel_workers_maintenance to
0.

* New additions to the chapter in the documentation that Robert added
a little while back, "Chapter 15. Parallel Query". It's perhaps a bit
of a stretch to call this feature part of parallel query, but I think
that it works reasonably well. The optimizer does determine the number
of workers needed here, so while it doesn't formally produce a query
plan, I think the implication that it does is acceptable for
user-facing documentation. (Actually, it would be nice if you really
could EXPLAIN utility commands -- that would be a handy place to show
information about how they were executed.)

Maybe this new documentation describes things in what some would
consider to be excessive detail for users. The relatively detailed
information added on parallel sorting seemed to be in the pragmatic
spirit of the new chapter 15, so I thought I'd see what people
thought.

Work is still needed on:

* Cost model. Should probably attempt to guess final index size, and
derive calculation of number of workers from that. Also, I'm concerned
that I haven't given enough thought to the low end, where with default
settings most CREATE INDEX statements will use at least one parallel
worker.

* The whole way that I teach nbtsort.c to disallow catalog tables for
parallel CREATE INDEX due to concerns about parallel safety is in need
of expert review, preferably from Robert. It's complicated in a way
that relies on things happening or not happening from a distance.

* Heikki seems to want to change more about logtape.c, and its use of
indirection blocks. That may evolve, but for now I can only target the
master branch.

* More extensive performance testing. I think that this V3 is probably
the fastest version yet, what with Heikki's improvements, but I
haven't really verified that.

-- 
Peter Geoghegan


0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz
Description: GNU Zip compressed data


0002-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data


0003-Add-force_btree_randomaccess-GUC-for-testing.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Remove -Wl,-undefined,dynamic_lookup in macOS build.

2016-10-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Oct 7, 2016 at 7:14 PM, Tom Lane  wrote:
>> BTW, OS X hasn't got libintl AFAICT:
>> What are you using to get past that?

> MacPorts.

OK, I'll take a look.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Our "fallback" atomics implementation doesn't actually work

2016-10-07 Thread Andres Freund
On 2016-10-07 17:12:45 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > It's not quite there yet, unfortunately. At the moment
> > pg_atomic_write_u32() is used for local buffers - and we explicitly
> > don't want that to be locking for temp buffers
> > (c.f. 6b93fcd149329d4ee7319561b30fc15a573c6307).
> 
> Hm.
> 
> > Don't really have a great idea about addressing this, besides either
> > just living with the lock for temp buffers on fallback platforms (which
> > don't have much of a practical relevance), or introduce
> > pg_atomic_unlocked_write_u32() or something. Neither seems great.
> 
> Maybe we could hack it with some macro magic that would cause
> pg_atomic_write_u32() to be expanded into a simple assignment in
> localbuf.c only?

I think it's just as well to add a variant that's globally documented to
have no locking, there might be further uses of it. It's already in two
files (bufmgr.c/localbuf.c), and I don't think it's impossible that
further usages will crop up.

Patch that I intend to push soon-ish attached.

Andres
>From b0779abb3add11d4dd745779dd81ea8ecdd00a1d Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 7 Oct 2016 16:55:15 -0700
Subject: [PATCH] Fix fallback implementation of pg_atomic_write_u32().

I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes.  But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.

In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption.  I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.

The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.

This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.

Reported-By: Tom Lane
Discussion: <14947.1475690...@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
---
 src/backend/port/atomics.c| 13 +
 src/backend/storage/buffer/bufmgr.c   |  6 +++---
 src/backend/storage/buffer/localbuf.c | 16 
 src/include/port/atomics.h| 25 +++--
 src/include/port/atomics/fallback.h   |  3 +++
 src/include/port/atomics/generic.h|  9 +
 src/include/storage/buf_internals.h   |  3 ++-
 7 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/src/backend/port/atomics.c b/src/backend/port/atomics.c
index 42169a3..d5970e4 100644
--- a/src/backend/port/atomics.c
+++ b/src/backend/port/atomics.c
@@ -104,6 +104,19 @@ pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
 	ptr->value = val_;
 }
 
+void
+pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	/*
+	 * One might think that an unlocked write doesn't need to acquire the
+	 * spinlock, but one would be wrong. Even an unlocked write has to cause a
+	 * concurrent pg_atomic_compare_exchange_u32() (et al) to fail.
+	 */
+	SpinLockAcquire((slock_t *) >sema);
+	ptr->value = val;
+	SpinLockRelease((slock_t *) >sema);
+}
+
 bool
 pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 	uint32 *expected, uint32 newval)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2b63cd3..df4c9d7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -821,7 +821,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			Assert(buf_state & BM_VALID);
 			buf_state &= ~BM_VALID;
-			pg_atomic_write_u32(>state, buf_state);
+			pg_atomic_unlocked_write_u32(>state, buf_state);
 		}
 		else
 		{
@@ -941,7 +941,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		uint32		buf_state = pg_atomic_read_u32(>state);
 
 		buf_state |= BM_VALID;
-		pg_atomic_write_u32(>state, buf_state);
+		pg_atomic_unlocked_write_u32(>state, buf_state);
 	}
 	else
 	{
@@ -3167,7 +3167,7 @@ FlushRelationBuffers(Relation rel)
 		  false);
 
 buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
-pg_atomic_write_u32(>state, buf_state);
+

Re: [HACKERS] [COMMITTERS] pgsql: Remove -Wl,-undefined,dynamic_lookup in macOS build.

2016-10-07 Thread Robert Haas
On Fri, Oct 7, 2016 at 7:14 PM, Tom Lane  wrote:
> I wrote:
>> Robert Haas  writes:
>>> This broke the build for me.  OS X Yosemite, 10.10.5.
>
>> Hm, probably means we need an explicit reference to -lintl when
>> linking libpqwalreceiver.so.
>
> BTW, OS X hasn't got libintl AFAICT:
>
> # ./configure --enable-nls
> ...
> checking for library containing bind_textdomain_codeset... no
> configure: error: a gettext implementation is required for NLS
>
> What are you using to get past that?

MacPorts.

[rhaas pgsql]$ port contents gettext | grep libintl
  /opt/local/include/libintl.h
  /opt/local/lib/libintl.8.dylib
  /opt/local/lib/libintl.a
  /opt/local/lib/libintl.dylib
  /opt/local/share/gettext/intl/libintl.rc
  /opt/local/share/gettext/libintl.jar
[rhaas pgsql]$

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Remove -Wl,-undefined,dynamic_lookup in macOS build.

2016-10-07 Thread Tom Lane
I wrote:
> Robert Haas  writes:
>> This broke the build for me.  OS X Yosemite, 10.10.5.

> Hm, probably means we need an explicit reference to -lintl when
> linking libpqwalreceiver.so.

BTW, OS X hasn't got libintl AFAICT:

# ./configure --enable-nls
...
checking for library containing bind_textdomain_codeset... no
configure: error: a gettext implementation is required for NLS

What are you using to get past that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Remove -Wl,-undefined,dynamic_lookup in macOS build.

2016-10-07 Thread Tom Lane
Robert Haas  writes:
> On Wed, Oct 5, 2016 at 11:04 PM, Tom Lane  wrote:
>> Remove -Wl,-undefined,dynamic_lookup in macOS build.

> This broke the build for me.  OS X Yosemite, 10.10.5.

Hm, probably means we need an explicit reference to -lintl when
linking libpqwalreceiver.so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Remove -Wl,-undefined,dynamic_lookup in macOS build.

2016-10-07 Thread Robert Haas
On Wed, Oct 5, 2016 at 11:04 PM, Tom Lane  wrote:
> Remove -Wl,-undefined,dynamic_lookup in macOS build.
>
> We don't need this anymore, and it prevents build-time error checking
> that's usually good to have, so remove it.  Undoes one change of commit
> cac765820.
>
> Unfortunately, it's much harder to get a similar effect on other common
> platforms, because we don't want the linker to throw errors for symbols
> that will be resolved in the core backend.  Only macOS and AIX expect the
> core backend executable to be available while linking loadable modules,
> so only these platforms can usefully throw errors for unresolved symbols
> at link time.
>
> Discussion: <2652.1475512...@sss.pgh.pa.us>

This broke the build for me.  OS X Yosemite, 10.10.5.

$ ./configure --enable-nls --with-libraries=/opt/local/lib
--with-includes=/opt/local/include
$ make
...
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -O2  -bundle
-multiply_defined suppress -o libpqwalreceiver.so libpqwalreceiver.o
-L../../../../src/port -L../../../../src/common -L/opt/local/lib
-Wl,-dead_strip_dylibs   -L../../../../src/interfaces/libpq -lpq
-bundle_loader ../../../../src/backend/postgres
Undefined symbols for architecture x86_64:
  "_libintl_gettext", referenced from:
  _libpqrcv_get_conninfo in libpqwalreceiver.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [libpqwalreceiver.so] Error 1
make[1]: *** [all-backend/replication/libpqwalreceiver-recurse] Error 2
make: *** [all-src-recurse] Error 2

Without --enable-nls, or if I back up one commit, it works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing inheritance merge behavior in ALTER TABLE ADD CONSTRAINT

2016-10-07 Thread Corey Huinker
On Fri, Oct 7, 2016 at 5:09 PM, Tom Lane  wrote:

> What seems like a saner answer to me is to change the backend so that it
> will accept these ALTER commands in either order, with the same end state.
> The reason it throws an error now, IMO, is just so that blindly issuing
> the same ALTER ADD CONSTRAINT twice will fail.  But we could deal with
> that by saying that it's okay as long as the initially-targeted constraint
> doesn't already have conislocal = true.
>


+1. Been bitten by this one myself.


Re: [HACKERS] Radix tree for character conversion

2016-10-07 Thread Heikki Linnakangas

On 10/07/2016 06:55 PM, Robert Haas wrote:

On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangas  wrote:

Ouch. We should find and document an authoritative source for all the
mappings we have...

I think the next steps here are:

1. Find an authoritative source for all the existing mappings.
2. Generate the radix tree files directly from the authoritative sources,
instead of the existing *.map files.
3. Completely replace the existing binary-search code with this.


It might be best to convert using the existing map files, and then
update the mappings later.  Otherwise, when things break, you won't
know what to blame.


I was thinking that we keep the mappings unchanged, but figure out where 
we got the mappings we have. An authoritative source may well be "file X 
from unicode, with the following tweaks: ...". As long as we have some 
way of representing that, in text files, or in perl code, that's OK.


What I don't want is that the current *.map files are turned into the 
authoritative source files, that we modify by hand. There are no 
comments in them, for starters, which makes hand-editing cumbersome. It 
seems that we have edited some of them by hand already, but we should 
rectify that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Our "fallback" atomics implementation doesn't actually work

2016-10-07 Thread Tom Lane
Andres Freund  writes:
> It's not quite there yet, unfortunately. At the moment
> pg_atomic_write_u32() is used for local buffers - and we explicitly
> don't want that to be locking for temp buffers
> (c.f. 6b93fcd149329d4ee7319561b30fc15a573c6307).

Hm.

> Don't really have a great idea about addressing this, besides either
> just living with the lock for temp buffers on fallback platforms (which
> don't have much of a practical relevance), or introduce
> pg_atomic_unlocked_write_u32() or something. Neither seems great.

Maybe we could hack it with some macro magic that would cause
pg_atomic_write_u32() to be expanded into a simple assignment in
localbuf.c only?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fixing inheritance merge behavior in ALTER TABLE ADD CONSTRAINT

2016-10-07 Thread Tom Lane
I've been looking into the misbehavior complained of here:
https://www.postgresql.org/message-id/CADbMkNPT-Jz5PRSQ4RbUASYAjocV_KHUWapR%2Bg8fNvhUAyRpxA%40mail.gmail.com

I originally thought this was pg_dump's fault, but I now think it's not,
or at least that changing the backend's behavior would be a superior
solution.  The problem can be described thus:

create table parent(f1 int);
create table child() inherits(parent);
alter table parent add constraint inh_check check(f1 > 0) not valid;
alter table child add constraint inh_check check(f1 > 0) not valid;

The above sequence fails with
ERROR:  constraint "inh_check" for relation "child" already exists
However, if you swap the order of the two ALTER ADD commands, it succeeds,
with the second ALTER instead saying
NOTICE:  merging constraint "inh_check" with inherited definition
In that case you end up with a child constraint marked as having both
local origin (conislocal=true) and inherited origin (coninhcount=1).
This is the situation that Benedikt's example has, and the problem is
for pg_dump to restore this state.

Now, pg_dump needs to issue separate commands along these lines to restore
NOT VALID constraints, because it has to copy data into the tables before
adding the constraints.  (Otherwise, if any rows not satisfying the
constraint are still in the table, the restore would fail.)  The problem
from pg_dump's viewpoint is that there's nothing particularly guaranteeing
that the two ALTERs will be issued in the "right" order.  Absent other
considerations, it'll tend to issue the ALTERs in alphabetical order,
which means dumping this example exactly as given will work, but not so
much if the table names are like "foo" and "foo_child".

We could possibly try to fix this by adding dependencies, but the
dependencies would have to say that the parent table's constraint depends
on the child table's constraint, which seems pretty weird.

What seems like a saner answer to me is to change the backend so that it
will accept these ALTER commands in either order, with the same end state.
The reason it throws an error now, IMO, is just so that blindly issuing
the same ALTER ADD CONSTRAINT twice will fail.  But we could deal with
that by saying that it's okay as long as the initially-targeted constraint
doesn't already have conislocal = true.

While I was poking at this, I came across a second problem in the same
area, which is that this succeeds:

alter table child add constraint inh_check check(f1 > 0) not valid;
alter table parent add constraint inh_check check(f1 > 0);

After that, you have a parent constraint that claims to be valid and
inheritable, but nonetheless there might be rows in a child table
that don't meet the constraint.  That is BAD.  It would break planner
deductions that depend on believing that check constraints hold for
all rows emitted by an inherited table scan.  (I'm not certain whether
there are any affected cases right now, but it's certainly plausible
that there might be such in future.)  Whatever you think of the other
situation, we need to disallow this.

The attached proposed patch (sans test cases as yet) deals with both
of these things, and doesn't change the results of any existing regression
test cases.

I'm inclined to treat this as a bug and back-patch it as far as we
allow NOT VALID check constraints, which seems to be 9.2.

Comments?

regards, tom lane

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index dbd6094..ea06a57 100644
*** a/src/backend/catalog/heap.c
--- b/src/backend/catalog/heap.c
*** static void StoreConstraints(Relation re
*** 105,110 
--- 105,111 
   bool is_internal);
  static bool MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
  			bool allow_merge, bool is_local,
+ 			bool is_initially_valid,
  			bool is_no_inherit);
  static void SetRelationNumChecks(Relation rel, int numchecks);
  static Node *cookConstraint(ParseState *pstate,
*** AddRelationNewConstraints(Relation rel,
*** 2301,2306 
--- 2302,2308 
  			 */
  			if (MergeWithExistingConstraint(rel, ccname, expr,
  			allow_merge, is_local,
+ 			cdef->initially_valid,
  			cdef->is_no_inherit))
  continue;
  		}
*** AddRelationNewConstraints(Relation rel,
*** 2389,2394 
--- 2391,2397 
  static bool
  MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
  			bool allow_merge, bool is_local,
+ 			bool is_initially_valid,
  			bool is_no_inherit)
  {
  	bool		found;
*** MergeWithExistingConstraint(Relation rel
*** 2436,2457 
  if (equal(expr, stringToNode(TextDatumGetCString(val
  	found = true;
  			}
  			if (!found || !allow_merge)
  ereport(ERROR,
  		(errcode(ERRCODE_DUPLICATE_OBJECT),
  errmsg("constraint \"%s\" for relation \"%s\" already exists",
  	   ccname, RelationGetRelationName(rel;
  
! 			tup = 

Re: [HACKERS] Illegal SJIS mapping

2016-10-07 Thread Heikki Linnakangas

On 09/07/2016 09:50 AM, Kyotaro HORIGUCHI wrote:

Hi,

I found an useless entry in utf8_to_sjis.map


 {0xc19c, 0x815f},


which is apparently illegal as UTF-8 which postgresql
deliberately refuses. So it should be removed and the attached
patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C
FULLWIDTH REVERSE SOLIDUS) and it is a right mapping.


Yes, I think you're right. Committed, thanks!


By the way, the file comment at the beginning of UCS_to_SJIS.pl
is the following.

# Generate UTF-8 <--> SJIS code conversion tables from
# map files provided by Unicode organization.
# Unfortunately it is prohibited by the organization
# to distribute the map files. So if you try to use this script,
# you have to obtain SHIFTJIS.TXT from
# the organization's ftp site.

The file was found at the following place thanks to google.

ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/

As the URL is showing, or as written in the file
Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and
the *live* definition *may* be found in Unicode Character
Database. But I haven't found SJIS-related informatin there.

>

If I'm not missing anything, the only available authority would
be JIS X 0208/0213 but what should be implmented seems to be
maybe-modified MS932 for which I don't know the authority.

Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got
a quite different mapping files from the current ones.

So, I wonder how the mappings related to SJIS (and/or EUC-JP) are
maintained. If no authoritative information is available, the
generating script no longer usable. If any other autority is
choosed, it is to be modified according to whatever the new
source format is.


The script is clearly intended to read CP932.TXT, rather than 
SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at


ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

However, running the script with that doesn't produce exactly what we 
have in utf8_to_sjis.map, either. It's otherwise same, but we have some 
extra mappings:


-  {0xc2a5, 0x5c},
-  {0xc2ac, 0x81ca},
-  {0xe28096, 0x8161},
-  {0xe280be, 0x7e},
-  {0xe28892, 0x817c},
-  {0xe3809c, 0x8160},

Those mappings were added in commit 
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus 
mapping for the invalid 0xc19c UTF-8 byte sequence was also added by 
that commit, as well a few valid mappings that UCS_to_SJIS.pl also produces.


I can't judge if those mappings make sense. If we can't find an 
authoritative source for them, I suggest that we leave them as they are, 
but also hard-code them to UCS_to_SJIS.pl, so that running that script 
produces those mappings in utf8_to_sjis.map, even though they are not 
present in the CP932.TXT source file.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Robert Haas
On Fri, Oct 7, 2016 at 11:51 AM, Jeff Janes  wrote:
> What happens if you turn fsync off?  Once a xlog file is fully written, it
> is immediately fsynced, even if the backend is holding WALWriteLock or
> wal_insert (or both) at the time, and even if synchrounous_commit is off.
> Assuming this machine has a BBU so that it doesn't have to wait for disk
> rotation, still fsyncs are expensive because the kernel has to find all the
> data and get it sent over to the BBU, while holding locks.

Scale factor 300, 32 clients, fsync=off:

  5  Lock| tuple
 18  LWLockTranche   | lock_manager
 24  LWLockNamed | WALWriteLock
 88  LWLockTranche   | buffer_content
265  LWLockTranche   | wal_insert
373  LWLockNamed | CLogControlLock
496  LWLockNamed | ProcArrayLock
532  Lock| extend
540  LWLockNamed | XidGenLock
545  Lock| transactionid
  27067  Client  | ClientRead
  85364  |

But I'm not sure you're right about the way the fsync=off code works.
I think pg_fsync()/pg_fsync_writethrough()/pg_fsync_no_writethrough()
look at enableFsync and just do nothing if it's false.

>> Second, ClientRead becomes a bigger and bigger issue as the number of
>> clients increases; by 192 clients, it appears in 45% of the samples.
>> That basically means that pgbench is increasingly unable to keep up
>> with the server; for whatever reason, it suffers more than the server
>> does from the increasing lack of CPU resources.
>
> I would be careful about that interpretation.  If you asked pgbench, it
> would probably have the opposite opinion.

Yeah, that's possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Robert Haas
On Fri, Oct 7, 2016 at 1:39 PM, Andres Freund  wrote:
> On 2016-10-06 14:38:56 -0400, Robert Haas wrote:
>> Obviously, there's a vast increase in TPS, and the backends seem to
>> spend most of their time actually doing work.  ClientRead is now the
>> overwhelmingly dominant wait event, although wal_insert and
>> WALWriteLock contention is clearly still a significant problem.
>> Contention on other locks is apparently quite rare.  Notice that
>> client reads are really significant here - more than 20% of the time
>> we sample what a backend is doing, it's waiting for the next query.
>> It seems unlikely that performance on this workload can be improved
>> very much by optimizing anything other than WAL writing, because no
>> other wait event appears in more than 1% of the samples.
>
> I don't think you meant that, but this sounds a bit like that there's
> nothing to do performance-wise - I don't think that's true.  There's no
> significant lock-contention, yes. But that obviously doesn't mean we can
> use resources more efficiently.

Yeah, right.  Doing the same stuff faster will surely help.  It just
doesn't look like we can get anywhere by improving the locking, unless
it's something related to WAL writing.

>> Second, ClientRead becomes a bigger and bigger issue as the number of
>> clients increases; by 192 clients, it appears in 45% of the samples.
>> That basically means that pgbench is increasingly unable to keep up
>> with the server; for whatever reason, it suffers more than the server
>> does from the increasing lack of CPU resources.
>
> Isn't that more a question of pgbench threads beeing preemted, so server
> processes can run? Since there's not enough hardware threads for both
> servers and clients to always run, you *have* to switch between them.
> Usually a read from the client after processing a query will cause a
> directed wakeup of the other thread (which is why it's not a very
> frequently observed state), but if the queue of to-be-run tasks is very
> long that'll not happen.

Yeah, maybe I shouldn't have said it suffers more than the server, but
rather along with the server.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-10-07 Thread Heikki Linnakangas

On 09/12/2016 11:08 AM, Michael Paquier wrote:

On Fri, Sep 9, 2016 at 6:49 AM, Andres Freund  wrote:

On 2016-09-08 18:13:06 -0300, Alvaro Herrera wrote:

I can't vouch for the windows stuff, and
the invocations indeed look vulnerable. I'm not sure if the fix actually
matters on windows, given . is the default for pretty much everything
there.


Well, maybe it doesn't matter now but as I understand the fix is going
to enter the next stable upstream perl, so it'll fail eventually.  It'd
be saner to just fix the thing completely so that we can forget about
it.


Yea, it'd need input from somebody on windows. Michael? What happens if
you put a line remove . from INC (like upthread) in the msvc stuff?


I haven't tested that directly because I am not sure how to enforce
INC when running the prove command through system(), and there is no
point to use pop on @INC directly in vcregress.pl as that would just
be ignored in the system() call. But to be short: this will blow up as
@INC is part of the default per perl -V if one uses active perl. So it
looks fair to me to append the local source directory as well to what
is included. You can actually do that by just adding $dir to
$ENV{PERL5LIB} in vcregress.pl and that would be fine.


I committed a fix for the unix Makefile, per Andres' original 
suggestion, because this started to bother me. I tried reproducing this 
on Windows by hacking Prove.pm to remove '.' from @INC, but I wasn't 
able to make that to fail. So I didn't do anything about MSVC for now. 
Let's fix vcregress.pl, when someone reports an actual problem on Windows.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Our "fallback" atomics implementation doesn't actually work

2016-10-07 Thread Andres Freund
On 2016-10-06 00:06:33 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Hm. After a long battle of head vs. wall I think I see what the problem
> > is.  For the fallback atomics implementation I somehow had assumed that
> > pg_atomic_write_u32() doesn't need to lock, as it's just an unlocked
> > write.  But that's not true, because it has to cause
> > pg_atomic_compare_exchange_u32 to fail.
> 
> Hah ... obvious once you see it.
> 
> > For me the problem often takes a lot longer to reproduce (once only
> > after 40min), could you run with the attached patch, and see whether
> > that fixes things for you?
> 
> For me, with the described test case, HEAD fails within a minute,
> two times out of three or so.  I've not reproduced it after half an
> hour of beating on this patch.  Looks good.

It's not quite there yet, unfortunately. At the moment
pg_atomic_write_u32() is used for local buffers - and we explicitly
don't want that to be locking for temp buffers
(c.f. 6b93fcd149329d4ee7319561b30fc15a573c6307).

Don't really have a great idea about addressing this, besides either
just living with the lock for temp buffers on fallback platforms (which
don't have much of a practical relevance), or introduce
pg_atomic_unlocked_write_u32() or something. Neither seems great.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Steve Crawford
This thread gets me thinking about the definition of "support." While
support in practice seems to primarily relate to fixes/updates to the
supported version itself it could just as well apply to interoperability
support by newer versions.

Given that the standard PostgreSQL upgrade process involves upgrading
clients first and using pg_dump from the newer version, it is reasonable to
assume that the clients/utilities for a given version would support
interacting with any prior version that was not EOL at the time the new
major version is released.

In other words, 9.6 was released last month, the same month that 9.1 was
EOL, so 9.6 clients should work with 9.1 through 9.6 servers but from my
perspective there is no need to *guarantee* that 10 would do so. The
standard caveats apply. A new version *might* work for an unsupported older
version but no assurance is offered.

This is effectively a 5-year upgrade "grace period" *after* the EOL date of
a given version which seems plenty generous.

Defining the term of backward compatibility support might be useful in the
future when these types of questions arise.

Cheers,
Steve






On Fri, Oct 7, 2016 at 9:06 AM, Tom Lane  wrote:

> Robert Haas  writes:
> > On Fri, Oct 7, 2016 at 11:34 AM, Tom Lane  wrote:
> >> Greg Stark  writes:
> >>> For another there may be binary-only applications or drivers out there
> >>> that are using V2 for whatever reason.
>
> >> The problem with letting it just sit there is that we're not, in fact,
> >> testing it.  If we take the above argument seriously then we should
> >> provide some way to configure libpq to prefer V2 and run regression
> >> tests in that mode.  Otherwise, if/when we break it, we'll never know it
> >> till we get field reports.
>
> > I agree with that.  I think it would be fine to keep V2 support if
> > somebody wants to do the work to let us have adequate test coverage,
> > but if nobody volunteers I think we might as well rip it out.  I don't
> > particularly enjoy committing things only to be told that they've
> > broken something I can't test without unreasonable effort.
>
> When I wrote the above I was thinking of an essentially user-facing
> libpq feature, similar to the one JDBC has, to force use of V2 protocol.
> But actually, for testing purposes, I don't think that's what we want.
> Any such feature would fail to exercise libpq's logic for falling back
> from V3 to V2 when it connects to an old server, which is surely something
> we'd like to test without actually having a pre-7.4 server at hand.
>
> So what I'm thinking is it'd be sufficient to do something like
> this in pqcomm.h:
>
> +#ifndef FORCE_OLD_PROTOCOL
>  #define PG_PROTOCOL_LATEST PG_PROTOCOL(3,0)
> +#else /* make like a pre-7.4 server for testing purposes */
> +#define PG_PROTOCOL_LATEST PG_PROTOCOL(2,0)
> +#endif
>
> which would cause the server to reject 3.0 requests just as if it were
> ancient.  Then we could test with that #define, maybe have a buildfarm
> critter doing it.  (This might break pqmq.c though, so we might need
> to work slightly harder than this.)
>
> Also, I realized while perusing this that the server still has support
> for protocol 1.0 (!).  That's *definitely* dead code.  There's not much
> of it, but still, I'd rather rip it out than continue to pretend it's
> supported.
>
> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Alfred Perlstein



On 10/7/16 10:42 AM, Andres Freund wrote:

Hi,

On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote:

This contention on WAL reminds me of another scenario I've heard about that
was similar.

To fix things what happened was that anyone that the first person to block
would be responsible for writing out all buffers for anyone blocked behind
"him".

We pretty much do that already. But while that's happening, the other
would-be-writers show up as blocking on the lock.  We don't use kind of
an odd locking model for the waiters (LWLockAcquireOrWait()), which
waits for the lock to be released, but doesn't try to acquire it
afterwards. Instead the wal position is rechecked, and in many cases
we'll be done afterwards, because enough has been written out.

Greetings,

Andres Freund



Are the batched writes all done before fsync is called?

Are you sure that A only calls fsync after flushing all the buffers from 
B, C, and D?  Or will it fsync twice?  Is there instrumentation to show 
that?


I know there's a tremendous level of skill involved in this code, but 
simply asking in case there's some tricks.


Another strategy that may work is actually intentionally 
waiting/buffering some few ms between flushes/fsync, for example, make 
sure that the number of flushes per second doesn't exceed some 
configurable amount because each flush likely eats at least one iop from 
the disk and there is a maximum iops per disk, so might as well buffer 
more if you're exceeding that iops count.  You'll trade some latency, 
but gain throughput for doing that.


Does this make sense?  Again apologies if this has been covered.  Is 
there a whitepaper or blog post or clear way I can examine the algorithm 
wrt locks/buffering for flushing WAL logs?


-Alfred






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Realistically though, how much would code coverage info have helped?
> Code coverage on a back branch would not have told you about whether
> it leaves blobs behind in the final regression DB state.  Code coverage
> on HEAD might have helped you notice some aspects of this failure, but
> it would not have told you about the same query failing before 7.4
> that worked later.

The code coverage report is exactly what I was using to figure out what
was being tested in pg_dump and what wasn't.  Many of the tests that are
included in the new TAP testing framework that I wrote for pg_dump were
specifically to provide code coverage and did improve the report.

If the regression tests in older versions were updated to make sure that
all the capabilities of pg_dump in those versions were tested then my
testing with the regression test databases would have shown that the
newer version of pg_dump wasn't handling those cases correctly.  That
would require more comprehensive testing to be done in those
back-branches though, which would require more than just the code
coverage tool being included, that's true.

Another approach to this would be to figure out a way for the newer
testing framework in HEAD to be run against older versions, though we'd
need to have a field which indicates which version of PG a given test
should be run against as there are certainly tests of newer capabilities
than older versions supported.

Ultimately, I'm afraid we may have to just punt on the idea of this kind
of testing being done using the same testing structure that exists in
HEAD and is used in the buildfarm.  That would be unfortunate, but I'm
not quite sure how you could have a buildfarm member than runs every
major version between 8.0 and HEAD and knows how to tell the HEAD
build-system what all the ports are for all those versions to connect to
and run tests against.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> Stephen Frost wrote:
> > * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> > > Stephen Frost wrote:
> > > 
> > > > What would be really nice would be code coverage information for the
> > > > back-branches also, as that would allow us to figure out what we're
> > > > missing coverage for.  I realize that we don't like adding new things to
> > > > back-branches as those changes could impact packagers, but that might
> > > > not impact them since that only runs when you run 'make coverage'.
> > > 
> > > Hmm?  9.1 already has "make coverage", so there's nothing to backpatch.
> > > Do you mean to backpatch that infrastructure even further back than
> > > that?
> > 
> > I wasn't sure how far back it went, but if it's only to 9.1, then yes,
> > farther than that.  Specifically, to as far back as we wish to provide
> > support for pg_dump, assuming it's reasonable to do so.
> 
> I said 9.1 because that's the oldest we support, but it was added in
> 8.4.
> 
> Do you really want to go back to applying patches back to 7.0?  That's
> brave.

Hrm.  My thought had actually been "back to whatever we decide we want
pg_dump to support."  The discussion about that seems to be trending
towards 8.0 rather than 7.0, but you bring up an interesting point about
if we actually want to back-patch things that far.

I guess my thinking is that if we decide that 8.0 is the answer then we
should at least be open to back-patching things that allow us to test
that we are actually still supporting 8.0 and maybe that even means
having a buildfarm member or two which checks back that far.

> > > Or perhaps you are saying that coverage.pg.org should report results for
> > > each branch separately?  We could do that ...
> > 
> > This would certainly be nice to have, but the first is more important.
> > coverage.pg.org is nice to tell people "hey, here's where you can look
> > to find what we aren't covering", but when you're actually hacking on
> > code, you really want a much faster turn-around
> 
> True.  We could actually update things in coverage.postgresql.org much
> faster, actually.  Right now it's twice a day, but if we enlarge the
> machine I'm sure we can do better (yes, we can do that pretty easily).
> Also, to make it faster, we could install ccache 3.10 in that machine,
> although that would be against our regular pginfra policy.
> 
> At some point I thought about providing reports for each day, so that we
> can see how it has improved over time, but that may be too much :-)
> 
> > and you'd like that pre-commit too.
> 
> Yeah, that's a good point.

This is the real issue, imv, with coverage.pg.org.  I still like having
it, and having stats kept over time which allow us to see how we're
doing over time when it comes to our code coverage would be nice, but
the coverage.pg.org site isn't as useful for active development.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Tom Lane
Alvaro Herrera  writes:
> Stephen Frost wrote:
>> I wasn't sure how far back it went, but if it's only to 9.1, then yes,
>> farther than that.  Specifically, to as far back as we wish to provide
>> support for pg_dump, assuming it's reasonable to do so.

> Do you really want to go back to applying patches back to 7.0?  That's
> brave.

Branches before about 7.3 or 7.4 don't build cleanly on modern tools.
In fact, they don't even build cleanly on my old HPUX 10.20 box ...
I just tried, and they have problems with the bison and flex I have
installed there now.  As a data point, that bison executable bears
a file date of Jan 31 2003.  Andres reported something similar in
the year-or-two-ago thread that was mentioned earlier.

This doesn't even consider optional features; I wasn't trying to build
SSL support for instance, but I'm pretty sure OpenSSL has been a moving
target over that kind of time span.

So I think trying to collect code coverage info on those branches is nuts.
Maybe we could sanely do it for the later 8.x releases.

Realistically though, how much would code coverage info have helped?
Code coverage on a back branch would not have told you about whether
it leaves blobs behind in the final regression DB state.  Code coverage
on HEAD might have helped you notice some aspects of this failure, but
it would not have told you about the same query failing before 7.4
that worked later.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Andres Freund
Hi,

On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote:
> This contention on WAL reminds me of another scenario I've heard about that
> was similar.
> 
> To fix things what happened was that anyone that the first person to block
> would be responsible for writing out all buffers for anyone blocked behind
> "him".

We pretty much do that already. But while that's happening, the other
would-be-writers show up as blocking on the lock.  We don't use kind of
an odd locking model for the waiters (LWLockAcquireOrWait()), which
waits for the lock to be released, but doesn't try to acquire it
afterwards. Instead the wal position is rechecked, and in many cases
we'll be done afterwards, because enough has been written out.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Andres Freund
Hi,

On 2016-10-06 14:38:56 -0400, Robert Haas wrote:
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work.  ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare.  Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples.

I don't think you meant that, but this sounds a bit like that there's
nothing to do performance-wise - I don't think that's true.  There's no
significant lock-contention, yes. But that obviously doesn't mean we can
use resources more efficiently.


> Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources.

Isn't that more a question of pgbench threads beeing preemted, so server
processes can run? Since there's not enough hardware threads for both
servers and clients to always run, you *have* to switch between them.
Usually a read from the client after processing a query will cause a
directed wakeup of the other thread (which is why it's not a very
frequently observed state), but if the queue of to-be-run tasks is very
long that'll not happen.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Andres Freund
Hi,

On 2016-10-06 18:15:58 -0400, Robert Haas wrote:
> That's pretty tight, especially since I now notice Andres also left a
> postmaster running on this machine back in April, with
> shared_buffers=8GB.

Oops, sorry for that.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> It might be a good idea to retroactively modify 9.1-9.3 so that there
>> are some blobs in the final state, for purposes of testing pg_dump and
>> pg_upgrade.

> I certainly think that would be a good idea.  I thought we had been
> insisting on coverage via the regression tests for a lot farther back
> than 9.4. though perhaps that was only for newer features and we never
> went back and added it for existing capabilities.

Well, there were regression tests for blobs for a long time, but they
carefully cleaned up their mess.  It was only in 70ad7ed4e that we
made them leave some blobs behind.

I took a quick look at back-patching that commit, but the test would
need to be rewritten to not depend on features that don't exist
further back (like \gset), which likely explains why I didn't do it
at the time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Alvaro Herrera
Stephen Frost wrote:
> * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> > Stephen Frost wrote:
> > 
> > > What would be really nice would be code coverage information for the
> > > back-branches also, as that would allow us to figure out what we're
> > > missing coverage for.  I realize that we don't like adding new things to
> > > back-branches as those changes could impact packagers, but that might
> > > not impact them since that only runs when you run 'make coverage'.
> > 
> > Hmm?  9.1 already has "make coverage", so there's nothing to backpatch.
> > Do you mean to backpatch that infrastructure even further back than
> > that?
> 
> I wasn't sure how far back it went, but if it's only to 9.1, then yes,
> farther than that.  Specifically, to as far back as we wish to provide
> support for pg_dump, assuming it's reasonable to do so.

I said 9.1 because that's the oldest we support, but it was added in
8.4.

Do you really want to go back to applying patches back to 7.0?  That's
brave.

> > Or perhaps you are saying that coverage.pg.org should report results for
> > each branch separately?  We could do that ...
> 
> This would certainly be nice to have, but the first is more important.
> coverage.pg.org is nice to tell people "hey, here's where you can look
> to find what we aren't covering", but when you're actually hacking on
> code, you really want a much faster turn-around

True.  We could actually update things in coverage.postgresql.org much
faster, actually.  Right now it's twice a day, but if we enlarge the
machine I'm sure we can do better (yes, we can do that pretty easily).
Also, to make it faster, we could install ccache 3.10 in that machine,
although that would be against our regular pginfra policy.

At some point I thought about providing reports for each day, so that we
can see how it has improved over time, but that may be too much :-)

> and you'd like that pre-commit too.

Yeah, that's a good point.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> Stephen Frost wrote:
> 
> > What would be really nice would be code coverage information for the
> > back-branches also, as that would allow us to figure out what we're
> > missing coverage for.  I realize that we don't like adding new things to
> > back-branches as those changes could impact packagers, but that might
> > not impact them since that only runs when you run 'make coverage'.
> 
> Hmm?  9.1 already has "make coverage", so there's nothing to backpatch.
> Do you mean to backpatch that infrastructure even further back than
> that?

I wasn't sure how far back it went, but if it's only to 9.1, then yes,
farther than that.  Specifically, to as far back as we wish to provide
support for pg_dump, assuming it's reasonable to do so.

> Or perhaps you are saying that coverage.pg.org should report results for
> each branch separately?  We could do that ...

This would certainly be nice to have, but the first is more important.
coverage.pg.org is nice to tell people "hey, here's where you can look
to find what we aren't covering", but when you're actually hacking on
code, you really want a much faster turn-around and you'd like that
pre-commit too.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Alvaro Herrera
Stephen Frost wrote:

> What would be really nice would be code coverage information for the
> back-branches also, as that would allow us to figure out what we're
> missing coverage for.  I realize that we don't like adding new things to
> back-branches as those changes could impact packagers, but that might
> not impact them since that only runs when you run 'make coverage'.

Hmm?  9.1 already has "make coverage", so there's nothing to backpatch.
Do you mean to backpatch that infrastructure even further back than
that?

Or perhaps you are saying that coverage.pg.org should report results for
each branch separately?  We could do that ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Ugh.  Thanks for fixing.  I had tested back to 7.4 with the regression
> > tests but either those didn't include blobs or something got changed
> > after my testing and I didn't re-test all the way back when I should
> > have.
> 
> It looks like the final state of the regression tests doesn't include
> any blobs before about 9.4.  You wouldn't have seen any results worse
> than a warning message in 7.4-8.4, unless there were some blobs so that
> the data extraction loop got iterated.
> 
> It might be a good idea to retroactively modify 9.1-9.3 so that there
> are some blobs in the final state, for purposes of testing pg_dump and
> pg_upgrade.

I certainly think that would be a good idea.  I thought we had been
insisting on coverage via the regression tests for a lot farther back
than 9.4. though perhaps that was only for newer features and we never
went back and added it for existing capabilities.

What would be really nice would be code coverage information for the
back-branches also, as that would allow us to figure out what we're
missing coverage for.  I realize that we don't like adding new things to
back-branches as those changes could impact packagers, but that might
not impact them since that only runs when you run 'make coverage'.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Question / requests.

2016-10-07 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, Oct 5, 2016 at 10:58 AM, Francisco Olarte
>  wrote:
> > On Tue, Oct 4, 2016 at 7:50 PM, Robert Haas  wrote:
> >> On Mon, Oct 3, 2016 at 5:44 PM, Alvaro Herrera  
> >> wrote:
> > ...
> >>> I wonder if the real answer isn't just to disallow -f with parallel
> >>> vacuuming.
> >> Seems like we should figure out which catalog tables are needed in
> >> order to perform a VACUUM, and force those to be done last and one at
> >> a time.
> >
> > Is the system catalog a bottleneck for people who has real use for
> > paralell vacuum?
> 
> I don't know, but it seems like the documentation for vacuumdb
> currently says, more or less, "Hey, if you use -j with -f, it may not
> work!", which seems unacceptable to me.  It should be the job of the
> person writing the feature to make it work in all cases, not the job
> of the person using the feature to work around the problem when it
> doesn't.

The most interesting use case of vacuumdb is lazy vacuuming, I think, so
committing that patch as it was submitted previously was a good step
forward even if it didn't handle VACUUM FULL 100%.

I agree that it's better to have both modes Just Work in parallel, which
is the point of this subsequent patch.  So let's move forward.  I
support Francisco's effort to make -f work with -j.  I don't have a
strong opinion on which of the various proposals presented so far is the
best way to implement it, but let's figure that out and get it done.

If you want to argue that vacuumdb -f -j not working properly is a bug
in the vacuumdb -j commit, ISTM you're arguing that we should backpatch
whatever we come up with as a bug fix, but I would disagree with that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-10-07 Thread Gilles Darold
Le 03/10/2016 à 16:09, Christoph Berg a écrit :
> Hi Gilles,
>
> I've just tried v4 of the patch. The OID you picked for
> pg_current_logfile doesn't work anymore, but after increasing it
> randomly by 1, it compiles. I like the added functionality,
> especially that "select pg_read_file(pg_current_logfile());" just
> works.

I've changed the OID and some other things in this v5 of the patch, see
bellow.

> What bugs me is the new file "pg_log_file" in PGDATA. It clutters the
> directory listing. I wouldn't know where else to put it, but you might
> want to cross-check that with the thread that is trying to reshuffle
> the directory layout to make it easier to exclude files from backups.
> (Should this file be part of backups?)
>
> It's probably correct to leave the file around on shutdown (given it's
> still a correct pointer). But there might be a case for removing it on
> startup if logging_collector isn't active anymore.

The file has been renamed into current_logfile and is now removed at
startup if logging_collector is not active. The file can be excluded
from a backup but otherwise if it is restored it will be removed or
overridden at startup. Perhaps the file can give a useful information in
a backup to know the last log file active at backup time, but not sure
it has any interest.

I'm not sure which thread is talking about reshuffling the directory
layout, please give me a pointer if this is not the thread talking about
renaming of pg_xlog and pg_clog. In the future if we have a directory to
store files that must be excluded from backup or status files, it will
be easy to move this file here. I will follow such a change.

> Also, pg_log_file is tab-completion-unfriendly, it conflicts with
> pg_log/. Maybe name it current_logfile?
Right, done.

> Another thing that might possibly be improved is csv logging:
>
> # select pg_read_file(pg_current_logfile());
>  pg_read_file  
> ───
>  LOG:  ending log output to stderr↵
>  HINT:  Future log output will go to log destination "csvlog".↵
>
> -rw---  1 cbe staff 1011 Okt  3 15:06 postgresql-2016-10-03_150602.csv
> -rw---  1 cbe staff   96 Okt  3 15:06 postgresql-2016-10-03_150602.log
>
> ... though it's unclear what to do if both stderr and csvlog are
> selected.
>
> Possibly NULL should be returned if only "syslog" is selected.
> (Maybe remove pg_log_file once 'HINT:  Future log output will go to
> log destination "syslog".' is logged?)

I've totally missed that we can have log_destination set to stderr and
csvlog at the same time, so pg_current_logfile() might return two
filenames in this case. I've changed the function to return a setof
record to report the last stderr or csv log file or both. One another
major change is that the current log filename list is also updated after
a configuration reload and not just after a startup or a log rotation.
So in the case of you are switching from stderr to  csvlog or both you
will see immediately the change in current_logfile instead of waiting
for the next log rotation.

  * log_destination set to csvlog only:

postgres=# select * from pg_current_logfile();
  pg_current_logfile  
---
 pg_log/postgresql-2016-10-07_1646.csv
(1 row)

* log_destination set to stderr only:

postgres=# select pg_reload_conf();
 pg_reload_conf

 t
(1 row)

postgres=# select * from pg_current_logfile();
  pg_current_logfile  
---
 pg_log/postgresql-2016-10-07_1647.log
(1 row)

* log_destination set to both stderr,csvlog:

postgres=# select pg_reload_conf();
 pg_reload_conf

 t
(1 row)

postgres=# select * from pg_current_logfile();
  pg_current_logfile  
---
 pg_log/postgresql-2016-10-07_1648.log
 pg_log/postgresql-2016-10-07_1648.csv
(2 rows)

When logging_collector is disabled, this function return null.

As the return type has changed to a setof, the query to read the file
need to be change too:

postgres=# SELECT pg_read_file(log) FROM pg_current_logfile() b(log);
 
pg_read_file   

LOG:  duration: 0.182 ms  statement: select  pg_read_file(log) from
pg_current_logfile() file(log);+
 
(1 row)

I can change the return type to a single text[] if that's looks better.

Thanks

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a588350..69a74af 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15480,6 +15480,12 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
   
 
   

Re: [HACKERS] Radix tree for character conversion

2016-10-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangas  wrote:
>> Ouch. We should find and document an authoritative source for all the
>> mappings we have...
>> 
>> I think the next steps here are:
>> 
>> 1. Find an authoritative source for all the existing mappings.
>> 2. Generate the radix tree files directly from the authoritative sources,
>> instead of the existing *.map files.
>> 3. Completely replace the existing binary-search code with this.

> It might be best to convert using the existing map files, and then
> update the mappings later.  Otherwise, when things break, you won't
> know what to blame.

I think I went through this exercise last year or so, and updated the
notes about the authoritative sources where I was able to find one.
In the remaining cases, I believe that the maps have been intentionally
tweaked and we should be cautious about undoing that.  Tatsuo-san might
remember more about why they are the way they are.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Oct 7, 2016 at 11:34 AM, Tom Lane  wrote:
>> Greg Stark  writes:
>>> For another there may be binary-only applications or drivers out there
>>> that are using V2 for whatever reason.

>> The problem with letting it just sit there is that we're not, in fact,
>> testing it.  If we take the above argument seriously then we should
>> provide some way to configure libpq to prefer V2 and run regression
>> tests in that mode.  Otherwise, if/when we break it, we'll never know it
>> till we get field reports.

> I agree with that.  I think it would be fine to keep V2 support if
> somebody wants to do the work to let us have adequate test coverage,
> but if nobody volunteers I think we might as well rip it out.  I don't
> particularly enjoy committing things only to be told that they've
> broken something I can't test without unreasonable effort.

When I wrote the above I was thinking of an essentially user-facing
libpq feature, similar to the one JDBC has, to force use of V2 protocol.
But actually, for testing purposes, I don't think that's what we want.
Any such feature would fail to exercise libpq's logic for falling back
from V3 to V2 when it connects to an old server, which is surely something
we'd like to test without actually having a pre-7.4 server at hand.

So what I'm thinking is it'd be sufficient to do something like
this in pqcomm.h:

+#ifndef FORCE_OLD_PROTOCOL
 #define PG_PROTOCOL_LATEST PG_PROTOCOL(3,0)
+#else /* make like a pre-7.4 server for testing purposes */
+#define PG_PROTOCOL_LATEST PG_PROTOCOL(2,0)
+#endif

which would cause the server to reject 3.0 requests just as if it were
ancient.  Then we could test with that #define, maybe have a buildfarm
critter doing it.  (This might break pqmq.c though, so we might need
to work slightly harder than this.)

Also, I realized while perusing this that the server still has support
for protocol 1.0 (!).  That's *definitely* dead code.  There's not much
of it, but still, I'd rather rip it out than continue to pretend it's
supported.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about pg_control usage

2016-10-07 Thread Jeff Janes
On Fri, Oct 7, 2016 at 8:24 AM, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> Hi, hackers!
>
> I am examining various global variables in ShmemVariableCache,
> pg_control and pg_controldata.  To force and debug xid wraparound,
> I've implemented a function, that allows to set nextXid value in control
> file manually, but it doesn't work as expected.
>

Why not use pg_resetxlog -x ?

After an investigation I found that, while loading global variables in
> StartupXlog()
> we read checkPoint location from ControlFile, and then read all the
> information
> from the checkPoint record. And never ever use nextXid stored in
> ControlFile.
>
> What is the purpose of having CheckPoint stored in control file then?
> I thought, that it allows us to start up server after correct shutdown
> without any xlog, as said in the comment in pg_control.h
>
>  * Body of CheckPoint XLOG records.  This is declared here because we keep
>  * a copy of the latest one in pg_control for possible disaster recovery.
>  * Changing this struct requires a PG_CONTROL_VERSION bump.
>
> But as far as I see, we never use this data.
>
> Could you please explain, why don't we use information from control file
> in StartapXlog()?



I believe pg_resetxlog uses the data from the control file to create an
artificial checkpoint record, and injects that record into an artificial
xlog file, which will allow you start the system.  So it is a small use,
but a sometimes very important one.

Cheers,

Jeff


Re: [HACKERS] Radix tree for character conversion

2016-10-07 Thread Robert Haas
On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangas  wrote:
> Ouch. We should find and document an authoritative source for all the
> mappings we have...
>
> I think the next steps here are:
>
> 1. Find an authoritative source for all the existing mappings.
> 2. Generate the radix tree files directly from the authoritative sources,
> instead of the existing *.map files.
> 3. Completely replace the existing binary-search code with this.

It might be best to convert using the existing map files, and then
update the mappings later.  Otherwise, when things break, you won't
know what to blame.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind and ssl test suites don't work if '.' is not in @INC

2016-10-07 Thread Heikki Linnakangas

On 10/07/2016 03:05 PM, Michael Paquier wrote:

On Fri, Oct 7, 2016 at 8:54 PM, Heikki Linnakangas  wrote:

I'm a bit surprised no-one else has reported this yet. Have I missed a
report?


Andres has reported that a couple of weeks back, on Debian testing like you:
https://www.postgresql.org/message-id/20160908204529.flg6nivjuwp5v...@alap3.anarazel.de


Ah, thanks, I'll reply on that thread.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Jeff Janes
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas  wrote:

>
> Next, I tried lowering the scale factor to something that fits in
> shared buffers.  Here are the results at scale factor 300:
>
>  14  Lock| tuple
>  22  LWLockTranche   | lock_manager
>  39  LWLockNamed | WALBufMappingLock
> 331  LWLockNamed | CLogControlLock
> 461  LWLockNamed | ProcArrayLock
> 536  Lock| transactionid
> 552  Lock| extend
> 716  LWLockTranche   | buffer_content
> 763  LWLockNamed | XidGenLock
>2113  LWLockNamed | WALWriteLock
>6190  LWLockTranche   | wal_insert
>   25002  Client  | ClientRead
>   78466  |
>
> tps = 27651.562835 (including connections establishing)
>
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work.  ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare.  Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples.  It's not
> clear how much of the WAL-related stuff is artificial lock contention
> and how much of it is the limited speed at which the disk can rotate.
>

What happens if you turn fsync off?  Once a xlog file is fully written, it
is immediately fsynced, even if the backend is holding WALWriteLock or
wal_insert (or both) at the time, and even if synchrounous_commit is off.
Assuming this machine has a BBU so that it doesn't have to wait for disk
rotation, still fsyncs are expensive because the kernel has to find all the
data and get it sent over to the BBU, while holding locks.




Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources.


I would be careful about that interpretation.  If you asked pgbench, it
would probably have the opposite opinion.

The backend tosses its response at the kernel (which will never block,
because the pgbench responses are all small and the kernel will buffer
them) and then goes into ClientRead.  After the backend goes into ClientRead,
the kernel needs to find and wake up the pgbench, deliver the response, and
pgbench has to receive and process the response.  Only then does it create
a new query (I've toyed before with having pgbench construct the next query
while it is waiting for a response on the previous one, but that didn't
seem promising, and much of pgbench has been rewritten since then), pass
the query back to the kernel. Then the kernel has to find and wake up the
backend and deliver the new query.  So for a reasonable chunk of the time
that the server thinks it is waiting for the client, the client also thinks
it is waiting for the server.

I think we need to come up with some benchmarking queries which get more
work done per round-trip to the database. And build them into the binary,
because otherwise people won't use them as much as they should if they have
to pass "-f" files around mailing lists and blog postings.   For example,
we could enclose 5 statements of the TPC-B-like into a single function
which takes aid, bid, tid, and delta as arguments.  And presumably we could
drop the other two statements (BEGIN and COMMIT) as well, and rely on
autocommit to get that job done.  So we could go from 7 statements to 1.



> Third,
> Lock/transactionid and Lock/tuple become more and more common as the
> number of clients increases; these happen when two different pgbench
> threads deciding to hit the same row at the same time.  Due to the
> birthday paradox this increases pretty quickly as the number of
> clients ramps up, but it's not really a server issue: there's nothing
> the server can do about the possibility that two or more clients pick
> the same random number at the same time.
>


What I have done in the past is chop a zero off from:

#define naccounts   10

and recompile pgbench.  Then you can increase the scale factor so that you
have less contention on pgbench_branches while still fitting the data in
shared_buffers, or in RAM.

Cheers,

Jeff


Re: [HACKERS] Question about pg_control usage

2016-10-07 Thread Tom Lane
Anastasia Lubennikova  writes:
> What is the purpose of having CheckPoint stored in control file then?

IIRC, we use some but not all of the fields.  It seemed prudent (and
simpler) to keep a copy of the whole thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Robert Haas
On Fri, Oct 7, 2016 at 11:34 AM, Tom Lane  wrote:
> Greg Stark  writes:
>> On Fri, Oct 7, 2016 at 3:06 PM, Tom Lane  wrote:
>>> In the same line, maybe we should kill libpq's support for V2 protocol
>>> (which would make the cutoff 7.4).  And maybe the server's support too,
>>> though that wouldn't save very much code.  The argument for cutting this
>>> isn't so much that we would remove lots of code as that we're removing
>>> code that never gets tested, at least not by us.
>
>> Somehow removing the whole protocol support seems a bit different to
>> me than removing pg_dump logic. For one it's nice to be able to run a
>> modern psql against old servers so you can run a benchmark script.
>
> Yeah, but surely pre-7.4 servers are no longer of much interest for that;
> or if you want historical results you should also use a matching libpq.
>
>> For another there may be binary-only applications or drivers out there
>> that are using V2 for whatever reason.
>
> The problem with letting it just sit there is that we're not, in fact,
> testing it.  If we take the above argument seriously then we should
> provide some way to configure libpq to prefer V2 and run regression
> tests in that mode.  Otherwise, if/when we break it, we'll never know it
> till we get field reports.

I agree with that.  I think it would be fine to keep V2 support if
somebody wants to do the work to let us have adequate test coverage,
but if nobody volunteers I think we might as well rip it out.  I don't
particularly enjoy committing things only to be told that they've
broken something I can't test without unreasonable effort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Tom Lane
Stephen Frost  writes:
> Ugh.  Thanks for fixing.  I had tested back to 7.4 with the regression
> tests but either those didn't include blobs or something got changed
> after my testing and I didn't re-test all the way back when I should
> have.

It looks like the final state of the regression tests doesn't include
any blobs before about 9.4.  You wouldn't have seen any results worse
than a warning message in 7.4-8.4, unless there were some blobs so that
the data extraction loop got iterated.

It might be a good idea to retroactively modify 9.1-9.3 so that there
are some blobs in the final state, for purposes of testing pg_dump and
pg_upgrade.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question / requests.

2016-10-07 Thread Francisco Olarte
Robert:

On Fri, Oct 7, 2016 at 3:20 PM, Robert Haas  wrote:
> On Wed, Oct 5, 2016 at 10:58 AM, Francisco Olarte
>  wrote:
>> On Tue, Oct 4, 2016 at 7:50 PM, Robert Haas  wrote:
>>> On Mon, Oct 3, 2016 at 5:44 PM, Alvaro Herrera  
>>> wrote:
>> ...
 I wonder if the real answer isn't just to disallow -f with parallel
 vacuuming.
>>> Seems like we should figure out which catalog tables are needed in
>>> order to perform a VACUUM, and force those to be done last and one at
>>> a time.
>>
>> Is the system catalog a bottleneck for people who has real use for
>> paralell vacuum?

> I don't know, but it seems like the documentation for vacuumdb
> currently says, more or less, "Hey, if you use -j with -f, it may not
> work!", which seems unacceptable to me.  It should be the job of the
> person writing the feature to make it work in all cases, not the job
> of the person using the feature to work around the problem when it
> doesn't.

That may be the case, but the only ways to solve it seems to be
disallow full paralell as suggested.

OTOH what I was asking was just if people think the time gained by
minimizing the part of pg_catalog serially processed on a
full-paralell case would be enough to warrant the increased code
complexity and bug surface.

Anyway, I'll stick to my original plan even if someone decides to fix
or disallow full paralell as I think it has it uses.

Francisco Olarte.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Tom Lane
Greg Stark  writes:
> On Fri, Oct 7, 2016 at 3:06 PM, Tom Lane  wrote:
>> In the same line, maybe we should kill libpq's support for V2 protocol
>> (which would make the cutoff 7.4).  And maybe the server's support too,
>> though that wouldn't save very much code.  The argument for cutting this
>> isn't so much that we would remove lots of code as that we're removing
>> code that never gets tested, at least not by us.

> Somehow removing the whole protocol support seems a bit different to
> me than removing pg_dump logic. For one it's nice to be able to run a
> modern psql against old servers so you can run a benchmark script.

Yeah, but surely pre-7.4 servers are no longer of much interest for that;
or if you want historical results you should also use a matching libpq.

> For another there may be binary-only applications or drivers out there
> that are using V2 for whatever reason.

The problem with letting it just sit there is that we're not, in fact,
testing it.  If we take the above argument seriously then we should
provide some way to configure libpq to prefer V2 and run regression
tests in that mode.  Otherwise, if/when we break it, we'll never know it
till we get field reports.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Question about pg_control usage

2016-10-07 Thread Anastasia Lubennikova

Hi, hackers!

I am examining various global variables in ShmemVariableCache,
pg_control and pg_controldata.  To force and debug xid wraparound,
I've implemented a function, that allows to set nextXid value in control
file manually, but it doesn't work as expected.

After an investigation I found that, while loading global variables in 
StartupXlog()
we read checkPoint location from ControlFile, and then read all the 
information
from the checkPoint record. And never ever use nextXid stored in 
ControlFile.


What is the purpose of having CheckPoint stored in control file then?
I thought, that it allows us to start up server after correct shutdown
without any xlog, as said in the comment in pg_control.h

 * Body of CheckPoint XLOG records.  This is declared here because we keep
 * a copy of the latest one in pg_control for possible disaster recovery.
 * Changing this struct requires a PG_CONTROL_VERSION bump.

But as far as I see, we never use this data.

Could you please explain, why don't we use information from control file
in StartapXlog()?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Amit Langote  writes:
> > Just noticed that the getBlobs() query does not work for a 7.3 server
> > (maybe <= 7.3) due to the following change in commit 23f34fa4 [1]:
> 
> Ugh.
> 
> > I could fix that using the attached patch.
> 
> There's more wrong than that, as you'd notice if you tried dumping
> a DB that actually had some LOs in it :-(.  This obviously wasn't
> tested on anything older than 9.0.
> 
> Will push a fix in a bit, as soon as I can boot up my dinosaur with
> a working 7.0 server to test that branch.

Ugh.  Thanks for fixing.  I had tested back to 7.4 with the regression
tests but either those didn't include blobs or something got changed
after my testing and I didn't re-test all the way back when I should
have.

I wasn't able to (easily) get anything older than 7.4 to compile on my
box, which is why I had stopped there.

In any case, thanks again for the fix.

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Greg Stark
On Fri, Oct 7, 2016 at 3:06 PM, Tom Lane  wrote:
> pg_dump alleges support for dumping from servers back to 7.0.  Would v10
> be a good time to remove some of that code?  It's getting harder and
> harder to even compile those ancient branches, let alone get people to
> test against them (cf. 4806f26f9).  My initial thought is to cut support
> for pre-7.3 or maybe pre-7.4 servers, as that would allow removal of
> support for cases where the server lacks schemas or pg_depend, each of
> which requires a fair deal of klugery in pg_dump.

I might be expected to be the holdout here but it seems sensible to
me. Removing code in pg_dump to deal with lacking schemas and
pg_depend seems like a major simplification.


> In the same line, maybe we should kill libpq's support for V2 protocol
> (which would make the cutoff 7.4).  And maybe the server's support too,
> though that wouldn't save very much code.  The argument for cutting this
> isn't so much that we would remove lots of code as that we're removing
> code that never gets tested, at least not by us.

If it's testing we're concerned about IIRC the current servers can be
arm-twisted into speaking V2 protocol. So it should be possible to
test both modern servers and modern pg_dump using V2 protocol with a
simple tweak.

Somehow removing the whole protocol support seems a bit different to
me than removing pg_dump logic. For one it's nice to be able to run a
modern psql against old servers so you can run a benchmark script. For
another there may be binary-only applications or drivers out there
that are using V2 for whatever reason.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Robert Haas
On Fri, Oct 7, 2016 at 10:06 AM, Tom Lane  wrote:
> pg_dump alleges support for dumping from servers back to 7.0.  Would v10
> be a good time to remove some of that code?  It's getting harder and
> harder to even compile those ancient branches, let alone get people to
> test against them (cf. 4806f26f9).  My initial thought is to cut support
> for pre-7.3 or maybe pre-7.4 servers, as that would allow removal of
> support for cases where the server lacks schemas or pg_depend, each of
> which requires a fair deal of klugery in pg_dump.

It's been a long time since I've seen any 7.x versions in the wild,
and the pg_dump stuff is a nuisance to maintain.  +1 for dropping
support for anything pre-7.4.  Anybody who is upgrading from 7.3 to 10
isn't likely to want to do it in one swing anyway.  In the real world,
few upgrades span 14 major versions.

> In the same line, maybe we should kill libpq's support for V2 protocol
> (which would make the cutoff 7.4).  And maybe the server's support too,
> though that wouldn't save very much code.  The argument for cutting this
> isn't so much that we would remove lots of code as that we're removing
> code that never gets tested, at least not by us.
>
> One small problem with cutting libpq's V2 support is that the server's
> report_fork_failure_to_client() function still sends a V2-style message.
> We could change that in HEAD, certainly, but we don't really want modern
> libpq unable to parse such a message from an older server.  Possibly we
> could handle that specific case with a little special-purpose code and
> still be able to take out most of fe-protocol2.c.

I definitely think we can't remove client support for anything that
modern servers still send, even if we change 10devel so that it no
longer does so.   I think we have to at least wait until all versions
that might send a message are EOL before removing client-side support
for it -- and maybe a bit longer than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM corruption leading to errors

2016-10-07 Thread Anastasia Lubennikova

06.10.2016 20:59, Pavan Deolasee:
I investigated a bug report from one of our customers and it looked 
very similar to previous bug reports here [1], [2], [3] (and probably 
more). In these reports, the error looks something like this:


ERROR:  could not read block 28991 in file "base/16390/572026": read 
only 0 of 8192 bytes


I traced it to the following code in MarkBufferDirtyHint(). The 
function returns without setting the DIRTY bit on the standby:


3413 /*
3414  * If we're in recovery we cannot dirty a page 
because of a hint.
3415  * We can set the hint, just not dirty the page as a 
result so the

3416  * hint is lost when we evict the page or shutdown.
3417  *
3418  * See src/backend/storage/page/README for longer 
discussion.

3419  */
3420 if (RecoveryInProgress())
3421 return;
3422

freespace.c freely uses MarkBufferDirtyHint() whenever changes are 
made to the FSM. I think that's usually alright because FSM changes 
are not WAL logged and if FSM ever returns a block with less free 
space than the caller needs, the caller is usually prepared to update 
the FSM and request for a new block. But if it returns a block that is 
outside the size of the relation, then we've a trouble. The very next 
ReadBuffer() fails to handle such a block and throws the error.


When a relation is truncated, the FSM is truncated too to remove 
references to the heap blocks that are being truncated. But since the 
FSM buffer may not be marked DIRTY on the standby, if the buffer gets 
evicted from the buffer cache, the on-disk copy of the FSM page may be 
left with references to the truncated heap pages. When the standby is 
later promoted to be the master, and an insert/update is attempted to 
the table, the FSM may return a block that is outside the valid range 
of the relation. That results in the said error.


Once this was clear, it was easy to put together a fully reproducible 
test case. See the attached script; you'll need to adjust to your 
environment. This affects all releases starting 9.3 and the script can 
reproduce the problem on all these releases.


I believe the fix is very simple. The FSM change during truncation is 
critical and the buffer must be marked by MarkBufferDirty() i.e. those 
changes must make to the disk. I think it's alright not to WAL log 
them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. 
But it must not be lost across a checkpoint. Also, since it happens 
only during relation truncation, I don't see any problem from 
performance perspective.


What bothers me is how to fix the problem for already affected 
standbys. If the FSM for some table is already corrupted at the 
standby, users won't notice it until the standby is promoted to be the 
new master. If the standby starts throwing errors suddenly after 
failover, it will be a very bad situation for the users, like we 
noticed with our customers. The fix is simple and users can just 
delete the FSM (and VACUUM the table), but that doesn't sound nice and 
they would not know until they see the problem.


One idea is to always check if the block returned by the FSM is 
outside the range and discard such blocks after setting the FSM 
(attached patch does that). The problem with that approach is that 
RelationGetNumberOfBlocks() is not really cheap and invoking it 
everytime FSM is consulted may not be a bright idea. Can we cache that 
value in the RelationData or some such place (BulkInsertState?) and 
use that as a hint for quickly checking if the block is (potentially) 
outside the range and discard it? Any other ideas?


The other concern I've and TBH that's what I initially thought as the 
real problem, until I saw RecoveryInProgress() specific code, is: can 
this also affect stand-alone masters? The comments at 
MarkBufferDirtyHint() made me think so:


3358  * 3. This function does not guarantee that the buffer is always 
marked dirty
3359  *(due to a race condition), so it cannot be used for 
important changes.


So I was working with a theory that somehow updates to the FSM page 
are lost because the race mentioned in the comment actually kicks in. 
But I'm not sure if the race is only possible when the caller is 
holding a SHARE lock on the buffer. When the FSM is truncated, the 
caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're 
safe. I could not reproduce the issue on a stand-alone master. But 
probably worth checking.


It might also be a good idea to inspect other callers of 
MarkBufferDirtyHint() and see if any of them is vulnerable, especially 
from standby perspective. I did one round, and couldn't see another 
problem.


Thanks,
Pavan

[1] 
https://www.postgresql.org/message-id/CAJakt-8%3DaXa-F7uFeLAeSYhQ4wFuaX3%2BytDuDj9c8Gx6S_ou%3Dw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20160601134819.30392.85...@wrigleys.postgresql.org
[3] 

Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Magnus Hagander
On Fri, Oct 7, 2016 at 4:06 PM, Tom Lane  wrote:

> pg_dump alleges support for dumping from servers back to 7.0.  Would v10
> be a good time to remove some of that code?  It's getting harder and
> harder to even compile those ancient branches, let alone get people to
> test against them (cf. 4806f26f9).  My initial thought is to cut support
> for pre-7.3 or maybe pre-7.4 servers, as that would allow removal of
> support for cases where the server lacks schemas or pg_depend, each of
> which requires a fair deal of klugery in pg_dump.
>

+1 for doing it, and also picking a version that's "easily explainable". If
we can say "now we support back to 8.0" that's a lot better than saying 8.1
or 7.4.



> In the same line, maybe we should kill libpq's support for V2 protocol
> (which would make the cutoff 7.4).  And maybe the server's support too,
> though that wouldn't save very much code.  The argument for cutting this
> isn't so much that we would remove lots of code as that we're removing
> code that never gets tested, at least not by us.
>

Which is a pretty strong argument in itself.



> One small problem with cutting libpq's V2 support is that the server's
> report_fork_failure_to_client() function still sends a V2-style message.
> We could change that in HEAD, certainly, but we don't really want modern
> libpq unable to parse such a message from an older server.  Possibly we
> could handle that specific case with a little special-purpose code and
> still be able to take out most of fe-protocol2.c.
>
> Thoughts?
>

I agree we want the new libpq to be able to deal with that one from old
versions, because 9.6 isn't really old yet. We *should* probably change it
in head for the future anyway, but that means we have to keep it around in
libpq for quite a long while anyway.

Unless the special purpose code becomes long and complicated, I think
that's the best way to do it.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Oct 7, 2016 at 9:39 AM, Tom Lane  wrote:
>> There's more wrong than that, as you'd notice if you tried dumping
>> a DB that actually had some LOs in it :-(.  This obviously wasn't
>> tested on anything older than 9.0.

> Back in 2014, we talked about removing support for some older server versions:
> https://www.postgresql.org/message-id/24529.1415921...@sss.pgh.pa.us

> I think there have been other discussions, too, but I can't find them
> at the moment.

I just re-raised the subject:
https://www.postgresql.org/message-id/2661.1475849...@sss.pgh.pa.us

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Robert Haas
On Fri, Oct 7, 2016 at 9:39 AM, Tom Lane  wrote:
> Amit Langote  writes:
>> Just noticed that the getBlobs() query does not work for a 7.3 server
>> (maybe <= 7.3) due to the following change in commit 23f34fa4 [1]:
>
> Ugh.
>
>> I could fix that using the attached patch.
>
> There's more wrong than that, as you'd notice if you tried dumping
> a DB that actually had some LOs in it :-(.  This obviously wasn't
> tested on anything older than 9.0.
>
> Will push a fix in a bit, as soon as I can boot up my dinosaur with
> a working 7.0 server to test that branch.

Back in 2014, we talked about removing support for some older server versions:

https://www.postgresql.org/message-id/24529.1415921...@sss.pgh.pa.us

I think there have been other discussions, too, but I can't find them
at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Is it time to kill support for very old servers?

2016-10-07 Thread Tom Lane
pg_dump alleges support for dumping from servers back to 7.0.  Would v10
be a good time to remove some of that code?  It's getting harder and
harder to even compile those ancient branches, let alone get people to
test against them (cf. 4806f26f9).  My initial thought is to cut support
for pre-7.3 or maybe pre-7.4 servers, as that would allow removal of
support for cases where the server lacks schemas or pg_depend, each of
which requires a fair deal of klugery in pg_dump.

In the same line, maybe we should kill libpq's support for V2 protocol
(which would make the cutoff 7.4).  And maybe the server's support too,
though that wouldn't save very much code.  The argument for cutting this
isn't so much that we would remove lots of code as that we're removing
code that never gets tested, at least not by us.

One small problem with cutting libpq's V2 support is that the server's
report_fork_failure_to_client() function still sends a V2-style message.
We could change that in HEAD, certainly, but we don't really want modern
libpq unable to parse such a message from an older server.  Possibly we
could handle that specific case with a little special-purpose code and
still be able to take out most of fe-protocol2.c.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?

2016-10-07 Thread Robert Haas
On Thu, Oct 6, 2016 at 5:07 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Alternatively, get a bigger box.  :-)
>
> So what's it take to get access to hydra?

Send me a private email with your .ssh key.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-07 Thread Tom Lane
Amit Langote  writes:
> Just noticed that the getBlobs() query does not work for a 7.3 server
> (maybe <= 7.3) due to the following change in commit 23f34fa4 [1]:

Ugh.

> I could fix that using the attached patch.

There's more wrong than that, as you'd notice if you tried dumping
a DB that actually had some LOs in it :-(.  This obviously wasn't
tested on anything older than 9.0.

Will push a fix in a bit, as soon as I can boot up my dinosaur with
a working 7.0 server to test that branch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move allocation size overflow handling to MemoryContextAllocExtended()?

2016-10-07 Thread Robert Haas
On Wed, Oct 5, 2016 at 2:09 PM, Andres Freund  wrote:
> On 2016-10-04 21:40:29 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>> > On 2016-10-05 09:38:15 +0900, Michael Paquier wrote:
>> >> The existing interface of MemoryContextAlloc do not care much about
>> >> anything except Size, so I'd just give the responsability to the
>> >> caller to do checks like queue != (Size) queue when queue is a uint64
>> >> for example.
>>
>> > Well, that duplicates the check and error message everywhere.
>>
>> It seems like you're on the edge of reinventing calloc(), which is not an
>> API that anybody especially likes.
>
> I'm not sure how doing an s/Size/uint64/ in a bunch of APIs does
> that. Because that'd allow us to to throw an error in a number of useful
> cases where we currently can't.
>
> I'm mostly concerned that there's a bunch of cases on 32bit platforms
> where size_t is trivially overflowed. And being a bit more defensive
> against that seems like a good idea. It took about a minute (10s of that
> due to a typo) to find something that looks borked to me:
> bool
> spi_printtup(TupleTableSlot *slot, DestReceiver *self)
> {
> if (tuptable->free == 0)
> {
> /* Double the size of the pointer array */
> tuptable->free = tuptable->alloced;
> tuptable->alloced += tuptable->free;
> tuptable->vals = (HeapTuple *) repalloc_huge(tuptable->vals,
>   
> tuptable->alloced * sizeof(HeapTuple));
> }
> seems like it could overflow quite easily on a 32bit system.
>
>
> People don't think about 32bit size_t a whole lot anymore, so I think by
> defaulting to 64bit calculations for this kind of thing, we'll probably
> prevent a number of future bugs.

I think you're right, but I also think that if we start using uint64
rather than Size for byte-lengths, it will spread like kudzu through
the whole system and we'll lose the documentation benefit of having
sizes be called "Size".  Since descriptive type names are a good
thing, I don't like that very much.  One crazy idea is to change Size
to always be 64 bits and fix all the places where we translate between
Size and size_t.  But I'm not sure that's a good idea, either.  This
might be one of those cases where it's best to just accept that we're
going to miss some things and fix them as we find them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question / requests.

2016-10-07 Thread Robert Haas
On Wed, Oct 5, 2016 at 10:58 AM, Francisco Olarte
 wrote:
> On Tue, Oct 4, 2016 at 7:50 PM, Robert Haas  wrote:
>> On Mon, Oct 3, 2016 at 5:44 PM, Alvaro Herrera  
>> wrote:
> ...
>>> I wonder if the real answer isn't just to disallow -f with parallel
>>> vacuuming.
>> Seems like we should figure out which catalog tables are needed in
>> order to perform a VACUUM, and force those to be done last and one at
>> a time.
>
> Is the system catalog a bottleneck for people who has real use for
> paralell vacuum?

I don't know, but it seems like the documentation for vacuumdb
currently says, more or less, "Hey, if you use -j with -f, it may not
work!", which seems unacceptable to me.  It should be the job of the
person writing the feature to make it work in all cases, not the job
of the person using the feature to work around the problem when it
doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-07 Thread Robert Haas
On Thu, Oct 6, 2016 at 8:40 PM, Jeff Janes  wrote:
> In commit 37484ad2aacef5ec7, you changed the way that frozen tuples were
> represented, so that we could make freezing more aggressive without losing
> forensic evidence.  But I don't think we ever did anything to actually make
> the freezing more aggressive.

See 3cff1879f8d03cb729368722ca823a4bf74c0cac.  The objection to doing
it in other cases is that it adds write-ahead log volume which might
cause its own share of problems.  There might be some way of getting
ahead of that, though.  I think if we piggyback on an existing WAL
record like XLOG_HEAP2_CLEAN or XLOG_HEAP2_VISIBLE the impact might be
minimal, but I haven't been dedicated enough to try to write the
patch.

> When I applied the up-thread patch so that pgbench_history gets autovac,
> those autovacs don't actually cause any pages to get frozen until the wrap
> around kicks in, even when all the tuples on the early pages should be well
> beyond vacuum_freeze_min_age.

If the pages are already all-visible, they'll be skipped until
vacuum_freeze_table_age is reached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-07 Thread Alfred Perlstein

Robert,

This contention on WAL reminds me of another scenario I've heard about 
that was similar.


To fix things what happened was that anyone that the first person to 
block would be responsible for writing out all buffers for anyone 
blocked behind "him".


The for example if you have many threads, A, B, C, D

If while A is writing to WAL and hold the lock, then B arrives, B of 
course blocks, then C comes along and blocks as well, then D.


Finally A finishes its write and then 

Now you have two options for resolving this, either

1) A drops its lock, B picks up the lock... B writes its buffer and then 
drops the lock.  Then C gets the lock, writes its buffer, drops the 
lock, then finally D gets the lock, writes its buffer and then drops the 
lock.


2) A then writes out B's, C's, and D's buffers, then A drops the lock, 
B, C and D wake up, note that their respective buffers are written and 
just return.  This greatly speeds up the system. (just be careful to 
make sure A doesn't do "too much work" otherwise you can get a sort of 
livelock if too many threads are blocked behind it, generally only issue 
one additional flush on behalf of other threads, do not "loop until the 
queue is empty")


I'm not sure if this is actually possible with the way WAL is 
implemented, (or perhaps if this strategy is already implemented) but 
it's definitely worth if not done already as it can speed things up 
enormously.


On 10/6/16 11:38 AM, Robert Haas wrote:

Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench.  I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on.  During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.  First, I tried scale factor
3000 with 32 clients and got these results:

   1  LWLockTranche   | buffer_mapping
   9  LWLockNamed | CLogControlLock
  14  LWLockNamed | ProcArrayLock
  16  Lock| tuple
  25  LWLockNamed | CheckpointerCommLock
  49  LWLockNamed | WALBufMappingLock
 122  LWLockTranche   | clog
 182  Lock| transactionid
 287  LWLockNamed | XidGenLock
1300  Client  | ClientRead
1375  LWLockTranche   | buffer_content
3990  Lock| extend
   21014  LWLockNamed | WALWriteLock
   28497  |
   58279  LWLockTranche   | wal_insert

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload.   Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test.  The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:

   1  BufferPin   | BufferPin
   1  LWLockTranche   | lock_manager
   2  LWLockTranche   | buffer_mapping
   8  LWLockNamed | ProcArrayLock
   9  LWLockNamed | CheckpointerCommLock
   9  LWLockNamed | CLogControlLock
  11  LWLockTranche   | buffer_content
  37  LWLockTranche   | clog
 153  Lock| tuple
 388  LWLockNamed | XidGenLock
 827  Lock| transactionid
1267  Client  | ClientRead
   20631  Lock| extend
   91767  |

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement.  The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture.  However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O.  This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:

  14  Lock| 

Re: [HACKERS] Complete LOCK TABLE ... IN ACCESS|ROW|SHARE

2016-10-07 Thread marllius ribeiro
This was my first test which had help Gerdan.

I did some tests and found nothing special. The stated resource is implemented 
correctly.
He passes all regression tests and enables the use of the new features 
specified.


The test was initially performed to verify that the features exist, however not 
effected, it follows the evidence:


postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock teste IN ACCESS

postgres=# lock teste IN ACCESS

postgres=# lock t
TABLE  teste
postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock TABLE
information_schema.  pg_temp_1.   pg_toast_temp_1. teste
pg_catalog.  pg_toast.public.
postgres=# lock TABLE teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock TABLE teste IN ACCESS

postgres=# lock TABLE teste IN ACCESS

postgres=# lock TABLE teste IN ACCESS

postgres=# lock TABLE teste IN SHARE

postgres=# lock TABLE teste IN ROW

postgres=# lock TABLE teste IN ROW

postgres=# lock TABLE teste IN ROW


-

After applied patch come the new features.


postgres=# lock table teste IN ACCESS
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
   SHARE ROW EXCLUSIVE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
   SHARE UPDATE EXCLUSIVE MODE
postgres=# lock table teste IN ACCESS
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN ACCESS
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN ROW
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN SHARE
MODE   ROW EXCLUSIVE MODE UPDATE EXCLUSIVE MODE
postgres=# lock table teste IN SHARE

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Complete LOCK TABLE ... IN ACCESS|ROW|SHARE

2016-10-07 Thread marllius ribeiro
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

This was my first test which had help Gerdan.

I did some tests and found nothing special. The stated resource is implemented 
correctly.
He passes all regression tests and enables the use of the new features 
specified.


The test was initially performed to verify that the features exist, however not 
effected, it follows the evidence:


postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock teste IN ACCESS

postgres=# lock teste IN ACCESS

postgres=# lock t
TABLE  teste
postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock TABLE
information_schema.  pg_temp_1.   pg_toast_temp_1. teste
pg_catalog.  pg_toast.public.
postgres=# lock TABLE teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
postgres=# lock TABLE teste IN ACCESS

postgres=# lock TABLE teste IN ACCESS

postgres=# lock TABLE teste IN ACCESS

postgres=# lock TABLE teste IN SHARE

postgres=# lock TABLE teste IN ROW

postgres=# lock TABLE teste IN ROW

postgres=# lock TABLE teste IN ROW


-

After applied patch come the new features.


postgres=# lock table teste IN ACCESS
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN
ACCESS EXCLUSIVE MODEEXCLUSIVE MODE   ROW SHARE MODE
   SHARE ROW EXCLUSIVE MODE
ACCESS SHARE MODEROW EXCLUSIVE MODE   SHARE MODE
   SHARE UPDATE EXCLUSIVE MODE
postgres=# lock table teste IN ACCESS
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN ACCESS
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN ROW
EXCLUSIVE MODE  SHARE MODE
postgres=# lock table teste IN SHARE
MODE   ROW EXCLUSIVE MODE UPDATE EXCLUSIVE MODE
postgres=# lock table teste IN SHARE


The new status of this patch is: Ready for Committer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind and ssl test suites don't work if '.' is not in @INC

2016-10-07 Thread Michael Paquier
On Fri, Oct 7, 2016 at 8:54 PM, Heikki Linnakangas  wrote:
> I think we should fix this by the attached. Any better ideas?

Updating PG_PROVE_FLAGS in src/Makefile.global.in, because any test
suite having its own .pm file will fail. It is as well necessary to
refresh $ENV{PERL5LIB} in vcregress.pl.

> I'm a bit surprised no-one else has reported this yet. Have I missed a
> report?

Andres has reported that a couple of weeks back, on Debian testing like you:
https://www.postgresql.org/message-id/20160908204529.flg6nivjuwp5v...@alap3.anarazel.de
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_rewind and ssl test suites don't work if '.' is not in @INC

2016-10-07 Thread Heikki Linnakangas

On my system running Debian stretch (= testing):

~/git-sandbox-pgsql/96stable/src/bin/pg_rewind (REL9_6_STABLE)$ make check
rm -rf '/home/heikki/git-sandbox-pgsql/96stable'/tmp_install
/bin/mkdir -p '/home/heikki/git-sandbox-pgsql/96stable'/tmp_install/log
make -C '../../..' 
DESTDIR='/home/heikki/git-sandbox-pgsql/96stable'/tmp_install install 
>'/home/heikki/git-sandbox-pgsql/96stable'/tmp_install/log/install.log 2>&1
rm -rf 
/home/heikki/git-sandbox-pgsql/96stable/src/bin/pg_rewind/tmp_check/log
cd . && 
TESTDIR='/home/heikki/git-sandbox-pgsql/96stable/src/bin/pg_rewind' 
PATH="/home/heikki/git-sandbox-pgsql/96stable/tmp_install/home/heikki/pgsql.96stable/bin:$PATH" 
LD_LIBRARY_PATH="/home/heikki/git-sandbox-pgsql/96stable/tmp_install/home/heikki/pgsql.96stable/lib" 
PGPORT='65432' 
PG_REGRESS='/home/heikki/git-sandbox-pgsql/96stable/src/bin/pg_rewind/../../../src/test/regress/pg_regress' 
prove -I ../../../src/test/perl/ --verbose t/*.pl

t/001_basic.pl 
1..8
Can't locate RewindTest.pm in @INC (you may need to install the 
RewindTest module) (@INC contains: 
/home/heikki/git-sandbox-pgsql/96stable/src/bin/pg_rewind/../../../src/test/perl 
/etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.24.1 
/usr/local/share/perl/5.24.1 /usr/lib/x86_64-linux-gnu/perl5/5.24 
/usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.24 
/usr/share/perl/5.24 /usr/local/lib/site_perl 
/usr/lib/x86_64-linux-gnu/perl-base) at t/001_basic.pl line 6.

BEGIN failed--compilation aborted at t/001_basic.pl line 6.

That's not nice. Perl's @INC usually contains the current directory, and 
those test suites rely on that, but that changed recently on Debian 
stretch, and other distributions are likely to follow. There's more 
information on this at https://www.debian.org/security/2016/dsa-3628.


I think we should fix this by the attached. Any better ideas?

I'm a bit surprised no-one else has reported this yet. Have I missed a 
report?


- Heikki
>From d6f4eea7ee41a1d2ab0335bd126bf2cca7017cdd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 7 Oct 2016 14:40:43 +0300
Subject: [PATCH 1/1] Make TAP test suites to work, when @INC does not contain
 current dir.

Perl and/or new Linux distributions are starting to remove "." from
the @INC list by default. That breaks the pg_rewind and ssl test suites,
which use helper perl modules that reside in the same directory. To fix,
add the correct directory explicitly to prove's include dir.
---
 src/bin/pg_rewind/Makefile | 2 ++
 src/test/ssl/Makefile  | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/src/bin/pg_rewind/Makefile b/src/bin/pg_rewind/Makefile
index d03a0a2..af4a800 100644
--- a/src/bin/pg_rewind/Makefile
+++ b/src/bin/pg_rewind/Makefile
@@ -18,6 +18,8 @@ include $(top_builddir)/src/Makefile.global
 PG_CPPFLAGS = -I$(libpq_srcdir)
 PG_LIBS = $(libpq_pgport)
 
+PG_PROVE_FLAGS += -I$(top_srcdir)/$(subdir)
+
 override CPPFLAGS := -I$(libpq_srcdir) -DFRONTEND $(CPPFLAGS)
 
 OBJS	= pg_rewind.o parsexlog.o xlogreader.o datapagemap.o timeline.o \
diff --git a/src/test/ssl/Makefile b/src/test/ssl/Makefile
index 2b04d82..c65791c 100644
--- a/src/test/ssl/Makefile
+++ b/src/test/ssl/Makefile
@@ -13,6 +13,8 @@ subdir = src/test/ssl
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
+PG_PROVE_FLAGS += -I$(top_srcdir)/$(subdir)
+
 CERTIFICATES := server_ca server-cn-and-alt-names \
 	server-cn-only server-single-alt-name server-multiple-alt-names \
 	server-no-names server-revoked server-ss \
-- 
2.9.3


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-10-07 Thread Amit Langote
On 2016/10/07 18:27, Ashutosh Bapat wrote:
> It's allowed to specify an non-default opclass in partition by clause,
> but I do not see any testcase testing the same. Can you please add
> one.

OK, I will add some tests related to that.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind: Should abort if both --source-pgdata and --source-server are specified

2016-10-07 Thread Heikki Linnakangas

On 10/07/2016 01:34 PM, Michael Banck wrote:

Hi,

ISTM that pg_rewind's --source-pgdata and --source-server options are
mutually exclusive, and the synopsis in the documentation seems to
indicate that as well:

|pg_rewind [option...] {-D | --target-pgdata} directory
|{--source-pgdata=directory | --source-server=connstr}

However, there is no such check in the code.

I've seen people assume --source-pgdata is supposed to be the data
directory location on the remote server if they specify --source-server
as well, and are then confused by error messages.

So I think pg_rewind should abort in this case.  Patch for that attached.


Agreed. Committed, thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Radix tree for character conversion

2016-10-07 Thread Heikki Linnakangas

On 10/07/2016 11:36 AM, Kyotaro HORIGUCHI wrote:

The radix conversion function and map conversion script became
more generic than the previous state. So I could easily added
radix conversion of EUC_JP in addition to SjiftJIS.

nm -S said that the size of radix tree data for sjis->utf8
conversion is 34kB and that for utf8->sjis is 46kB.  (eucjp->utf8
57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
not missing something, radix tree is faster and require less
memory.


Cool!


Currently the tree structure is devided into several elements,
One for 2-byte, other ones for 3-byte and 4-byte codes and output
table. The other than the last one is logically and technically
merged into single table but it makes the generator script far
complex than the current complexity. I no longer want to play
hide'n seek with complex perl object..


I think that's OK. There isn't really anything to gain by merging them.


It might be better that combining this as a native feature of the
core. Currently the helper function is in core but that function
is given as conv_func on calling LocalToUtf.


Yeah, I think we want to completely replace the current binary-search 
based code with this. I would rather maintain just one mechanism.



Current implement uses *.map files of pg_utf_to_local as
input. It seems not good but the radix tree files is completely
uneditable. Provide custom made loading functions for every
source instead of load_chartable() would be the way to go.

# However, for example utf8_to_sjis.map, it doesn't seem to have
# generated from the source mentioned in UCS_to_SJIS.pl


Ouch. We should find and document an authoritative source for all the 
mappings we have...


I think the next steps here are:

1. Find an authoritative source for all the existing mappings.
2. Generate the radix tree files directly from the authoritative 
sources, instead of the existing *.map files.

3. Completely replace the existing binary-search code with this.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_rewind: Should abort if both --source-pgdata and --source-server are specified

2016-10-07 Thread Michael Banck
Hi,

ISTM that pg_rewind's --source-pgdata and --source-server options are
mutually exclusive, and the synopsis in the documentation seems to
indicate that as well: 

|pg_rewind [option...] {-D | --target-pgdata} directory
|{--source-pgdata=directory | --source-server=connstr}

However, there is no such check in the code.

I've seen people assume --source-pgdata is supposed to be the data
directory location on the remote server if they specify --source-server
as well, and are then confused by error messages.

So I think pg_rewind should abort in this case.  Patch for that attached.

Thoughts?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 5fdd4c5..dd62dd0 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -162,6 +162,13 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (datadir_source != NULL && connstr_source != NULL)
+	{
+		fprintf(stderr, _("%s: only one of --source-pgdata or --source-server can be specified\n"), progname);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+		exit(1);
+	}
+
 	if (datadir_target == NULL)
 	{
 		fprintf(stderr, _("%s: no target data directory specified (--target-pgdata)\n"), progname);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_stat_statements and non default search_path

2016-10-07 Thread Julien Rouhaud
Hello,

I've faced multiple times a painful limitation with pg_stat_statements.
Queries are normalized based on the textual SQL statements, and the
queryid is computed using objects' oids. So for instance with different
search_path settings, we can end up with the same normalized query but
different queryids, and there's no way to know what are the actual
relations each query is using. It also means that it can be almost
impossible to use the normalized query to replay a query.

I see two ways of fixing this. First one would be to store normalized
queries while fully qualifiying the relation's names. After a quick look
this can't be done without storing at least a token location in
RangeTblEntry nodes, which sounds like a bad idea.

The other way would be to store the value of search_path with each pgss
entry (either in shared_memory or in the external file).

Is it something that we should fix, and if yes is any of this
acceptable, or does someone see another way to solve this problem?

Regards.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-07 Thread Tomas Vondra

On 10/05/2016 10:03 AM, Amit Kapila wrote:

On Wed, Oct 5, 2016 at 12:05 PM, Tomas Vondra
 wrote:

Hi,

After collecting a lot more results from multiple kernel versions, I can
confirm that I see a significant improvement with 128 and 192 clients,
roughly by 30%:

   64128192

 master 62482  43181  50985
 granular-locking   61701  59611  47483
 no-content-lock62650  59819  47895
 group-update   63702  64758  62596

But I only see this with Dilip's workload, and only with pre-4.3.0 kernels
(the results above are from kernel 3.19).



That appears positive.



I got access to a large machine with 72/144 cores (thanks to Oleg and 
Alexander from Postgres Professional), and I'm running the tests on that 
machine too.


Results from Dilip's workload (with scale 300, unlogged tables) look 
like this:


32  64128 192224 256288
  master104943  128579  72167  100967  66631   97088  63767
  granular-locking  103415  141689  83780  120480  71847  115201  67240
  group-update  105343  144322  92229  130149  81247  126629  76638
  no-content-lock   103153  140568  80101  119185  70004  115386  66199

So there's some 20-30% improvement for >= 128 clients.

But what I find much more intriguing is the zig-zag behavior. I mean, 64 
clients give ~130k tps, 128 clients only give ~70k but 192 clients jump 
up to >100k tps again, etc.


FWIW I don't see any such behavior on pgbench, and all those tests were 
done on the same cluster.



With 4.5.5, results for the same benchmark look like this:

   64128192

 master 35693  39822  42151
 granular-locking   35370  39409  41353
 no-content-lock36201  39848  42407
 group-update   35697  39893  42667

That seems like a fairly bad regression in kernel, although I have not
identified the feature/commit causing it (and it's also possible the issue
lies somewhere else, of course).

With regular pgbench, I see no improvement on any kernel version. For
example on 3.19 the results look like this:

   64128192

 master 54661  61014  59484
 granular-locking   55904  62481  60711
 no-content-lock56182  62442  61234
 group-update   55019  61587  60485



Are the above results with synchronous_commit=off?



No, but I can do that.


I haven't done much more testing (e.g. with -N to eliminate
collisions on branches) yet, let's see if it changes anything.



Yeah, let us see how it behaves with -N. Also, I think we could try
at higher scale factor?



Yes, I plan to do that. In total, I plan to test combinations of:

(a) Dilip's workload and pgbench (regular and -N)
(b) logged and unlogged tables
(c) scale 300 and scale 3000 (both fits into RAM)
(d) sync_commit=on/off

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-10-07 Thread Ashutosh Bapat
It's allowed to specify an non-default opclass in partition by clause,
but I do not see any testcase testing the same. Can you please add
one.

On Fri, Oct 7, 2016 at 2:50 PM, Amit Langote
 wrote:
> On 2016/10/07 17:33, Rajkumar Raghuwanshi wrote:
>> On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote wrote:
>> I have applied latest patches, getting some error and crash, please check
>> if you are also able to reproduce the same.
>>
>> Observation1 : Not able to create index on partition table.
>> --
>> CREATE TABLE rp (c1 int, c2 int) PARTITION BY RANGE(c1);
>> CREATE TABLE rp_p1 PARTITION OF rp FOR VALUES START (1) END (10);
>> CREATE TABLE rp_p2 PARTITION OF rp FOR VALUES START (10) END (20);
>>
>> CREATE INDEX idx_rp_c1 on rp(c1);
>> ERROR:  cannot create index on partitioned table "rp"
>
> This one is a new behavior as I mentioned earlier in my previous email.
>
>> Observation2 : Getting cache lookup failed error for multiple column range
>> partition
>> --
>> CREATE TABLE rp1_m (c1 int, c2 int) PARTITION BY RANGE(c1, ((c1 + c2)/2));
>>
>> CREATE TABLE rp1_m_p1 PARTITION OF rp1_m FOR VALUES START (1, 1) END (10,
>> 10);
>> ERROR:  cache lookup failed for attribute 0 of relation 16429
>>
>> Observation3 : Getting server crash with multiple column range partition
>> --
>> CREATE TABLE rp2_m (c1 int, c2 int) PARTITION BY RANGE(((c2 + c1)/2), c2);
>> CREATE TABLE rp2_m_p1 PARTITION OF rp2_m FOR VALUES START (1, 1) END (10,
>> 10);
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>> !>
>
> Fixed.  Should have been caught when running the regression tests after I
> made changes to 0001 to address some review comments (apparently there was
> no test in 0003 that would've caught this, so added a new one, thanks!).
>
>
> Attached updated patches.
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-10-07 Thread Tomas Vondra


On 10/06/2016 07:36 AM, Pavan Deolasee wrote:



On Wed, Oct 5, 2016 at 1:43 PM, Tomas Vondra
> wrote:


...

I can confirm the significant speedup, often by more than 75%
(depending on number of indexes, whether the data set fits into RAM,
etc.). Similarly for the amount of WAL generated, although that's a
bit more difficult to evaluate due to full_page_writes.

I'm not going to send detailed results, as that probably does not
make much sense at this stage of the development - I can repeat the
tests once the open questions get resolved.


Sure. Anything that stands out? Any regression that you see? I'm not
sure if your benchmarks exercise the paths which might show overheads
without any tangible benefits. For example, I wonder if a test with many
indexes where most of them get updated and then querying the table via
those updated indexes could be one such test case.



No, nothing that would stand out. Let me explain what benchmark(s) I've 
done. I've made some minor mistakes when running the benchmarks, so I 
plan to rerun them and post the results after that. So let's take the 
data with a grain of salt.


My goal was to compare current non-HOT behavior (updating all indexes) 
with the WARM (updating only indexes on modified columns), and I've 
taken two approaches:


1) fixed number of indexes, update variable number of columns

Create a table with 8 secondary indexes and then run a bunch of 
benchmarks updating increasing number of columns. So the first run did


UPDATE t SET c1 = c1+1 WHERE id = :id;

while the second did

UPDATE t SET c1 = c1+1, c2 = c2+1 WHERE id = :id;

and so on, up to updating all the columns in the last run. I've used 
multiple scripts to update all the columns / indexes uniformly 
(essentially using multiple "-f" flags with pgbench). The runs were 
fairly long (2h, enough to get stable behavior).


For a small data set (fits into RAM), the results look like this:

 master  patched diff
15994   8490 +42%
24347   7903 +81%
34340   7400 +70%
44324   6929 +60%
54256   6495 +52%
64253   5059 +19%
74235   4534 +7%
84194   4237 +1%

and the amount of WAL generated (after correction for tps difference) 
looks like this (numbers are MBs)


 master   patcheddiff
1 27257 18508-32%
2 21753 14599-33%
3 21912 15864-28%
4 22021 17135-22%
5 21819 18258-16%
6 21929 20659 -6%
7 21994 22234 +1%
8 21851 23267 +6%

So this is quite significant difference. I'm pretty sure the minor WAL 
increase for the last two runs is due to full page writes (which also 
affects the preceding runs, making the WAL reduction smaller than the 
tps increase).


I do have results for larger data sets (>RAM), the results are very 
similar although the speedup seems a bit smaller. But I need to rerun those.


2) single-row update, adding indexes between runs

This is kinda the opposite of the previous approach, i.e. transactions 
always update a single column (multiple scripts to update the columns 
uniformly), but there are new indexes added between runs. The results 
(for a large data set, exceeding RAM) look like this:


 master   patcheddiff
0   954 1404 +47%
1   701 1045 +49%
2   484  816 +70%
3   346  683 +97%
4   248  608 +145%
5   190  525 +176%
6   152  397 +161%
7   123  315 +156%
8   123  270 +119%

So this looks really interesting.



There's a lot of useful and important feedback in the thread(s) so
far, particularly the descriptions of various failure cases. I think
it'd be very useful to collect those examples and turn them into
regression tests - that's something the patch should include anyway.


Sure. I added only a handful test cases which I knew regression isn't
covering. But I'll write more of them. One good thing is that the code
gets heavily exercised even during regression. I caught and fixed
multiple bugs running regression. I'm not saying that's enough, but it
certainly gives some confidence.



I don't see any changes to src/test in the patch, so I'm not sure what 
you mean when you say you added a handful of test cases?





and update:

update t set a = a+1, b=b+1;

which has to update all indexes on the table, but:

select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables

 n_tup_upd | n_tup_hot_upd
---+---
  1000 |  1000

So it's still counted as "WARM" - does it make sense?


No, it does not. The code currently just marks any update as a WARM
update if the 

Re: [HACKERS] VACUUM's ancillary tasks

2016-10-07 Thread Masahiko Sawada
On Fri, Oct 7, 2016 at 6:46 AM, Robert Haas  wrote:
> On Thu, Oct 6, 2016 at 3:56 PM, Jeff Janes  wrote:
>>> Sure, I could handle each case separately, but the goal of this patch
>>> (as hinted at in the Subject) is to generalize all the different tasks
>>> we've been giving to VACUUM.  The only missing piece is what the first
>>> patch addresses; which is insert-mostly tables never getting vacuumed.
>>
>> I don't buy the argument that we should do this in order to be "general".
>> Those other pieces are present to achieve a specific job, not out of
>> generality.
>
> +1.
>
>> If we want to have something to vacuum insert-mostly tables for the sake of
>> the index-only-scans, then I don't think we can ignore the fact that the
>> visibility map is page based, not tuple based.  If we insert 10,000 tuples
>> into a full table and they all go into 100 pages at the end, that is very
>> different than inserting 10,000 tuples which each go into a separate page
>> (because each page has that amount of freespace).  In one case you have
>> 10,000 tuples not marked as all visible, in the other case you have
>> 1,000,000 not marked as all visible.
>
> +1.
>
>> Also, I think that doing more counts which get amalgamated into the same
>> threshold, rather than introducing another parameter, is a bad thing.  I
>> have insert-mostly, most of the time, tables which are never going to
>> benefit from index-only-scans, and I don't want to pay the cost of them
>> getting vacuumed just because of some inserts, when I will never get a
>> benefit out of it.  I could turn autovacuum off for those tables, but then I
>> would have to remember to manually intervene on the rare occasion they do
>> get updates or deletes.  I want to be able to manually pick which tables I
>> sign up for this feature (and then forget it), not manually pick which ones
>> to exempt from it and have to constantly review that.
>
> Of course, if you do that, then what will happen is eventually it will
> be time to advance relfrozenxid for that relation, and you'll get a
> giant soul-crushing VACUUM of doom, rather than a bunch of small,
> hopefully-innocuous VACUUMs.  I've been wondering if would make sense
> to trigger vacuums based on inserts based on a *fixed* number of
> pages, rather than a percentage of the table.  Say, for example,
> whenever we have 64MB of not-all-visible pages, we VACUUM.

+1

>
> There are some difficulties:
>
> 1. We don't want to vacuum too early.  For example, a bulk load may
> vastly inflate the size of a table, but vacuuming it while the load is
> in progress will be useless.  You can maybe avoid that problem by
> basing this on statistics only reported at the end of the transaction,
> but there's another, closely-related issue: vacuuming right after the
> transaction completes may be useless, too, if there are old snapshots
> still in existence that render the newly-inserted tuples
> not-all-visible.

If the dummy xid can be generated for vacuum before starting vacuum
for maintenance vm which is triggered by the amount of the cleared vm
page, that vacuum could wait for old transaction finishes.


> 2. We don't want to trigger index vacuuming for a handful of dead
> tuples, or at least not too often.  I've previously suggested
> requiring a certain minimum number of dead index tuples that would be
> required before index vacuuming occurs; prior to that, we'd just
> truncate to dead line pointers and leave those for the next vacuum
> cycle.  This might need some refinement - should it be page-based? -
> but the basic idea still seems sound.
>

Where do we leave dead line pointers at?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-10-07 Thread Rajkumar Raghuwanshi
On Thu, Oct 6, 2016 at 12:44 PM, Amit Langote  wrote:

>
> Attached revised patches.  Also, includes a fix for an issue reported by
> Rajkumar Raghuwanshi [1] which turned out to be a bug in one of the later
> patches.  I will now move on to addressing the comments on patch 0003.
>
> Thanks a lot for the review!
>
> Thanks,
> Amit
>
> [1]
> https://www.postgresql.org/message-id/5dded2f1-c7f6-e7fc-56b
> 5-23ab59495...@lab.ntt.co.jp


Hi,

I have applied latest patches, getting some error and crash, please check
if you are also able to reproduce the same.

Observation1 : Not able to create index on partition table.
--
CREATE TABLE rp (c1 int, c2 int) PARTITION BY RANGE(c1);
CREATE TABLE rp_p1 PARTITION OF rp FOR VALUES START (1) END (10);
CREATE TABLE rp_p2 PARTITION OF rp FOR VALUES START (10) END (20);

CREATE INDEX idx_rp_c1 on rp(c1);
ERROR:  cannot create index on partitioned table "rp"

Observation2 : Getting cache lookup failed error for multiple column range
partition
--
CREATE TABLE rp1_m (c1 int, c2 int) PARTITION BY RANGE(c1, ((c1 + c2)/2));

CREATE TABLE rp1_m_p1 PARTITION OF rp1_m FOR VALUES START (1, 1) END (10,
10);
ERROR:  cache lookup failed for attribute 0 of relation 16429

Observation3 : Getting server crash with multiple column range partition
--
CREATE TABLE rp2_m (c1 int, c2 int) PARTITION BY RANGE(((c2 + c1)/2), c2);
CREATE TABLE rp2_m_p1 PARTITION OF rp2_m FOR VALUES START (1, 1) END (10,
10);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-07 Thread Masahiko Sawada
On Fri, Oct 7, 2016 at 9:40 AM, Jeff Janes  wrote:
> On Thu, Oct 6, 2016 at 2:46 PM, Robert Haas  wrote:
>>
>>
>> > Also, I think that doing more counts which get amalgamated into the same
>> > threshold, rather than introducing another parameter, is a bad thing.  I
>> > have insert-mostly, most of the time, tables which are never going to
>> > benefit from index-only-scans, and I don't want to pay the cost of them
>> > getting vacuumed just because of some iOn Thu, Oct 6, 2016 at 3:56 PM,
>> > Jeff Janes  wrote:
>> >> Sure, I could handle each case separately, but the goal of this patch
>>
>> nserts, when I will never get a
>> > benefit out of it.  I could turn autovacuum off for those tables, but
>> > then I
>> > would have to remember to manually intervene on the rare occasion they
>> > do
>> > get updates or deletes.  I want to be able to manually pick which tables
>> > I
>> > sign up for this feature (and then forget it), not manually pick which
>> > ones
>> > to exempt from it and have to constantly review that.
>>
>> Of course, if you do that, then what will happen is eventually it will
>> be time to advance relfrozenxid for that relation, and you'll get a
>> giant soul-crushing VACUUM of doom, rather than a bunch of small,
>> hopefully-innocuous VACUUMs.
>
>
> I think I will get the soul-crushing vacuum of doom anyway, if the database
> lasts that long.  For one reason, in my case while updates and deletes are
> rare, they are common enough that the frozen vm bits from early vacuums will
> be mostly cleared before the vacuum of doom comes around.
>
> For a second reason, I don't think the frozen bit in the vm actually gets
> set by much of anything other than a autovacuum-for-wraparound or a manual
> VACUUM FREEZE.

Yeah, the freeze map would be effective especially for static table.
Since we can skip to vacuum frozen page and the freezing tuples is not
big pain usually, we might want to change autovacuum more aggressive
to freeze the page by reducing default value of vacuum_freeze_min_age.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-07 Thread Ashutosh Bapat
On Thu, Oct 6, 2016 at 2:52 PM, Amit Langote
 wrote:
> On 2016/10/06 17:45, Ashutosh Bapat wrote:
>> On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada  
>> wrote:
>>> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat 
>>>  wrote:
> My understanding is that basically the local server can not return
> COMMIT to the client until 2nd phase is completed.

 If we do that, the local server may not return to the client at all,
 if the foreign server crashes and never comes up. Practically, it may
 take much longer to finish a COMMIT, depending upon how long it takes
 for the foreign server to reply to a COMMIT message.
>>>
>>> Yes, I think 2PC behaves so, please refer to [1].
>>> To prevent local server stops forever due to communication failure.,
>>> we could provide the timeout on coordinator side or on participant
>>> side.
>>
>> This too, looks like a heuristic and shouldn't be the default
>> behaviour and hence not part of the first version of this feature.
>
> At any rate, the coordinator should not return to the client until after
> the 2nd phase is completed, which was the original point.  If COMMIT
> taking longer is an issue, then it could be handled with one of the
> approaches mentioned so far (even if not in the first version), but no
> version of this feature should really return COMMIT to the client only
> after finishing the first phase.  Am I missing something?

There is small time window between actual COMMIT and a commit message
returned. An actual commit happens when we insert a WAL saying
transaction X committed and then we return to the client saying a
COMMIT happened. Note that a transaction may be committed but we will
never return to the client with a commit message, because connection
was lost or the server crashed. I hope we agree on this.

COMMITTING the foreign prepared transactions happens after we COMMIT
the local transaction. If we do it before COMMITTING local transaction
and the local server crashes, we will roll back local transaction
during subsequence recovery while the foreign segments have committed
resulting in an inconsistent state.

If we are successful in COMMITTING foreign transactions during
post-commit phase, COMMIT message will be returned after we have
committed all foreign transactions. But in case we can not reach a
foreign server, and request times out, we can not revert back our
decision that we are going to commit the transaction. That's my answer
to the timeout based heuristic.

I don't see much point in holding up post-commit processing for a
non-responsive foreign server, which may not respond for days
together. Can you please elaborate a use case? Which commercial
transaction manager does that?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers