date:20180418

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev


Thank you, pushed.


Actually, I see one tiny issue with extra '*' characters here:


+* The number of attributes won't be explicitly represented if the
+* negative infinity tuple was generated during a page split that
+* occurred with a version of Postgres before v11.  There must be a
+* problem when there is an explicit representation that is
+* non-zero, * or when there is no explicit representation and the
+* tuple is * evidently not a pre-pg_upgrade tuple.


I also suggest fixing this indentation before commit:


+   /*
+*Cannot leak memory here, TupleDescCopy() doesn't allocate any
+* inner structure, so, plain pfree() should clean all allocated memory
+*/


fixed

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

RE: Speedup of relation deletes during recovery

2018-04-18 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> Yeah, it's worth working on this problem. To decrease the number of scans
> of
> shared_buffers, you would need to change the order of truncations of files
> and
> WAL logging. In RelationTruncate(), currently WAL is logged after FSM and
> VM
> are truncated. IOW, with the patch, FSM and VM would need to be truncated
> after
> WAL logging. You would need to check whether this reordering is valid.

Sure, thank you for advice.

Takayuki Tsunakawa

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread Amit Langote

Hi David.

On 2018/04/19 9:04, David Rowley wrote:
> On 19 April 2018 at 03:13, Robert Haas  wrote:
>> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
>>  wrote:
>>> The patch does happen to improve performance slightly, but that is
>>> most likely due to the caching of the ExprStates rather than the
>>> change of memory management. It's not really possible to do that with
>>> the reset unless we stored the executor's memory context in
>>> PartitionPruneContext and did a context switch back inside
>>> partkey_datum_from_expr before calling ExecInitExpr.
>>
>> 10% is more than a "slight" improvement, I'd say!  It's certainly got
>> to be worth avoiding the repeated calls to ExecInitExpr, whatever we
>> do about the memory contexts.
> 
> I've attached a patch which does just this. On benchmarking again with
> this single change performance has improved 15% over master.
> 
> Also, out of curiosity, I also checked what this performed like before
> the run-time pruning patch was committed (5c0675215). Taking the
> average of the times below, it seems without this patch the
> performance of this case has improved about 356% and about 410% with
> this patch. So, I agree, it might be worth considering.
> 
> create table p (a int, value int) partition by hash (a);
> select 'create table p'||x|| ' partition of p for values with (modulus
> 10, remainder '||x||');' from generate_series(0,9) x;
> \gexec
> create table t1 (a int);
> 
> insert into p select x,x from generate_Series(1,1000) x;
> insert into t1 select x from generate_series(1,1000) x;
> 
> create index on p(a);
> 
> set enable_hashjoin = 0;
> set enable_mergejoin = 0;
> explain analyze select count(*) from t1 inner join p on t1.a=p.a;
> 
> -- Unpatched
> Execution Time: 20413.975 ms
> Execution Time: 20232.050 ms
> Execution Time: 20229.116 ms
> 
> -- Patched
> Execution Time: 17758.111 ms
> Execution Time: 17645.151 ms
> Execution Time: 17492.260 ms
> 
> -- 5c0675215e153ba1297fd494b34af2fdebd645d1
> Execution Time: 72875.161 ms
> Execution Time: 71817.757 ms
> Execution Time: 72411.730 ms

That's neat!  Definitely agree that we should call ExecInitExpr just once
here.  The patch looks good too, except the long line.  Maybe:

@@ -1514,13 +1514,15 @@ ExecSetupPartitionPruneState(PlanState *planstate,
List *partitionpruneinfo)
 foreach(lc3, step->exprs)
 {
 Expr   *expr = (Expr *) lfirst(lc3);
+int step_id = step->step.step_id;

 /*
  * partkey_datum_from_expr does not need an expression state
  * to evaluate a Const.
  */
 if (!IsA(expr, Const))
-context->exprstates[step->step.step_id * partnatts +
keyno] = ExecInitExpr(expr, context->planstate);
+context->exprstates[step_id * partnatts + keyno] =
+ExecInitExpr(expr, context->planstate);

Thanks,
Amit

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> a very long time before accessing to the relation. Which would cause the
> response-time spikes, for example, I observed such spikes several times
> on
> the server with shared_buffers = 300GB while running the benchmark.

FYI, a long transaction took about 900 ms, while the average transaction 
response time was 150 ms or so.  (I'm working with Fujii-san in this 
performance benchmark.)


> Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
> spikes
> for that relation.

How about an integer variable to replace the following?

#define REL_TRUNCATE_FRACTION   16


> Also, first of all, if other transactions need to extend the relation
> (i.e., need new pages) as soon as VACUUM truncates the empty pages at the
> end,
> that truncation would not be so helpful for performance. In this case,
> the truncation and extension of the relation are unnecessarily repeated,
> which would decrease the performance. So, to alleviate this situation,
> $SUBJECT is useful, I think.

I wonder if fillfactor=50 would alleviate this situation.

Regards
Takayuki Tsunakawa

Is a modern build system acceptable for older platforms

2018-04-18 Thread Catalin Iacob

There have been several discussions of replacing PG's autoconf +
src/tools/msvc system. The last example is happening now at the bottom of
the Setting rpath on llvmjit.so thread.

I see potentially big advantages to moving but also to PG's conservative
approach that keeps it running on edge and old platforms so I set to look
more carefully what could be problematic or a showstopper for a more modern
build system. Here are my findings, hope they help.

Unlike autoconf, all newer alternatives that I know of (certainly CMake and
Meson which were floated as alternatives so far) require themselves to be
present on the build machine when building. I know they have good reasons
to do this but that means they impose new dependencies for building PG.
Let's see what those are for CMake and Meson to get an idea if that's
acceptable and a feeling for how much friction they will introduce.

CMake
=

* needs a C++11 compiler (since 3.10, before it used to only need C++98)
* needs libuv (since 3.10 apparently, I know that some years ago it had no
library dependencies besides the C++ standard library)
* has a make backend so no new depedency (it maybe works with non GNU make
so maybe it lowers one dependency)
* can bootstrap on a number of Unix systems, see
https://gitlab.kitware.com/cmake/cmake/blob/master/bootstrap

For the platforms in "CMake's buildfarm" see
https://open.cdash.org/index.php?project=CMake

The C++11 requirement caused 3.10 and higher to not build anymore for HP-UX:
https://gitlab.kitware.com/cmake/cmake/issues/17137

Meson
=

* needs Python >= 3.4
* needs ninja
** meson has no make backend see
http://mesonbuild.com/FAQ.html#why-is-there-not-a-make-backend for rationale
** as a small positive, this would mean not having to explain "you need
GNU make, BSD make is not enough"

Ninja:
* needs C++
** I think C++98 is enough but not 100% sure, with a quick look at the
code I noticed no newer C++ features and the bootstrap script does not pass
any -std arguments to the C++ compiler so it should be 98
* https://github.com/ninja-build/ninja/pull/1007 talks about adding AIX
support and is in a release already
* https://github.com/ninja-build/ninja/blob/master/configure.py is the
bootstrap script which lists these as known platforms: 'linux', 'darwin',
'freebsd', 'openbsd', 'solaris', 'sunos5', 'mingw', 'msvc', 'gnukfreebsd',
'bitrig', 'netbsd', 'aix', 'dragonfly'

Python 3:
* points to ActivePython for HP-UX: https://www.python.org/download/other/
* some googling suggests Python > 3.2 works well on AIX and there are some
links to binaries

If I look at the requirements above versus what Postgres has in
src/template and in the build farm it seems like HP-UX and AIX could be the
more problematic or at least fiddly ones.

A related issue is that future versions of CMake or Meson could move their
baseline dependencies and desupport old platforms faster than PG might want
to but there one could make the case to just use the older meson or CMake.

So before the discussion whether the gains from switching build systems
would offset the pain, I think the project needs to decide whether a newer
build system is acceptable in the first place as it has a chance of
desupporting a platform alltogether or at least making it more painful for
some platforms by adding the bootstrap step for the build system with
potentially cascading dependencies (get Python 3 working, get ninja
bootstrapped, get PG built or get libuv built, get CMake built, get PG
built).

The above is all about getting the build system to work at all. If that
isn't a showstopper there's a subsequent discussion to be had about older
platforms where one could get the build system to work but convenient
packages are missing. For example not even RHEL7 has any Python3 packages
in the base system (it does in Software Collections though) which means
some extra hoops on getting meson running there. And RHEL5 is in an even
worse spot as it has no Software Collections, who knows if Python 3 builds
on it out of the box etc.

Re: Corrupted btree index on HEAD because of covering indexes

2018-04-18 Thread Teodor Sigaev


Will see...

Michael Paquier wrote:

Hi all,

I was just testing the VACUUM truncation logic, and bumped into what
looks like a corrupted btree index.  Here is a reproducer:
create table aa (a int primary key, b bool);
insert into aa values (generate_series(1,100), false);
checkpoint;
update aa set b = false where a > 50; -- Dirties a set of shared
buffers
delete from aa where a > 75; -- Delete a set of rows
vacuum aa;
delete from aa where a > 10;
vacuum aa; -- error on btree with right sibling

And here is the actual failure when the second vacuum:
ERROR:  XX000: right sibling 4132 of block 2128 is not next child 5396 of block 412 in 
index "aa_pkey"
LOCATION:  _bt_mark_page_halfdead, nbtpage.c:1564

This works on REL_10_STABLE, so I am adding an open item.  I have not
investigated the exact problem yet, but bisect is showing me covering
indexes as the culprit (8224de4).

Thanks,
--
Michael



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Corrupted btree index on HEAD because of covering indexes

2018-04-18 Thread Michael Paquier

Hi all,

I was just testing the VACUUM truncation logic, and bumped into what
looks like a corrupted btree index.  Here is a reproducer:
create table aa (a int primary key, b bool);
insert into aa values (generate_series(1,100), false);
checkpoint;
update aa set b = false where a > 50; -- Dirties a set of shared
buffers
delete from aa where a > 75; -- Delete a set of rows
vacuum aa;
delete from aa where a > 10;
vacuum aa; -- error on btree with right sibling

And here is the actual failure when the second vacuum:
ERROR:  XX000: right sibling 4132 of block 2128 is not next child 5396 of block 
412 in index "aa_pkey"
LOCATION:  _bt_mark_page_halfdead, nbtpage.c:1564

This works on REL_10_STABLE, so I am adding an open item.  I have not
investigated the exact problem yet, but bisect is showing me covering
indexes as the culprit (8224de4).

Thanks,
--
Michael


signature.asc
Description: PGP signature

Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 04:49:17PM -0400, Tom Lane wrote:
> Just to throw out a possibly-crazy idea: maybe we could fix this by
> PANIC'ing if truncation fails, so that we replay the row deletions from
> WAL.  Obviously this would be intolerable if the case were frequent,
> but we've had only two such complaints in the last nine years, so maybe
> it's tolerable.  It seems more attractive than taking a large performance
> hit on truncation speed in normal cases, anyway.

It can take some time to go through the whole thread...

And that was my first intuition when looking at those things.

So one case is where the truncation of the main relation happens first,
and succeeds.  After that comes up potentially the truncation of index
pages which can refer to tuples on the pages which have been truncated
previously, and then that part fails.  This causes index references to
be broken, which is what the report of 2010 is originally about.

> A gotcha to be concerned about is what happens if we replay from WAL,
> come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
> failure again, which is surely not unlikely.  PANIC'ing again will not
> do.  I think we could probably handle that by having the replay code
> path zero out all the pages it was unable to delete; as long as that
> succeeds, we can call it good and move on.

The complain we are discussing here is Windows antivirus meddling with
PostgreSQL by randomly preventing an access to the file to be
truncated.  Would a PANIC in this unique code path be sufficient though?
It seems to me that any error could cause an inconsistency, which could
justify the use of a critical section instead to force WAL replay to
cleanup things?
--
Michael


signature.asc
Description: PGP signature

RE: Built-in connection pooling

2018-04-18 Thread Tsunakawa, Takayuki

From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]
Oracle, for example, you can create dedicated and non-dedicated backends.
> I wonder why we do not want to have something similar in Postgres.

Yes, I want it, too.  In addition to dedicated and shared server processes, 
Oracle provides Database Resident Connection Pooling (DRCP).  I guessed you 
were inspired by this.

https://docs.oracle.com/cd/B28359_01/server.111/b28310/manproc002.htm#ADMIN12348

BTW, you are doing various great work -- autoprepare, multithreaded Postgres, 
built-in connection pooling, etc. etc., aren't you?  Are you doing all of these 
alone?

Regards
Takayuki Tsunakawa

Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-18 Thread Ashutosh Bapat

On Thu, Apr 19, 2018 at 2:54 AM, David Rowley
 wrote:
> If we just did it at plan time then
> pre-PREPAREd queries might still prune.  That does not seem very
> useful if it's being disabled due to the discovery of some bug.
>

As you have pointed out upthread, that's a problem with every enable_*
GUC. After seeing a bug, users would usually re-prepare their
statements with pruning turned off. So, I don't see this as a reason
for introducing two GUCs.

> The more I think about this the more undecided I am as to whether we
> need to add a GUC for this at all, so I'm keen to hear more people
> voice their opinion about this.  If bugs are the only true reason to
> add it, then the need for the GUC should diminish every day that
> nobody reports any bugs.
>

Apart from bugs, I think, this GUC can be used to avoid extra planning
time/memory/CPU incurred in pruning, when users know for sure that
pruning is not going to happen e.g. the cases like no qual on
partition key or no equality qual on hash partition key etc. Do we
know how much planning time can be saved this way?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Postgres 10 problem with UNION ALL of null value in "subselect"

2018-04-18 Thread Pavel Stehule

2018-04-19 5:01 GMT+02:00 Kyotaro HORIGUCHI :

> At Mon, 16 Apr 2018 18:39:24 +0530, Ashutosh Bapat <
> ashutosh.ba...@enterprisedb.com> wrote in

Is there a memory leak in commit 8561e48?

2018-04-18 Thread jian.l...@i-soft.com.cn

in commit 8561e48, _SPI_stack alloc from TopMemoryContext. But AtEOXact_SPI 
just set _SPI_stack = NULL. Is this a memory leak?

Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process

2018-04-18 Thread Thomas Munro

On Wed, Apr 18, 2018 at 6:55 PM, Thomas Munro
 wrote:
> Here's a draft patch that does that.

Here's a better one (the previous version could read past the end of
the occurred_events array).

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Exit-by-default-if-postmaster-dies-v2.patch
Description: Binary data

Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-18 Thread Thomas Munro

On Wed, Apr 18, 2018 at 5:04 PM, Thomas Munro
 wrote:
> On Wed, Apr 11, 2018 at 10:22 PM, Heikki Linnakangas  wrote:
>>> On Tue, Apr 10, 2018 at 12:53 PM, Andres Freund 
>>> wrote:
 That person said he'd work on adding an equivalent of linux'
 prctl(PR_SET_PDEATHSIG) to FreeBSD.
>
> Here is an implementation of Andres's idea for Linux, and also for
> patched FreeBSD (for later if/when that lands).  Do you think this
> makes sense Heikki?  I am planning to add this to the next CF.

Here's a new version with a stupid bug fixed (I accidentally posted a
testing version that returned false instead of true, as cfbot quickly
pointed out -- d'oh).

By the way, these patches only use the death signal to make
PostmasterIsAlive() fast, for use by busy loops like recovery.  The
postmaster pipe is still used for IO/timeout loops to detect
postmaster death.  In theory you could get rid of the postmaster pipe
completely when USE_POSTMASTER_DEATH_SIGNAL is defined and make it
like the latch code, using the same self-pipe.  I'm not sure if there
is anything to be gained by that (that wasn't already gained by using
epoll/kqueue) so I'm not proposing it.

-- 
Thomas Munro
http://www.enterprisedb.com

0001-Use-signals-for-postmaster-death-on-Linux-v2.patch
Description: Binary data

0002-Use-signals-for-postmaster-death-on-FreeBSD-v2.patch
Description: Binary data

Re: Postgres 10 problem with UNION ALL of null value in "subselect"

2018-04-18 Thread Kyotaro HORIGUCHI

At Mon, 16 Apr 2018 18:39:24 +0530, Ashutosh Bapat 
 wrote in

Re: Adding an LWLockHeldByMe()-like function that reports if any buffer content lock is held

2018-04-18 Thread Peter Geoghegan

On Wed, Apr 18, 2018 at 6:53 PM, Michael Paquier  wrote:
>> I'm curious about what we'll find by just by adding
>> Assert(!AnyBufferLockHeldByMe()) to the top of
>> heap_tuple_fetch_attr(). AssertNotInCriticalSection() certainly found
>> several bugs when it was first added.
>
> Yep.

I wrote a simple prototype of this. No assertion failed during a "make
installcheck". Assertions were added to heap_tuple_fetch_attr(), and a
couple of other places.

-- 
Peter Geoghegan

Re: VM map freeze corruption

2018-04-18 Thread Masahiko Sawada

On Wed, Apr 18, 2018 at 10:36 PM, Alvaro Herrera
 wrote:
> Pavan Deolasee wrote:
>> On Wed, Apr 18, 2018 at 7:37 AM, Wood, Dan  wrote:
>
>> > My analysis is that heap_prepare_freeze_tuple->FreezeMultiXactId()
>> > returns FRM_NOOP if the MultiXACT locked rows haven't committed.  This
>> > results in changed=false and totally_frozen=true(as initialized).  When
>> > this returns to lazy_scan_heap(), no rows are added to the frozen[] array.
>> > Yet, tuple_totally_frozen is true.  This means the page is marked frozen in
>> > the VM, even though the MultiXACT row wasn't left untouch.
>> >
>> > A fix to heap_prepare_freeze_tuple() that seems to do the trick is:
>> > else
>> > {
>> > Assert(flags & FRM_NOOP);
>> > +  totally_frozen = false;
>> > }
>> >
>>
>> That's a great find!
>
> Indeed.
>
> This family of bugs (introduced by freeze map changes in 9.6) was
> initially fixed in 38e9f90a227d but this spot was missed in that fix.
>
> IMO the cause is the totally_frozen variable, which starts life in a
> bogus state (true) and the different code paths can set it to the right
> state, or by inaction end up deciding that the initial bogus state was
> correct in the first place.  Errors of omission are far too easy in that
> kind of model, ISTM, so I propose this slightly different patch, which
> albeit yet untested seems easier to verify and easier to get right.
>

Thank you for the patch!
Agreed. The patch looks to fix this issue correctly.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Speedup of relation deletes during recovery

2018-04-18 Thread Michael Paquier

On Thu, Apr 19, 2018 at 01:52:26AM +0900, Fujii Masao wrote:
> No. But after my colleague truncated more than one hundred tables on
> the server with shared_buffers = 300GB, the recovery could not finish
> even after 10 minutes since the startup of the recovery. So I had to
> shutdown the server immediately, set shared_buffers to very small
> temporarily and start the server to cause the recovery to finish soon.

Hm, OK.  Here are simple functions to create and drop many relations in
a single transaction:
create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'create table tab_' || i::text || ' (a int);';
execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function drop_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
query_string := 'drop table tab_' || i::text;
execute query_string;
  end loop;
end;
$$ language plpgsql;

With a server running 8GB of shared buffers (you likely need to increase
max_locks_per_transaction) and 10k relations created and dropped, it
took 50 seconds to finish redo of roughly 4 segments:
2018-04-19 11:17:15 JST [7472]: [14-1] db=,user=,app=,client= LOG:  redo
starts at 0/15DF2E8
2018-04-19 11:18:05 JST [7472]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E46160: wanted 24, got 0
2018-04-19 11:18:05 JST [7472]: [16-1] db=,user=,app=,client= LOG:  redo
done at 0/4A7C6E

Then with your patch it took... Barely 1 second.
2018-04-19 11:20:33 JST [11690]: [14-1] db=,user=,app=,client= LOG:
redo starts at 0/15DF2E8
2018-04-19 11:20:34 JST [11690]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E299B0: wanted 24, got 0
2018-04-19 11:20:34 JST [11690]: [16-1] db=,user=,app=,client= LOG:
redo done at 0/4E29978

Looking at profiles with perf I can also see that 95% of the time is
spent in DropRelFileNodesAllBuffers which is where the startup process
spends most of its CPU.  So HEAD is booh.  And your patch is excellent
in this context.
--
Michael


signature.asc
Description: PGP signature

Re: Adding an LWLockHeldByMe()-like function that reports if any buffer content lock is held

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 06:44:00PM -0700, Peter Geoghegan wrote:
> What I have in mind here is something that's a bit like
> AssertNotInCriticalSection(). We don't need to pepper
> AssertNotInCriticalSection() everywhere in practice, because calling
> palloc() is a pretty good proxy for "function should not be called in
> a critical section" -- palloc() calls AssertNotInCriticalSection(),
> which probably catches most unsafe code in critical sections
> immediately.

In this case, the prospect of limiting unnecessary PANIC exists on OOM
was the deal breaker.

> We could probably also get decent
> Assert(!AnyBufferLockHeldByMe()) coverage without adding many new
> asserts.
> 
> I'm curious about what we'll find by just by adding
> Assert(!AnyBufferLockHeldByMe()) to the top of
> heap_tuple_fetch_attr(). AssertNotInCriticalSection() certainly found
> several bugs when it was first added.

Yep.
--
Michael


signature.asc
Description: PGP signature

Re: ON CONFLICT DO UPDATE for partitioned tables

2018-04-18 Thread Amit Langote

On 2018/04/18 22:40, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 2018/04/18 0:04, Alvaro Herrera wrote:
>>> Amit Langote wrote:
>>>
 I just confirmed my hunch that this wouldn't somehow do the right thing
 when the OID system column is involved.  Like this case:
>>>
>>> This looks too big a patch to pursue now.  I'm inclined to just remove
>>> the equalTupdesc changes.
>>
>> OK.  Here is the patch that removes equalTupdesc optimization.
> 
> Hmm.  If we modify (during pg12, of course -- not now) partition tables
> that are created identical to their parent table so that they share the
> pg_type row, this would become useful.  Unless there a reason why that
> change is completely unworkable, I'd just leave it there.  (I claim that
> it works like that only because it used to work like that, not because
> it's impossible to make work the other way.)

Yeah, I too have wondered in the past what it would take to make
equalTupDescs() return true for parent and partitions.  Maybe we can make
it work by looking a bit harder than I did then.

Although, just leaving it there now would mean we're adding a few cycles
needlessly in the PG 11 code.  Why not add that optimization when we
surely know it can work?

Thanks,
Amit

Re: Problem while setting the fpw with SIGHUP

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 10:52:51AM -0400, Robert Haas wrote:
> I would just document the risks.  If the documentation says that you
> can't rely on the value until after the next checkpoint, or whatever
> the rule is, then I think we're fine.  I don't think that we really
> have the infrastructure to do any better; if we try, we'll just end up
> with odd warts.  Documenting the current set of warts is less churn
> and less work.

The last version of the patch proposed has eaten this diff which was
part of one of the past versions (v2-0001-Change-FPW-handling.patch from
https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp):
+The default is on. The change of the parameter takes
+effect at the next checkpoint time.
So there were some documentation about the beHavior change for what it's
worth. 

And, er, actually, I was thinking again about the case where a user
wants to disable full_page_writes temporarily to do some bulk load and
then re-enable it.  With the patch proposed to actually update the FPW
effect at checkpoint time, then a user would need to issue a manual
checkpoint after updating the configuration and reloading, which may
create more I/O than he'd want to pay for, then a second checkpoint
would need to be issued after the configuration comes back again.  That
would cause a regression which could surprise a class of users.  WAL and
FPW overhead is a problem which shows up a lot when doing bulk-loading
of data.
--
Michael


signature.asc
Description: PGP signature

Re: Adding an LWLockHeldByMe()-like function that reports if any buffer content lock is held

2018-04-18 Thread Peter Geoghegan

On Wed, Apr 18, 2018 at 5:46 PM, Michael Paquier  wrote:
> Personally, I favor approaches like that, because it allows to catch up
> problems in using some APIs when people working on a patch miss any kind
> of warning comments at the top of the function or within it which
> summarize the conditions under which something needs to be used.

Right. Imagine how long it would take to figure out when there is a
bug without something like this assertion. It's fairly difficult to
debug LWLock deadlocks in production, even for experts.

What I have in mind here is something that's a bit like
AssertNotInCriticalSection(). We don't need to pepper
AssertNotInCriticalSection() everywhere in practice, because calling
palloc() is a pretty good proxy for "function should not be called in
a critical section" -- palloc() calls AssertNotInCriticalSection(),
which probably catches most unsafe code in critical sections
immediately. We could probably also get decent
Assert(!AnyBufferLockHeldByMe()) coverage without adding many new
asserts.

I'm curious about what we'll find by just by adding
Assert(!AnyBufferLockHeldByMe()) to the top of
heap_tuple_fetch_attr(). AssertNotInCriticalSection() certainly found
several bugs when it was first added.

-- 
Peter Geoghegan

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 07:41:44PM +0530, Amit Kapila wrote:
> I think it makes sense to pursue that approach, but it might be worth
> considering some alternative till we have it.  I remember last time
> (in 2015) we have discussed some another solution [1] to this problem
> (or similar) and we have left it unattended in the hope that we will
> get a better solution, but we are still in the same situation.  I
> think in general it is better to go with the approach which can fix
> the root cause of the problem, but if that is going to take a long
> time, it is not terrible to provide some workable solution which can
> help users.

Yeah, I can understand that feeling.  When we talked about the
compression of FPWs back in 9.5, we discussed that if we had
double-writes then this would not be necessary, and we are still with
wal_compression but without double writes (actually, it happens that
compression of pages can also be used with double writes, but that's
enough highjacking for this thread..).

Then, let's consider the beginning of the first commit fest of v12 as
judgement.  Implementing radix tree for shared buffers is a long-term
project, which has no guarantee to get merged, while a visibly-simple
reloptions which helps in some cases...
--
Michael


signature.asc
Description: PGP signature

Re: Adding an LWLockHeldByMe()-like function that reports if any buffer content lock is held

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 04:53:29PM -0700, Peter Geoghegan wrote:
> It occurred to me that it would be nice to be able to
> Assert(!AnyBufferLockHeldByMe()) at a certain point within
> index_form_tuple(), to make sure that our assumptions hold. If
> index_truncate_tuple() (or any other function) ever called
> index_form_tuple(), and ended up actually performing table access with
> an exclusive buffer lock held, we'd at least be able to catch the bug
> when assertions are enabled. A function that lets code assert that no
> buffer locks are held, for the rare cases where external table access
> is required seems like good general infrastructure.

Personally, I favor approaches like that, because it allows to catch up
problems in using some APIs when people working on a patch miss any kind
of warning comments at the top of the function or within it which
summarize the conditions under which something needs to be used. 

> Does this seem like a good idea? This could get pretty expensive if it
> was overused, even by the standards of what we expect from
> assertion-enabled builds, but we could make it optional if the
> overhead got out of hand.

There are other places which could be improved, mentioned one here:
https://www.postgresql.org/message-id/CAB7nPqRJtKO5NLZZis3xxzH05sdAodG39qFdLQRg371Pi94PzQ%40mail.gmail.com
And if you see downthread this is a topic where it seems hard to reach a
consensus.
--
Michael


signature.asc
Description: PGP signature

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Craig Ringer

On 19 April 2018 at 07:31, Mark Kirkwood  wrote:
> On 19/04/18 00:45, Craig Ringer wrote:
>
>>
>> I guarantee you that when you create a 100GB EBS volume on AWS EC2,
>> you don't get 100GB of storage preallocated. AWS are probably pretty
>> good about not running out of backing store, though.
>>
>>
>
> Some db folks (used to anyway) advise dd'ing to your freshly attached
> devices on AWS (for performance mainly IIRC), but that would help prevent
> some failure scenarios for any thin provisioned storage (but probably really
> annoy the admins' thereof).

This still makes a lot of sense on AWS EBS, particularly when using a
volume created from a non-empty snapshot. Performance of S3-snapshot
based EBS volumes is spectacularly awful, since they're copy-on-read.
Reading the whole volume helps a lot.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Proposal: Adding json logging

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 12:10:47PM -0700, Christophe Pettus wrote:
> On Apr 18, 2018, at 11:59, Robert Haas  wrote:
>> For the record, I'm tentatively in favor of including something like
>> this in contrib.
> 
> I'm much less fussed by this in contrib/ (with the same concern you
> noted), at minimum as an example of how to do logging in other
> formats. 

Using a contrib module for logging format has also a side effect. When
the logging collector is disabled, all the log entries which are created
by the postmaster have junk data as it is sort of impossible to make the
loaded module know that the logging collector is enabled in
configuration but that the log entries cannot use the pipe protocol 
yet.  In short, you finish with a couple of entries which are formatted
for the pipe protocol used by the syslogger but are redirected to
stderr.  There are only a couple of entries which enter in this
category, like a misconfiguration of the server, or the ports the server
is listening to (look for redirection_done in elog.c).  One simple fix
would be to pass down the value of redirection_done to emit_log_hook,
and this requires patching the server.
--
Michael


signature.asc
Description: PGP signature

Re: Proposal: Adding json logging

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 02:59:26PM -0400, Robert Haas wrote:
> 
> Note that logging_collector should be enabled in postgresql.conf to
> ensure consistent log outputs.  As JSON strings are longer than normal
> logs generated by PostgreSQL, this module increases the odds of malformed
> log entries.
> 
> 
> I'm not sure I understand the issue, but I don't like malformed log entries.

If logging_collector is disabled, then all the log entries would just go
to stderr (that's mentioned in the docs).  While that may be fine for
low volumes of logs, for many concurrent processes generating logs then
a process can overwrite another process log entry.  The logs generated by
the JSON format are longer in length, increased mainly by the repetitive
use of the key names in the blob, which increase in turn the volume
generated.

So in this case this can cause JSON blobs to look broken.  I had a
report on github not long ago about that:
https://github.com/michaelpq/pg_plugins/issues/17
--
Michael

signature.asc
Description: PGP signature

Re: Fix for documentation of Covering Indexes

2018-04-18 Thread Michael Paquier

On Wed, Apr 18, 2018 at 05:52:01AM -0400, Heikki Linnakangas wrote:
> Committed, thanks!

Thanks for the commit.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread David Rowley

On 19 April 2018 at 12:04, David Rowley  wrote:
> insert into p select x,x from generate_Series(1,1000) x;
> insert into t1 select x from generate_series(1,1000) x;

Correction. These were meant to read:

insert into p select x,x from generate_Series(1,1000) x;
insert into t1 select x from generate_series(1,1000) x;

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread David Rowley

On 19 April 2018 at 03:13, Robert Haas  wrote:
> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
>  wrote:
>> I did go and start working on a patch to test how possible this would
>> be and came up with the attached. I've left a stray
>> MemoryContextStatsDetail call in there which does indicate that
>> something is not being freed. I'm just not sure what it is yet.
>>
>> The patch does happen to improve performance slightly, but that is
>> most likely due to the caching of the ExprStates rather than the
>> change of memory management. It's not really possible to do that with
>> the reset unless we stored the executor's memory context in
>> PartitionPruneContext and did a context switch back inside
>> partkey_datum_from_expr before calling ExecInitExpr.
>
> 10% is more than a "slight" improvement, I'd say!  It's certainly got
> to be worth avoiding the repeated calls to ExecInitExpr, whatever we
> do about the memory contexts.

I've attached a patch which does just this. On benchmarking again with
this single change performance has improved 15% over master.

Also, out of curiosity, I also checked what this performed like before
the run-time pruning patch was committed (5c0675215). Taking the
average of the times below, it seems without this patch the
performance of this case has improved about 356% and about 410% with
this patch. So, I agree, it might be worth considering.

create table p (a int, value int) partition by hash (a);
select 'create table p'||x|| ' partition of p for values with (modulus
10, remainder '||x||');' from generate_series(0,9) x;
\gexec
create table t1 (a int);

insert into p select x,x from generate_Series(1,1000) x;
insert into t1 select x from generate_series(1,1000) x;

create index on p(a);

set enable_hashjoin = 0;
set enable_mergejoin = 0;
explain analyze select count(*) from t1 inner join p on t1.a=p.a;

-- Unpatched
Execution Time: 20413.975 ms
Execution Time: 20232.050 ms
Execution Time: 20229.116 ms

-- Patched
Execution Time: 17758.111 ms
Execution Time: 17645.151 ms
Execution Time: 17492.260 ms

-- 5c0675215e153ba1297fd494b34af2fdebd645d1
Execution Time: 72875.161 ms
Execution Time: 71817.757 ms
Execution Time: 72411.730 ms

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Initialize-expr-states-once-in-run-time-partition-pr.patch
Description: Binary data

Adding an LWLockHeldByMe()-like function that reports if any buffer content lock is held

2018-04-18 Thread Peter Geoghegan

During recent review of the INCLUDE covering index patch, I pushed to
formalize the slightly delicate assumptions that we make around how
index_truncate_tuple() is called. It's natural to call
index_truncate_tuple() during a page split, when a buffer lock is
held. This is what we actually do in most cases.

It occurred to me that it would be nice to be able to
Assert(!AnyBufferLockHeldByMe()) at a certain point within
index_form_tuple(), to make sure that our assumptions hold. If
index_truncate_tuple() (or any other function) ever called
index_form_tuple(), and ended up actually performing table access with
an exclusive buffer lock held, we'd at least be able to catch the bug
when assertions are enabled. A function that lets code assert that no
buffer locks are held, for the rare cases where external table access
is required seems like good general infrastructure.

Does this seem like a good idea? This could get pretty expensive if it
was overused, even by the standards of what we expect from
assertion-enabled builds, but we could make it optional if the
overhead got out of hand.

-- 
Peter Geoghegan

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Mark Kirkwood


On 19/04/18 00:45, Craig Ringer wrote:



I guarantee you that when you create a 100GB EBS volume on AWS EC2,
you don't get 100GB of storage preallocated. AWS are probably pretty
good about not running out of backing store, though.




Some db folks (used to anyway) advise dd'ing to your freshly attached 
devices on AWS (for performance mainly IIRC), but that would help 
prevent some failure scenarios for any thin provisioned storage (but 
probably really annoy the admins' thereof).


regards
Mark

Re: pruning disabled for array, enum, record, range type partition keys

2018-04-18 Thread Tom Lane

Alvaro Herrera  writes:
> I now wonder if there's anything else that equivclass.c or indxpath.c
> can teach us on this topic.

I've been meaning to review this but have been a bit distracted.
Will try to look tomorrow.

regards, tom lane

Re: Query is over 2x slower with jit=on

2018-04-18 Thread Robert Haas

On Wed, Apr 18, 2018 at 3:29 PM, Andres Freund  wrote:
> Not convinced that that is true - the issue is more likely that JIT work in 
> workers is counted as execute time... Gotta add that somehow, not sure what 
> the best way would be.

Oh, that does seem like something that should be fixed.  If that's
what is happening here, it's bound to confuse a lot of people.
Probably you need to add some code to
ExecParallelRetrieveInstrumentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Peter Geoghegan

On Wed, Apr 18, 2018 at 1:45 PM, Peter Geoghegan  wrote:
> I suggest committing this patch as-is.

Actually, I see one tiny issue with extra '*' characters here:

> +* The number of attributes won't be explicitly represented if the
> +* negative infinity tuple was generated during a page split that
> +* occurred with a version of Postgres before v11.  There must be 
> a
> +* problem when there is an explicit representation that is
> +* non-zero, * or when there is no explicit representation and the
> +* tuple is * evidently not a pre-pg_upgrade tuple.

I also suggest fixing this indentation before commit:

> +   /*
> +*Cannot leak memory here, TupleDescCopy() doesn't allocate any
> +* inner structure, so, plain pfree() should clean all allocated memory
> +*/

-- 
Peter Geoghegan

Re: pruning disabled for array, enum, record, range type partition keys

2018-04-18 Thread Alvaro Herrera

Amit Langote wrote:
> On Thu, Apr 19, 2018 at 12:01 AM, Alvaro Herrera
>  wrote:

> > Makes sense.  Still, I was expecting that pruning of hash partitioning
> > would also work for pseudotypes, yet it doesn't.
> 
> It does?

Aha, so it does.

While staring at this new code, I was confused as to why we didn't use
the commutator if the code above had determined one.  I was unable to
cause a test to fail, so I put that thought aside.

Some time later, after restructuring the code in a way that seemed to
make more sense to me (and saving one get_op_opfamily_properties call
for the case of the not-equals operator), I realized that with the new
code we can store the opstrategy in the PartClause instead of leaving it
as Invalid and look it up again later, so I did that.  And lo and
behold, the tests that used commutators started failing!  So I fixed
that one in the obvious way, and the tests work fully again.

Please give this version another look.  I also rewrote a couple of
comments.

I now wonder if there's anything else that equivclass.c or indxpath.c
can teach us on this topic.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 340b0b3a97af0fa07562bc4434a4908402c3efbd Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 18 Apr 2018 18:18:51 -0300
Subject: [PATCH v5] Fix pruning code to determine comparison function
 correctly

It's unreliable to determine one using the constant expression's
type directly (for example, it doesn't work correctly when the
expression contains an array, enum, etc.).  Instead, use righttype
of the operator, the one that supposedly passed the op_in_opfamily
test using the partitioning opfamily.
---
 src/backend/partitioning/partprune.c  | 101 +++
 src/test/regress/expected/partition_prune.out | 138 ++
 src/test/regress/sql/partition_prune.sql  |  53 ++
 3 files changed, 250 insertions(+), 42 deletions(-)

diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 7666c6c412..b8a006e774 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -1426,10 +1426,12 @@ match_clause_to_partition_key(RelOptInfo *rel,
OpExpr *opclause = (OpExpr *) clause;
Expr   *leftop,
   *rightop;
-   Oid commutator = InvalidOid,
+   Oid op_lefttype,
+   op_righttype,
+   commutator = InvalidOid,
negator = InvalidOid;
Oid cmpfn;
-   Oid exprtype;
+   int op_strategy;
boolis_opne_listp = false;
PartClauseInfo *partclause;
 
@@ -1483,33 +1485,39 @@ match_clause_to_partition_key(RelOptInfo *rel,
return PARTCLAUSE_UNSUPPORTED;
 
/*
-* Normally we only bother with operators that are listed as 
being
-* part of the partitioning operator family.  But we make an 
exception
-* in one case -- operators named '<>' are not listed in any 
operator
-* family whatsoever, in which case, we try to perform partition
-* pruning with it only if list partitioning is in use.
+* Determine the input types of the operator we're considering.
+*
+* Normally we only care about operators that are listed as 
being part
+* of the partitioning operator family.  But there is one 
exception:
+* the not-equals operators are not listed in any operator 
family
+* whatsoever, but we can find their negator in the opfamily 
and it's
+* an equality op.  We can use one of those if we find it, but 
they
+* are only useful for list partitioning.
 */
-   if (!op_in_opfamily(opclause->opno, partopfamily))
+   if (op_in_opfamily(opclause->opno, partopfamily))
{
+   Oid oper;
+
+   oper = OidIsValid(commutator) ? commutator : 
opclause->opno;
+   get_op_opfamily_properties(oper, partopfamily, false,
+  
_strategy, _lefttype,
+  
_righttype);
+   }
+   else
+   {
+   /* Not good unless list partitioning */
if (part_scheme->strategy != PARTITION_STRATEGY_LIST)
return

Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-18 Thread David Rowley

On 18 April 2018 at 21:36, Ashutosh Bapat
 wrote:
> On Wed, Apr 18, 2018 at 5:37 AM, David Rowley
>> a) Disable run-time pruning during execution.
>> b) Disable run-time pruning during planning.
>> c) Both of the above.
>>
>> The differentiation of the above is important when you consider
>> PREPAREd statements. Currently, no enable_ GUC will affect a
>> pre-PREPAREd query. We might want to keep that rule despite there
>> being flexibility not to, in this case.
>
>
> If run-time pruning is disabled, why do we want to waste CPU cycles
> and memory to produce plan time details? It might be useful to do so,
> if there was a large chance that people prepared a statement which
> could use partition pruning with run-time pruning disables but
> EXECUTEd it with run-time pruning enabled. It will be less likely that
> the session which prepares a plan would change the GUCs before
> executing it.

I have to admit, can't really imagine any valid cases were disabling
this feature would be useful. Generally, enable_* properties can be
used to coax the planner into producing some plan shape that it
otherwise didn't due to some costing problem.  I can only imagine it
might be useful to disable either for testing or as a workaround for
some bug that might crop up. Perhaps that's not enough reason to go
and add a GUC that'll likely need to exist forever. But it probably
does mean that we'd want c) so that the code is completely disabled as
soon as the setting is off.  If we just did it at plan time then
pre-PREPAREd queries might still prune.  That does not seem very
useful if it's being disabled due to the discovery of some bug.

The more I think about this the more undecided I am as to whether we
need to add a GUC for this at all, so I'm keen to hear more people
voice their opinion about this.  If bugs are the only true reason to
add it, then the need for the GUC should diminish every day that
nobody reports any bugs.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread Tom Lane

I wrote:
> Relation truncation throws away the page image in memory without ever
> writing it to disk.  Then, if the subsequent file truncate step fails,
> we have a problem, because anyone who goes looking for that page will
> fetch it afresh from disk and see the tuples as live.

> There are WAL entries recording the row deletions, but that doesn't
> help unless we crash and replay the WAL.

> It's hard to see a way around this that isn't fairly catastrophic for
> performance :-(.

Just to throw out a possibly-crazy idea: maybe we could fix this by
PANIC'ing if truncation fails, so that we replay the row deletions from
WAL.  Obviously this would be intolerable if the case were frequent,
but we've had only two such complaints in the last nine years, so maybe
it's tolerable.  It seems more attractive than taking a large performance
hit on truncation speed in normal cases, anyway.

A gotcha to be concerned about is what happens if we replay from WAL,
come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
failure again, which is surely not unlikely.  PANIC'ing again will not
do.  I think we could probably handle that by having the replay code
path zero out all the pages it was unable to delete; as long as that
succeeds, we can call it good and move on.

Or maybe just do that in the mainline case too?  That is, if ftruncate
fails, handle it by zeroing the undeletable pages and pressing on?

> But in any case it's wrapped up in order-of-operations
> issues.  I've long since forgotten the details, but I seem to have thought
> that there were additional order-of-operations hazards besides this one.

It'd be a good idea to redo that investigation before concluding this
issue is fixed, too.  I was not thinking at the time that it'd be years
before anybody did anything, or I'd have made more notes.

regards, tom lane

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Peter Geoghegan

On Wed, Apr 18, 2018 at 1:32 PM, Teodor Sigaev  wrote:
>> I don't understand. We do check the number of attributes on rightmost
>> pages, but we do so separately, in the main loop. For every item that
>> isn't the high key.
>
> Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now
> it's a bit more readable.

The new comment looks good.

Now I understand what you meant about _bt_check_natts(). And, I agree
that this is an improvement -- the extra verbosity is worth it.

> I didn't do that in v1, sorry, I was unclear. Attached patch contains all
> changes suggested in my previous email.

Looks new BTreeTupSetNAtts () assertion good to me.

I suggest committing this patch as-is.

Thank you
-- 
Peter Geoghegan

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev


I don't understand. We do check the number of attributes on rightmost
pages, but we do so separately, in the main loop. For every item that
isn't the high key.
Comment added, pls, verify. And refactored _bt_check_natts(), I hope, 
now it's a bit more readable.



4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to
BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK
bits, now it will overwrite that bits.


An assertion sounds like it would be an improvement, though I don't
see that in the patch you posted.
I didn't do that in v1, sorry, I was unclear. Attached patch contains 
all changes suggested in my previous email.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index be0206d58e..1a605f9944 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -698,6 +698,9 @@ nextpage:
  *	 "real" data item on the page to the right (if such a first item is
  *	 available).
  *
+ * - That tuples report that they have the expected number of attributes.
+ *	 INCLUDE index pivot tuples should not contain non-key attributes.
+ *
  * Furthermore, when state passed shows ShareLock held, and target page is
  * internal page, function also checks:
  *
@@ -722,43 +725,35 @@ bt_target_page_check(BtreeCheckState *state)
 	elog(DEBUG2, "verifying %u items on %s block %u", max,
 		 P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock);
 
-
-	/* Check the number of attributes in high key if any */
-	if (!P_RIGHTMOST(topaque))
+	/*
+	 * Check the number of attributes in high key. Note, rightmost page doesn't
+	 * contain a high key, so nothing to check
+	 */
+	if (!P_RIGHTMOST(topaque) &&
+		!_bt_check_natts(state->rel, state->target, P_HIKEY))
 	{
-		if (!_bt_check_natts(state->rel, state->target, P_HIKEY))
-		{
-			ItemId		itemid;
-			IndexTuple	itup;
-			char	   *itid,
-	   *htid;
+		ItemId		itemid;
+		IndexTuple	itup;
 
-			itemid = PageGetItemId(state->target, P_HIKEY);
-			itup = (IndexTuple) PageGetItem(state->target, itemid);
-			itid = psprintf("(%u,%u)", state->targetblock, P_HIKEY);
-			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-			ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+		itemid = PageGetItemId(state->target, P_HIKEY);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
 
-			ereport(ERROR,
-	(errcode(ERRCODE_INDEX_CORRUPTED),
-	 errmsg("wrong number of index tuple attributes for index \"%s\"",
-			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.",
-		itid,
-		BTreeTupGetNAtts(itup, state->rel),
-		P_ISLEAF(topaque) ? "heap" : "index",
-		htid,
-		(uint32) (state->targetlsn >> 32),
-		(uint32) state->targetlsn)));
-		}
+		ereport(ERROR,
+(errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("wrong number of high key index tuple attributes in index \"%s\"",
+		RelationGetRelationName(state->rel)),
+ errdetail_internal("Index block=%u natts=%u block type=%s page lsn=%X/%X.",
+	state->targetblock,
+	BTreeTupleGetNAtts(itup, state->rel),
+	P_ISLEAF(topaque) ? "heap" : "index",
+	(uint32) (state->targetlsn >> 32),
+	(uint32) state->targetlsn)));
 	}
 
-
 	/*
 	 * Loop over page items, starting from first non-highkey item, not high
-	 * key (if any).  Also, immediately skip "negative infinity" real item (if
-	 * any).
+	 * key (if any).  Most tests are not performed for the "negative infinity"
+	 * real item (if any).
 	 */
 	for (offset = P_FIRSTDATAKEY(topaque);
 		 offset <= max;
@@ -791,7 +786,7 @@ bt_target_page_check(BtreeCheckState *state)
 		tupsize, ItemIdGetLength(itemid),
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn),
-	 errhint("This could be a torn page problem")));
+	 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
 		if (!_bt_check_natts(state->rel, state->target, offset))
@@ -806,11 +801,11 @@ bt_target_page_check(BtreeCheckState *state)
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
-	 errmsg("wrong number of index tuple attributes for index \"%s\"",
+	 errmsg("wrong number of index tuple attributes in index \"%s\"",
 			RelationGetRelationName(state->rel)),
 	 errdetail_internal("Index tid=%s natts=%u points to %s tid=%s page lsn=%X/%X.",
 		itid,
-		BTreeTupGetNAtts(itup, state->rel),
+		BTreeTupleGetNAtts(itup, state->rel),
 		P_ISLEAF(topaque) ? "heap" : "index",
 		htid,
 		(uint32) (state->targetlsn >> 32),
@@ -818,8 +813,8 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/*
-		 * Don't try to generate scankey using

Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread Tom Lane

"MauMau"  writes:
> I'd like to continue to think of a solution and create a patch, based
> on the severity and how the customer will respond to our answer.  I
> have a feeling that we have to say it's a bit serious, since it
> requires recovery from a base backup, not just rebuilding indexes.
> The patch development may be after PGCon.

It's definitely something we need to work on.  It looks to me like
the original thread died because we weren't in a part of the cycle
where we wanted to work on major patches ... and here we are again :-(.
So yeah, targeting a fix for v12 might be the right way to think
about it, seeing how seldom the problem crops up.

regards, tom lane

Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread MauMau

From: Tom Lane
[ re-reads thread... ]  The extra assumption you need in order to have
trouble is that the blocks in question are dirty in shared buffers and
have never been written to disk since their rows were deleted.  Then
the situation is that the page image on disk shows the rows as live,
while the up-to-date page image in memory correctly shows them as
dead.
Relation truncation throws away the page image in memory without ever
writing it to disk.  Then, if the subsequent file truncate step fails,
we have a problem, because anyone who goes looking for that page will
fetch it afresh from disk and see the tuples as live.


Thank you so much, I got it!  And I'm always impressed at how quick
and concise you are, while you are busy addressing multiple issues and
answering user questions.  Maybe I wouldn't be surprised to hear that
there are multiple clones of Tom Lane.

I'd like to continue to think of a solution and create a patch, based
on the severity and how the customer will respond to our answer.  I
have a feeling that we have to say it's a bit serious, since it
requires recovery from a base backup, not just rebuilding indexes.
The patch development may be after PGCon.

Regards
MauMau

Re: Proposal: Adding json logging

2018-04-18 Thread Alvaro Herrera

John W Higgins wrote:
> On Sun, Apr 15, 2018 at 11:08 AM, David Arnold  wrote:
> 
> > >This would appear to solve multiline issues within Fluent.
> > >https://docs.fluentd.org/v0.12/articles/parser_multiline
> >
> > I definitely looked at that, but what guarantees do I have that the
> > sequence is always ERROR/STATEMENT/DETAIL? And not the other way round?
> 
> Have you asked that question? You seem to at least have opened the source
> code - did you try to figure out what the logging format is?

I looked at this a couple of days ago.  I think parsing with this
library is possible to a certain extent, and the problems stem from
limitations of the library.  So, it turns out that the firstline can be
set to a single regex that searches for PANIC, FATAL, ERROR, WARNING,
LOG, NOTICE, DEBUG.  That's always the first line in any postgres log
event.

A log event contains some subsequent lines.  Those start either with a
tab (which is a continuation of the previous line) or with one of
DETAIL, CONTEXT, HINT, STATEMENT, QUERY.  This seems very simple to
parse (just add lineN patterns for those), *except* that the messages
can be multiline too; and where would you assign the continuation lines
for each of those?  parser_multiline does not support that.

Another thing worth keeping in mind is that you need to change the regex
depending on log_line_prefix, which sounds very painful.

All in all, the best approach might be to create a specific
parser_postgresql.rb plugin.  Seems much easier to write two dozen lines
of Ruby than change all of PostgreSQL's logging infrastructure.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Tom Lane

I wrote:
> Pavan Deolasee  writes:
>> What if we remember the buffers as seen by count_nondeletable_pages() and
>> then just discard those specific buffers instead of scanning the entire
>> shared_buffers again?

> That's an idea.

BTW, before pushing too hard on any of this, we need to think about the
data-corruption hazard that MauMau just reminded us about.  I'm afraid
what we're likely to end up with after the dust settles is worse
performance than today, not better :-(.

https://postgr.es/m/5BBC590AE8DF4ED1A170E4D48F1B53AC@tunaPC

regards, tom lane

Re: Proposal: Adding json logging

2018-04-18 Thread David Arnold

Excellent phrasing (thanks to Christophe!): "There is a large class of log
analysis tool out there that has trouble with multiline formats and we
should be good ecosystem players"

> I'm much less fussed by this in contrib/ (with the same concern you
noted), at a minimum as an example of how to do logging in other formats.

This would be a very well balanced compromise, almost every distribution
flavour also packages contrib so installing contrib and loading such module
as a convenient pre-packaged shared library would just be an excellent
solution to the big majority of Postgres users.

Now we are moving to the solutions space :) I wanted to wait this week,
though, to give sufficient time to comment on all aspects and would do
another wrap-up next weekend. This suggestion would definitely be part of
it.

El mié., 18 abr. 2018 a las 14:10, Christophe Pettus ()
escribió:

>
> > On Apr 18, 2018, at 11:59, Robert Haas  wrote:
> >
> > I'm not sure exactly how you intended to this comment, but it seems to
> > me that whether CSV is ease or hard to parse, somebody might
> > legitimately find JSON more convenient.
>
> Of course.  The specific comment I was replying to made a couple of jumps
> that I wanted to unwind: The first is that we don't have a machine-readable
> format for PostgreSQL (we do, CSV), and that there was "no substantial
> objection to this need."
>
> If the requirement is: "There is a large class of log analysis tool out
> there that has trouble with multiline formats and we should be good
> ecosystem players," that's fine.  (I'm a bit sour about the number of tools
> being written with one-line-per-event baked into them and whose solution to
> any other format is "use regex," but that's neither here nor there, I
> suppose.)
>
> My primary objection to creating new output formats is that it creates an
> implicit burden on downstream tools to adopt them.  For example, a log of
> query analysis tools don't yet process JSON-format plans, and they've been
> around for a while.  By introducing a new format in core (which was the
> starting proposal), we're essentially telling all the tools (such as
> pgbadger) that might absorb them that we expect them to adopt that too.
>
> > For the record, I'm tentatively in favor of including something like
> > this in contrib.
>
> I'm much less fussed by this in contrib/ (with the same concern you
> noted), at minimum as an example of how to do logging in other formats.
>
> --
> -- Christophe Pettus
>x...@thebuild.com
>
> --
[image: XOE Solutions]  DAVID ARNOLD
Gerente General
xoe.solutions
dar@xoe.solutions
+57 (315) 304 13 68
*Confidentiality Note: * This email may contain confidential and/or private
information. If you received this email in error please delete and notify
sender.
*Environmental Consideration: * Please avoid printing this email on paper,
unless really necessary.

Re: Query is over 2x slower with jit=on

2018-04-18 Thread Andres Freund



On April 18, 2018 12:16:35 PM PDT, Robert Haas  wrote:
>On Wed, Apr 18, 2018 at 11:50 AM, Andres Freund 
>wrote:
>> JIT has cost, and sometimes it's not beneficial. Here our heuristics
>> when to JIT appear to be a bit off. In the parallel world it's worse
>> because the JITing is duplicated for parallel workers atm.
>
>It seems like you're describing it as if the JIT just didn't produce
>gains sufficient to make up for the cost of doing it, but that's not
>really the issue here AFAICS.  Here the JIT actually made code that
>run slower than the un-JIT-ted code.  That seems like a different sort
>of problem.

Not convinced that that is true - the issue is more likely that JIT work in 
workers is counted as execute time... Gotta add that somehow, not sure what the 
best way would be.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Query is over 2x slower with jit=on

2018-04-18 Thread Robert Haas

On Wed, Apr 18, 2018 at 11:50 AM, Andres Freund  wrote:
> JIT has cost, and sometimes it's not beneficial. Here our heuristics
> when to JIT appear to be a bit off. In the parallel world it's worse
> because the JITing is duplicated for parallel workers atm.

It seems like you're describing it as if the JIT just didn't produce
gains sufficient to make up for the cost of doing it, but that's not
really the issue here AFAICS.  Here the JIT actually made code that
run slower than the un-JIT-ted code.  That seems like a different sort
of problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Proposal: Adding json logging

2018-04-18 Thread Christophe Pettus

> On Apr 18, 2018, at 11:59, Robert Haas  wrote:
> 
> I'm not sure exactly how you intended to this comment, but it seems to
> me that whether CSV is ease or hard to parse, somebody might
> legitimately find JSON more convenient.

Of course.  The specific comment I was replying to made a couple of jumps that 
I wanted to unwind: The first is that we don't have a machine-readable format 
for PostgreSQL (we do, CSV), and that there was "no substantial objection to 
this need."

If the requirement is: "There is a large class of log analysis tool out there 
that has trouble with multiline formats and we should be good ecosystem 
players," that's fine.  (I'm a bit sour about the number of tools being written 
with one-line-per-event baked into them and whose solution to any other format 
is "use regex," but that's neither here nor there, I suppose.)

My primary objection to creating new output formats is that it creates an 
implicit burden on downstream tools to adopt them.  For example, a log of query 
analysis tools don't yet process JSON-format plans, and they've been around for 
a while.  By introducing a new format in core (which was the starting 
proposal), we're essentially telling all the tools (such as pgbadger) that 
might absorb them that we expect them to adopt that too.

> For the record, I'm tentatively in favor of including something like
> this in contrib.

I'm much less fussed by this in contrib/ (with the same concern you noted), at 
minimum as an example of how to do logging in other formats.

--
-- Christophe Pettus
   x...@thebuild.com

Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread Tom Lane

"MauMau"  writes:
> However, I have a question.  How does the truncation failure in
> autovacuum lead to duplicate keys?  The failed-to-be-truncated pages
> should only contain dead tuples, so pg_dump's table scan should ignore
> dead tuples in those pages.

[ re-reads thread... ]  The extra assumption you need in order to have
trouble is that the blocks in question are dirty in shared buffers and
have never been written to disk since their rows were deleted.  Then
the situation is that the page image on disk shows the rows as live,
while the up-to-date page image in memory correctly shows them as dead.
Relation truncation throws away the page image in memory without ever
writing it to disk.  Then, if the subsequent file truncate step fails,
we have a problem, because anyone who goes looking for that page will
fetch it afresh from disk and see the tuples as live.

There are WAL entries recording the row deletions, but that doesn't
help unless we crash and replay the WAL.

It's hard to see a way around this that isn't fairly catastrophic for
performance :-(.  But in any case it's wrapped up in order-of-operations
issues.  I've long since forgotten the details, but I seem to have thought
that there were additional order-of-operations hazards besides this one.

regards, tom lane

Re: Proposal: Adding json logging

2018-04-18 Thread Robert Haas

On Sun, Apr 15, 2018 at 1:07 PM, Christophe Pettus  wrote:
>> On Apr 15, 2018, at 09:51, David Arnold  wrote:
>> 1. Throughout this vivid discussion a good portion of support has already 
>> been manifested for the need of a more structured (machine readable) logging 
>> format. There has been no substantial objection to this need.
>
> I'm afraid I don't see that.  While it's true that as a standard, CSV is 
> relatively ill-defined, as a practical matter in PostgreSQL it is very easy 
> to write code that parses .csv format.

I'm not sure exactly how you intended to this comment, but it seems to
me that whether CSV is ease or hard to parse, somebody might
legitimately find JSON more convenient.  For example, and as has been
discussed on this thread, if you have a system that is consuming the
logs that already knows how to parse JSON but does not know how to
parse CSV, then you will find the JSON format to be convenient.

For the record, I'm tentatively in favor of including something like
this in contrib.  I think it's useful to have more examples of how to
use our existing hooks in contrib, and I think this is useful on
principle.

I am a little concerned about this bit from the README, though:

Note that logging_collector should be enabled in postgresql.conf to
ensure consistent log outputs.  As JSON strings are longer than normal
logs generated by PostgreSQL, this module increases the odds of malformed
log entries.

I'm not sure I understand the issue, but I don't like malformed log entries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Deadlock in multiple CIC.

2018-04-18 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> Will look into that too.  I'm not sure that adding extra expected
>> outputs is sane, though --- might be best to just force the intended
>> isolation level within those tests.

> As I recall (not much, admittedly) that was one of the options we
> considered in the old commit, but since the other isolation levels
> behaved differently we ageed that it was worth adding coverage for them.
> I don't know which ones are failing now; maybe forcing a specific
> isolation level is sufficient.

So the other one that fails for me is tuplelock-update, and a bit of
bisection testing confirms that it's never worked under serializable
isolation.  The "expected" behavior in that case is unexciting --- all but
the s1 transaction just fail with "could not serialize access".  So I'm
not convinced whether it's better to provide that as an expected output,
or to change the test to force a lower isolation level where it's testing
something useful.  Anyway, that's your test case so I'll leave it to you
to do something with it.

> Clearly we should have done something to make sure these tests were run
> periodically with different isolation levels.

Yeah :-(.  Both of these test cases have failed this scenario since
they were created.

regards, tom lane

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Tom Lane

Pavan Deolasee  writes:
> What if we remember the buffers as seen by count_nondeletable_pages() and
> then just discard those specific buffers instead of scanning the entire
> shared_buffers again?

That's an idea.

> Surely we revisit all to-be-truncated blocks before
> actual truncation. So we already know which buffers to discard. And we're
> holding exclusive lock at that point, so nothing can change underneath. Of
> course, we can't really remember a large number of buffers, so we can do
> this in small chunks.

Hm?  We're deleting the last N consecutive blocks, so it seems like we
just need to think in terms of clearing that range.  I think this can
just be a local logic change inside DropRelFileNodeBuffers().

You could optimize it fairly easily with some heuristic that compares
N to sizeof shared buffers; if it's too large a fraction, the existing
implementation will be cheaper than a bunch of hashtable probes.

regards, tom lane

Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread MauMau

Hello,

It seems to me that our customer might have hit an unresolved data
corruption issue which is already known in this ML, but I can't figure
out why this happened.  I'd appreciate if you could give me your
thoughts.  Depending on the severity of this issue and the customer's
request, I think I'll submit a patch to solve the unresolved issue.

The customer is using PostgreSQL 9.2 on Windows.  Autovacuum failed to
truncate the pages at the end of a table, which is probably due to
anti-virus software.  FYI, automatic checkpoint was in progress when
this error occurred.

ERROR:  could not truncate file "base/aaa/bbb" to 58 blocks:
Permission denied

After a while, an application got a unique key violation error.  The
customer says that there shouldn't be any duplicate keys.

ERROR:  duplicate key value violates unique constraint "pk_xxx"

The output of pg_dump on that table certainly includes multiple couple
of rows with the same primary key values... data corruption.



Another Japanese user, who is not our customer, hit the same problem
with 9.4, which was not solved (note: the mail is in Japanese).  He
said he repeatedly encountered the same error even after REINDEXing
with 9.4, but it doesn't happen with 9.1.  I wonder if there's
something introduced in 9.2 which causes the issue, such as index-only
scan stuff:

https://ml.postgresql.jp/pipermail/pgsql-jp/2016-October/016865.html


The problem with truncation failure was found in 2010.  The user
reported the problem as another phenomenon on 9.0.  The problem could
not be solved even by leading hackers here -- Tom, Robert, Alvaro,
Heikki, Greg Stark, etc.

TODO
https://wiki.postgresql.org/wiki/Todo

Restructure truncation logic to be more resistant to failure
This also involves not writing dirty buffers for a truncated or
dropped relation
http://archives.postgresql.org/pgsql-hackers/2010-08/msg01032.php



Tom's comments in above thread

Imagine that we have some rows at the end of a table, we delete them,
we vacuum before the next checkpoint.  Vacuum decides it can now
truncate away the last pages, but fails to do so.  The original page
state is still on disk, which means we have lost the fact of the
deletion --- the rows are now effectively live again, though their
index entries are probably gone.

...
Still, we have a live issue with heap truncation during plain VACUUM.
However, the scope of the problem seems a lot less than I was
thinking.
Maybe write-the-buffers-first is a sufficient longterm solution.

...
So it seems like the only case where there is really grounds for PANIC
on failure is the VACUUM case.  And there we might apply Heikki's idea
of trying to zero the untruncatable pages first.

...
I'm thinking that we need some sort of what-to-do-on-error flag passed
into RelationTruncate, plus at least order-of-operations fixes in
several other places, if not a wholesale refactoring of this whole
call
stack.  But I'm running out of steam and don't have a concrete
proposal
to make right now.  In any case, we've got more problems here than
just
the original one of forgetting dirty buffers too soon.



However, I have a question.  How does the truncation failure in
autovacuum lead to duplicate keys?  The failed-to-be-truncated pages
should only contain dead tuples, so pg_dump's table scan should ignore
dead tuples in those pages.

Regards
MauMau

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Pavan Deolasee

On Wed, Apr 18, 2018 at 10:50 PM, Fujii Masao  wrote:

>
>
> I'm not sure if it's safe to cancel forcibly VACUUM's truncation during
> scaning shared_buffers. That scan happens after WAL-logging and before
> the actual truncation.
>
>
Ah ok. I misread your proposal. This is about the shared_buffers scan
in DropRelFileNodeBuffers() and we can't cancel that operation.

What if we remember the buffers as seen by count_nondeletable_pages() and
then just discard those specific buffers instead of scanning the entire
shared_buffers again? Surely we revisit all to-be-truncated blocks before
actual truncation. So we already know which buffers to discard. And we're
holding exclusive lock at that point, so nothing can change underneath. Of
course, we can't really remember a large number of buffers, so we can do
this in small chunks. Scan last K blocks, remember those K buffers, discard
those K buffers, truncate the relation and then try for next K blocks. If
another backend requests lock on the table, we give up or retry after a
while.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Peter Geoghegan

(()
On Wed, Apr 18, 2018 at 10:10 AM, Teodor Sigaev  wrote:
> I mostly agree with your patch, nice work, but I have some notices for your
> patch:

Thanks.

> 1)
> bt_target_page_check():
> if (!P_RIGHTMOST(topaque) &&
> !_bt_check_natts(state->rel, state->target, P_HIKEY))
>
> Seems not very obvious: it looks like we don't need to check nattrs on
> rightmost page. Okay, I remember that on rightmost page there is no hikey at
> all, but at least comment should added. Implicitly bt_target_page_check()
> already takes into account 'is page rightmost or not?' by using
> P_FIRSTDATAKEY, so, may be better to move rightmost check into
> bt_target_page_check() with some refactoring if-logic:

I don't understand. We do check the number of attributes on rightmost
pages, but we do so separately, in the main loop. For every item that
isn't the high key.

This code appears before the main bt_target_page_check() loop because
we're checking the high key itself, on its own, which is a new thing.
The high key is also involved in the loop (on non-rightmost pages),
but that's only because we check real items *against* the high key (we
assume the high key is good and that the item might be bad). The high
key is involved in every iteration of the main loop (on non-rightmost
pages), rather than getting its own loop.

That said, I am quite happy if you want to put a comment about this
being the rightmost page at the beginning of the check.

> 2)
> Style notice:
> ItemPointerSetInvalid(_tid);
> +   BTreeTupSetNAtts(, 0);
> if (PageAddItem(page, (Item) , sizeof(IndexTupleData),
> P_HIKEY,
> It's better to have blank line between BTreeTupSetNAtts() and if clause.

Sure.

> 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use
> full Tuple word in dowlink macroses, here we use just Tup. Seems, better to
> have Tuple in both cases. Or Tup, but still in both cases.

+1

> 4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to
> BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK
> bits, now it will overwrite that bits.

An assertion sounds like it would be an improvement, though I don't
see that in the patch you posted.

> Attached patch is rebased to current head and contains some comment
> improvement in index_truncate_tuple() - you save some amount of memory with
> TupleDescCopy() call but didn't explain why pfree is enough to free all
> allocated memory.

Makes sense.

-- 
Peter Geoghegan

Re: pgindent run soon?

2018-04-18 Thread Tom Lane

Teodor Sigaev  writes:
>> If there are large refactoring or bug-fix patches that haven't landed
>> yet, then it'd be appropriate to wait for those to get in, but I'm not
>> aware of such at the moment.

> Pls, wait
> https://www.postgresql.org/message-id/9c63951d-7696-ecbb-b832-70db7ed3f39b%40sigaev.ru

Sure.

regards, tom lane

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Fujii Masao

On Wed, Apr 18, 2018 at 11:29 PM, Pavan Deolasee
 wrote:
>
>
> On Tue, Apr 17, 2018 at 11:05 PM, Fujii Masao  wrote:
>>
>> Hi,
>>
>> I'd like to propose to add $SUBJECT for performance improvement.
>>
>> When VACUUM tries to truncate the trailing empty pages, it scans
>> shared_buffers
>> to invalidate the pages-to-truncate during holding an AccessExclusive lock
>> on
>> the relation. So if shared_buffers is huge, other transactions need to
>> wait for
>> a very long time before accessing to the relation. Which would cause the
>> response-time spikes, for example, I observed such spikes several times on
>> the server with shared_buffers = 300GB while running the benchmark.
>> Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
>> spikes
>> for that relation.
>
>
> Alvaro reminded me that we already have a mechanism in place which forces
> VACUUM to give up the exclusive lock if another backend is waiting on the
> lock for more than certain pre-defined duration. AFAICS we give up the lock,
> but again retry truncation from the previously left off position. What if we
> make that lock-wait duration configurable on a per-table basis? And may be a
> special value to never truncate (though it seems quite excessive to me and a
> possible footgun)

I'm not sure if it's safe to cancel forcibly VACUUM's truncation during
scaning shared_buffers. That scan happens after WAL-logging and before
the actual truncation.

> I was actually thinking in the other direction. So between the time VACUUM
> figures out it can possibly truncate last K pages, some backend may insert a
> tuple in some page and make the truncation impossible. What if we truncate
> the FSM before starting the backward scan so that new inserts go into the
> pages prior to the truncation point, if possible. That will increase the
> chances of VACUUM being able to truncate all the empty pages. Though I think
> in some cases it might lead to unnecessary further extension of the
> relation. May be we use some heuristic based on available free space in the
> table prior to the truncation point?

Isn't this too complicated? I wonder what heuristic we can use here.

Regards,

-- 
Fujii Masao

Re: pgindent run soon?

2018-04-18 Thread Teodor Sigaev


If there are large refactoring or bug-fix patches that haven't landed
yet, then it'd be appropriate to wait for those to get in, but I'm not
aware of such at the moment.

Pls, wait
https://www.postgresql.org/message-id/9c63951d-7696-ecbb-b832-70db7ed3f39b%40sigaev.ru


Thank you.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: WIP: Covering + unique indexes.

2018-04-18 Thread Teodor Sigaev


I mostly agree with your patch, nice work, but I have some notices for your 
patch:

1)
bt_target_page_check():
if (!P_RIGHTMOST(topaque) &&
!_bt_check_natts(state->rel, state->target, P_HIKEY))

Seems not very obvious: it looks like we don't need to check nattrs on rightmost 
page. Okay, I remember that on rightmost page there is no hikey at all, but at 
least comment should added. Implicitly bt_target_page_check() already takes into 
account 'is page rightmost or not?' by using  P_FIRSTDATAKEY, so, may be better 
to move rightmost check into bt_target_page_check() with some refactoring if-logic:


if (offset > maxoff)
return true; //nothing to check, also covers empty rightmost page

if (P_ISLEAF) {
if (offnum >= P_FIRSTDATAKEY)
...
else /* if (offnum == P_HIKEY) */
...
}
else // !P_ISLEAF
{
if (offnum == P_FIRSTDATAKEY)
...
else if (offnum > P_FIRSTDATAKEY)
...
else /* if (offnum == P_HIKEY) */
...
}

I see it's possible only 3 nattrs value: 0, nkey and nkey+nincluded, but 
collapsing if-clause to three branches causes difficulties for code readers. Let 
compiler optimize that. Sorry for late notice, but it takes my attention only 
when I noticed (!P_RIGHTMOST && !_bt_check_natts) condition.


2)
Style notice:
ItemPointerSetInvalid(_tid);
+   BTreeTupSetNAtts(, 0);
if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
It's better to have blank line between BTreeTupSetNAtts() and if clause.

3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use full 
Tuple word in dowlink macroses, here we use just Tup. Seems, better to have 
Tuple in both cases. Or Tup, but still in both cases.


4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to 
BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK 
bits, now it will overwrite that bits.


Attached patch is rebased to current head and contains some comment improvement 
in index_truncate_tuple() - you save some amount of memory with TupleDescCopy() 
call but didn't explain why pfree is enough to free all allocated memory.





Peter Geoghegan wrote:

On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov
 wrote:

Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
argument, not relation>  It anyway doesn't need number of key attributes,
only total number of attributes.  Then _bt_isequal() would be able to use
BTreeTupGetNAtts().


That would make the BTreeTupGetNAtts() assertions quite a bit more
verbose, since there is usually no existing tuple descriptor variable,
but there is almost always a "rel" variable. The coverage within
_bt_isequal() does not seem important, because we only use it with the
page high key in rare cases, where _bt_moveright() will already have
tested the highkey.


I think it's completely OK to fix broken things when you've to touch
them.  Probably, Teodor would decide to make that by separate commit.
So, it's up to him.


You're right to say that this old negative infinity tuple code within
_bt_insertonpg() is broken code, and not just dead code. The code
doesn't just confuse things (e.g. see recent commit 2a67d644). It also
seems like it could actually be harmful. This is code that could only
ever corrupt your database.

I'm fine if Teodor wants to commit that change separately, of course.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index be0206d58e..7efd7ac47b 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -698,6 +698,9 @@ nextpage:
  *	 "real" data item on the page to the right (if such a first item is
  *	 available).
  *
+ * - That tuples report that they have the expected number of attributes.
+ *	 INCLUDE index pivot tuples should not contain non-key attributes.
+ *
  * Furthermore, when state passed shows ShareLock held, and target page is
  * internal page, function also checks:
  *
@@ -722,43 +725,32 @@ bt_target_page_check(BtreeCheckState *state)
 	elog(DEBUG2, "verifying %u items on %s block %u", max,
 		 P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock);
 
-
-	/* Check the number of attributes in high key if any */
-	if (!P_RIGHTMOST(topaque))
+	/* Check the number of attributes in high key */
+	if (!P_RIGHTMOST(topaque) &&
+		!_bt_check_natts(state->rel, state->target, P_HIKEY))
 	{
-		if (!_bt_check_natts(state->rel, state->target, P_HIKEY))
-		{
-			ItemId		itemid;
-			IndexTuple	itup;
-			char	   *itid,
-	   *htid;
+		ItemId		itemid;
+		IndexTuple	itup;
 
-			itemid = PageGetItemId(state->target, P_HIKEY);
-			itup = (IndexTuple) PageGetItem(state->target, itemid);
-			itid = psprintf("(%u,%u)",

Re: Query is over 2x slower with jit=on

2018-04-18 Thread Andres Freund



On April 18, 2018 9:50:48 AM PDT, Chapman Flack  wrote:
>On 04/18/2018 12:27 PM, Simon Riggs wrote:
>
>> Please change the name of the "JIT" parameter to something meaningful
>> to humans before this gets too far into the wild.
>> 
>> SSL is somewhat understandable because its not a Postgres-private
>term.
>
>JIT is hardly a Postgres-private term. It's a familiar term in a
>widespread community, though I might cede the point that the
>community in question is more compiler wonks than DBAs.

You're right. There's another thread where we discussed this. Also due to 
concern of Simon's. Can we not do so separately here?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Speedup of relation deletes during recovery

2018-04-18 Thread Fujii Masao

On Wed, Apr 18, 2018 at 10:44 AM, Michael Paquier  wrote:
> On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote:
>> Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
>> Per the discussion [1], this unused function was left intentionally.
>> But it's still dead code since 2012, so I'd like to remove it. Patch 
>> attached.
>
> Indeed, it's close to six years and the status is the same.  So let's
> drop it.  I have been surrounding the area to see if any modules
> actually use those, particularly on github, but I could not find
> callers.
>
> The patch looks logically fine to me.  In your first message, you
> mentioned that the replay time increased a lot.  Do you have numbers to
> share with some large settings of shared_buffers?

No. But after my colleague truncated more than one hundred tables on
the server with shared_buffers = 300GB, the recovery could not finish
even after 10 minutes since the startup of the recovery. So I had to
shutdown the server immediately, set shared_buffers to very small
temporarily and start the server to cause the recovery to finish soon.

> It would be better to wait for v12 branch to open before pushing
> anything, as the focus is on stabililizing things on v11.

Yes, so I added this to next CommitFest.

Regards,

-- 
Fujii Masao

Re: Query is over 2x slower with jit=on

2018-04-18 Thread Chapman Flack

On 04/18/2018 12:27 PM, Simon Riggs wrote:

> Please change the name of the "JIT" parameter to something meaningful
> to humans before this gets too far into the wild.
> 
> SSL is somewhat understandable because its not a Postgres-private term.

JIT is hardly a Postgres-private term. It's a familiar term in a
widespread community, though I might cede the point that the
community in question is more compiler wonks than DBAs.

-Chap

Re: Deadlock in multiple CIC.

2018-04-18 Thread Alvaro Herrera

Tom Lane wrote:

> *** /home/postgres/pgsql/src/test/isolation/expected/lock-update-delete_1.out 
>   Mon Feb 12 14:53:46 2018
> --- 
> /home/postgres/pgsql/src/test/isolation/output_iso/results/lock-update-delete.out
>Wed Apr 18 11:30:23 2018
> ***
> *** 150,156 
>   
>   t  
>   step s1l: <... completed>
> ! error in steps s2_unlock s1l: ERROR:  could not serialize access due to 
> concurrent update
>   
>   starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
>   pg_advisory_lock
> --- 150,158 
>   
>   t  
>   step s1l: <... completed>
> ! keyvalue  
> ! 
> ! 1  1  
>   
>   starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
>   pg_advisory_lock
> 
> It looks like maybe this one wasn't updated in 533e9c6b0 --- would
> you check/confirm that?

I think the new output is correct in REPEATABLE READ but it represents a
bug for the SERIALIZABLE mode.

The case is: a tuple is updated, with its key not modified; a concurrent
transaction is trying to read the tuple.  The original expected output
says that the reading transaction is aborted, which matches what the
test comment says:

# When run in REPEATABLE READ or SERIALIZABLE transaction isolation levels, all
# permutations that commit s2 cause a serializability error; all permutations
# that rollback s2 can get through.

In REPEATABLE READ it seems fine to read the original version of the
tuple (returns 1) and not raise an error (the reading transaction will
simply see the value that was current when it started).  But in
SERIALIZABLE mode, as far as I understand it, this case should raise a
serializability error.

I hope I'm wrong.  Copying Kevin, just in case.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Query is over 2x slower with jit=on

2018-04-18 Thread Simon Riggs

On 18 April 2018 at 16:50, Andres Freund  wrote:
> On 2018-04-18 17:35:31 +0200, Andreas Joseph Krogh wrote:
>> With jit=on:
>> https://explain.depesz.com/s/vYB
>> Planning Time: 0.336 ms
>>  JIT:
>>   Functions: 716
>>   Generation Time: 78.404 ms
>>   Inlining: false
>>   Inlining Time: 0.000 ms
>>   Optimization: false
>>   Optimization Time: 43.916 ms
>>   Emission Time: 600.031 ms
>
> Any chance this is a debug LLVM build?
>
>
>> What's the deal with jit making it slower?
>
> JIT has cost, and sometimes it's not beneficial. Here our heuristics
> when to JIT appear to be a bit off. In the parallel world it's worse
> because the JITing is duplicated for parallel workers atm.

Please change the name of the "JIT" parameter to something meaningful
to humans before this gets too far into the wild.

SSL is somewhat understandable because its not a Postgres-private term.

geqo is regrettable and we really don't want any more too
short/abbreviated parameter names.

Think of our EOU if every GUC was a TLA.

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Sv: Re: Query is over 2x slower with jit=on

2018-04-18 Thread Andreas Joseph Krogh

På onsdag 18. april 2018 kl. 17:50:55, skrev Andres Freund >:
On 2018-04-18 17:35:31 +0200, Andreas Joseph Krogh wrote:
 > With jit=on:
 > https://explain.depesz.com/s/vYB
 > Planning Time: 0.336 ms
 >  JIT:
 >   Functions: 716
 >   Generation Time: 78.404 ms
 >   Inlining: false
 >   Inlining Time: 0.000 ms
 >   Optimization: false
 >   Optimization Time: 43.916 ms
 >   Emission Time: 600.031 ms

 Any chance this is a debug LLVM build?

 > What's the deal with jit making it slower?

 JIT has cost, and sometimes it's not beneficial. Here our heuristics
 when to JIT appear to be a bit off. In the parallel world it's worse
 because the JITing is duplicated for parallel workers atm.

PostgreSQL is built with "--enable-debug --with-llvm". LLVM is the one which 
comes with Ubuntu-17.10.

-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com

Re: Setting rpath on llvmjit.so?

2018-04-18 Thread Yuriy Zhuravlev

I talked about autoconf build system, /tools/msvc it's extra home build
system.

On Thu, 19 Apr 2018, 00:58 Andres Freund,  wrote:

> On 2018-04-18 15:54:59 +, Yuriy Zhuravlev wrote:
> > Current autoconf system not working on Windows at all, what we talk
> about?
>
> We generate windows project files. See src/tools/msvc/
>
> - Andres
>

Re: Setting rpath on llvmjit.so?

2018-04-18 Thread Yuriy Zhuravlev

>
> I also politely decline the offer to be forced to use XCode on mac.


Why? We supporting MSVC and not nmake, what difference with xcode? Also,
it's just extra benefit from cmake/meson.

Re: Setting rpath on llvmjit.so?

2018-04-18 Thread Andres Freund

On 2018-04-18 15:54:59 +, Yuriy Zhuravlev wrote:
> Current autoconf system not working on Windows at all, what we talk about?

We generate windows project files. See src/tools/msvc/

- Andres

Re: Setting rpath on llvmjit.so?

2018-04-18 Thread Yuriy Zhuravlev

Current autoconf system not working on Windows at all, what we talk about?

On Wed, 18 Apr 2018, 23:57 Robert Haas,  wrote:

> On Tue, Apr 17, 2018 at 4:13 PM, Andres Freund  wrote:
> > I'd not advocate for this solely based on the age of autoconf. But the
> > separate windows buildsystem which makes it very hard to build
> > extensions separately is a good reason on its own. As is the fact that
> > recursive make as we're using it has significant issues. Both of those
> > would be avoided by using cmake or such.
>
> How much of the pain building extensions on Windows comes from our
> build system and how much of it comes from the fact that it's Windows?
>  My guess would have been that the latter is by far the bigger issue,
> but maybe I'm wrong.
>
> There is really nothing keeping us from removing or reducing the use
> of recursive make without switching to cmake.  And it's probably not
> even that much work.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>

Re: Query is over 2x slower with jit=on

2018-04-18 Thread Andres Freund

On 2018-04-18 17:35:31 +0200, Andreas Joseph Krogh wrote:
> With jit=on:
> https://explain.depesz.com/s/vYB
> Planning Time: 0.336 ms 
>  JIT:
>   Functions: 716
>   Generation Time: 78.404 ms
>   Inlining: false
>   Inlining Time: 0.000 ms
>   Optimization: false
>   Optimization Time: 43.916 ms
>   Emission Time: 600.031 ms

Any chance this is a debug LLVM build?

> What's the deal with jit making it slower?

JIT has cost, and sometimes it's not beneficial. Here our heuristics
when to JIT appear to be a bit off. In the parallel world it's worse
because the JITing is duplicated for parallel workers atm.

Greetings,

Andres Freund

Re: [FEATURE PATCH] pg_stat_statements with plans (v02)

2018-04-18 Thread Julian Markwort

On Fri, 2018-04-06 at 13:57 -0700, legrand legrand wrote:
> At startup time there are 2 identical plans found in the view
> I thought it should have be only one: the "initial" one 
> (as long as there is no "good" or "bad" one).

Yes, those are 'remnants' from the time where I had two columns, one
for good_plan and one for bad_plan.
Whenever only one plan existed (either because the statement hadn't
been executed more than once or because the deviation from the previous
plan's time wasn't significant enough), the patch would display the
same plan string for both columns.

I'll have to be a little creative, but I think I'll be able to print
only one plan if both the good and bad plan are essentially the
"inital" plan.

You can be assured that the plan string isn't saved twice for those
cases, the good_plan and bad_plan structs just used the same  offset
for the qtexts file.

> maybe 3 "plan_types" like "init", "good" and "bad" should help.

I think, the categorization into good and bad plans can be made based
on the execution time of said plan.
I'd suggest using something like
SELECT * FROM pg_stat_statements_plans GROUP BY queryid ORDER BY
plan_time;
This way, if there is only one row for any queryid, this plan is
essentially the "initial" one. If there are two plans, their goodness
can be infered from the exection times...

> Will a new line be created for good or bad plan if the plan is the
> same ?
> shouldn't we capture "constants" and "parameters" inspite ?

Capturing constants and parameters would require us to normalize the
plan, which isn't done currently. (Primarily. because it would involve
a lot of logic; Secondly, because for my use case, I'm interested in
the specific constants and parameters that led to a good or bad plan)

Do you have a use case in mind which would profit from normalized
plans?

> Is query text needed in plan? 
> it can be huge and is already available (in normalized format) 
> in pg_stat_statements. If realy needed in raw format 
> (with constants) we could force pgss to store it (in replacement of
> normalize one)
> using a guc pg_stat_statements.normalized = false (I have a patch to
> do
> that).

The query is part of what's returned by an explain statement, so I've
thought to keep the 'plan' intact.
Since all long strings (queries and plans alike) are stored in a file
on disk, I'm not too much worried about the space these strings take
up.
I'd think that a hugely long query would lead to an even longer plan,
in which case the problem to focus on might not be to reduce the length
of the string stored by a statistical plugin...

> 
> In Thread:
> http://www.postgresql-archive.org/Poc-pg-stat-statements-with-planid-
> td6014027.html
> pg_stat_statements view has a new key with  dbid,userid,queryid +
> planid
> and 2 columns added: "created" and "last_updated".

I've taken a look at your patch. I agree that having a plan identifier
would be great, but I'm not a big fan of the jumbling. That's a lot of
hashing that needs to be done to decide wether two plans are
essentially equivalent or not.

Maybe, in the future, in a different patch, we could add functionality
that would allow us to compare two plan trees. There is even a remark
in the header of src/backend/nodes/equalfuncs.c regarding this:
 * Currently, in fact, equal() doesn't know how to compare Plan trees
 * either.  This might need to be fixed someday.

> May be your view pg_stat_statements_plans could be transformed :
> same key as pgss: dbid,userid,queryid,planid 
> and THE plan column to store explain result (no more need for
> plan_time nor
> tz that 
> are already available in modified pgss).

The documentation [of pg_stat_statements] gives no clear indication
which fields qualify as primary keys, mainly because the hashing that
generates the queryid isn't guaranteed to produce unique results...
So I'm not sure which columns should be used to create associations
between pg_stat_statements and pg_stat_statements_plans.

> Thoses changes to pgss are far from being accepted by community:
> - planid calculation has to be discussed (I reused the one from
> pg_stat_plans, 
>   but I would have prefered explain command to provide it ...),
> - "created" and "last_updated" columns also,
> - ...
> 
> So maybe your stategy to keep pgss unchanged, and add extensions view
> is
> better.
> There is a risk of duplicated informations (like execution_time,
> calls, ...)

It might be possible to completely seperate these two views into two
extensions. (pg_stat_statements_plans could use it's own hooks to
monitor query execution)
But that would probably result in a lot of duplicated code (for the
querytexts file) and would be a lot more elaborate to create, as I'd
need to recreate most of the hashmaps and locks.

Piggybacking onto the existing logic of pg_stat_statements to store a
plan every once in a while seems simpler to me.

Regards
Julian

smime.p7s
Description: S/MIME cryptographic signature

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Simon Riggs

On 17 April 2018 at 20:09, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Andres was working on a radix tree structure to fix this problem, but
>> that seems to be abandoned now, and it seems a major undertaking.  While
>> I agree that the proposed solution is a wart, it seems much better than
>> no solution at all.  Can we consider Fujii's proposal as a temporary
>> measure until we fix shared buffers?  I'm +1 on it myself.
>
> Once we've introduced a user-visible reloption it's going to be
> practically impossible to get rid of it, so I'm -1.  I'd much rather
> see somebody put some effort into the radix-tree idea than introduce
> a kluge that we'll be stuck with, and that doesn't even provide a
> good user experience.  Disabling vacuum truncation is *not* something
> that I think we should recommend.

The truncation at the end of VACUUM takes an AccessExclusiveLock,
which is already user visible. Using a radix tree won't alter that.

ISTM the user might be interested in having the *lock* NOT happen, so
I am +1 for the suggestion regardless of whether radix tree ever
happens.

The lock itself can be cancelled, so the user would also be interested
in explicitly requesting a retry with a separate command/function.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Deadlock in multiple CIC.

2018-04-18 Thread Tom Lane

I wrote:
>>> (A couple of the other isolation tests do fail reliably under this
>>> scenario; is it worth hardening them?)

>> Yes, I think it's worth making them pass somehow -- see commits
>> f18795e7b74c, a0eae1a2eeb6.

> Will look into that too.  I'm not sure that adding extra expected
> outputs is sane, though --- might be best to just force the intended
> isolation level within those tests.

Hmm, so one of the ones that fails is lock-update-delete, which I see
*already has* an alternate output file for serializable mode ... but
it doesn't match what I get:

*** /home/postgres/pgsql/src/test/isolation/expected/lock-update-delete_1.out   
Mon Feb 12 14:53:46 2018
--- 
/home/postgres/pgsql/src/test/isolation/output_iso/results/lock-update-delete.out
   Wed Apr 18 11:30:23 2018
***
*** 150,156 
  
  t  
  step s1l: <... completed>
! error in steps s2_unlock s1l: ERROR:  could not serialize access due to 
concurrent update
  
  starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
  pg_advisory_lock
--- 150,158 
  
  t  
  step s1l: <... completed>
! keyvalue  
! 
! 1  1  
  
  starting permutation: s2b s1l s2u s2_blocker1 s2r s2_unlock
  pg_advisory_lock

It looks like maybe this one wasn't updated in 533e9c6b0 --- would
you check/confirm that?

regards, tom lane

Re: Deadlock in multiple CIC.

2018-04-18 Thread Alvaro Herrera

Tom Lane wrote:

> Anyway, at this point I'm going to give up on the debug logging, revert
> 9.4 to its prior state, and then see if the transaction-restart patch
> makes the problem go away.

Agreed, thanks.

> >> (A couple of the other isolation tests do fail reliably under this
> >> scenario; is it worth hardening them?)
> 
> > Yes, I think it's worth making them pass somehow -- see commits
> > f18795e7b74c, a0eae1a2eeb6.
> 
> Will look into that too.  I'm not sure that adding extra expected
> outputs is sane, though --- might be best to just force the intended
> isolation level within those tests.

As I recall (not much, admittedly) that was one of the options we
considered in the old commit, but since the other isolation levels
behaved differently we ageed that it was worth adding coverage for them.
I don't know which ones are failing now; maybe forcing a specific
isolation level is sufficient.

Clearly we should have done something to make sure these tests were run
periodically with different isolation levels.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Query is over 2x slower with jit=on

2018-04-18 Thread Andreas Joseph Krogh

Hi all.
 
I don't know whether this is expected or not but I'm experiencing over 2x 
slowdown on a large query in PG-11 with JIT=on.
 
(query is a prepared statement executed with "explain analyze execute 
myprepared(arg1, arg2, ..., argn)")
 
After 10 executions these are the results (the first 5 executed in > 16s, then 
the plan changed)
 
With jit=on:
https://explain.depesz.com/s/vYB
Planning Time: 0.336 ms 
 JIT:
  Functions: 716
  Generation Time: 78.404 ms
  Inlining: false
  Inlining Time: 0.000 ms
  Optimization: false
  Optimization Time: 43.916 ms
  Emission Time: 600.031 ms
 Execution Time: 2035.150 ms
 (385 rows)
  
With jit=off:
https://explain.depesz.com/s/X6mA
Planning Time: 0.371 ms 
 Execution Time: 833.941 ms
 (377 rows)
  
Both are master as of 55d26ff638f063fbccf57843f2c27f9795895a5c
 
The query largely consists of CTEs with aggregation which are FULL OUTER 
JOIN'ed.
 
On v10 the query executes in:
Execution time: 1159.628 ms
  
So v11 (with jit=off) is about 25% faster (due to parallel hash-join I think), 
which is nice!
 
What's the deal with jit making it slower?
 
--
Andreas Joseph Krogh

Re: Deadlock in multiple CIC.

2018-04-18 Thread Tom Lane

I wrote:
>> It's still not entirely clear what's happening on okapi, ...

okapi has now passed two consecutive runs with elog(LOG) messages in place
between DefineIndex's snapmgr calls.  Considering that it had failed 37 of
44 test runs since 47a3a13 went in, I think two successive passes is
sufficient evidence to conclude that we have a Heisenbug in which the
presence of debug tooling affects the result.  And that in turn suggests
strongly that it's a compiler bug.  Broken interprocedural optimization,
perhaps?  Although it'd have to be cross-file optimization, which is
more than I thought icc would do.

Anyway, at this point I'm going to give up on the debug logging, revert
9.4 to its prior state, and then see if the transaction-restart patch
makes the problem go away.

>> (A couple of the other isolation tests do fail reliably under this
>> scenario; is it worth hardening them?)

> Yes, I think it's worth making them pass somehow -- see commits
> f18795e7b74c, a0eae1a2eeb6.

Will look into that too.  I'm not sure that adding extra expected
outputs is sane, though --- might be best to just force the intended
isolation level within those tests.

regards, tom lane

Re: pgindent run soon?

2018-04-18 Thread Robert Haas

On Tue, Apr 17, 2018 at 12:57 PM, Tom Lane  wrote:
> Now that feature freeze is past, I wonder if it's time to run pgindent.

+1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread Robert Haas

On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
 wrote:
> I did go and start working on a patch to test how possible this would
> be and came up with the attached. I've left a stray
> MemoryContextStatsDetail call in there which does indicate that
> something is not being freed. I'm just not sure what it is yet.
>
> The patch does happen to improve performance slightly, but that is
> most likely due to the caching of the ExprStates rather than the
> change of memory management. It's not really possible to do that with
> the reset unless we stored the executor's memory context in
> PartitionPruneContext and did a context switch back inside
> partkey_datum_from_expr before calling ExecInitExpr.

10% is more than a "slight" improvement, I'd say!  It's certainly got
to be worth avoiding the repeated calls to ExecInitExpr, whatever we
do about the memory contexts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pruning disabled for array, enum, record, range type partition keys

2018-04-18 Thread Amit Langote

On Thu, Apr 19, 2018 at 12:01 AM, Alvaro Herrera
 wrote:
> Amit Langote wrote:
>
>> On 2018/04/18 7:11, Alvaro Herrera wrote:
>>
>> @@ -1546,8 +1546,8 @@ match_clause_to_partition_key(RelOptInfo *rel,
>>case PARTITION_STRATEGY_HASH:
>>   cmpfn = get_opfamily_proc(part_scheme->partopfamily[partkeyidx],
>> -   op_righttype, op_righttype,
>> -   HASHEXTENDED_PROC);
>> +   part_scheme->partopcintype[partkeyidx],
>> +   op_righttype, HASHEXTENDED_PROC);
>>
>> This change is not quite right, because it disables pruning.  The above
>> returns InvalidOid as there are no hash AM procedures (in pg_amproc) whose
>> lefttype and righttype don't match.
>
> Makes sense.  Still, I was expecting that pruning of hash partitioning
> would also work for pseudotypes, yet it doesn't.

It does?

+-- array type hash partition key
+create table pph_arrpart (a int[]) partition by hash (a);
+create table pph_arrpart1 partition of pph_arrpart for values with
(modulus 2, remainder 0);
+create table pph_arrpart2 partition of pph_arrpart for values with
(modulus 2, remainder 1);
+insert into pph_arrpart values ('{1}'), ('{1, 2}'), ('{4, 5}');
+select tableoid::regclass, * from pph_arrpart order by 1;
+   tableoid   |   a
+--+---
+ pph_arrpart1 | {1,2}
+ pph_arrpart1 | {4,5}
+ pph_arrpart2 | {1}
+(3 rows)
+
+explain (costs off) select * from pph_arrpart where a = '{1}';
+   QUERY PLAN
+
+ Append
+   ->  Seq Scan on pph_arrpart2
+ Filter: (a = '{1}'::integer[])
+(3 rows)
+
+explain (costs off) select * from pph_arrpart where a = '{1, 2}';
+QUERY PLAN
+--
+ Append
+   ->  Seq Scan on pph_arrpart1
+ Filter: (a = '{1,2}'::integer[])
+(3 rows)

Thanks,
Amit

Re: pruning disabled for array, enum, record, range type partition keys

2018-04-18 Thread Alvaro Herrera

Amit Langote wrote:

> On 2018/04/18 7:11, Alvaro Herrera wrote:
> 
> @@ -1546,8 +1546,8 @@ match_clause_to_partition_key(RelOptInfo *rel,
>case PARTITION_STRATEGY_HASH:
>   cmpfn = get_opfamily_proc(part_scheme->partopfamily[partkeyidx],
> -   op_righttype, op_righttype,
> -   HASHEXTENDED_PROC);
> +   part_scheme->partopcintype[partkeyidx],
> +   op_righttype, HASHEXTENDED_PROC);
> 
> This change is not quite right, because it disables pruning.  The above
> returns InvalidOid as there are no hash AM procedures (in pg_amproc) whose
> lefttype and righttype don't match.

Makes sense.  Still, I was expecting that pruning of hash partitioning
would also work for pseudotypes, yet it doesn't.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Setting rpath on llvmjit.so?

2018-04-18 Thread Robert Haas

On Tue, Apr 17, 2018 at 4:13 PM, Andres Freund  wrote:
> I'd not advocate for this solely based on the age of autoconf. But the
> separate windows buildsystem which makes it very hard to build
> extensions separately is a good reason on its own. As is the fact that
> recursive make as we're using it has significant issues. Both of those
> would be avoided by using cmake or such.

How much of the pain building extensions on Windows comes from our
build system and how much of it comes from the fact that it's Windows?
 My guess would have been that the latter is by far the bigger issue,
but maybe I'm wrong.

There is really nothing keeping us from removing or reducing the use
of recursive make without switching to cmake.  And it's probably not
even that much work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem while setting the fpw with SIGHUP

2018-04-18 Thread Robert Haas

On Wed, Apr 18, 2018 at 10:37 AM, Amit Kapila  wrote:
> On Fri, Apr 13, 2018 at 10:36 PM, Robert Haas  wrote:
>> On Thu, Apr 12, 2018 at 9:29 PM, Michael Paquier  wrote:
>>> Still does it matter when the change is effective?
>>
>> I don't really care deeply about when the change takes effect, but I
>> do care about whether the time when the system *says* the change took
>> effect is the same as when it *actually* took effect.  If those aren't
>> the same, it's confusing.
>>
>
> So, what in your opinion is the way to deal with this?  If we make it
> a PGC_POSTMASTER parameter, it will have a very clear behavior and
> users don't need to bother whether they have a risk of torn page
> problem or not and as a side-impact the code will be simplified as
> well.  However, as Michael said the people who get the benefit of this
> option by disabling/enabling this parameter might complain.  Keeping
> it as a SIGHUP option has the drawback that even after the user has
> enabled it, there is a risk of torn pages.

I would just document the risks.  If the documentation says that you
can't rely on the value until after the next checkpoint, or whatever
the rule is, then I think we're fine.  I don't think that we really
have the infrastructure to do any better; if we try, we'll just end up
with odd warts.  Documenting the current set of warts is less churn
and less work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem while setting the fpw with SIGHUP

2018-04-18 Thread Amit Kapila

On Fri, Apr 13, 2018 at 10:36 PM, Robert Haas  wrote:
> On Thu, Apr 12, 2018 at 9:29 PM, Michael Paquier  wrote:
>> Still does it matter when the change is effective?
>
> I don't really care deeply about when the change takes effect, but I
> do care about whether the time when the system *says* the change took
> effect is the same as when it *actually* took effect.  If those aren't
> the same, it's confusing.
>

So, what in your opinion is the way to deal with this?  If we make it
a PGC_POSTMASTER parameter, it will have a very clear behavior and
users don't need to bother whether they have a risk of torn page
problem or not and as a side-impact the code will be simplified as
well.  However, as Michael said the people who get the benefit of this
option by disabling/enabling this parameter might complain.  Keeping
it as a SIGHUP option has the drawback that even after the user has
enabled it, there is a risk of torn pages.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Setting rpath on llvmjit.so?

2018-04-18 Thread Peter Eisentraut

On 4/17/18 16:14, Andres Freund wrote:
> I still think cmake is the least unreasonable path going forward.

I would rather try to make Meson work and if needed add features back
into Meson.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Documentation for bootstrap data conversion

2018-04-18 Thread Tom Lane

John Naylor  writes:
> On 4/9/18, Tom Lane  wrote:
>> Meh, I think either is fine really.  I do recall changing something
>> in bki.sgml that referred to both "bootstrap relations" and "bootstrap
>> catalogs" in practically the same sentence.  I think that *is* confusing,
>> because it's not obvious whether relation and catalog are meant to be
>> interchangeable terms (and, in general, they aren't).  If we wanted to
>> do anything here I'd be more interested in s/relation/catalog/g in this
>> usage.

> I was curious, so did these searches

> git grep  'bootstrap\S* relation'
> git grep  'system .*relation'

> and dug through a bit to find cases where 'catalog' is clearly a
> better term. Most of these are in the pg_*.h/.dat file boilerplate
> comments, which would be easy enough to change with a script. If we're
> going to do that, the word order of this phrase now seems a bit
> awkward:
> definition of the system "access method" catalog (pg_am)
> so we may as well go the extra step and do:
> definition of the "access method" system catalog (pg_am)

> Thoughts?

Makes sense to me --- I suspect that the original phrasing was chosen
by somebody whose native language wasn't English.

> Beyond that, there are many cases where the language might be
> debatable, or at least it's not obvious to the casual observer whether
> it could also be referring to an index or toast table, so it's
> probably best to leave well enough alone for most cases.

Agreed, there's no need to go overboard here.  But let's fix that
awkward construction in the catalog header files, as well as anyplace
else where it seems like a clear win.  Will you send a patch?

regards, tom lane

Re: remove quoting hacks and simplify bootscanner.l

2018-04-18 Thread Tom Lane

John Naylor  writes:
> For the bootstrap data conversion, it was desirable for postgres.bki
> to remain unchanged, so some ugly quoting hacks were added to
> genbki.pl to match the quoting conventions in the DATA() lines. At
> this point, it's possible (and worthwhile I think) to remove those,
> and along the way simplify the tokenizing rules in bootscanner.l. This
> will result in some largish changes to postgres.bki, but they're easy
> to reason about and have no functional consequence. Make check passes.

Forgot to follow up to this last night, but I pushed this with a couple of
changes:

* I didn't see a reason to remove '-' from the set of "id" characters.
That'd force quoting of data fields that are just "-", which there are
a lot of, so it would bulk up the .bki file for no gain.

* I didn't like assuming that Perl's \w exactly matches the set of
characters in the "id" production, so I changed that to use a
regex character class matching bootscanner.l's.

Also I did a bit of additional work to make single and double quotes
less magic.  It was kind of tempting to rethink how bootscanner.l
parses double-quoted fields, but in the end I just left that as-is
and made the Perl code cope with it.  I think as long as people can
write quotes in the .dat files without thinking too hard, nobody
will care how weird it looks in the .bki file.

regards, tom lane

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Pavan Deolasee

On Tue, Apr 17, 2018 at 11:05 PM, Fujii Masao  wrote:

> Hi,
>
> I'd like to propose to add $SUBJECT for performance improvement.
>
> When VACUUM tries to truncate the trailing empty pages, it scans
> shared_buffers
> to invalidate the pages-to-truncate during holding an AccessExclusive lock
> on
> the relation. So if shared_buffers is huge, other transactions need to
> wait for
> a very long time before accessing to the relation. Which would cause the
> response-time spikes, for example, I observed such spikes several times on
> the server with shared_buffers = 300GB while running the benchmark.
> Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
> spikes
> for that relation.
>

Alvaro reminded me that we already have a mechanism in place which forces
VACUUM to give up the exclusive lock if another backend is waiting on the
lock for more than certain pre-defined duration. AFAICS we give up the
lock, but again retry truncation from the previously left off position.
What if we make that lock-wait duration configurable on a per-table basis?
And may be a special value to never truncate (though it seems quite
excessive to me and a possible footgun)

I was actually thinking in the other direction. So between the time VACUUM
figures out it can possibly truncate last K pages, some backend may insert
a tuple in some page and make the truncation impossible. What if we
truncate the FSM before starting the backward scan so that new inserts go
into the pages prior to the truncation point, if possible. That will
increase the chances of VACUUM being able to truncate all the empty pages.
Though I think in some cases it might lead to unnecessary further extension
of the relation. May be we use some heuristic based on available free space
in the table prior to the truncation point?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Built-in connection pooling

2018-04-18 Thread Vladimir Borodin



> 18 апр. 2018 г., в 16:24, David Fetter  написал(а):
> 
> On Wed, Apr 18, 2018 at 02:52:39PM +0300, Konstantin Knizhnik wrote:
>> Yandex team is following this approach with theirOdysseus
>> (multithreaded version of pgbouncer with many of pgbouncer issues
>> fixed).
> 
> Have they opened the source to Odysseus?  If not, do they have plans to?

No, we haven't yet. Yep, we plan to do it until end on May.

> 
> Best,
> David.
> -- 
> David Fetter  http://fetter.org/
> Phone: +1 415 235 3778
> 
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
> 

--
May the force be with you…
https://simply.name

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Amit Kapila

On Wed, Apr 18, 2018 at 7:46 AM, Andres Freund  wrote:
> On 2018-04-18 10:46:51 +0900, Michael Paquier wrote:
>> On Tue, Apr 17, 2018 at 06:13:31PM -0700, Andres Freund wrote:
>> > Not sure what you mean?
>>
>> Do you need help on it?  I suggest that I could undertake the proposed
>> patch and submit it earlier in the development cycle of v12.
>
> I think it's at the very least two months of serious development work to
> get it into a state ready for submission. And a good chunk of that not
> even sketched out.  Replacing the hashtable is the easy part, the memory
> management (Complicated due to lock-freeness. I'm thinking of using a
> variant of epoch based reclamation) isn't really there, the management
> of shared "open relations" state are the hard parts...
>
> So yes, I could use help on it, but it'll be a lot of actual design and
> investigatory work.
>

I think it makes sense to pursue that approach, but it might be worth
considering some alternative till we have it.  I remember last time
(in 2015) we have discussed some another solution [1] to this problem
(or similar) and we have left it unattended in the hope that we will
get a better solution, but we are still in the same situation.  I
think in general it is better to go with the approach which can fix
the root cause of the problem, but if that is going to take a long
time, it is not terrible to provide some workable solution which can
help users.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JPLGjpMeJ5YLNE7bpNBhP2EQe_rDR%2BAw3atNfj9WkAGg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Postgres stucks in deadlock detection

2018-04-18 Thread Konstantin Knizhnik




On 16.04.2018 14:11, Konstantin Knizhnik wrote:



On 14.04.2018 10:09, Юрий Соколов wrote:
пт, 13 апр. 2018 г., 21:10 Andres Freund >:


Hi,

On 2018-04-13 19:13:07 +0300, Konstantin Knizhnik wrote:
> On 13.04.2018 18:41, Andres Freund wrote:
> > On 2018-04-13 16:43:09 +0300, Konstantin Knizhnik wrote:
> > > Updated patch is attached.
> > > + /*
> > > +  * Ensure that only one backend is checking for deadlock.
> > > +  * Otherwise under high load cascade of deadlock timeout
expirations can cause stuck of Postgres.
> > > +  */
> > > + if
(!pg_atomic_test_set_flag(>activeDeadlockCheck))
> > > + {
> > > +  enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout);
> > > +         return;
> > > + }
> > > + inside_deadlock_check = true;
> > I can't see that ever being accepted.  This means there's
absolutely no
> > bound for deadlock checks happening even under light
concurrency, even
> > if there's no contention for a large fraction of the time.
>
> It may cause problems only if
> 1. There is large number of active sessions
> 2. They perform deadlock-prone queries (so no attempts to avoid
deadlocks at
> application level)
> 3. Deadlock timeout is set to be very small (10 msec?)

That's just not true.


> Otherwise either probability that all backends once and once
again are
> trying to check deadlocks concurrently is very small (and can
be even more
> reduced by using random timeout for subsequent deadlock
checks), either
> system can not normally function in any case because large
number of clients
> fall into deadlock.

Operating systems batch wakeups.


> I completely agree that there are plenty of different
approaches, but IMHO
> the currently used strategy is the worst one, because it can
stall system
> even if there are not deadlocks at all.


> I always think that deadlock is a programmer's error rather
than normal
> situation. May be it is wrong assumption

It is.


> So before implementing some complicated solution of the
problem9too slow
> deadlock detection), I think that first it is necessary to
understand
> whether there is such problem at al and under which workload it
can happen.

Sure. I'm not saying that you shouldn't experiment with a patch
like the
one you sent. What I am saying is that that can't be the actual
solution
that will be integrated.


What about my version?
at 
https://www.postgresql.org/message-id/flat/bac42052debbd66e8d5f786d8abe8...@postgrespro.ru
It still performs deadlock detection every time, but it tries to 
detect it with shared lock first,
and only if there is probability of real deadlock, it rechecks with 
exclusive lock.


Although even shared lock leads to some stalleness for active 
transactions, but in the catastrophic situation, where many backends 
to check for inexisting deadlock at the same time, it greately reduce 
pause.


Unfortunately your patch doesn't help with inserts.
In most cases Postgres obtains exclusive partition locks: any 
lock/unlock operation requires exclusive lock.
Shared locks of all partitions are used for collecting some 
information or to perform recovery.


Below results of three different versions:

1. Vanilla Postgres:

[kk@hp06 knizhnik]$ pgbench -n -c 1000 -j 36 -T 100 -P 1 -f test.sql 
postgres

progress: 1.0 s, 227.9 tps, lat 109.560 ms stddev 189.328
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 4.0 s, 53.0 tps, lat 1145.417 ms stddev 1607.484
progress: 5.0 s, 19.0 tps, lat 236.807 ms stddev 819.307
progress: 6.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 7.0 s, 2.0 tps, lat 6649.944 ms stddev 9.691
progress: 8.0 s, 60.0 tps, lat 2343.313 ms stddev 3335.967
progress: 9.0 s, 89.0 tps, lat 1813.495 ms stddev 3337.904
progress: 10.0 s, 99.1 tps, lat 2093.653 ms stddev 3758.468
progress: 11.0 s, 94.9 tps, lat 2424.560 ms stddev 4258.622
progress: 12.0 s, 79.0 tps, lat 2037.880 ms stddev 4152.258
progress: 13.0 s, 30.0 tps, lat 80.697 ms stddev 24.854
progress: 14.0 s, 2.0 tps, lat 71.642 ms stddev 0.022
progress: 15.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 16.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 17.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 18.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 19.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 20.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 21.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 22.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 23.0 s, 3.0 tps, lat 22396.463 ms stddev 1.964
progress: 24.0 s, 1.0 tps, lat 23667.927 ms stddev 0.001
progress: 25.0 s, 25.0 tps, lat 8862.603 ms stddev 11545.517
progress: 26.0 s, 27.0 tps, lat 6738.338 ms stddev 10984.215
progress: 27.0 s, 40.0 tps,

Re: Built-in connection pooling

2018-04-18 Thread Konstantin Knizhnik




On 18.04.2018 16:41, Heikki Linnakangas wrote:

On 18/04/18 07:52, Konstantin Knizhnik wrote:



On 18.04.2018 13:36, Heikki Linnakangas wrote:

On 18/04/18 06:10, Konstantin Knizhnik wrote:

But there are still use cases which can not be covered y external
connection pooler.


Can you name some? I understand that the existing external connection
poolers all have their limitations. But are there some fundamental
issues that can *only* be addressed by a built-in implementation?


Well, may be I missed something, but i do not know how to efficiently
support
1. Temporary tables
2. Prepared statements
3. Sessoin GUCs
with any external connection pooler (with pooling level other than 
session).


Me neither. What makes it easier to do these things in an internal 
connection pooler? What could the backend do differently, to make 
these easier to implement in an external pooler?


All this things are addressed now in my builtin connection pool 
implementation:
1. Temporary tables are maintained by creation of private temporary 
namespace for each session
2. Prepared statements are supported by adding unique session prefix to 
each prepared statement name (so there is single prepare statement cache 
in backend, but each session has its own prepared statements)
3.  Each session maintains list of updated GUCs and them are 
saved/restored on reschedule.


It was not so difficult to implement all this stuff (the main troubles I 
had with GUCs), but looks like none of them are possible fort external 
connection pooler.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Built-in connection pooling

2018-04-18 Thread Heikki Linnakangas


On 18/04/18 07:52, Konstantin Knizhnik wrote:



On 18.04.2018 13:36, Heikki Linnakangas wrote:

On 18/04/18 06:10, Konstantin Knizhnik wrote:

But there are still use cases which can not be covered y external
connection pooler.


Can you name some? I understand that the existing external connection
poolers all have their limitations. But are there some fundamental
issues that can *only* be addressed by a built-in implementation?


Well, may be I missed something, but i do not know how to efficiently
support
1. Temporary tables
2. Prepared statements
3. Sessoin GUCs
with any external connection pooler (with pooling level other than session).


Me neither. What makes it easier to do these things in an internal 
connection pooler? What could the backend do differently, to make these 
easier to implement in an external pooler?


- Heikki

Re: ON CONFLICT DO UPDATE for partitioned tables

2018-04-18 Thread Alvaro Herrera

Amit Langote wrote:
> On 2018/04/18 0:04, Alvaro Herrera wrote:
> > Amit Langote wrote:
> > 
> >> I just confirmed my hunch that this wouldn't somehow do the right thing
> >> when the OID system column is involved.  Like this case:
> > 
> > This looks too big a patch to pursue now.  I'm inclined to just remove
> > the equalTupdesc changes.
> 
> OK.  Here is the patch that removes equalTupdesc optimization.

Hmm.  If we modify (during pg12, of course -- not now) partition tables
that are created identical to their parent table so that they share the
pg_type row, this would become useful.  Unless there a reason why that
change is completely unworkable, I'd just leave it there.  (I claim that
it works like that only because it used to work like that, not because
it's impossible to make work the other way.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: VM map freeze corruption

2018-04-18 Thread Alvaro Herrera

Pavan Deolasee wrote:
> On Wed, Apr 18, 2018 at 7:37 AM, Wood, Dan  wrote:

> > My analysis is that heap_prepare_freeze_tuple->FreezeMultiXactId()
> > returns FRM_NOOP if the MultiXACT locked rows haven't committed.  This
> > results in changed=false and totally_frozen=true(as initialized).  When
> > this returns to lazy_scan_heap(), no rows are added to the frozen[] array.
> > Yet, tuple_totally_frozen is true.  This means the page is marked frozen in
> > the VM, even though the MultiXACT row wasn't left untouch.
> >
> > A fix to heap_prepare_freeze_tuple() that seems to do the trick is:
> > else
> > {
> > Assert(flags & FRM_NOOP);
> > +  totally_frozen = false;
> > }
> >
> 
> That's a great find!

Indeed.

This family of bugs (introduced by freeze map changes in 9.6) was
initially fixed in 38e9f90a227d but this spot was missed in that fix.

IMO the cause is the totally_frozen variable, which starts life in a
bogus state (true) and the different code paths can set it to the right
state, or by inaction end up deciding that the initial bogus state was
correct in the first place.  Errors of omission are far too easy in that
kind of model, ISTM, so I propose this slightly different patch, which
albeit yet untested seems easier to verify and easier to get right.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4fdb549099..791958a543 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6800,9 +6800,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
  xl_heap_freeze_tuple *frz, 
bool *totally_frozen_p)
 {
boolchanged = false;
-   boolfreeze_xmax = false;
+   boolxmin_frozen = false;/* is xmin frozen after this? */
+   boolfreeze_xmax = false;/* whether to freeze xmax now */
+   boolxmax_frozen = false;/* ix xmax frozen after this? */
TransactionId xid;
-   booltotally_frozen = true;
 
frz->frzflags = 0;
frz->t_infomask2 = tuple->t_infomask2;
@@ -6829,10 +6830,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
frz->t_infomask |= HEAP_XMIN_FROZEN;
changed = true;
+   xmin_frozen = true;
}
-   else
-   totally_frozen = false;
}
+   else if (HeapTupleHeaderXminFrozen(tuple))
+   xmin_frozen = true;
 
/*
 * Process xmax.  To thoroughly examine the current Xmax value we need 
to
@@ -6870,7 +6872,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
if (flags & FRM_MARK_COMMITTED)
frz->t_infomask |= HEAP_XMAX_COMMITTED;
changed = true;
-   totally_frozen = false;
}
else if (flags & FRM_RETURN_IS_MULTI)
{
@@ -6892,7 +6893,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
frz->xmax = newxmax;
 
changed = true;
-   totally_frozen = false;
}
else
{
@@ -6923,9 +6923,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,

 xid)));
freeze_xmax = true;
}
-   else
-   totally_frozen = false;
}
+   else if (((tuple->t_infomask & HEAP_XMAX_INVALID) == HEAP_XMAX_INVALID) 
||
+
!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
+   xmax_frozen = true;
 
if (freeze_xmax)
{
@@ -6941,6 +6942,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
changed = true;
+   xmax_frozen = true;
}
 
/*
@@ -6981,7 +6983,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
}
}
 
-   *totally_frozen_p = totally_frozen;
+   *totally_frozen_p = xmin_frozen && xmax_frozen;
return changed;
 }

Re: Built-in connection pooling

2018-04-18 Thread Konstantin Knizhnik




On 18.04.2018 16:24, David Fetter wrote:

On Wed, Apr 18, 2018 at 02:52:39PM +0300, Konstantin Knizhnik wrote:

Yandex team is following this approach with theirOdysseus
(multithreaded version of pgbouncer with many of pgbouncer issues
fixed).

Have they opened the source to Odysseus?  If not, do they have plans to?


It is better to ask Valdimir Borodin (Yandex) about it.
But as far as I know - the answer is yes.
The Yandex policy is to make there products available for community.
I just wonder why it was not interested to community to know details of 
this project...


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Built-in connection pooling

2018-04-18 Thread Konstantin Knizhnik




On 18.04.2018 16:09, Craig Ringer wrote:

On 18 April 2018 at 19:52, Konstantin Knizhnik
 wrote:


As far as I know most of DBMSes have some kind of internal connection
pooling.
Oracle, for example, you can create dedicated and non-dedicated backends.
I wonder why we do not want to have something similar in Postgres.

I want to, and I know many others to.

But the entire PostgreSQL architecture makes it hard to do well, and
means it requires heavy changes to do it in a way that will be
maintainable and reliable.

Making it work, and making something maintainble and mergeable, are
two different things. Something I continue to struggle with myself.


Here I completely agree with you.
Now my prototype "works": it is able to correctly handle errors, 
transaction rollbacks, long living transactions,... but I am completely 
sure that there are a lot of not tested cases when it will work 
incorrectly. But still I do not think that making built-in connection 
pooling really reliable is something unreachable.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Built-in connection pooling

2018-04-18 Thread David Fetter

On Wed, Apr 18, 2018 at 02:52:39PM +0300, Konstantin Knizhnik wrote:
> Yandex team is following this approach with theirOdysseus
> (multithreaded version of pgbouncer with many of pgbouncer issues
> fixed).

Have they opened the source to Odysseus?  If not, do they have plans to?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Built-in connection pooling

2018-04-18 Thread Craig Ringer

On 18 April 2018 at 19:52, Konstantin Knizhnik
 wrote:

> As far as I know most of DBMSes have some kind of internal connection
> pooling.
> Oracle, for example, you can create dedicated and non-dedicated backends.
> I wonder why we do not want to have something similar in Postgres.

I want to, and I know many others to.

But the entire PostgreSQL architecture makes it hard to do well, and
means it requires heavy changes to do it in a way that will be
maintainable and reliable.

Making it work, and making something maintainble and mergeable, are
two different things. Something I continue to struggle with myself.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Oddity in tuple routing for foreign partitions

2018-04-18 Thread Etsuro Fujita


(2018/04/18 14:44), Amit Langote wrote:

On 2018/04/17 16:41, Etsuro Fujita wrote:

In the INSERT/COPY-tuple-routing case, as explained by Amit, the
RTE at that position in the EState's range table is the one for the
partitioned table of a given partition, so the statement would be true.
BUT in the UPDATE-tuple-routing case, the RTE is the one for the given
partition, not for the partitioned table, so the statement would not be
true.  In the latter case, we don't need to create a child RTE and replace
the original RTE with it, but I handled both cases the same way for
simplicity.


Oh, I didn't really consider this part carefully.  That the resultRelInfo
received by BeginForeignInsert (when called by ExecInitRoutingInfo) could
be a reused UPDATE result relation.  It might be possible to justify the
parent_rte/child_rte terminology by explaining it a bit better.


Yeah, I think that terminology would be confusing, so I changed it to 
old_rte/new_rte.



I see
three cases that arise during tuple routing:

1. This is INSERT and so the resultRelation that's received in
BeginForeignInsert has been freshly created in ExecInitPartitionInfo
and it bears node->nominalRelation or 1 as its ri_RangeTableIndex


Right.


2. This is UPDATE and the resultRelInfo that's received in
BeginForeignInsert has been freshly created in ExecInitPartitionInfo
and it bears node->nominalRelation or 1 as its ri_RangeTableIndex


In that case, we have a valid plan node, so it would only bear 
node->nominalRelation.



3. This is UPDATE and the resultRelInfo that's received in
BeginForeignInsert is a reused one, in which case, it bears the planner
assigned ri_RangeTableIndex


Right.


In all three cases, I think we can rely on using ri_RangeTableIndex to
fetch a valid "parent" RTE from es_range_table.


I slept on this, I noticed there is another bug in case #2.  Here is an 
example with the previous patch:


postgres=# create table utrtest (a int, b text) partition by list (a);
postgres=# create table loct (a int check (a in (1)), b text);
postgres=# create foreign table remp (a int check (a in (1)), b text) 
server loopback options (table_name 'loct');

postgres=# create table locp (a int check (a in (2)), b text);
postgres=# alter table utrtest attach partition remp for values in (1);
postgres=# alter table utrtest attach partition locp for values in (2);
postgres=# create trigger loct_br_insert_trigger before insert on loct 
for each row execute procedure br_insert_trigfunc();


where the trigger function is the one defined in an earlier email.

postgres=# insert into utrtest values (2, 'qux');
INSERT 0 1
postgres=# update utrtest set a = 1 where a = 2 returning *;
 a |  b
---+-
 1 | qux
(1 row)

UPDATE 1

Wrong result!  The result should be concatenated with ' triggered !'.

In case #2, since we initialize expressions for the partition's 
ResultRelInfo including RETURNING by translating the attnos of the 
corresponding expressions in the ResultRelInfo for the first subplan 
target rel, I think we should replace the RTE for the first subplan 
target rel, not the RTE for the nominal rel, with the new one created 
for the foreign table.  Attached is an updated version for fixing this 
issue.



Do you think we need to clarify this in a comment?


Yeah, but I updated comments about this a little bit different way in 
the attached.  Does that make sense?


Thanks for the comments!

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 7454,7459  select tableoid::regclass, * FROM itrtest;
--- 7454,7501 
   remp1| 1 | foo
  (1 row)
  
+ delete from itrtest;
+ drop index loct1_idx;
+ -- Test that remote triggers work with insert tuple routing
+ create function br_insert_trigfunc() returns trigger as $$
+ begin
+ 	new.b := new.b || ' triggered !';
+ 	return new;
+ end
+ $$ language plpgsql;
+ create trigger loct1_br_insert_trigger before insert on loct1
+ 	for each row execute procedure br_insert_trigfunc();
+ create trigger loct2_br_insert_trigger before insert on loct2
+ 	for each row execute procedure br_insert_trigfunc();
+ -- The new values are concatenated with ' triggered !'
+ insert into itrtest values (1, 'foo') returning *;
+  a |b
+ ---+-
+  1 | foo triggered !
+ (1 row)
+ 
+ insert into itrtest values (2, 'qux') returning *;
+  a |b
+ ---+-
+  2 | qux triggered !
+ (1 row)
+ 
+ insert into itrtest values (1, 'test1'), (2, 'test2') returning *;
+  a | b 
+ ---+---
+  1 | test1 triggered !
+  2 | test2 triggered !
+ (2 rows)
+ 
+ with result as (insert into itrtest values (1, 'test1'), (2, 'test2') returning *) select * from result;
+  a | b 
+ ---+---
+  1 | test1 triggered !
+  2 | test2 triggered !
+ (2 rows)
+ 
+ drop trigger loct1_br_insert_trigger on loct1;
+

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-18 Thread Bruce Momjian

On Tue, Apr 17, 2018 at 02:41:42PM -0700, Andres Freund wrote:
> On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:
> > On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> > > That doesn't seem like a very practical way. It's better than nothing,
> > > of course, but I wonder how would that work with containers (where I
> > > think you may not have access to the kernel log at all). Also, I'm
> > > pretty sure the messages do change based on kernel version (and possibly
> > > filesystem) so parsing it reliably seems rather difficult. And we
> > > probably don't want to PANIC after I/O error on an unrelated device, so
> > > we'd need to understand which devices are related to PostgreSQL.
> 
> You can certainly have access to the kernel log in containers. I'd
> assume such a script wouldn't check various system logs but instead tail
> /dev/kmsg or such. Otherwise the variance between installations would be
> too big.

I was thinking 'dmesg', but the result is similar.

> There's not *that* many different type of error messages and they don't
> change that often. If we'd just detect error for the most common FSs
> we'd probably be good. Detecting a few general storage layer message
> wouldn't be that hard either, most things have been unified over the
> last ~8-10 years.

It is hard to know exactly what the message format should be for each
operating system because it is hard to generate them on demand, and we
would need to filter based on Postgres devices.

The other issue is that once you see a message during a checkpoint and
exit, you don't want to see that message again after the problem has
been fixed and the server restarted.  The simplest solution is to save
the output of the last check and look for only new entries.  I am
attaching a script I run every 15 minutes from cron that emails me any
unexpected kernel messages.

I am thinking we would need a contrib module with sample scripts for
various operating systems.

> > Replying to your specific case, I am not sure how we would use a script
> > to check for I/O errors/space-exhaustion if the postgres user doesn't
> > have access to it.
> 
> Not sure what you mean?
> 
> Space exhaustiion can be checked when allocating space, FWIW. We'd just
> need to use posix_fallocate et al.

I was asking about cases where permissions prevent viewing of kernel
messages.  I think you can view them in containers, but in virtual
machines you might not have access to the host operating system's kernel
messages, and that might be where they are.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
:

[ "$#" -gt 1 -o \( "$#" -eq 1 -a "$1" != "-i" \) ] && echo "Usage:  $(basename 
$0) [-i]" 1>&2 && exit 1

# change this to a safe directory
TMP="$HOME"

dmesg -T |
egrep -v '\<(mounted filesystem with ordered data mode|MSI/MSI-X|lowering 
kernel.perf_event_max_sample_rate|promiscuous mode|bad checksum|fragment too 
large|short packet)\>' > $TMP/0

if [ "$1" != "-i" -a -e /u/dmesg.log ]
thendiff /u/dmesg.log $TMP/0 | sed -n 's/^> //p' > $TMP/1
[ -s $TMP/1 ] && cat $TMP/1 | mail -s 'changed dmesg output' root
fi

[ -s $TMP/1 -o "$1" = "-i" ] && cp $TMP/0 /u/dmesg.log

1 2 >

1 - 100 of 120 matches

Mail list logo