Re: [HACKERS] Redesigning checkpoint_segments

2015-02-23 Thread Andres Freund
On 2015-02-22 21:24:56 -0500, Robert Haas wrote:
 On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek p...@2ndquadrant.com wrote:
  I am wondering a bit about interaction with wal_keep_segments.
  One thing is that wal_keep_segments is still specified in number of segments
  and not size units, maybe it would be worth to change it also?
  And the other thing is that, if set, the wal_keep_segments is the real
  max_wal_size from the user perspective (not from perspective of the
  algorithm in this patch, but user does not really care about that) which is
  somewhat weird given the naming.
 
 It seems like wal_keep_segments is more closely related to
 wal_*min*_size.  The idea of both settings is that each is a minimum
 amount of WAL we want to keep around for some purpose.  But they're
 not quite the same, I guess, because wal_min_size just forces us to
 keep that many files around - they can be overwritten whenever.
 wal_keep_segments is an amount of actual WAL data we want to keep
 around.
 
 Would it make sense to require that wal_keep_segments = wal_min_size?

I don't think so. Right now checkpoint_segments is a useful tool to
relatively effectively control the amount of WAL that needs to be
replayed in the event of a crash. wal_keep_segments in contrast doesn't
have much to do with the normal working of the system, except that it
delays recycling of WAL segments a bit.

With a condition like above, how would you set up things that you have
50k segments around for replication (say a good days worth), but that
your will never have to replay more than ~800 segments (i.e. something
like checkpoint_segments = 800)?

Am I missing something?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Abbreviated keys for Numeric

2015-02-23 Thread Andrew Gierth
 Tomas == Tomas Vondra tomas.von...@2ndquadrant.com writes:

 Tomas Interesting, but I think Gavin was asking about how much
 Tomas variation was there for each tested case (e.g. query executed on
 Tomas the same code / dataset). And in those cases the padding /
 Tomas alignment won't change, and thus the effects you describe won't
 Tomas influence the results, no?

My point is exactly the fact that since the result is not affected, this
variation between runs of the same code is of no real relevance to the
question of whether a given change in performance can properly be
attributed to a patch.

Put it this way: suppose I have a test that when run repeatedly with no
code changes takes 6.10s (s=0.025s), and I apply a patch that changes
that to 6.26s (s=0.025s). Did the patch have an impact on performance?

Now suppose that instead of applying the patch I insert random amounts
of padding in an unused function and find that my same test now takes a
mean of 6.20s (s=0.058s) when I take the best timing for each padding
size and calculate stats across sizes. Now it looks obvious that the
actual code of the patch probably wasn't responsible for any change...

The numbers used here aren't theoretical; they are obtained by testing a
single query - select * from d_flt order by v offset 1000 where
d_flt contains 5 million float8 values - over 990 times with 33
different random padding sizes (uniform in 0-32767). Here's a scatter
plot, with 3 runs of each padding size so you can see the repeatability:
http://tinyurl.com/op9qg8a

-- 
Andrew.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-02-23 Thread Kohei KaiGai
 Amit and I had a long discussion about this on Friday while in Boston
 together.  I previously argued that the master and the slave should be
 executing the same node, ParallelSeqScan.  However, Amit argued
 persuasively that what the master is doing is really pretty different
 from what the worker is doing, and that they really ought to be
 running two different nodes.  This led us to cast about for a better
 design, and we came up with something that I think will be much
 better.

 The basic idea is to introduce a new node called Funnel.  A Funnel
 node will have a left child but no right child, and its job will be to
 fire up a given number of workers.  Each worker will execute the plan
 which is the left child of the funnel.  The funnel node itself will
 pull tuples from all of those workers, and can also (if there are no
 tuples available from any worker) execute the plan itself.  So a
 parallel sequential scan will look something like this:

 Funnel
 Workers: 4
 - Partial Heap Scan on xyz

 What this is saying is that each worker is going to scan part of the
 heap for xyz; together, they will scan the whole thing.


What is the best way to determine the number of workers?
Fixed number is an idea. It may also make sense to add a new common field
to Path node to introduce how much portion of the node execution can be
parallelized, or unavailable to run in parallel.
Not on the plan time, we may be able to determine the number according to
the number of concurrent workers and number of CPU cores.

 The neat thing about this way of separating things out is that we can
 eventually write code to push more stuff down into the funnel.  For
 example, consider this:

 Nested Loop
 - Seq Scan on foo
 - Index Scan on bar
 Index Cond: bar.x = foo.x

 Now, if a parallel sequential scan is cheaper than a regular
 sequential scan, we can instead do this:

 Nested Loop
 - Funnel
 - Partial Heap Scan on foo
 - Index Scan on bara
 Index Cond: bar.x = foo.x

 The problem with this is that the nested loop/index scan is happening
 entirely in the master.  But we can have logic that fixes that by
 knowing that a nested loop can be pushed through a funnel, yielding
 this:

 Funnel
 - Nested Loop
 - Partial Heap Scan on foo
 - Index Scan on bar
 Index Cond: bar.x = foo.x

 Now that's pretty neat.  One can also imagine doing this with
 aggregates.  Consider:

I guess the planner enhancement shall exist around add_paths_to_joinrel().
In case when any underlying join paths that support multi-node execution,
the new portion will add Funnel node with these join paths. Just my thought.

 HashAggregate
 - Funnel
 - Partial Heap Scan on foo
 Filter: x = 1

 Here, we can't just push the HashAggregate through the filter, but
 given infrastructure for we could convert that to something like this:

 HashAggregateFinish
 - Funnel
 - HashAggregatePartial
 - Partial Heap Scan on foo
  Filter: x = 1

 That'd be swell.

 You can see that something like this will also work for breaking off
 an entire plan tree and shoving it down into a worker.  The master
 can't participate in the computation in that case, but it's otherwise
 the same idea.

I believe the entire vision we've discussed around combining aggregate
function thread is above, although people primarily considers to apply
this feature on aggregate push-down across join.

One key infrastructure may be a capability to define the combining function
of aggregates. It informs the planner given aggregation support two stage
execution. In addition to this, we may need to have a planner enhancement
to inject the partial aggregate node during path construction.

Probably, we have to set a flag to inform later stage (that will construct
Agg plan) the underlying scan/join node takes partial aggregation, thus,
final aggregation has to expect state data, instead of usual arguments for
row-by-row.

Also, I think HashJoin with very large outer relation but unbalanced much
small inner is a good candidate to distribute multiple nodes.
Even if multi-node HashJoin has to read the small inner relation N-times,
separation of very large outer relation will make gain.

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Florian Weimer
On 02/22/2015 02:05 PM, Andres Freund wrote:
 On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote:
 I honestly wonder why postgres uses renegotiation at all. The motivation
 that cryptoanalysis is easier as more data is sent seems quite
 far-fetched.
 
 I don't think so. There's a fair number of algorithms that can/could be
 much easier be attached with lots of data available. Especially if you
 can guess/know/control some of the data.  Additionally renegotiating
 regularly helps to constrain a possible key leagage to a certain amount
 of time. With backend connections often being alive for weeks at a time
 that's not a bad thing.

Renegotiation will be removed from future TLS versions because it is
considered unnecessary with modern ciphers:

  https://github.com/tlswg/tls13-spec/issues/38

If ciphers require rekeying, that mechanism will be provided at the TLS
layer in the future.

I think you could remove renegotiation from PostgreSQL as long as you
offer something better than RC4 in the TLS handshake.

-- 
Florian Weimer / Red Hat Product Security


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump gets attributes from tables in extensions

2015-02-23 Thread Michael Paquier
On Mon, Feb 23, 2015 at 5:57 PM, Rushabh Lathia rushabh.lat...@gmail.com
wrote:
 Thanks to the easy handy testcase, was able to replicate the test scenario
 on my local environment. And yes tbinfo-dobj.ext_member check into
 getTableAttrs() do fix the issue.

 Looking more into pg_dump code what I found that, generally PG don't have
 tbinfo-dobj.ext_member check to ignore the object. Mostly we do this kind
 of check using tbinfo-dobj.dump (look at dumpTable() for reference). Do
you
 have any particular reason if choosing dobj.ext_member over dobj.dump ?

Hm. I have been pondering about this code more and I am dropping the patch
as either adding a check based on the flag to track dumpable objects or
ext_member visibly breaks the logic that binary upgrade and tables in
extensions rely on. Particularly this portion makes me think so in
getExtensionMembership():
/*
 * Normally, mark the member object as not to be dumped.
But in
 * binary upgrades, we still dump the members individually,
since the
 * idea is to exactly reproduce the database contents
rather than
 * replace the extension contents with something different.
 */
if (!dopt-binary_upgrade)
dobj-dump = false;
else
dobj-dump = refdobj-dump;
Regards,
-- 
Michael


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-23 Thread Fujii Masao
On Fri, Feb 20, 2015 at 9:33 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a 
 trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, 
 like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing 
 something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not 
 related to
 the retry time to restore WAL from the archive.

 [Re-reading the code...]
 Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
 maximum of 5s then. I also noticed in previous patch that the wait was
 maximized to 5s. To begin with, a loop should have been used if it was
 a sleep, but as now patch uses a latch this limit does not make much
 sense... Patch updated is attached.

 On second thought, the interval to check the trigger file is very different
 from the wait time to retry to retrieve WAL. So it seems strange and even
 confusing to control them by one parameter. If we really want to make
 the interval for the trigger file configurable, we should invent new GUC for 
 it.
 But I don't think that it's worth doing that. If someone wants to react the
 trigger file more promptly for fast promotion, he or she basically can use
 pg_ctl promote command, instead. Thought?

 Hm, OK.

 Attached is the updated version of the patch. I changed the parameter so that
 it doesn't affect the interval of checking the trigger file.

 -static pg_time_t last_fail_time = 0;
 -pg_time_tnow;
 +TimestampTz now = GetCurrentTimestamp();
 +TimestampTzlast_fail_time = now;

 I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
 to be called for each WaitForWALToBecomeAvailable().

 +WaitLatch(XLogCtl-recoveryWakeupLatch,
 +  WL_LATCH_SET | WL_TIMEOUT,
 +  wait_time / 1000);

 Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.

 Yeah, I am wondering though why this has not been added after 89fd72cb though.

 +{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS,

 WAL_SETTINGS should be REPLICATION_STANDBY? Changed.

 Sure.

So I pushed the patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ERROR: cannot GetMultiXactIdMembers() during recovery

2015-02-23 Thread Marko Tiikkaja

Hi,

Andres asked me on IRC to report this here.  Since we upgraded our 
standby servers to 9.1.15 (though the master is still running 9.1.14), 
we've seen the error in $SUBJECT a number of times.  I managed to 
reproduce it today by running the same query over and over again, and 
attached is the back trace.


Let me know if you need any additional information.


.m
#0  GetMultiXactIdMembers (multi=56513428, xids=0x7fff9e3691e8) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/transam/multixact.c:923
#1  0x7f029a917dd4 in MultiXactIdIsRunning (multi=optimized out) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/transam/multixact.c:386
#2  0x7f029abad79d in HeapTupleSatisfiesVacuum (tuple=0x7f0145f2c060, 
OldestXmin=2418699920, buffer=111945) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/utils/time/tqual.c:1184
#3  0x7f029a8f08f5 in index_getnext (scan=scan@entry=0x7f029b66ac78, 
direction=direction@entry=ForwardScanDirection) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/index/indexam.c:644
#4  0x7f029a9fe646 in IndexNext (node=node@entry=0x7f029b669550) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeIndexscan.c:78
#5  0x7f029a9f41dc in ExecScanFetch (recheckMtd=0x7f029a9fe5c0 
IndexRecheck, accessMtd=0x7f029a9fe600 IndexNext, node=0x7f029b669550)
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:82
#6  ExecScan (node=node@entry=0x7f029b669550, 
accessMtd=accessMtd@entry=0x7f029a9fe600 IndexNext, 
recheckMtd=recheckMtd@entry=0x7f029a9fe5c0 IndexRecheck)
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:167
#7  0x7f029a9fe72b in ExecIndexScan (node=node@entry=0x7f029b669550) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeIndexscan.c:146
#8  0x7f029a9ec968 in ExecProcNode (node=node@entry=0x7f029b669550) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:398
#9  0x7f029a9fd3e1 in MultiExecHash (node=node@entry=0x7f029b6690b0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHash.c:103
#10 0x7f029a9eca74 in MultiExecProcNode (node=node@entry=0x7f029b6690b0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:536
#11 0x7f029a9fdd94 in ExecHashJoin (node=node@entry=0x7f029b667aa0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:177
#12 0x7f029a9ec8b8 in ExecProcNode (node=node@entry=0x7f029b667aa0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:447
#13 0x7f029a9fe319 in ExecHashJoinOuterGetTuple (hashvalue=0x7fff9e369554, 
hjstate=0x7f029b666c80, outerNode=0x7f029b667aa0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:656
#14 ExecHashJoin (node=node@entry=0x7f029b666c80) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:209
#15 0x7f029a9ec8b8 in ExecProcNode (node=node@entry=0x7f029b666c80) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:447
#16 0x7f029aa05a59 in ExecSort (node=node@entry=0x7f029b666a10) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSort.c:103
#17 0x7f029a9ec898 in ExecProcNode (node=node@entry=0x7f029b666a10) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:458
#18 0x7f029aa096e0 in begin_partition 
(winstate=winstate@entry=0x7f029b665530) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeWindowAgg.c:683
#19 0x7f029aa0b49b in ExecWindowAgg 
(winstate=winstate@entry=0x7f029b665530) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeWindowAgg.c:1287
#20 0x7f029a9ec868 in ExecProcNode (node=0x7f029b665530) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:470
#21 0x7f029a9f41dc in ExecScanFetch (recheckMtd=0x7f029aa084f0 
SubqueryRecheck, accessMtd=0x7f029aa08500 SubqueryNext, node=0x7f029b664e60)
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:82
#22 ExecScan (node=node@entry=0x7f029b664e60, 
accessMtd=accessMtd@entry=0x7f029aa08500 SubqueryNext, 
recheckMtd=recheckMtd@entry=0x7f029aa084f0 SubqueryRecheck)
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:167
#23 0x7f029aa08538 in ExecSubqueryScan (node=node@entry=0x7f029b664e60) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSubqueryscan.c:85
#24 0x7f029a9ec938 in ExecProcNode (node=node@entry=0x7f029b664e60) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:412
#25 0x7f029aa05a59 in ExecSort (node=node@entry=0x7f029b664bf0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSort.c:103
#26 0x7f029a9ec898 in ExecProcNode 

Re: [HACKERS] Enforce creation of destination folders for source files in pg_regress (Was: pg_regress writes into source tree)

2015-02-23 Thread Robert Haas

 On Feb 22, 2015, at 5:41 AM, Michael Paquier michael.paqu...@gmail.com 
 wrote:
 This is up to the maintainer of each extension to manage their code
 tree. However I can imagine that some people would be grateful if we
 allow them to not need sql/ and expected/ containing only one single
 .gitignore file ignoring everything with *, making the code tree of
 their extensions cleaner.

I don't see this as an improvement. What's wrong with a .gitignore file 
ignoring everything?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: cannot GetMultiXactIdMembers() during recovery

2015-02-23 Thread Andres Freund
Hi,

On 2015-02-23 15:00:35 +0100, Marko Tiikkaja wrote:
 Andres asked me on IRC to report this here.  Since we upgraded our standby
 servers to 9.1.15 (though the master is still running 9.1.14), we've seen
 the error in $SUBJECT a number of times.

FWIW, I think this is just as borked in 9.1.14 and will likely affect
all of 9.0 - 9.2. The problem is that in those releases multixacts
aren't maintained on the standby in a way that allows access.

index_getnext() itself is actually pretty easy to fix, it already checks
whether the scan started while in recovery when using the result of the
error triggering HeapTupleSatisfiesVacuum(), just too late.  I don't
remember other HTSV callers that can run in recovery, given that DDL is
obviously impossible and we don't support serializable while in
recovery.

Alternatively we could make MultiXactIdIsRunning() return false  9.3
when in recovery. I think that'd end up fixing things, but it seems
awfully fragile to me.

I do see a HTSU in pgrowlocks.c - that's not really safe during recovery
  9.3, given it accesses multixacts. I guess it needs to throw an error.

I wonder if we shouldn't put a Assert() in HTSV/HTSU to prevent such
problems.

Greetings,

Andres Freund
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-23 Thread Fujii Masao
On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello,

 Attached is a patch which has following changes,

 As suggested above block ID in xlog structs has been replaced by chunk ID.
 Chunk ID is used to distinguish between different types of xlog record
 fragments.
 Like,
 XLR_CHUNK_ID_DATA_SHORT
 XLR_CHUNK_ID_DATA_LONG
 XLR_CHUNK_BKP_COMPRESSED
 XLR_CHUNK_BKP_WITH_HOLE

 In block references, block ID follows the chunk ID. Here block ID retains
 its functionality.
 This approach increases data by 1 byte for each block reference in an xlog
 record. This approach separates ID referring different fragments of xlog
 record from the actual block ID which is used to refer  block references in
 xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format
because I got the following error and the startup process could not replay
any WAL records when I set up replication and enabled wal_compression.

LOG:  record with invalid length at 0/3B0
LOG:  record with invalid length at 0/3000518
LOG:  Invalid block length in record 0/30005A0
LOG:  Invalid block length in record 0/3000D60
...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Andres Freund
Hi,

On 2015-02-23 15:25:57 +, Thom Brown wrote:
 I've noticed that if the primary is started and then a base backup is
 immediately taken from it and started as as a synchronous standby, it
 doesn't replicate and the primary hangs indefinitely when trying to run any
 WAL-generating statements.  It only recovers when either the primary is
 restarted (which has to use a fast shutdown otherwise it also hangs
 forever), or the standby is restarted.
 
 Here's a way of reproducing it:
 ...
 Note that if you run the commands one by one, there isn't a problem.  If
 you run it as a script, the standby doesn't connect to the primary.  There
 aren't any errors reported by either the standby or the primary.  The
 primary's wal sender process reports the following:
 
 wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158
 
 Anyone know why this would be happening?  And if this could be a problem in
 other scenarios?

Given that normally a walsender doesn't wait for syncrep I guess this is
the above backend just did authentication. If you gdb into the
walsender, what's the backtrace?

We previously had discussions about that being rather annoying; I
unfortunately don't remember enough of the thread to reference it
here. If it really is this, I think we should add some more smarts about
only enabling syncrep once a backend is fully up and maybe even remove
it from more scenarios during commits generally (e.g. if no xid was
assigned and we just pruned something).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown
Hi,

I've noticed that if the primary is started and then a base backup is
immediately taken from it and started as as a synchronous standby, it
doesn't replicate and the primary hangs indefinitely when trying to run any
WAL-generating statements.  It only recovers when either the primary is
restarted (which has to use a fast shutdown otherwise it also hangs
forever), or the standby is restarted.

Here's a way of reproducing it:

---

mkdir -p -m 0700 primary standby1

initdb -N -k -D primary -E 'UTF8'

cat  PRIMARYCONFIG  primary/postgresql.conf
shared_buffers = 8MB
logging_collector = on
log_line_prefix = '%m - %u - %d'
synchronous_standby_names = 'standby1'
max_connections = 8
wal_level = 'hot_standby'
port = 5530
max_wal_senders = 3
wal_keep_segments = 6
PRIMARYCONFIG

cat  PRIMARYHBA  primary/pg_hba.conf
local   replication rep_user   trust
hostreplication rep_user   127.0.0.1/32trust
hostreplication rep_user   ::1/128 trust
PRIMARYHBA

pg_ctl start -D primary

psql -p 5530 -h localhost -c 'SET SESSION synchronous_commit TO
'off';CREATE USER rep_user REPLICATION;;' -d postgres

pg_basebackup -x -D standby1 -h localhost -p 5530 -U rep_user

cat  STANDBYCONFIG  standby1/postgresql.conf
port = 5531
hot_standby = on
STANDBYCONFIG

cat  STANDBYRECOVERY  standby1/recovery.conf
standby_mode = 'on'
recovery_target_timeline = 'latest'
primary_conninfo = 'host=127.0.0.1 user=rep_user port=5530
application_name=standby1'
STANDBYRECOVERY

pg_ctl -D standby1 start

---

Note that if you run the commands one by one, there isn't a problem.  If
you run it as a script, the standby doesn't connect to the primary.  There
aren't any errors reported by either the standby or the primary.  The
primary's wal sender process reports the following:

wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158

Anyone know why this would be happening?  And if this could be a problem in
other scenarios?

Thom


Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Andres Freund
On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
 I unfortunately don't remember enough of the thread to reference it
 here.

Found the right keywords. The threads below
http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
and
http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown
On 23 February 2015 at 15:38, Andres Freund and...@2ndquadrant.com wrote:

 Hi,

 On 2015-02-23 15:25:57 +, Thom Brown wrote:
  I've noticed that if the primary is started and then a base backup is
  immediately taken from it and started as as a synchronous standby, it
  doesn't replicate and the primary hangs indefinitely when trying to run
 any
  WAL-generating statements.  It only recovers when either the primary is
  restarted (which has to use a fast shutdown otherwise it also hangs
  forever), or the standby is restarted.
 
  Here's a way of reproducing it:
  ...
  Note that if you run the commands one by one, there isn't a problem.  If
  you run it as a script, the standby doesn't connect to the primary.
 There
  aren't any errors reported by either the standby or the primary.  The
  primary's wal sender process reports the following:
 
  wal sender process rep_user 127.0.0.1(45243) startup waiting for
 0/3000158
 
  Anyone know why this would be happening?  And if this could be a problem
 in
  other scenarios?

 Given that normally a walsender doesn't wait for syncrep I guess this is
 the above backend just did authentication. If you gdb into the
 walsender, what's the backtrace?


#0  0x7f66d1725940 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00617faa in WaitLatchOrSocket ()
#2  0x0064741b in SyncRepWaitForLSN ()
#3  0x004bbf8f in CommitTransaction ()
#4  0x004be135 in CommitTransactionCommand ()
#5  0x00757679 in InitPostgres ()
#6  0x00675032 in PostgresMain ()
#7  0x004617ef in ServerLoop ()
#8  0x00627c9c in PostmasterMain ()
#9  0x0046223d in main ()

-- 
Thom


Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Albe Laurenz
Florian Weimer wrote:
 On 02/22/2015 02:05 PM, Andres Freund wrote:
 On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote:
 I honestly wonder why postgres uses renegotiation at all. The motivation
 that cryptoanalysis is easier as more data is sent seems quite
 far-fetched.

 I don't think so. There's a fair number of algorithms that can/could be
 much easier be attached with lots of data available. Especially if you
 can guess/know/control some of the data.  Additionally renegotiating
 regularly helps to constrain a possible key leagage to a certain amount
 of time. With backend connections often being alive for weeks at a time
 that's not a bad thing.
 
 Renegotiation will be removed from future TLS versions because it is
 considered unnecessary with modern ciphers:
 
   https://github.com/tlswg/tls13-spec/issues/38
 
 If ciphers require rekeying, that mechanism will be provided at the TLS
 layer in the future.
 
 I think you could remove renegotiation from PostgreSQL as long as you
 offer something better than RC4 in the TLS handshake.

I'd say it is best to wait if and how OpenSSL change their API when they
implement TLS 1.3.

I'd vote against removing renegotiation.  At the very least, if the feature
should provide unnecessary and cumbersome with future versions of OpenSSL,
we should retain ssl_renegotiation_limit and change the default to 0.
It might still be of value with older versions.

If changing the encryption is so useless, whe did the TLS workgroup
decide to introduce rekeying as a substitute for renegotiation?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

2015-02-23 Thread Andres Freund
Hi,

On 2014-07-26 18:16:01 +0200, Andres Freund wrote:
 On 2014-07-26 11:32:24 -0400, Tom Lane wrote:
  MauMau maumau...@gmail.com writes:
   [ sinval catchup signal - ProcessCatchupEvent - WaitLatch - deadlock ]
  
  Ugh.
  
  One line of thought is that it's pretty unsafe to be doing anything
  as complicated as transaction start/commit in a signal handler, even one
  that is sure it's not interrupting anything else.
 
 Yea, that's really not nice.

MauMau, we don't do this anymore. Could you verify that the issue is
fixed for you?

I'd completely forgotten that this thread made me work on moving
everything complicated out of signal handlers...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown
On 23 February 2015 at 15:42, Andres Freund and...@2ndquadrant.com wrote:

 On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
  I unfortunately don't remember enough of the thread to reference it
  here.

 Found the right keywords. The threads below

 http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
 and

 http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau


Yes, this seems to be virtually the same issue reported.  The trace looks
the same except for RecordTransactionCommit.

-- 
Thom


Re: [HACKERS] multiple backends attempting to wait for pincount 1

2015-02-23 Thread Andres Freund
On 2015-02-17 13:14:00 -0500, Tom Lane wrote:
 Hm, good point.  On the other hand, should we worry about the possibility
 of a lost signal?  Moving the flag-clearing would guard against that,
 which the current code does not.  But we've not seen field reports of such
 issues AFAIR, so this might not be an important consideration.

I think if there were lost signals there'd be much bigger problems given
the same (or in master) similar mechanics are used for a lot of other
things including heavyweight and lightweight locks wait queues.

  ...
  /*
   * Make sure waiter flag is reset - it might not be if
   * ProcWaitForSignal() returned for another reason than 
  UnpinBuffer()
   * signalling us.
   */
  LockBufHdr(bufHdr);
  buf-flags = ~BM_PIN_COUNT_WAITER;
  Assert(bufHdr-wait_backend_pid == MyProcPid);
  UnlockBufHdr(bufHdr);
 
  PinCountWaitBuf = NULL;
  /* Loop back and try again */
  }
 
  to the bottom of the loop would suffice.
 
 No, I disagree.  If we maintain the rule that the signaler clears
 BM_PIN_COUNT_WAITER, then once that happens there is nothing to stop a
 third party from trying to LockBufferForCleanup on the same buffer (except
 for table-level locking conventions, which IMO this mechanism shouldn't be
 dependent on).  So this coding would potentially clear the
 BM_PIN_COUNT_WAITER flag belonging to that third party, and then fail the
 Assert --- but only in debug builds, not in production, where it would
 just silently lock up the third-party waiter.  So I think having a test to
 verify that it's still our BM_PIN_COUNT_WAITER flag is essential.

Pushed with a test guarding against that. I still think it might be
slightly better to error out if somebody else waits, but I guess it's
unlikely that we'd mistakenly add code doing that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query Rewrite with Postgres' materialized views

2015-02-23 Thread Eric Grinstein
Thank you for your answers.
I am very eager to contribute to Postgres, especially in the materialized
views area.
I have created a thread
http://www.postgresql.org/message-id/cak7uwezzxjjsxp9uoxwhpnjjutjajvpkja9skzalsnnrdjs...@mail.gmail.com
proposing
to work on it as my Google Summer of Code project.
It became a consensus that implementing the query rewrite feature would be
unfeasible
for such a short period of time, but Tomas Vondra suggested I could work on
a first part of It, such as
making a staleness flag for the MVs. If you have any opinions on the
thread, i'd be pleased to read them.

Thank you and regards,

Eric

2015-02-21 16:09 GMT-02:00 Robert Haas robertmh...@gmail.com:

 On Sat, Feb 21, 2015 at 1:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  I'm not really sure what Josh was talking about in that answer.  In
  terms of doing this automatically, I doubt that's likely to happen
  until we have a way to automatically update a materialized view when
  the underlying data changes --- and Kevin Grittner has done a bunch of
  work towards that, but more is needed to get us there.
 
  Even with auto maintenance, it seems unlikely that matviews would ever
  be so transparent that it would be okay for the planner to automatically
  substitute them into a query.  The data in a matview is going to be at
  least a little bit stale, else You're Doing It Wrong.

 Perhaps, but this is a feature we've gotten MANY customer requests
 for.  IDK why, but it's true.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] Redesigning checkpoint_segments

2015-02-23 Thread Heikki Linnakangas

On 02/23/2015 01:01 PM, Andres Freund wrote:

On 2015-02-22 21:24:56 -0500, Robert Haas wrote:

On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek p...@2ndquadrant.com wrote:

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of segments
and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which is
somewhat weird given the naming.


It seems like wal_keep_segments is more closely related to
wal_*min*_size.  The idea of both settings is that each is a minimum
amount of WAL we want to keep around for some purpose.  But they're
not quite the same, I guess, because wal_min_size just forces us to
keep that many files around - they can be overwritten whenever.
wal_keep_segments is an amount of actual WAL data we want to keep
around.

Would it make sense to require that wal_keep_segments = wal_min_size?


I don't think so. Right now checkpoint_segments is a useful tool to
relatively effectively control the amount of WAL that needs to be
replayed in the event of a crash. wal_keep_segments in contrast doesn't
have much to do with the normal working of the system, except that it
delays recycling of WAL segments a bit.

With a condition like above, how would you set up things that you have
50k segments around for replication (say a good days worth), but that
your will never have to replay more than ~800 segments (i.e. something
like checkpoint_segments = 800)?


Right. While wal_keep_segments and wal_min_size both set a kind of a 
minimum on the amount of WAL that's kept in pg_xlog, they are different 
things, and a rule that one must be less than or greater than the other 
doesn't make sense.


Everyone seems to be happy with the names and behaviour of the GUCs, so 
committed.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mogrify and indent features for jsonb

2015-02-23 Thread Dmitry Dolgov
Hi, Petr, thanks for the review.

 I think it would be better if the ident printing didn't put the start
of array ([) and start of dictionary ({) on separate line
Did you mean this?

[
{
a: 1,
b: 2
}
]

I tried to verify this in several ways (http://jsonprettyprint.com/,
JSON.stringify, json.too from python), and I always get this result
(new line after ([) and ({) ).

 I don't really understand the point of h_atoi() function.
Initially, this function was to combine the convertion logic and specific
verifications. But I agree, strtol is more correct way, I should improve
this.

 The code looks ok as well except maybe the replacePath could use couple
of comments
I already added several commentaries (and looks like I should add even more
in the nearest future) for this function in the jsonbx extension, and I
think we can update this patch one more time with that improvement.

 About the Assert(ARR_ELEMTYPE(path) == TEXTOID);
I based my work on the hstore extension, which contains such kind of
assertions. But I suppose, it's not required anymore, so I removed this
from the extension. And, I think, we can also remove this from patch.

On 18 February 2015 at 08:32, Petr Jelinek p...@2ndquadrant.com wrote:

 Hi,

 I looked at the patch and have several comments.

 First let me say that modifying the individual paths of the json is the
 feature I miss the most in the current implementation so I am happy that
 this patch was submitted.

 I would prefer slightly the patch split into two parts, one for the indent
 printing and one with the manipulation functions, but it's not too big
 patch so it's not too bad as it is.

 There is one compiler warning that I see:
 jsonfuncs.c:3533:1: warning: no previous prototype for ‘jsonb_delete_path’
 [-Wmissing-prototypes]
  jsonb_delete_path(PG_FUNCTION_ARGS)

 I think it would be better if the ident printing didn't put the start of
 array ([) and start of dictionary ({) on separate line since most pretty
 printing implementations outside of Postgres (like the JSON.stringify or
 python's json module) don't do that. This is purely about consistency with
 the rest of the world.

 The json_ident might be better named as json_pretty perhaps?

 I don't really understand the point of h_atoi() function. What's wrong
 with using strtol like pg_atoi does? Also there is no integer overflow
 check anywhere in that function.

 There is tons of end of line whitespace mess in jsonb_indent docs.

 Otherwise everything I tried so far works as expected. The code looks ok
 as well except maybe the replacePath could use couple of comments (for
 example the line which uses the h_atoi) to make it easier to follow.

 About the:

 +   /* XXX : why do we need this assertion? The functions is declared
 to take text[] */
 +   Assert(ARR_ELEMTYPE(path) == TEXTOID);


 I wonder about this also, some functions do that, some don't, I am not
 really sure what is the rule there myself.

 --
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Bug in pg_dump

2015-02-23 Thread Gilles Darold
Le 17/02/2015 14:44, Michael Paquier a écrit :
 On Tue, Jan 20, 2015 at 8:06 PM, Gilles Darold gilles.dar...@dalibo.com 
 wrote:
 Le 19/01/2015 14:41, Robert Haas a écrit :
 On Thu, Jan 15, 2015 at 6:26 AM, Gilles Darold gilles.dar...@dalibo.com 
 wrote:
 I attach a patch that solves the issue in pg_dump, let me know if it might
 be included in Commit Fest or if the three other solutions are a better
 choice.
 I think a fix in pg_dump is appropriate, so I'd encourage you to add
 this to the next CommitFest.

 Ok, thanks Robert. The patch have been added to next CommitFest under
 the Bug Fixes section.

 I've also made some review of the patch and more test. I've rewritten
 some comments and added a test when TableInfo is NULL to avoid potential
 pg_dump crash.

 New patch attached: pg_dump.c-extension-FK-v2.patch
 So, I looked at your patch and I have some comments...

 The approach taken by the patch looks correct to me as we cannot
 create FK constraints after loading the data in the case of an
 extension, something that can be done with a data-only dump.

 Now, I think that the routine hasExtensionMember may impact
 performance unnecessarily in the case of databases with many tables,
 say thousands or more. And we can easily avoid this performance
 problem by checking the content of each object dumped by doing the
 legwork directly in getTableData. Also, most of the NULL pointer
 checks are not needed as most of those code paths would crash if
 tbinfo is NULL, and actually only keeping the one in dumpConstraint is
 fine.

Yes that's exactly what we discuss at Moscow, thanks for removing the
hasExtensionMember() routine. About NULL pointer I'm was not sure that
it could not crash on some other parts so I decided to check it
everywhere. If that's ok to just check in dumpConstraint() that's fine.
 
 On top of those things, I have written a small extension able to
 reproduce the problem that I have included as a test module in
 src/test/modules. The dump/restore check is done using the TAP tests,
 and I have hacked prove_check to install as well the contents of the
 current folder to be able to use the TAP routines with an extension
 easily. IMO, as we have no coverage of pg_dump with extensions I think
 that it would be useful to ensure that this does not break again in
 the future.

Great ! I did not had time to do that, thank you so much.

 All the hacking I have done during the review results in the set of
 patch attached:
 - 0001 is your patch, Gilles, with some comment fixes as well as the
 performance issue with hasExtensionMember fix
 - 0002 is the modification of prove_check that makes TAP tests usable
 with extensions
 - 0003 is the test module covering the tests needed for pg_dump, at
 least for the problem of this thread.

 Gilles, how does that look to you?

Looks great to me, I have tested with the postgis_topology extension
everything works fine.

 (Btw, be sure to generate your patches directly with git next time. :) )

Sorry, some old reminiscence :-)

Thanks for the review.
 
Best regards,

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] json_populate_record issue - TupleDesc reference leak

2015-02-23 Thread Pavel Stehule
Hi

When I tested json_populate_function in test
http://stackoverflow.com/questions/7711432/how-to-set-value-of-composite-variable-field-using-dynamic-sql/28673097#28673097
I found a small issue

create type pt as (a int, b int);

postgres=# select json_populate_record('(10,20)'::pt, '{}');
WARNING:  TupleDesc reference leak: TupleDesc 0x7f10fcf41400 (567018,-1)
still referenced
 json_populate_record
--
 (10,20)
(1 row)

jsonb is ok

postgres=# select jsonb_populate_record('(10,20)'::pt, '{}');
 jsonb_populate_record
---
 (10,20)
(1 row)

Regards

Pavel


Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Andres Freund
On 2015-02-23 15:48:25 +, Thom Brown wrote:
 On 23 February 2015 at 15:42, Andres Freund and...@2ndquadrant.com wrote:
 
  On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
   I unfortunately don't remember enough of the thread to reference it
   here.
 
  Found the right keywords. The threads below
 
  http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
  and
 
  http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau
 
 
 Yes, this seems to be virtually the same issue reported.  The trace looks
 the same except for RecordTransactionCommit.

So, I proposed in
http://www.postgresql.org/message-id/20140707155113.gb1...@alap3.anarazel.de
that we make sequences assign a xid and only wait for syncrep when a xid
is assigned. The biggest blocker was that somebody would have to do some
code reviewing to find other locations that might need similar
treatment.

I did a, quick, grep for XLogInsert() and I think we're otherwise
fine. There's some debatable cases:

* XLOG_STANDBY_LOCK  doesn't force a xid to be assigned. I think it's
  harmless though, as we really only need to wait for that to be
  replicated if the transaction did something relevant (i.e. catalog
  changes). And those will force xid assignment.
* 2pc records don't assign a xid. But twophase.c does it's own waiting,
  so that's fine.
* Plain vacuums will not trigger waits. But I think that's good. There's
  really no need to wait if all that's been done is some cleanup without
  visible consequences.
* Fujii brought up that we might want to wait for XLOG_SWITCH - I don't
  really see why.
* XLOG_RESTORE_POINT is a similar candidate - I don't see really valid
  arguments for making 2pc wait.


The attached, untested, patch changes things so that we
a) only wait for syncrep if we both wrote WAL and had a xid assigned
b) use an async commit if we just had a xid assigned, without having
   written WAL, even if synchronous_commit = off
c) acquire a xid when WAL logging sequence changes (arguable at least
   one of the xid assignments is redundant, but it doesn't cost
   anything, so ...)

I think it makes sense to change a) and b) that way because there's no
need to wait for WAL flushes/syncrep waits when all that happened is
manipulations of temporary/unlogged tables or HOT pruning. It's slightly
wierd that the on-disk flush and the syncrep wait essentially used two
different mechanisms for deciding when to flush.

Comments? This is obviously just a POC, but I think something like this
does make a great deal of sense.

Thom, does that help?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 16828b249b4d2ebfc04678f2570d6c439981e472 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Mon, 23 Feb 2015 17:38:40 +0100
Subject: [PATCH] Reconsider when to flush WAL and wait for syncrep while
 committing.

---
 src/backend/access/transam/xact.c | 30 ++
 src/backend/commands/sequence.c   | 23 +++
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 97000ef..2818a03 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1031,10 +1031,9 @@ RecordTransactionCommit(void)
 
 		/*
 		 * If we didn't create XLOG entries, we're done here; otherwise we
-		 * should flush those entries the same as a commit record.  (An
-		 * example of a possible record that wouldn't cause an XID to be
-		 * assigned is a sequence advance record due to nextval() --- we want
-		 * to flush that to disk before reporting commit.)
+		 * should trigger flushing those entries the same as a commit record
+		 * would.  This will primarily happen for HOT pruning and the like; we
+		 * want these to be flushed to disk in due time.
 		 */
 		if (!wrote_xlog)
 			goto cleanup;
@@ -1153,11 +1152,13 @@ RecordTransactionCommit(void)
 	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
-	 * transaction has not performed any WAL-logged operation.  The latter
-	 * case can arise if the current transaction wrote only to temporary
-	 * and/or unlogged tables.  In case of a crash, the loss of such a
-	 * transaction will be irrelevant since temp tables will be lost anyway,
-	 * and unlogged tables will be truncated.  (Given the foregoing, you might
+	 * transaction has not performed any WAL-logged operation or didn't assign
+	 * a xid.  The transaction can end up not writing any WAL, even if it has
+	 * a xid, if it only wrote to temporary and/or unlogged tables.  It can
+	 * end up having written WAL without and xid if it did HOT pruning.  In
+	 * case of a crash, the loss of such a transaction will be irrelevant;
+	 * temp tables will be lost anyway, 

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-02-23 Thread David Steele
On 2/18/15 10:25 AM, David Steele wrote:
 On 2/18/15 6:11 AM, Fujii Masao wrote:
 The pg_audit doesn't log BIND parameter values when prepared statement is 
 used.
 Seems this is an oversight of the patch. Or is this intentional?
 
 It's actually intentional - following the model I talked about in my
 earlier emails, the idea is to log statements only.  This also follows
 on 2ndQuadrant's implementation.

Unfortunately, I think it's beyond the scope of this module to log bind
variables.  I'm following not only 2ndQuadrant's implementation, but
Oracle's as well.

 Logging values is interesting, but I'm sure the user would want to
 specify which columns to log, which I felt was beyond the scope of the
 patch.
 
 The pg_audit cannot log the statement like SELECT 1 which doesn't access to
 the database object. Is this intentional? I think that there are many users 
 who
 want to audit even such statement.
 
 I think I see how to make this work.  I'll work on it for the next
 version of the patch.

This has been fixed in the v2 patch.

 Imagine the case where you call the user-defined function which executes
 many nested statements. In this case, pg_audit logs only top-level statement
 (i.e., issued directly by client) every time nested statement is executed.
 In fact, one call of such UDF can cause lots of *same* log messages. I think
 this is problematic.
 
 I agree - not sure how to go about addressing it, though.  I've tried to
 cut down on the verbosity of the logging in general, but of course it
 can still be a problem.
 
 Using security definer and a different logging GUC for the defining role
 might work.  I'll add that to my unit tests and see what happens.

That didn't work - but I didn't really expect it to.

Here are two options I thought of:

1) Follow Oracle's as session option and only log each statement type
against an object the first time it happens in a session.  This would
greatly reduce logging, but might be too little detail.  It would
increase the memory footprint of the module to add the tracking.

2) Only log once per call to the backend.  Essentially, we would only
log the statement you see in pg_stat_activity.  This could be a good
option because it logs what the user accesses directly, rather than
functions, views, etc. which hopefully are already going through a
review process and can be audited that way.

Would either of those address your concerns?

-- 
- David Steele
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra
Hi,

On 22.2.2015 22:30, Peter Geoghegan wrote:
 On Sun, Feb 22, 2015 at 1:19 PM, Tomas Vondra
 tomas.von...@2ndquadrant.com wrote:
 In short, this fixes all the cases except for the ASC sorted data. I
 haven't done any code review, but I think we want this.

 I'll use data from the i5-2500k, but it applies to the Xeon too, except
 that the Xeon results are more noisy and the speedups are not that
 significant.

 For the 'text' data type, and 'random' dataset, the results are these:

   scaledatumcost-model
 ---
  10 328%  323%
 100 392%  391%
 200  96%  565%
 300  97%  572%
 400  97%  571%
 500  98%  570%

 The numbers are speedup vs. master, so 100% means exactly the same
 speed, 200% means twice as fast.

 So while with 'datum' patch this actually caused very nice speedup for
 small datasets - about 3-4x speedup up to 1M rows, for larger datasets
 we've seen small regression (~3% slower). With the cost model fix, we
 actually see a significant speedup (about 5.7x) for these cases.
 
 Cool.
 
 I haven't verified whether this produces the same results, but if it
 does this is very nice.

 For 'DESC' dataset (i.e. data sorted in reverse order), we do get even
 better numbers, with up to 6.5x speedup on large datasets.

 But for 'ASC' dataset (i.e. already sorted data), we do get this:

   scaledatumcost-model
 ---
  10  85%   84%
 100  87%   87%
 200  76%   96%
 300  82%   90%
 400  91%   83%
 500  93%   81%

 Ummm, not that great, I guess :-(
 
 You should try it with the data fully sorted like this, but with one 
 tiny difference: The very last tuple is out of order. How does that 
 look?

So here are the results for ASC-ordered dataset, with one 'unsorted' row
added to the end of the dataset. As before the complete scripts are
attached, and the raw results are available in a spreadsheet:

http://bit.ly/18g1nTU

The durations are much higher than without the single unsorted row added
at the end. Queries often take 20x longer to finish (on the same code),
depending on the scale.

The speedup results (compared to master) look like this:

scale query# datum numeric   cost model
10 1  859%861% 856%
10 2  811%814% 805%
10 3  100%100%  97%
1001  805%804% 807%
1002  769%773% 770%
1003  100%100%  98%
2001   97% 97% 673%
2002   96% 97% 646%
2003   99%101% 678%
3001   98% 98% 578%
3002   96% 97% 557%
3003   99%101% 579%
4001   99% 99% 513%
4002   97% 98% 497%
4003   99%101% 510%
5001   99% 99% 469%
5002   97% 98% 456%
5003   99%101% 466%

What's interesting here is that some queries are much faster, but query
#3 is slow until we hit 2M rows:

select * from (select * from stuff_int_desc order by randint
   offset 1000) foo

Looking at the previous tests, I see this is exactly what's happening to
this query with 'random' dataset - it's slightly slower than master up
until 2M rows, when it suddenly jumps to the same speedup as the other
queries. Can we do something about that?

Anyway, I'm wondering what conclusion we can do from this? I believe
vast majority of datasets in production won't be perfectly sorted,
because when the table is CLUSTERed by index we tend to use index scan
to do the sort (so no problem), or the data are not actually perfectly
sorted (and here we get significant speedup).


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


scripts.tgz
Description: application/tgz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Abbreviated keys for Numeric

2015-02-23 Thread Tomas Vondra
Hi,

On 23.2.2015 11:59, Andrew Gierth wrote:
 Tomas == Tomas Vondra tomas.von...@2ndquadrant.com writes:
 
  Tomas Interesting, but I think Gavin was asking about how much
  Tomas variation was there for each tested case (e.g. query executed on
  Tomas the same code / dataset). And in those cases the padding /
  Tomas alignment won't change, and thus the effects you describe won't
  Tomas influence the results, no?
 
 My point is exactly the fact that since the result is not affected,
 this variation between runs of the same code is of no real relevance
 to the question of whether a given change in performance can properly
 be attributed to a patch.
 
 Put it this way: suppose I have a test that when run repeatedly with no
 code changes takes 6.10s (s=0.025s), and I apply a patch that changes
 that to 6.26s (s=0.025s). Did the patch have an impact on performance?
 
 Now suppose that instead of applying the patch I insert random amounts
 of padding in an unused function and find that my same test now takes a
 mean of 6.20s (s=0.058s) when I take the best timing for each padding
 size and calculate stats across sizes. Now it looks obvious that the
 actual code of the patch probably wasn't responsible for any change...
 
 The numbers used here aren't theoretical; they are obtained by testing a
 single query - select * from d_flt order by v offset 1000 where
 d_flt contains 5 million float8 values - over 990 times with 33
 different random padding sizes (uniform in 0-32767). Here's a scatter
 plot, with 3 runs of each padding size so you can see the repeatability:
 http://tinyurl.com/op9qg8a


I think we're talking about slightly different things, then.

I believe Gavin was asking about variability for executions with a
particular code (i.e. with fixed amount of padding), to decide whether
it even makes sense to compare results for different patches or whether
the differences are just random noise.

Interpreting those differences - whether they are due to changes in the
algorithm or a result of some padding somewhere else in the code, that's
of course important too.

I believe the small regressions (1-10%) for small data sets, might be
caused by this 'random padding' effect, because that's probably where
L1/L2 cache is most important. For large datasets the caches are
probably not as efficient anyway, so the random padding makes no
difference, and the speedup is just as good as for the other queries.
See for example this:

  http://www.postgresql.org/message-id/54eb580c.2000...@2ndquadrant.com

But I'm speculating here ... time for a profiler, I guess.


-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown
On 23 February 2015 at 16:53, Andres Freund and...@2ndquadrant.com wrote:

 On 2015-02-23 15:48:25 +, Thom Brown wrote:
  On 23 February 2015 at 15:42, Andres Freund and...@2ndquadrant.com
 wrote:
 
   On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
I unfortunately don't remember enough of the thread to reference it
here.
  
   Found the right keywords. The threads below
  
  
 http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
   and
  
  
 http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau
  
 
  Yes, this seems to be virtually the same issue reported.  The trace looks
  the same except for RecordTransactionCommit.

 So, I proposed in

 http://www.postgresql.org/message-id/20140707155113.gb1...@alap3.anarazel.de
 that we make sequences assign a xid and only wait for syncrep when a xid
 is assigned. The biggest blocker was that somebody would have to do some
 code reviewing to find other locations that might need similar
 treatment.

 I did a, quick, grep for XLogInsert() and I think we're otherwise
 fine. There's some debatable cases:

 * XLOG_STANDBY_LOCK  doesn't force a xid to be assigned. I think it's
   harmless though, as we really only need to wait for that to be
   replicated if the transaction did something relevant (i.e. catalog
   changes). And those will force xid assignment.
 * 2pc records don't assign a xid. But twophase.c does it's own waiting,
   so that's fine.
 * Plain vacuums will not trigger waits. But I think that's good. There's
   really no need to wait if all that's been done is some cleanup without
   visible consequences.
 * Fujii brought up that we might want to wait for XLOG_SWITCH - I don't
   really see why.
 * XLOG_RESTORE_POINT is a similar candidate - I don't see really valid
   arguments for making 2pc wait.


 The attached, untested, patch changes things so that we
 a) only wait for syncrep if we both wrote WAL and had a xid assigned
 b) use an async commit if we just had a xid assigned, without having
written WAL, even if synchronous_commit = off
 c) acquire a xid when WAL logging sequence changes (arguable at least
one of the xid assignments is redundant, but it doesn't cost
anything, so ...)

 I think it makes sense to change a) and b) that way because there's no
 need to wait for WAL flushes/syncrep waits when all that happened is
 manipulations of temporary/unlogged tables or HOT pruning. It's slightly
 wierd that the on-disk flush and the syncrep wait essentially used two
 different mechanisms for deciding when to flush.

 Comments? This is obviously just a POC, but I think something like this
 does make a great deal of sense.

 Thom, does that help?


Yeah, this appears to eliminate the problem, at least in the case I
reported.

Thanks

-- 
Thom


Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Andres Freund
On 2015-02-23 15:15:31 +0100, Florian Weimer wrote:
 On 02/22/2015 02:05 PM, Andres Freund wrote:
  On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote:
  I honestly wonder why postgres uses renegotiation at all. The motivation
  that cryptoanalysis is easier as more data is sent seems quite
  far-fetched.
  
  I don't think so. There's a fair number of algorithms that can/could be
  much easier be attached with lots of data available. Especially if you
  can guess/know/control some of the data.  Additionally renegotiating
  regularly helps to constrain a possible key leagage to a certain amount
  of time. With backend connections often being alive for weeks at a time
  that's not a bad thing.
 
 Renegotiation will be removed from future TLS versions because it is
 considered unnecessary with modern ciphers:
 
   https://github.com/tlswg/tls13-spec/issues/38
 
 If ciphers require rekeying, that mechanism will be provided at the TLS
 layer in the future.
 
 I think you could remove renegotiation from PostgreSQL as long as you
 offer something better than RC4 in the TLS handshake.

As far as I understand it, this argument misses an important
point. Without protocol level rekeying, handled by the library in the
background, never changing the session key pretty much breaks PFS. In
http the sessions almost always are short enough that such a
consideration doesn't play a role, but it's far from uncommon to have
database connections that live weeks and transport hundreds of gigabytes
in their lifetime.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Henry B Hotz
Renegotiation should be a best practice. Trouble is it's been broken (at the 
protocol level) three times in the last few years so it's a massive hole in 
practice. 

Ideally we should leave the renegotiate in, and only remove it if configure 
detects a broken version of TLS.

Personal email. hbh...@oxy.edu

 On Feb 23, 2015, at 7:01 AM, Albe Laurenz laurenz.a...@wien.gv.at wrote:
 
 I'd say it is best to wait if and how OpenSSL change their API when they
 implement TLS 1.3.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-02-23 Thread David Steele
Hi Stephen,

Thanks for your review.  All fixed except for comments below:

On 2/17/15 10:34 AM, Stephen Frost wrote:
 +/*
 + * Check privileges granted indirectly via role memberships. We do this 
 in
 + * a separate pass to minimize expensive indirect membership tests.  In
 + * particular, it's worth testing whether a given ACL entry grants any
 + * privileges still of interest before we perform the has_privs_of_role
 + * test.
 + */
 
 I'm a bit on the fence about this..  Can you provide a use-case where
 doing this makes sense..?  Does this mean I could grant admin_role1 to
 audit and then get auditing on everything that user1 has access to?
 That seems like it might be useful for environments where such roles
 already exist though it might end up covering more than is intended..

The idea is that if there are already ready-made roles to be audited
then they don't need to be reconstituted for the audit role.  You could
just do:

grant admin_role to audit;
grant user_role to audit;

Of course, we could list multiple roles in the pg_audit.role GUC, but I
thought this would be easier to use and maintain since there was some
worry about GUCs being fragile when they refer to database objects.

 +
 +# test.pl - pgAudit Unit Tests
 +
 
 This is definitiely nice..
 
 diff --git a/doc/src/sgml/pgaudit.sgml b/doc/src/sgml/pgaudit.sgml
 new file mode 100644
 index 000..f3f4ab9
 --- /dev/null
 +++ b/doc/src/sgml/pgaudit.sgml
 @@ -0,0 +1,316 @@
 +!-- doc/src/sgml/pgaudit.sgml --
 +
 +sect1 id=pgaudit xreflabel=pgaudit
 +  titlepg_audit/title
 
 There seems to be a number of places which are 'pgaudit' and a bunch
 that are 'pg_audit'.  I'm guessing you were thinking 'pg_audit', but
 it'd be good to clean up and make them all consistent.

Fixed, though I still left the file name as pgaudit.sgml since all but
one module follow that convention.

 +para
 +  Session auditing allows the logging of all commands that are executed 
 by
 +  a user in the backend.  Each command is logged with a single entry and
 +  includes the audit type (e.g. literalSESSION/literal), command 
 type
 +  (e.g. literalCREATE TABLE/literal, literalSELECT/literal) and
 +  statement (e.g. literalselect * from test/literal).
 +
 +  Fully-qualified names and object types will be logged for
 +  literalCREATE/literal, literalUPDATE/literal, and
 +  literalDROP/literal commands on literalTABLE/literal,
 +  literalMATVIEW/literal, literalVIEW/literal,
 +  literalINDEX/literal, literalFOREIGN TABLE/literal,
 +  literalCOMPOSITE TYPE/literal, literalINDEX/literal, and
 +  literalSEQUENCE/literal objects as well as function calls.
 +/para
 
 Ah, you do have a list of what objects you get fully qualified names
 for, nice.  Are there obvious omissions from that list..?  If so, we
 might be able to change what happens with objectAccess to include them..

It seems like these are the objects where having a name really matters.
 I'm more interested in using the deparse code to handle fully-qualified
names for additional objects rather than adding hooks.

 +sect3
 +  titleExamples/title
 +
 +  para
 +Set literalpgaudit.log = 'read, ddl'/literal in
 +literalpostgresql.conf/literal.
 +  /para
 
 Perhaps I missed it, but it'd be good to point out that GUCs can be set
 at various levels.  I know we probably say that somewhere else, but it's
 particularly relevant for this.

Yes, it's very relevant for this patch.  Do you think it's enough to
call out the functionality, or should I provide examples?  Maybe a
separate section just for this concept?

Patch v2 is attached for all changes except the doc change above.

-- 
- David Steele
da...@pgmasters.net
diff --git a/contrib/Makefile b/contrib/Makefile
index 195d447..d8e75f4 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
pageinspect \
passwordcheck   \
pg_archivecleanup \
+   pg_audit\
pg_buffercache  \
pg_freespacemap \
pg_prewarm  \
diff --git a/contrib/pg_audit/Makefile b/contrib/pg_audit/Makefile
new file mode 100644
index 000..32bc6d9
--- /dev/null
+++ b/contrib/pg_audit/Makefile
@@ -0,0 +1,20 @@
+# pg_audit/Makefile
+
+MODULE = pg_audit
+MODULE_big = pg_audit
+OBJS = pg_audit.o
+
+EXTENSION = pg_audit
+
+DATA = pg_audit--1.0.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_audit
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_audit/pg_audit--1.0.0.sql 
b/contrib/pg_audit/pg_audit--1.0.0.sql
new file 

Re: [HACKERS] Allow snapshot too old error, to prevent bloat

2015-02-23 Thread Amit Kapila
On Sun, Feb 22, 2015 at 10:18 PM, Kevin Grittner kgri...@ymail.com wrote:

 Amit Kapila amit.kapil...@gmail.com wrote:

  It seems to me that SQL Server also uses similar mechanism to
  avoid the bloat in version store (place to store previous
  versions or record).

  I think if other leading databases provide a way to control the
  bloat, it indicates that most of the customers having
  write-intesive workload would like to see such an option.

 Yeah, I think many users are surprised by the damage that can be
 done to a database by an idle connection or long-running read-only
 query, especially when coming from other databases which have
 protections against that.

 The bad news is that changing to specifying the limit in time
 rather than transaction IDs is far more complicated than I thought.
 Basically, we need to associate times in the past with the value of
 latestCompletedXid (or derived values like snapshot xmin) at those
 times.  I was initially thinking that by adding a timestamp to the
 snapshot we could derive this from active snapshots.  For any one
 connection, it's not that hard for it to scan the pairingheap of
 snapshots and pick out the right values; but the problem is that it
 is process-local memory and this needs to work while the connection
 is sitting idle for long periods of time.

Could you please explain in slightly more detail why can't it work
if we use timestamp instead of snapshot-xmin in your patch in
function TestForOldSnapshot()?

 I've got a couple ideas
 on how to approach it, but neither seems entirely satisfying:

 (1) Use a new timer ID to interrupt the process whenever its
 oldest snapshot which hasn't yet crossed the old snapshot
 threshold does so.  The problem is that this seems as though it
 would need to do an awful lot more work (scanning the pairing heap
 for the best match) than anything else is currently doing in a
 timeout handler.  One alternative would be to keep a second pairing
 heap to track snapshots which have not crossed the threshold,
 removing items as they get old -- that would make the work needed
 in the timeout handler closer to other handlers.

 (2) Use a course enough granularity on time and a short enough
 maximum for the GUC to just keep a circular buffer of the mappings
 in memory.  We might be able to make this dense enough that one
 minute resolution for up to 60 days could fit in 338kB.  Obviously
 that could be reduced with courser granularity or a shorter
 maximum.


How exactly will this allow to return snapshot old error, do we
need a check for page (page lsn) as in your current patch?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] POLA violation with \c service=

2015-02-23 Thread Alvaro Herrera

David Fetter wrote:

 My thinking behind this was that the patch is a bug fix and intended
 to be back-patched, so I wanted to mess with as little infrastructure
 as possible.  A new version of libpq seems like a very big ask for
 such a case.  You'll recall that the original problem was that
 
 \c service=foo
 
 only worked accidentally for some pretty narrow use cases and broke
 without much of a clue for the rest.  It turned out that the general
 problem was that options given to psql on the command line were not
 even remotely equivalent to \c, even though they were documented to
 be.

So, in view of these arguments and those put forward by Pavel
downthread, I think the attached is an acceptable patch for the master
branch.  It doesn't apply to back branches though; 9.4 and 9.3 have a
conflict in tab-complete.c, 9.2 has additional conflicts in command.c,
and 9.1 and 9.0 are problematic all over because they don't have
src/common.  Could you please submit patches adapted for each group of
branches?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Peter Geoghegan
On Mon, Feb 23, 2015 at 11:17 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 So while it's true that for the 3rd query we get much worse results
 compared to the other queries (i.e. we don't get 400% speedup but ~3%
 slowdown compared to master), it's true that master performs
 exceptionally well for this query with small datasets. Once we get to 2M
 rows, the master performance drops significantly but cost-model keeps
 the performance characteristics and the speedup jumps back to ~700%
 which is nice.

 These numbers are for the 'ASC + unsorted row' test, but I do get
 exactly the same pattern for the 'random' tests done previously.

Yeah. Looks like you're comparing a case where the old cost model did
the right thing anyway (i.e. used abbreviation). The difference would
then be entirely explainable as noise. Right?

 It would be nice if we could address the 3% regression for the last
 query, but I guess it's not a big deal. The numbers in general are
 absolutely impressive. Kudos.

Thanks.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POLA violation with \c service=

2015-02-23 Thread Alvaro Herrera
Here's the real attachment.  Previous one was a misguided shell
redirection.  Meh.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
From 830d41b9d23716af29491cbab59808c35e25ec12 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera alvhe...@alvh.no-ip.org
Date: Fri, 20 Feb 2015 14:36:41 -0300
Subject: [PATCH] Fix psql's \c to work with URIs and conninfo strings
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This makes it more consistent with what can be given in the command
line.

Authors: Andrew Dunstan, David Fetter, Andrew Gierth, Pavel Stěhule
---
 doc/src/sgml/ref/psql-ref.sgml| 30 ++---
 src/bin/psql/command.c| 69 +++
 src/bin/psql/tab-complete.c   | 20 ++--
 src/common/Makefile   |  2 +-
 src/common/connstrings.c  | 52 +
 src/include/common/connstrings.h  | 16 +
 src/interfaces/libpq/fe-connect.c | 36 +---
 7 files changed, 162 insertions(+), 63 deletions(-)
 create mode 100644 src/common/connstrings.c
 create mode 100644 src/include/common/connstrings.h

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index a637001..a218af8 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -796,18 +796,23 @@ testdb=gt;
   /varlistentry
 
   varlistentry
-termliteral\c/literal or literal\connect/literal literal[ replaceable class=parameterdbname/replaceable [ replaceable class=parameterusername/replaceable ] [ replaceable class=parameterhost/replaceable ] [ replaceable class=parameterport/replaceable ] ]/literal/term
+termliteral\c   [ { [ replaceable class=parameterdbname/replaceable [ replaceable class=parameterusername/replaceable ] [ replaceable class=parameterhost/replaceable ] [ replaceable class=parameterport/replaceable ] ] | replaceable class=parameterconninfo/replaceable string } ] /literal/term
+termliteral\connect [ { [ replaceable class=parameterdbname/replaceable [ replaceable class=parameterusername/replaceable ] [ replaceable class=parameterhost/replaceable ] [ replaceable class=parameterport/replaceable ] ] | replaceable class=parameterconninfo/replaceable string } ] /literal/term
 listitem
 para
 Establishes a new connection to a productnamePostgreSQL/
-server. If the new connection is successfully made, the
-previous connection is closed. If any of replaceable
+server using positional parameters as described below, a
+parameterconninfo/parameter string, or a acronymURI/acronym.
+If the new connection is successfully made, the
+previous connection is closed.
+When using positional parameters, if any of replaceable
 class=parameterdbname/replaceable, replaceable
 class=parameterusername/replaceable, replaceable
 class=parameterhost/replaceable or replaceable
 class=parameterport/replaceable are omitted or specified
 as literal-/literal, the value of that parameter from the
-previous connection is used. If there is no previous
+previous connection is used.
+If using positional parameters and there is no previous
 connection, the applicationlibpq/application default for
 the parameter's value is used.
 /para
@@ -822,6 +827,23 @@ testdb=gt;
 mechanism that scripts are not accidentally acting on the
 wrong database on the other hand.
 /para
+
+para
+Positional syntax:
+programlisting
+=gt; \c mydb myuser host.dom 6432
+/programlisting
+/para
+
+para
+The conninfo form detailed in xref linkend=LIBPQ-CONNSTRING
+takes two forms: keyword-value pairs, and URIs:
+/para
+programlisting
+=gt; \c service=foo
+=gt; \c host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable
+=gt; \c postgresql://tom@localhost/mydb?application_name=myapp
+/programlisting
 /listitem
   /varlistentry
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 7c9f28d..7d8d5bb 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1607,6 +1607,8 @@ do_connect(char *dbname, char *user, char *host, char *port)
 	PGconn	   *o_conn = pset.db,
 			   *n_conn;
 	char	   *password = NULL;
+	bool		keep_password = true;
+	bool		has_connection_string = false;
 
 	if (!o_conn  (!dbname || !user || !host || !port))
 	{
@@ -1620,15 +1622,31 @@ do_connect(char *dbname, char *user, char *host, char *port)
 		return false;
 	}
 
-	if (!dbname)
-		dbname = PQdb(o_conn);
 	if (!user)
 		user = PQuser(o_conn);
+
 	if (!host)
 		host = PQhost(o_conn);
+
 	if (!port)
 		port = PQport(o_conn);
 
+	if (dbname)
+		has_connection_string = recognized_connection_string(dbname);
+
+	keep_password =
+		

Re: [HACKERS] [RFC] LSN Map

2015-02-23 Thread Heikki Linnakangas

On 01/13/2015 01:22 PM, Marco Nenciarini wrote:

Il 08/01/15 20:18, Jim Nasby ha scritto:

On 1/7/15, 3:50 AM, Marco Nenciarini wrote:

The current implementation tracks only heap LSN. It currently does not
track any kind of indexes, but this can be easily added later.


Would it make sense to do this at a buffer level, instead of at the heap
level? That means it would handle both heap and indexes.
  I don't know if LSN is visible that far down though.


Where exactly you are thinking to handle it?


Dunno, but Jim's got a point. This is a maintenance burden to all 
indexams, if they all have to remember to update the LSN map separately. 
It needs to be done in some common code, like in PageSetLSN or 
XLogInsert or something.


Aside from that, isn't this horrible from a performance point of view? 
The patch doubles the buffer manager traffic, because any update to any 
page will also need to modify the LSN map. This code is copied from the 
visibility map code, but we got away with it there because the VM only 
needs to be updated the first time a page is modified. Subsequent 
updates will know the visibility bit is already cleared, and don't need 
to access the visibility map.


Ans scalability: Whether you store one value for every N pages, or the 
LSN of every page, this is going to have a huge effect of focusing 
contention to the LSN pages. Currently, if ten backends operate on ten 
different heap pages, for example, they can run in parallel. There will 
be some contention on the WAL insertions (much less in 9.4 than before). 
But with this patch, they will all fight for the exclusive lock on the 
single LSN map page.


You'll need to find a way to not update the LSN map on every update. For 
example, only update the LSN page on the first update after a checkpoint 
(although that would still have a big contention focusing effect right 
after a checkpoint).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json_populate_record issue - TupleDesc reference leak

2015-02-23 Thread Pavel Stehule
by the way - this feature is undocumented - I though so only value used as
type holder is not used.

Should be documented better, - if I understand - it is base stone for
implementation #= hstore operator

some nice example

postgres=# select json_populate_record('(10,20)'::pt, '{a:30}');
 json_populate_record
--
 (30,20)
(1 row)

a mentioned bug is corner case - bugfix is simple

diff --git a/src/backend/utils/adt/jsonfuncs.c
b/src/backend/utils/adt/jsonfuncs.c
new file mode 100644
index a8cdeaa..6e83f78
*** a/src/backend/utils/adt/jsonfuncs.c
--- b/src/backend/utils/adt/jsonfuncs.c
*** populate_record_worker(FunctionCallInfo
*** 2114,2119 
--- 2114,2122 
 */
if (hash_get_num_entries(json_hash) == 0  rec)
{
+   if (have_record_arg)
+   ReleaseTupleDesc(tupdesc);
+
hash_destroy(json_hash);
PG_RETURN_POINTER(rec);
}




2015-02-23 18:27 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:

 Hi

 When I tested json_populate_function in test
 http://stackoverflow.com/questions/7711432/how-to-set-value-of-composite-variable-field-using-dynamic-sql/28673097#28673097
 I found a small issue

 create type pt as (a int, b int);

 postgres=# select json_populate_record('(10,20)'::pt, '{}');
 WARNING:  TupleDesc reference leak: TupleDesc 0x7f10fcf41400 (567018,-1)
 still referenced
  json_populate_record
 --
  (10,20)
 (1 row)

 jsonb is ok

 postgres=# select jsonb_populate_record('(10,20)'::pt, '{}');
  jsonb_populate_record
 ---
  (10,20)
 (1 row)

 Regards

 Pavel




Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Peter Geoghegan
On Mon, Feb 23, 2015 at 8:40 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 The durations are much higher than without the single unsorted row added
 at the end. Queries often take 20x longer to finish (on the same code),
 depending on the scale.

I knew this would happen.  :-)

 What's interesting here is that some queries are much faster, but query
 #3 is slow until we hit 2M rows:

 select * from (select * from stuff_int_desc order by randint
offset 1000) foo

Not sure why that would be. Can you see what a #define
DEBUG_ABBREV_KEYS (within varlena.c) build shows happens with the
cost model? Enable debug1 output for that.

 Looking at the previous tests, I see this is exactly what's happening to
 this query with 'random' dataset - it's slightly slower than master up
 until 2M rows, when it suddenly jumps to the same speedup as the other
 queries. Can we do something about that?

Possibly.

 Anyway, I'm wondering what conclusion we can do from this? I believe
 vast majority of datasets in production won't be perfectly sorted,
 because when the table is CLUSTERed by index we tend to use index scan
 to do the sort (so no problem), or the data are not actually perfectly
 sorted (and here we get significant speedup).

One conclusion might be that commit a3f0b3d6 should not have added the
pre-check for perfectly sorted input, which is not in the Bentley 
Mcilroy original - exploiting the fact that the set is already
partially sorted is a fine thing, but we're not doing the necessary
bookkeeping. But that's not really why I suggested that test case.
Rather, I wanted to put the possible regression in the event of
perfectly sorted input in perspective. Maybe that regression isn't
worth worrying about, if in the event of a single tuple being out of
order at the end of the set the outcome dramatically shifts to
strongly favor abbreviation. Another way of looking at it would be
that the pre-sorted case is an argument against strxfrm() sorting in
general more than abbreviated keys in particular, an argument that
seems new and strange to me. The analysis of algorithms focuses on the
worst case and average case for a reason, and (say) the glibc
documentation never considers this when discussing strxfrm().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mogrify and indent features for jsonb

2015-02-23 Thread Thom Brown
On 15 February 2015 at 03:06, Andrew Dunstan and...@dunslane.net wrote:


 Attached is a patch to provide a number of very useful facilities to jsonb
 that people have asked for. These are based on work by Dmitry Dolgov in his
 jsonbx extension, but I take responsibility for any bugs.

 The facilities are:

 new operations:

 concatenation:jsonb || jsonb - jsonb
 deletion: jsonb - text - jsonb
 deletion: jsonb - int - text

 new functions:

 produce indented text: jsonb_indent(jsonb) - text
 change an element at a path:  jsonb_replace(jsonb, text[], jsonb) - jsonb.


 It would be relatively trivial to add:

 delete an element at a path: jsonb_delete(jsonb, text[]) - json


Would this support deleting type and the value 'dd' from the following?:

{a: 1, b: 2, c: {type: json, stuff: test}, d:
[aa,bb,cc,dd]}



 and I think we should do that for the sake of completeness.

 The docs might need a little extra work, and the indent code definitely
 needs work, which I hope to complete in the next day or two, but I wanted
 to put a stake in the ground.


This is high on my wanted list, so thanks for working on this.

Seems to work well for me with a few tests.

Is there a way to take the json:

'{a: 1, b: 2, c: {type: json, stuff: test}, d:
[aa,bb,cc,dd]}'

and add ee to d without replacing it?  I can think of ways of currently
doing it, but it's very convoluted just for pushing a value to an array.

Also, are there any plans to support the following?:

jsonb - text[] # Provide list of keys to delete in array
jsonb - jsonb # Deduplicate key:value pairs
jsonb  jsonb # Return overlapping jsonb (opposite of jsonb - jsonb)

Thanks

Thom


Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra
On 23.2.2015 19:22, Peter Geoghegan wrote:
 On Mon, Feb 23, 2015 at 8:40 AM, Tomas Vondra
 tomas.von...@2ndquadrant.com wrote:
 The durations are much higher than without the single unsorted row added
 at the end. Queries often take 20x longer to finish (on the same code),
 depending on the scale.
 
 I knew this would happen.  :-)
 
 What's interesting here is that some queries are much faster, but query
 #3 is slow until we hit 2M rows:

 select * from (select * from stuff_int_desc order by randint
offset 1000) foo
 
 Not sure why that would be. Can you see what a #define 
 DEBUG_ABBREV_KEYS (within varlena.c) build shows happens with the 
 cost model? Enable debug1 output for that.

test=# select * from (select * from stuff_text_asc order by randtxt
offset 1000) foo;
DEBUG:  abbrev_distinct after 160: 152.862464 (key_distinct: 155.187096,
norm_abbrev_card: 0.955390, prop_card: 0.20)
DEBUG:  abbrev_distinct after 320: 314.784336 (key_distinct: 313.425344,
norm_abbrev_card: 0.983701, prop_card: 0.20)
DEBUG:  abbrev_distinct after 640: 629.050490 (key_distinct: 643.945298,
norm_abbrev_card: 0.982891, prop_card: 0.20)
DEBUG:  abbrev_distinct after 1280: 1194.271440 (key_distinct:
1257.153875, norm_abbrev_card: 0.933025, prop_card: 0.20)
DEBUG:  abbrev_distinct after 2560: 2402.820431 (key_distinct:
2631.772790, norm_abbrev_card: 0.938602, prop_card: 0.20)
DEBUG:  abbrev_distinct after 5120: 4490.142750 (key_distinct:
5261.865309, norm_abbrev_card: 0.876981, prop_card: 0.20)
DEBUG:  abbrev_distinct after 10240: 8000.884171 (key_distinct:
10123.941172, norm_abbrev_card: 0.781336, prop_card: 0.20)
DEBUG:  abbrev_distinct after 20480: 13596.546899 (key_distinct:
20563.364696, norm_abbrev_card: 0.663894, prop_card: 0.13)
DEBUG:  abbrev_distinct after 40960: 23369.899021 (key_distinct:
40500.600680, norm_abbrev_card: 0.570554, prop_card: 0.084500)
DEBUG:  abbrev_distinct after 81920: 30884.544030 (key_distinct:
81466.848436, norm_abbrev_card: 0.377009, prop_card: 0.054925)
 randtxt
-
(0 rows)

 Looking at the previous tests, I see this is exactly what's
 happening to this query with 'random' dataset - it's slightly
 slower than master up until 2M rows, when it suddenly jumps to the
 same speedup as the other queries. Can we do something about that?
 
 Possibly.
 
 Anyway, I'm wondering what conclusion we can do from this? I believe
 vast majority of datasets in production won't be perfectly sorted,
 because when the table is CLUSTERed by index we tend to use index scan
 to do the sort (so no problem), or the data are not actually perfectly
 sorted (and here we get significant speedup).
 
 One conclusion might be that commit a3f0b3d6 should not have added the
 pre-check for perfectly sorted input, which is not in the Bentley 
 Mcilroy original - exploiting the fact that the set is already
 partially sorted is a fine thing, but we're not doing the necessary

Would that pre-check hurt even the random test case? I can imagine it
might behave strangely with almost perfectly sorted data (when only the
very last row is unsorted), but not for random data (but see below).

 bookkeeping. But that's not really why I suggested that test case.
 Rather, I wanted to put the possible regression in the event of
 perfectly sorted input in perspective. Maybe that regression isn't
 worth worrying about, if in the event of a single tuple being out of
 order at the end of the set the outcome dramatically shifts to
 strongly favor abbreviation. Another way of looking at it would be
 that the pre-sorted case is an argument against strxfrm() sorting in
 general more than abbreviated keys in particular, an argument that
 seems new and strange to me. The analysis of algorithms focuses on the
 worst case and average case for a reason, and (say) the glibc
 documentation never considers this when discussing strxfrm().

Yeah, every algorithm has a worst case - either you get a very fast
execution on average with a few rare cases when it's much slower, or you
get very bad average performance.

But after looking at the data a bit more, I actually think this is a red
herring. After looking at the actual runtimes (and not just speedups), I
get this:

  scalequery# master   cost-modelspeedup
 10 1884  103   856%
100 1  12153 1506   807%
200 1  26187 3894   673%
300 1  38147 6601   578%
400 1  5633210976   513%
500 1  7700916409   469%

 10 2883  110   805%
100 2  12114 1574   770%
200 2  26104 4038   646%
300 2  38028 6828   557%
400 2  5613811294   497%
500 2  7672016829   

[HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Alvaro Herrera
I found that the OBJECT_ATTRIBUTE symbol is useless.  I can just remove
it and replace it with OBJECT_COLUMN, and everything continues to work;
no test fails that I can find.

I thought we had a prohibition against ALTER TABLE when used on
composites, but it's not as severe as I thought.  The following commands
fail in master:

ALTER TABLE comptype RENAME TO comptype2;   -- HINT: Use ALTER TYPE
ALTER TABLE comptype SET SCHEMA sch;-- HINT: Use ALTER TYPE

However, the following command works in master:

ALTER TABLE comptype RENAME COLUMN a TO b;
and has the same effect as this:
ALTER TYPE comptype RENAME ATTRIBUTE a TO b;

The RENAME ATTTRIBUTE case in RenameStmt is the only thing currently
using OBJECT_ATTRIBUTE; therefore, since in precisely that case we do
not prohibit using ALTER TABLE, we can just remove OBJECT_ATTRIBUTE
completely.  That leads to the attached patch, which changes no test
result at all.

This symbol was added in

commit e440e12c562432a2a695b8054964fb34e3bd823e
Author: Peter Eisentraut pete...@gmx.net
Date:   Sun Sep 26 14:41:03 2010 +0300

Add ALTER TYPE ... ADD/DROP/ALTER/RENAME ATTRIBUTE

Like with tables, this also requires allowing the existence of
composite types with zero attributes.

reviewed by KaiGai Kohei

Thoughts?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index d899dd7..1da38c0 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -1591,7 +1591,6 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address,
 			break;
 		case OBJECT_TYPE:
 		case OBJECT_DOMAIN:
-		case OBJECT_ATTRIBUTE:
 		case OBJECT_DOMCONSTRAINT:
 			if (!pg_type_ownercheck(address.objectId, roleid))
 aclcheck_error_type(ACLCHECK_NOT_OWNER, address.objectId);
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 78b54b4..7f85fa4 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -330,7 +330,6 @@ ExecRenameStmt(RenameStmt *stmt)
 			return RenameRelation(stmt);
 
 		case OBJECT_COLUMN:
-		case OBJECT_ATTRIBUTE:
 			return renameatt(stmt);
 
 		case OBJECT_RULE:
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index dcf5b98..d14346f 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -1056,7 +1056,6 @@ EventTriggerSupportsObjectType(ObjectType obtype)
 			/* no support for event triggers on event triggers */
 			return false;
 		case OBJECT_AGGREGATE:
-		case OBJECT_ATTRIBUTE:
 		case OBJECT_CAST:
 		case OBJECT_COLUMN:
 		case OBJECT_COLLATION:
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6c21002..8ad933d 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -7848,7 +7848,7 @@ RenameStmt: ALTER AGGREGATE func_name aggr_args RENAME TO name
 			| ALTER TYPE_P any_name RENAME ATTRIBUTE name TO name opt_drop_behavior
 {
 	RenameStmt *n = makeNode(RenameStmt);
-	n-renameType = OBJECT_ATTRIBUTE;
+	n-renameType = OBJECT_COLUMN;
 	n-relationType = OBJECT_TYPE;
 	n-relation = makeRangeVarFromAnyName($3, @3, yyscanner);
 	n-subname = $6;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 6d26986..2b6fc3e 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1615,9 +1615,6 @@ AlterObjectTypeCommandTag(ObjectType objtype)
 		case OBJECT_AGGREGATE:
 			tag = ALTER AGGREGATE;
 			break;
-		case OBJECT_ATTRIBUTE:
-			tag = ALTER TYPE;
-			break;
 		case OBJECT_CAST:
 			tag = ALTER CAST;
 			break;
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ac13302..09607eb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1212,7 +1212,6 @@ typedef struct SetOperationStmt
 typedef enum ObjectType
 {
 	OBJECT_AGGREGATE,
-	OBJECT_ATTRIBUTE,			/* type's attribute, when distinct from column */
 	OBJECT_CAST,
 	OBJECT_COLUMN,
 	OBJECT_COLLATION,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Precedence of NOT LIKE, NOT BETWEEN, etc

2015-02-23 Thread Tom Lane
While fooling around with testing operator precedence warnings,
I discovered that there's some existing precedence behavior that's
not at all what I expected.  Consider these examples (all done in
9.4; this is longstanding behavior):

regression=# SELECT 1  '(3,4)'::point LIKE 42.0;
ERROR:  operator does not exist: point ~~ numeric
LINE 1: SELECT 1  '(3,4)'::point LIKE 42.0;
  ^

regression=# SELECT 1  '(3,4)'::point NOT LIKE 42.0;
ERROR:  operator does not exist: integer  point
LINE 1: SELECT 1  '(3,4)'::point NOT LIKE 42.0;
 ^

regression=# SELECT 1 LIKE '(3,4)'::point  42.0;
ERROR:  operator does not exist: integer ~~ point
LINE 1: SELECT 1 LIKE '(3,4)'::point  42.0;
 ^

regression=# SELECT 1 NOT LIKE '(3,4)'::point  42.0;
ERROR:  operator does not exist: integer !~~ point
LINE 1: SELECT 1 NOT LIKE '(3,4)'::point  42.0;
 ^

These queries are silly of course, the point is just to expose which
operator the parser thinks binds more tightly.  And what we have above is:
LIKE binds more tightly than , in either combination.  NOT LIKE binds
less tightly than  if  is on the left, but more tightly than  if 
is on the right.

This seems like a bug; it's certainly not what you'd call intuitive.

The cause, if I'm understanding the Bison manual correctly, is that
the NOT token has lower precedence than '', so at the critical point
where it has scanned 1  '(3,4)'::point and has to either reduce that
(thus binding  more tightly) or shift, it looks at the lookahead token
which is either LIKE or NOT in the first two examples above.  LIKE has
higher priority than '' so in the first example it shifts; but NOT
has lower priority than '' so in the second example it reduces.

The other productions starting with a_expr NOT ... also have the
same problem; that includes NOT ILIKE, NOT SIMILAR TO, NOT BETWEEN,
and NOT IN.

I'm not seeing any terribly pleasing ways to fix this.  Aside from
the option of doing nothing, it seems like these are the choices:

1. We could hack base_yylex() to reduce NOT LIKE to a single token
which could be given the same precedence as LIKE.  Ditto for the other
four cases.  This would result in nice behavior for these cases, but as
Robert has correctly pointed out, hacks in base_yylex() risk creating
other poorly-understood behaviors.

2. We could change the precedence levels so that LIKE/ILIKE/SIMILAR,
BETWEEN, and IN all have precedence just above NOT, which would pretty
much mask the problem since there would be no other tokens with in-between
precedence levels.  In the context of the operator precedence changes
I proposed earlier, this would mean inserting the IS tests and comparison
operators between IN_P and POSTFIXOP rather than where the
single-character comparison ops live now.  We would likely also have to
flatten LIKE/BETWEEN/IN into a single %nonassoc precedence level to avoid
having weird interactions among them.  This isn't terribly attractive
because it would risk larger behavioral changes than the previous
proposal.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce pinning in btree indexes

2015-02-23 Thread Heikki Linnakangas

On 02/15/2015 02:19 AM, Kevin Grittner wrote:

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.


The first question is: Do we really need that interlock for the non-MVCC 
snapshots either?


If we do: For non-MVCC snapshots, we need to ensure that all index scans 
that started before the VACUUM did complete before the VACUUM does. I 
wonder if we could find a different mechanism to enforce that. Using the 
pin-interlock for that has always seemed a bit silly to me. Perhaps grab 
a new heavy-weight lock on the table whenever a non-MVCC index scan on 
the table begins, and have VACUUM wait on it.



I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting.  That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.


Or you could use GetFakeLSNForUnloggedRel().


Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.


Why does an index-only scan need to hold the pin?


Finally, there was an optimization for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements.  The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit.  In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.


Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
are not that small. Especially if it's an index-only scan, you're 
copying a large portion of the page. Some scans call markpos on every 
tuple, so that could add up.



At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes.  But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.


Understood. Still, I'd like to see if we can easily get rid of the 
pinning altogether.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Precedence of NOT LIKE, NOT BETWEEN, etc

2015-02-23 Thread Tom Lane
I wrote:
 I'm not seeing any terribly pleasing ways to fix this.  Aside from
 the option of doing nothing, it seems like these are the choices:

 1. We could hack base_yylex() to reduce NOT LIKE to a single token
 which could be given the same precedence as LIKE.  Ditto for the other
 four cases.  This would result in nice behavior for these cases, but as
 Robert has correctly pointed out, hacks in base_yylex() risk creating
 other poorly-understood behaviors.

 2. We could change the precedence levels so that LIKE/ILIKE/SIMILAR,
 BETWEEN, and IN all have precedence just above NOT, which would pretty
 much mask the problem since there would be no other tokens with in-between
 precedence levels.  In the context of the operator precedence changes
 I proposed earlier, this would mean inserting the IS tests and comparison
 operators between IN_P and POSTFIXOP rather than where the
 single-character comparison ops live now.  We would likely also have to
 flatten LIKE/BETWEEN/IN into a single %nonassoc precedence level to avoid
 having weird interactions among them.  This isn't terribly attractive
 because it would risk larger behavioral changes than the previous
 proposal.

I thought of another possibility:

3. Leave everything as-is but mark the NOT-operator productions as having
the precedence of NOT rather than of LIKE etc.  This would change the
behavior only for the NOT-LIKE-followed-by- example, and would make the
two cases for NOT LIKE consistent though they'd remain inconsistent with
LIKE.  This behavior seems at least somewhat explainable/documentable
(NOT-foo operators have the precedence of NOT), whereas what we have
seems about impossible to justify.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-23 Thread Rahila Syed
Hello,

Attached is a patch which has following changes,

As suggested above block ID in xlog structs has been replaced by chunk ID.
Chunk ID is used to distinguish between different types of xlog record
fragments.
Like,
XLR_CHUNK_ID_DATA_SHORT
XLR_CHUNK_ID_DATA_LONG
XLR_CHUNK_BKP_COMPRESSED
XLR_CHUNK_BKP_WITH_HOLE

In block references, block ID follows the chunk ID. Here block ID retains
its functionality.
This approach increases data by 1 byte for each block reference in an xlog
record. This approach separates ID referring different fragments of xlog
record from the actual block ID which is used to refer  block references in
xlog record.

Following are WAL numbers for each scenario,

 WAL
FPW compression on   121.652 MB

FPW compression off   148.998 MB

HEAD  148.764 MB

Compression remains nearly same as before. There is some difference in WAL
between HEAD and HEAD+patch+compression OFF. This difference corresponds to
1 byte increase with each block reference of xlog record.

Thank you,
Rahila Syed





On Wed, Feb 18, 2015 at 7:53 PM, Syed, Rahila rahila.s...@nttdata.com
wrote:

 Hello,

 I think we should change the xlog format so that the block_id (which
 currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the
 block id but something like XLR_CHUNK_ID. Which is used as is for
 XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
 XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
 XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
 block id following the chunk id.

 Yes, that'll increase the amount of data for a backup block by 1 byte,
 but I think that's worth it. I'm pretty sure we will be happy about the
 added extensibility pretty soon.

 To clarify my understanding of the above change,

 Instead of a block id to reference different fragments of an xlog record ,
 a single byte field chunk_id  should be used.  chunk_id  will be same as
 XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments.
 But for block references, it will take store following values in order to
 store information about the backup blocks.
 #define XLR_CHUNK_BKP_COMPRESSED  0x01
 #define XLR_CHUNK_BKP_WITH_HOLE 0x02
 ...

 The new xlog format should look like follows,

 Fixed-size header (XLogRecord struct)
 Chunk_id(add a field before id field in XLogRecordBlockHeader struct)
 XLogRecordBlockHeader
 Chunk_id
  XLogRecordBlockHeader
 ...
 ...
 Chunk_id ( rename id field of the XLogRecordDataHeader struct)
 XLogRecordDataHeader[Short|Long]
  block data
  block data
  ...
  main data

 I will post a patch based on this.

 Thank you,
 Rahila Syed

 -Original Message-
 From: Andres Freund [mailto:and...@2ndquadrant.com]
 Sent: Monday, February 16, 2015 5:26 PM
 To: Syed, Rahila
 Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists
 Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

 On 2015-02-16 11:30:20 +, Syed, Rahila wrote:
  - * As a trivial form of data compression, the XLOG code is aware that
  - * PG data pages usually contain an unused hole in the middle,
  which
  - * contains only zero bytes.  If hole_length  0 then we have removed
  - * such a hole from the stored data (and it's not counted in the
  - * XLOG record's CRC, either).  Hence, the amount of block data
  actually
  - * present is BLCKSZ - hole_length bytes.
  + * Block images are able to do several types of compression:
  + * - When wal_compression is off, as a trivial form of compression,
  + the
  + * XLOG code is aware that PG data pages usually contain an unused
 hole
  + * in the middle, which contains only zero bytes.  If length  BLCKSZ
  + * then we have removed such a hole from the stored data (and it is
  + * not counted in the XLOG record's CRC, either).  Hence, the amount
  + * of block data actually present is length bytes.  The hole offset
  + * on page is defined using hole_offset.
  + * - When wal_compression is on, block images are compressed using a
  + * compression algorithm without their hole to improve compression
  + * process of the page. length corresponds in this case to the
  + length
  + * of the compressed block. hole_offset is the hole offset of the
  + page,
  + * and the length of the uncompressed block is defined by
  + raw_length,
  + * whose data is included in the record only when compression is
  + enabled
  + * and with_hole is set to true, see below.
  + *
  + * is_compressed is used to identify if a given block image is
  + compressed
  + * or not. Maximum page size allowed on the system being 32k, the
  + hole
  + * offset cannot be more than 15-bit long so the last free bit is
  + used to
  + * store the compression state of block image. If the maximum page
  + size
  + * allowed is increased to a value higher than that, we should
  + consider
  + * increasing this structure size as well, 

Re: [HACKERS] pg_dump gets attributes from tables in extensions

2015-02-23 Thread Rushabh Lathia
Thanks to the easy handy testcase, was able to replicate the test scenario
on my local environment. And yes tbinfo-dobj.ext_member check into
getTableAttrs() do fix the issue.

Looking more into pg_dump code what I found that, generally PG don't have
tbinfo-dobj.ext_member check to ignore the object. Mostly we do this kind
of check using tbinfo-dobj.dump (look at dumpTable() for reference). Do you
have any particular reason if choosing dobj.ext_member over dobj.dump ?

On Fri, Feb 20, 2015 at 12:20 PM, Michael Paquier michael.paqu...@gmail.com
 wrote:

 On Fri, Feb 20, 2015 at 5:33 AM, Peter Eisentraut pete...@gmx.net wrote:
  On 2/16/15 2:45 AM, Michael Paquier wrote:
  While looking at the patch to fix pg_dump with extensions containing
  tables referencing each other, I got surprised by the fact that
  getTableAttrs tries to dump table attributes even for tables that are
  part of an extension. Is that normal?
  Attached is a patch that I think makes things right, but not dumping any
  tables that are part of ext_member.
 
  Can you provide an example/test case?  (e.g., which publicly available
  extension contains tables with attributes?)

 Sure. Attached is a simplified version of the extension I used for the
 other patch on pg_dump.
 $ psql -c 'create extension dump_test'
 CREATE EXTENSION
 $ psql -At -c '\dx+ dump_test'
 table aa_tab_fkey
 table bb_tab_fkey
 $ pg_dump -v 21 | grep columns and types
 pg_dump: finding the columns and types of table public.bb_tab_fkey
 pg_dump: finding the columns and types of table public.aa_tab_fkey
 --
 Michael


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Rushabh Lathia


Re: [HACKERS] __attribute__ for non-gcc compilers

2015-02-23 Thread Oskari Saarenmaa
23.02.2015, 04:31, Robert Haas kirjoitti:
 On Tue, Feb 17, 2015 at 8:41 AM, Oskari Saarenmaa o...@ohmu.fi wrote:
 15.01.2015, 21:58, Robert Haas kirjoitti:
 On Wed, Jan 14, 2015 at 5:54 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 I think I'd for now simply not define pg_attribute_aligned() on
 platforms where it's not supported, instead of defining it empty. If we
 need a softer variant we can name it pg_attribute_aligned_if_possible or
 something.

 Sounds sane?

 Yes, that sounds like a much better plan.

 Attached an updated patch rebased on today's git master that never
 defines aligned or packed empty.

 This is also included in the current commitfest,
 https://commitfest.postgresql.org/4/115/
 
 Is this going to play nicely with pgindent?

I ran pgindent on the tree with this patch applied (with a few changes,
pgindent modified atomics headers a bit) and the changes looked ok to
me, mostly pgindent just rewrapped lines like this:

-extern void quickdie(SIGNAL_ARGS) pg_attribute_noreturn;
+extern void
+quickdie(SIGNAL_ARGS) pg_attribute_noreturn;


but there were two cases where it produced a bit weird indentation:

 #ifdef __arm__
-pg_attribute_packed/* Appropriate whack upside the
head for ARM */
+   pg_attribute_packed /* Appropriate whack upside
the head for ARM */
 #endif
 ItemPointerData;

and

 void
-pg_attribute_noreturn
+   pg_attribute_noreturn
 plpgsql_yyerror(const char *message)
 {


/ Oskari


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] deparsing utility commands

2015-02-23 Thread Stephen Frost
Andres,

* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2015-02-21 14:51:32 -0500, Stephen Frost wrote:
  It'd be *really* nice to be able to pass an object identifier to some
  function and get back the CREATE (in particular, though perhaps DROP, or
  whatever) command for it.  This gets us *awful* close to that without
  actually giving it to us and that's bugging me.  The issue is the
  parsetree, which I understand may be required in some cases (ALTER,
  primairly, it seems?), but isn't always.
  
  The USING and WITH CHECK quals can be extracted from the polForm also,
  of course, it's what psql and pg_dump are doing, after all.
  
  So, why depend on the parsetree?  What about have another argument which
  a user could use without the parsetree, for things that it's absolutely
  required for today (eg: ALTER sub-command differentiation)?
 
 I'm really not wild about pretty massively expanding the scope at the
 eleventh hour.  There's a fair number of commands where this the
 deparsing command will give you a bunch of commands - look at CREATE
 TABLE and CREATE SCHEMA ... for the most extreme examples. For those
 there's no chance to do that without the parse tree available.

For my 2c, I don't agree with either of your assumptions above.  I don't
see this as a massive expansion of the scope, nor that we're in the 11th
hour at this point.  Agreed, it's the last commitfest, but we're near
the beginning of it and Alvaro has already made modifications to the
patch set, as is generally expected to happen for any patch in a
commitfest, without much trouble.  The changes which I'm suggesting are
nearly trivial for the larger body of work and only slightly more than
trivial for the more complicated pieces.

 Yes, it might be possible to use the same code for a bunch of minor
 commands, but not for the interesting/complex stuff.

We can clearly rebuild at least CREATE commands for all objects without
access to the parse tree, obviously pg_dump manages somehow.  I didn't
specify that a single command had to be used.  Further, the actual
deparse_CreateStmt doesn't look like it'd have a terribly hard time
producing something without access to the parsetree.

The whole premise of this approach is that we're using the results of
the catalog changes as the basis for what's needed to recreate the
command.  The places where we're referring to the parsetree look more
like a crutch of convenience than for any particular good reason to.
Consider this part of 0011 which includes deparse_CreateStmt:

/*
 * Process table elements: column definitions and constraints.  Only
 * the column definitions are obtained from the parse node itself.  To
 * get constraints we rely on pg_constraint, because the parse node
 * might be missing some things such as the name of the constraints.
 */

and then looking down through to deparse_ColumnDef, we see that the
ColumnDef passed in is used almost exclusivly for the *column name*
(which is used to look up the entry in pg_attribute..).  Looping over
the pg_attribute entries for the table in the attnum order would
certainly return the same result, or something has gone wrong.

Beyond the inheiritance consideration, the only other reference actually
leads to what you argued (correctly, in my view) was a mistake in the
code- where a NOT NULL is omitted for the primary key case.  Weren't you
also complaining about the 'OF type' form being a potential issue?
That'd also go away with the approach I'm advocating.  In short, I
suspect that if this approach had been taken originally, at least some
of the concerns and issued levied against the current implementation
wouldn't exist.  Thankfully, the way the code has been developed, the
majority of the code is general infrastructure and the changes I'm
suggesting are all in code which is simpler, thanks to that
infrastructure already being there.

All I'm suggesting is that we focus on collecting the information from
the catalog and avoid using the parsetree.  For at least the CREATE and
DROP cases, that should be entirely possible.  This does end up being a
bit different from the original goal, which was closer to reproduce
exactly the command that was specified, but as shown above, that
probably wasn't what we ever really wanted.  The original command is
ambiguous on a number of levels and even where it isn't we can get the
canonical information we need straight from the catalog.

  Maybe that's possible and maybe it isn't, but at least for the CREATE
  cases we should be able to avoid forcing a user to provide a built
  parsetree, and that'd be *really* nice and open up this feature to
  quite a few other use-cases that I can think of.  I'd further suggest
  that we provide these command to the SQL level, and then have a
  wrapper which will take the name of an object, resolve it to Oid, and
  then pass back the CREATE command for it.
 
 I think this is a different feature that might end up sharing some of
 the infrastructure, but not more.


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-23 Thread Michael Paquier
On Mon, Feb 23, 2015 at 8:57 PM, Fujii Masao masao.fu...@gmail.com wrote:

 So I pushed the patch.


Thank you.
-- 
Michael


Re: [HACKERS] Bug in pg_dump

2015-02-23 Thread Michael Paquier
On Tue, Feb 24, 2015 at 2:17 AM, Gilles Darold gilles.dar...@dalibo.com
wrote:

 Looks great to me, I have tested with the postgis_topology extension
 everything works fine.


Actually, after looking more in depth at the internals of pg_dump I think
that both patches are wrong (did that yesterday night for another patch).
First this patch marks a table in an extension as always dumpable:
+  /* Mark member table as dumpable */
+  configtbl-dobj.dump = true;
And then many checks on ext_member are added in many code paths to ensure
that objects in extensions have their definition never dumped.
But actually this assumption is not true all the time per this code in
getExtensionMemberShip:
if (!dopt-binary_upgrade)
dobj-dump = false;
else
dobj-dump = refdobj-dump;
So this patch would break binary upgrade where some extension objects
should be dumped (one reason why I haven't noticed that before is that
pg_upgrade tests do not include extensions).

Hence, one idea coming to my mind to fix the problem would be to add some
extra processing directly in getExtensionMembership() after building the
objects DO_TABLE_DATA with makeTableDataInfo() by checking the FK
dependencies and add a dependency link with addObjectDependency. The good
part with that is that even user tables that reference extension tables
with a FK can be restored correctly because their constraint is added
*after* loading the data. I noticed as well that with this patch the
--data-only mode was dumping tables in the correct order.

Speaking of which, patches implementing this idea are attached. The module
test case has been improved as well with a check using a table not in an
extension linked with a FK to a table in an extension.
-- 
Michael
From 7fb84d68df023d913c448f2498987ca4f0a70595 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Tue, 17 Feb 2015 07:39:23 +
Subject: [PATCH 1/3] Fix ordering of tables part of extensions linked with
 constraints

Additional checks on FK constraints potentially linking between them
extension objects are done and data dump ordering is ensured. Note
that this does not take into account foreign keys of tables that are
not part of an extension linking to an extension table.
---
 src/bin/pg_dump/pg_dump.c | 56 ++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2b53c72..5b1b240 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -15236,7 +15236,8 @@ dumpRule(Archive *fout, DumpOptions *dopt, RuleInfo *rinfo)
 }
 
 /*
- * getExtensionMembership --- obtain extension membership data
+ * getExtensionMembership --- obtain extension membership data and check FK
+ * dependencies among extension tables.
  */
 void
 getExtensionMembership(Archive *fout, DumpOptions *dopt, ExtensionInfo extinfo[],
@@ -15423,6 +15424,59 @@ getExtensionMembership(Archive *fout, DumpOptions *dopt, ExtensionInfo extinfo[]
 	}
 }
 			}
+
+			/*
+			 * Now that all the TableInfoData objects have been created for
+			 * all the extensions, check their FK dependencies and register
+			 * them to ensure correct data ordering.  Note that this is not
+			 * a problem for user tables not included in an extension
+			 * referencing with a FK tables in extensions as their constraint
+			 * is declared after dumping their data. In --data-only mode the
+			 * table ordering is ensured as well thanks to
+			 * getTableDataFKConstraints().
+			 */
+			for (j = 0; j  nconfigitems; j++)
+			{
+int			i_confrelid, k;
+PQExpBuffer query2 = createPQExpBuffer();
+TableInfo  *configtbl;
+Oid			configtbloid = atooid(extconfigarray[j]);
+
+configtbl = findTableByOid(configtbloid);
+if (configtbl == NULL || configtbl-dataObj == NULL)
+	continue;
+
+appendPQExpBuffer(query2,
+	  SELECT confrelid 
+	  FROM pg_catalog.pg_constraint 
+	  WHERE conrelid = '%u' 
+	  AND contype = 'f',
+	  configtbloid);
+
+res = ExecuteSqlQuery(fout, query2-data, PGRES_TUPLES_OK);
+ntups = PQntuples(res);
+i_confrelid = PQfnumber(res, confrelid);
+
+for (k = 0; k  ntups; k++)
+{
+	Oid			confrelid;
+	TableInfo  *reftable;
+
+	confrelid = atooid(PQgetvalue(res, k, i_confrelid));
+	reftable = findTableByOid(confrelid);
+
+	if (reftable == NULL || reftable-dataObj == NULL)
+		continue;
+
+	/*
+	 * Make referencing TABLE_DATA object depend on the
+	 * referenced table's TABLE_DATA object.
+	 */
+	addObjectDependency(configtbl-dataObj-dobj,
+		reftable-dataObj-dobj.dumpId);
+}
+resetPQExpBuffer(query2);
+			}
 		}
 		if (extconfigarray)
 			free(extconfigarray);
-- 
2.3.0

From ad5cd360243b44e735195c5e94df3c21e8f18e07 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Tue, 17 Feb 2015 22:29:28 +0900
Subject: [PATCH 2/3] Make 

Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Kouhei Kaigai
Please see check_object_ownership(). It checks relation's ownership
if OBJECT_COLUMN, however, type's ownership is the correct check if
OBJECT_ATTRIBUTE.

--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Alvaro Herrera
 Sent: Tuesday, February 24, 2015 4:02 AM
 To: Pg Hackers; Peter Eisentraut
 Subject: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE
 for composites)
 
 I found that the OBJECT_ATTRIBUTE symbol is useless.  I can just remove
 it and replace it with OBJECT_COLUMN, and everything continues to work;
 no test fails that I can find.
 
 I thought we had a prohibition against ALTER TABLE when used on
 composites, but it's not as severe as I thought.  The following commands
 fail in master:
 
 ALTER TABLE comptype RENAME TO comptype2; -- HINT: Use ALTER TYPE
 ALTER TABLE comptype SET SCHEMA sch;  -- HINT: Use ALTER TYPE
 
 However, the following command works in master:
 
 ALTER TABLE comptype RENAME COLUMN a TO b;
 and has the same effect as this:
 ALTER TYPE comptype RENAME ATTRIBUTE a TO b;
 
 The RENAME ATTTRIBUTE case in RenameStmt is the only thing currently
 using OBJECT_ATTRIBUTE; therefore, since in precisely that case we do
 not prohibit using ALTER TABLE, we can just remove OBJECT_ATTRIBUTE
 completely.  That leads to the attached patch, which changes no test
 result at all.
 
 This symbol was added in
 
 commit e440e12c562432a2a695b8054964fb34e3bd823e
 Author: Peter Eisentraut pete...@gmx.net
 Date:   Sun Sep 26 14:41:03 2010 +0300
 
 Add ALTER TYPE ... ADD/DROP/ALTER/RENAME ATTRIBUTE
 
 Like with tables, this also requires allowing the existence of
 composite types with zero attributes.
 
 reviewed by KaiGai Kohei
 
 Thoughts?
 
 --
 Álvaro Herrerahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_check_dir comments and implementation mismatch

2015-02-23 Thread Noah Misch
On Sun, Feb 22, 2015 at 07:57:41PM -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Fri, Feb 20, 2015 at 12:59 AM, Noah Misch n...@leadboat.com wrote:
  On Mon, Feb 02, 2015 at 03:48:33PM -0500, Robert Haas wrote:
  If readir() fails and closedir() succeeds, the return will be -1 but
  errno will be 0.
 
  Out of curiosity, have you seen a closedir() implementation behave that 
  way?
  It would violate C99 (The value of errno is zero at program startup, but 
  is
  never set to zero by any library function.) and POSIX.
 
  No.  Good point, I didn't think about that.  I think this way is safer, 
  though.
 
 While the spec forbids library functions from setting errno to zero, there
 is no restriction on them changing errno in other ways despite returning
 success; their exit-time value of errno is only well-defined if they fail.
 So we do need to preserve errno explicitly across closedir(), or we may
 report the wrong failure from readdir().

Yes.  I'm happy with the commit itself.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] deparsing utility commands

2015-02-23 Thread Andres Freund
Hi,

On 2015-02-23 19:48:43 -0500, Stephen Frost wrote:
  Yes, it might be possible to use the same code for a bunch of minor
  commands, but not for the interesting/complex stuff.
 
 We can clearly rebuild at least CREATE commands for all objects without
 access to the parse tree, obviously pg_dump manages somehow.

That's not the same. pg_dump recreates a static state, not running log
of changes that can be replayed.

 I didn't specify that a single command had to be used.

Well, if you want to get 'the create table statement' you'll need to
decide what you want. With ddl event triggers it's clear - something
that, when replayed, results in the same catalog contents. Such an easy
definition doesn't, as far as I know, exist for a set of functions you
seem to imagine.

 Further, the actual deparse_CreateStmt doesn't look like it'd have a
 terribly hard time producing something without access to the
 parsetree.

The interesting bit isn't the deparse_CreateStmt itself, but all the
stuff that's also triggered when you do a CREATE TABLE
nontrival-stuff;. utility.c will first transform the statement and then
run run and stash every created subcommand. And the subcommands will
*also* get deparsed.

Just think about what to do about CREATE TABLE foo(id serial primary
key, data text, bar_id REFERENCES foo.bar); - there's no way you can get
which other objects to dump from the catalog alone. What for a schema,
considering CREATE SCHEMA ... (schema_elements)+;?

Sure, individual subcommands could refer to the catalog instead of the
parse tree. But to get the whole thing you can't easily just refer to
it.

 All I'm suggesting is that we focus on collecting the information from
 the catalog and avoid using the parsetree.  For at least the CREATE and
 DROP cases, that should be entirely possible.

DROP's already in 9.4, the additions in 9.5 were more or less usability
things.  The problem generating DROPs is right now more identifying
which one you want to drop and checking the dependencies - the latter
I'm not sure how to do without actually executing the DROP.

   Maybe that's possible and maybe it isn't, but at least for the CREATE
   cases we should be able to avoid forcing a user to provide a built
   parsetree, and that'd be *really* nice and open up this feature to
   quite a few other use-cases that I can think of.  I'd further suggest
   that we provide these command to the SQL level, and then have a
   wrapper which will take the name of an object, resolve it to Oid, and
   then pass back the CREATE command for it.
  
  I think this is a different feature that might end up sharing some of
  the infrastructure, but not more.
 
 I know you've looked through this code also and I really don't get the
 comment that only some of this infrastructure would be shared.  As I
 tried to point out, for the 'CREATE POLICY' case, a few minor
 modifications would have it provide exactly what I'm suggesting and I'm
 sure that most of the cases would be similar.  Simply looking through
 the code with an eye towards avoiding the parsetree in favor of pulling
 information from the catalog would illustrate the point I'm making, I
 believe.

I've no problem at all changing CREATE POLICY (and some other) deparse
functions to look more at the catalog than the command. What I don't see
is changing all of the create deparse functions, guarantee that they are
usable for getting the DDL of catalog objects and provide SQL
accessible infrastructure for that. That's a *massive* undertaking.

What I mean with 'sharing some of the infrastructure' is that I can see
a good portion of the deparse_* functions being reused for what you
desire.

But the decision about which of those to call will be an entirely
separate project. So is a whole new infrastructure to consider locking
and visibility (all the deparse stuff uses continualy evolving catalog
snapshots!) that it'll need as that's a problem the event trigger stuff
has much less to care about, because the objects are new rows and thus
can't be updated by other backends.

  I don't think it's all that related, so I'm not surprised. If you
  execute a CREATE TABLE(id serial primary key); you'll get a bunch of
  commands with this facility - it'd be much less clear what exactly shall
  be dumped in the case of some hypothetical GUI when you want to dump the
  table.
 
 I really don't think it's all that strange or complicated to consider
 and we've got a rather nice example of what a good approach would be.

Right. We got a *massive* program that solves dependencies and doesn't
do all that much useful/correct things if you only let it dump
individual objects. And that dumping of individual objects is what you
want... ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra
On 24.2.2015 01:44, Peter Geoghegan wrote:
 On Mon, Feb 23, 2015 at 4:41 PM, Tomas Vondra
 tomas.von...@2ndquadrant.com wrote:
 Are you going to add this into the CF? Would be nice to get this into
 9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider
 it part of one of the existing sortsupport patches.
 
 It's a bugfix, IMV. I guess I should add it to the current CF, but I
 think I might be forbidden from doing so as a non-CF-manager...


Yes, I consider that a bugfix too, fixing a performance regression.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about to have relnamespace and relrole?

2015-02-23 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 19 Feb 2015 15:30:53 -0500, Peter Eisentraut pete...@gmx.net wrote in 
54e647fd.5000...@gmx.net
 On 2/18/15 3:44 AM, Kyotaro HORIGUCHI wrote:
  Hello, this is the patchset v2 of this feature.
  
  0001-Add-regrole.patch
  0002-Add-regnamespace.patch
  0003-Check-new-reg-types-are-not-used-as-default-values.patch
  0004-Documentation-for-new-OID-alias-types.patch
 
 Somehow, these patches ended up in the commit fest without an author
 listed.  That should probably not be possible.

Mmm.. I saw the author for this is listed here,

https://commitfest.postgresql.org/4/

Did you fix that manually for me?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Kouhei Kaigai
 Kouhei Kaigai wrote:
  Please see check_object_ownership(). It checks relation's ownership
  if OBJECT_COLUMN, however, type's ownership is the correct check if
  OBJECT_ATTRIBUTE.
 
 Hmm.  Is there any case where the two are different?

AlterObjectTypeCommandTag()?

OBJECT_ATTRIBUTE makes ALTER TYPE tag, but ALTER COLUMN tag is
made with OBJECT_COLUMN.

Above two cases are all I could found.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Peter Geoghegan
On Mon, Feb 23, 2015 at 4:41 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 Are you going to add this into the CF? Would be nice to get this into
 9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider
 it part of one of the existing sortsupport patches.

It's a bugfix, IMV. I guess I should add it to the current CF, but I
think I might be forbidden from doing so as a non-CF-manager...
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about to have relnamespace and relrole?

2015-02-23 Thread Michael Paquier
On Tue, Feb 24, 2015 at 11:35 AM, Kyotaro HORIGUCHI 
horiguchi.kyot...@lab.ntt.co.jp wrote:

 Hello,

 At Thu, 19 Feb 2015 15:30:53 -0500, Peter Eisentraut pete...@gmx.net
 wrote in 54e647fd.5000...@gmx.net
  On 2/18/15 3:44 AM, Kyotaro HORIGUCHI wrote:
   Hello, this is the patchset v2 of this feature.
  
   0001-Add-regrole.patch
   0002-Add-regnamespace.patch
   0003-Check-new-reg-types-are-not-used-as-default-values.patch
   0004-Documentation-for-new-OID-alias-types.patch
 
  Somehow, these patches ended up in the commit fest without an author
  listed.  That should probably not be possible.

 Mmm.. I saw the author for this is listed here,

 https://commitfest.postgresql.org/4/

 Did you fix that manually for me?


Looking at the log entry:
2015-02-19 21:11:08 Michael Paquier (michael-kun) Changed authors to
Kyotaro Horiguchi (horiguti)
I do a pass from time to on the patches...
-- 
Michael


Re: [HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code

2015-02-23 Thread David Steele
On 2/23/15 6:55 PM, Michael Paquier wrote:
 Hi all,
 
 This is a purely informational email...
 I have put my hands on a Raspberry PI 2, and I have found that it takes
 6 minutes to compile the backend code using the 4 cores of the ARMv7
 processor, and close to 20 minutes on a single core (without ccache).
 The test has been done using ArchLinux ARM with a micro SD card of class
 10 (said to be able to manage 50MB/s in write, 60MB/s in read).
 In comparison, the buildfarm machine hamster, which is a Rasp PI 1,
 takes close to 2 hours to do the same operation, using the same OS and a
 similar class 10 card (30MB/s but the I/O is not the bottleneck).

Nice!  Looks like Moore's Law works even on tight budgets.

-- 
- David Steele
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pg_dump gets attributes from tables in extensions

2015-02-23 Thread Rushabh Lathia
On Mon, Feb 23, 2015 at 7:45 PM, Michael Paquier michael.paqu...@gmail.com
wrote:



 On Mon, Feb 23, 2015 at 5:57 PM, Rushabh Lathia rushabh.lat...@gmail.com
 wrote:
  Thanks to the easy handy testcase, was able to replicate the test
 scenario
  on my local environment. And yes tbinfo-dobj.ext_member check into
  getTableAttrs() do fix the issue.
 
  Looking more into pg_dump code what I found that, generally PG don't have
  tbinfo-dobj.ext_member check to ignore the object. Mostly we do this
 kind
  of check using tbinfo-dobj.dump (look at dumpTable() for reference). Do
 you
  have any particular reason if choosing dobj.ext_member over dobj.dump ?

 Hm. I have been pondering about this code more and I am dropping the patch
 as either adding a check based on the flag to track dumpable objects or
 ext_member visibly breaks the logic that binary upgrade and tables in
 extensions rely on. Particularly this portion makes me think so in
 getExtensionMembership():
 /*
  * Normally, mark the member object as not to be dumped.
 But in
  * binary upgrades, we still dump the members
 individually, since the
  * idea is to exactly reproduce the database contents
 rather than
  * replace the extension contents with something different.
  */
 if (!dopt-binary_upgrade)
 dobj-dump = false;
 else
 dobj-dump = refdobj-dump;


Ok. Looking at above code into getExtensionMembership(). It seems like you
fix you suggested is not correct. I new table with DEFAULT attribute into
dump_test extension and pg_dump with binary-upgrade is failing with
pg_dump: invalid column number 1 for table bb_tab_fkey.

rushabh@rushabh-centos-vm:dump_test$ cat dump_test--1.0.sql
/* dump_test/dump_test--1.0.sql */

-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use CREATE EXTENSION dump_test to load this file. \quit

CREATE TABLE IF NOT EXISTS bb_tab_fkey (
id int PRIMARY KEY
);

CREATE TABLE IF NOT EXISTS aa_tab_fkey (
id int REFERENCES bb_tab_fkey(id)
);

CREATE TABLE IF NOT EXISTS foo ( a int default 100 );

This gave another strong reason to add if (!tbinfo-dobj.dump) check rather
then ext_member check. What you say ?

Regards,


[HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code

2015-02-23 Thread Michael Paquier
Hi all,

This is a purely informational email...
I have put my hands on a Raspberry PI 2, and I have found that it takes 6
minutes to compile the backend code using the 4 cores of the ARMv7
processor, and close to 20 minutes on a single core (without ccache). The
test has been done using ArchLinux ARM with a micro SD card of class 10
(said to be able to manage 50MB/s in write, 60MB/s in read).
In comparison, the buildfarm machine hamster, which is a Rasp PI 1, takes
close to 2 hours to do the same operation, using the same OS and a similar
class 10 card (30MB/s but the I/O is not the bottleneck).

Regards,
-- 
Michael


Re: [HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code

2015-02-23 Thread Adam Brightwell
Michael,


 This is a purely informational email...
 I have put my hands on a Raspberry PI 2, and I have found that it takes 6
 minutes to compile the backend code using the 4 cores of the ARMv7
 processor, and close to 20 minutes on a single core (without ccache). The
 test has been done using ArchLinux ARM with a micro SD card of class 10
 (said to be able to manage 50MB/s in write, 60MB/s in read).
 In comparison, the buildfarm machine hamster, which is a Rasp PI 1, takes
 close to 2 hours to do the same operation, using the same OS and a similar
 class 10 card (30MB/s but the I/O is not the bottleneck).


Thanks for sharing this info.  I'm looking forward to getting my hands on
an rpi2 soon for some of my other projects.  Having an idea of the
increased performance, especially related to compilation, was certainly
something I was curious about.

-Adam



-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra
On 23.2.2015 21:52, Peter Geoghegan wrote:
 On Mon, Feb 23, 2015 at 11:17 AM, Tomas Vondra
 tomas.von...@2ndquadrant.com wrote:
 So while it's true that for the 3rd query we get much worse results
 compared to the other queries (i.e. we don't get 400% speedup but ~3%
 slowdown compared to master), it's true that master performs
 exceptionally well for this query with small datasets. Once we get to 2M
 rows, the master performance drops significantly but cost-model keeps
 the performance characteristics and the speedup jumps back to ~700%
 which is nice.

 These numbers are for the 'ASC + unsorted row' test, but I do get
 exactly the same pattern for the 'random' tests done previously.
 
 Yeah. Looks like you're comparing a case where the old cost model
 did the right thing anyway (i.e. used abbreviation). The difference
 would then be entirely explainable as noise. Right?

I don't think it's just noise - on the i5-2500 CPU, the result are very
consistent. For 1M random rows with this query (#3)

select * from (select * from stuff_text order by randtxt
   offset 1000) foo

I do get this (with 20 of the query):

 min   max  medianmean  stddev

master   932.449   932.902  932.637   932.639   0.116
cost-model   957.16958.47   957.497   957.586   0.411

That is super-consistent, IMHO, and the ~2.6% slowdown is clearly
visible. But it might be due to the padding described by Andrew.

Similarly on the Xeon E5450 CPU, although the results there are much
more noisy. Maybe it's more

 It would be nice if we could address the 3% regression for the
 last query, but I guess it's not a big deal. The numbers in general
 are absolutely impressive. Kudos.
 
 Thanks.

Are you going to add this into the CF? Would be nice to get this into
9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider
it part of one of the existing sortsupport patches.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT UPDATE and logical decoding

2015-02-23 Thread Peter Geoghegan
On Fri, Feb 20, 2015 at 3:44 PM, Peter Geoghegan p...@heroku.com wrote:
 they'd only see a
 REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of
 an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that
 was the definitive outcome. No need for output plugins to consider
 REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all.

 Yes. It'd be easiest if the only the final insert/update were actually
 WAL logged as full actions.

 Well, that implies that we'd actually know that we'd succeed when WAL
 logging the speculative heap tuple's insertion. We literally have no
 way of knowing if that's the case at that point, though - that's just
 the nature of value locking scheme #2's optimistic approach.

Attached patch does this. This is just a sketch, to provoke a quick
discussion (which is hopefully all this needs). I did things this way
because it seemed more efficient than rolling this into a whole new
revision of the ON CONFLICT patch series.

Ostensibly, this does the right thing: The test_decoding output plugin
only ever reports either an INSERT or an UPDATE within a transaction
(that contains an INSERT ... ON CONFLICT UPDATE). With jjanes_upsert
running, I have not found any problematic cases where this fails to be
true (although I have certainly been less than thorough so far -
please take that into consideration).

I have two basic concerns:

* Do you think that what I've sketched here is roughly the right
approach? Consolidating super deletions and insertions at transaction
reassembly seemed like the right thing to do, but obviously I'm not
best qualified to judge that. Clearly, it would be possible to
optimize this so REORDER_BUFFER_CHANGE_INSERT peek ahead happens
much more selectively, which I haven't bothered with yet.

* What are the hazards implied by my abuse of the
ReorderBufferIterTXNNext() interface? It looks like it should be okay
to peek ahead like this, because it has a slightly odd convention
around memory management that accidentally works. But honestly, I'm
feeling too lazy to make myself grok the code within
ReorderBufferIterTXNNext() at the moment, and so have not put concerns
about the code being totally broken to rest yet. Could the current
ReorderBufferChange be freed before it is sent to a logical decoding
plugin by way of a call to the  rb-apply_change() callback (when the
xact is spooled to disk, for example)?

Thanks
-- 
Peter Geoghegan
From 53784c818b39d39c6a1f3588f7823a9010074869 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan peter.geoghega...@gmail.com
Date: Sun, 22 Feb 2015 17:02:14 -0800
Subject: [PATCH 7/7] Logical decoding support for ON CONFLICT UPDATE

---
 src/backend/replication/logical/decode.c|  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 35 +++--
 src/include/replication/reorderbuffer.h |  3 ++-
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e7614bd..a9cd0df 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -682,7 +682,10 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		return;
 
 	change = ReorderBufferGetChange(ctx-reorder);
-	change-action = REORDER_BUFFER_CHANGE_DELETE;
+	if (!(xlrec-flags  XLOG_HEAP_KILLED_SPECULATIVE_TUPLE))
+		change-action = REORDER_BUFFER_CHANGE_DELETE;
+	else
+		change-action = REORDER_BUFFER_CHANGE_INTERNAL_DELETE;
 
 	memcpy(change-data.tp.relnode, target_node, sizeof(RelFileNode));
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 20bb3b7..8f9101e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -401,6 +401,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INSERT:
 		case REORDER_BUFFER_CHANGE_UPDATE:
 		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
 			if (change-data.tp.newtuple)
 			{
 ReorderBufferReturnTupleBuf(rb, change-data.tp.newtuple);
@@ -1313,7 +1314,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 	PG_TRY();
 	{
+		/*
+		 * To merge super delete and insert records, peek-ahead is occasionally
+		 * required
+		 */
 		ReorderBufferChange *change;
+		ReorderBufferChange *nextchange = NULL;
 
 		if (using_subtxn)
 			BeginInternalSubTransaction(replay);
@@ -1323,16 +1329,21 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		rb-begin(rb, txn);
 
 		iterstate = ReorderBufferIterTXNInit(rb, txn);
-		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
+		while ((change = nextchange ? nextchange :
+ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/* Forget about previous peek ahead */
+			nextchange = NULL;
+
 			switch (change-action)
 		

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-23 Thread Robert Haas
On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Robert pointed out that the visibility information
 for an index-only scan wasn't checked while the index page READ
 lock was held, so those scans also still hold the pins.

 Why does an index-only scan need to hold the pin?

Suppose there's a dead tuple in the heap someplace, and an index
pointer pointing to that dead tuple.  An index scan reaches the
relevant page and copies all of the index pointers.  VACUUM removes
the index pointer and marks the heap page all-visible.  The scan now
uses the index pointers copied to backend-local memory and notes that
the heap-page is all-visible, so the scan sees the tuple even though
that tuple is totally gone from the heap by that point.  Holding the
pin prevents this, because we can't reach the second heap pass until
the scan is done with the page, and therefore the dead line pointer in
the heap page can't be marked unused, and therefore the page can't be
marked all-visible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Abbreviated keys for Numeric

2015-02-23 Thread Andrew Gierth
 Tomas == Tomas Vondra tomas.von...@2ndquadrant.com writes:

 Tomas I believe the small regressions (1-10%) for small data sets,
 Tomas might be caused by this 'random padding' effect, because that's
 Tomas probably where L1/L2 cache is most important. For large datasets
 Tomas the caches are probably not as efficient anyway, so the random
 Tomas padding makes no difference,

Except that _your own results_ show that this is not the case.

In your first set of results you claimed a reduction in performance to
91% for a query which is _in no way whatsoever_ affected by any of the
code changes. How is this not noise?

I refer specifically to this case from your spreadsheet:

query   select * from (select * from stuff_text order by randtxt offset
  1000) foo
typetext
scale   5
master  26.949
datum   28.051
numeric 29.734
datum   96%
numeric 91% 

This query does not invoke any code path touched by either the datum or
numeric patches! The observed slowdown is therefore just noise (assuming
here that your timings are correct).

Whether that case can be improved by tweaking the _text_ abbreviation
code is another question, one which is not relevant to either of the
patches currently in play.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] LSN Map

2015-02-23 Thread Robert Haas
On Mon, Feb 23, 2015 at 12:52 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Dunno, but Jim's got a point. This is a maintenance burden to all indexams,
 if they all have to remember to update the LSN map separately. It needs to
 be done in some common code, like in PageSetLSN or XLogInsert or something.

 Aside from that, isn't this horrible from a performance point of view? The
 patch doubles the buffer manager traffic, because any update to any page
 will also need to modify the LSN map. This code is copied from the
 visibility map code, but we got away with it there because the VM only needs
 to be updated the first time a page is modified. Subsequent updates will
 know the visibility bit is already cleared, and don't need to access the
 visibility map.

 Ans scalability: Whether you store one value for every N pages, or the LSN
 of every page, this is going to have a huge effect of focusing contention to
 the LSN pages. Currently, if ten backends operate on ten different heap
 pages, for example, they can run in parallel. There will be some contention
 on the WAL insertions (much less in 9.4 than before). But with this patch,
 they will all fight for the exclusive lock on the single LSN map page.

 You'll need to find a way to not update the LSN map on every update. For
 example, only update the LSN page on the first update after a checkpoint
 (although that would still have a big contention focusing effect right after
 a checkpoint).

I think it would make more sense to do this in the background.
Suppose there's a background process that reads the WAL and figures
out which buffers it touched, and then updates the LSN map
accordingly.  Then the contention-focusing effect disappears, because
all of the updates to the LSN map are being made by the same process.
You need some way to make sure the WAL sticks around until you've
scanned it for changed blocks - but that is mighty close to what a
physical replication slot does, so it should be manageable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Alvaro Herrera
Kouhei Kaigai wrote:
 Please see check_object_ownership(). It checks relation's ownership
 if OBJECT_COLUMN, however, type's ownership is the correct check if
 OBJECT_ATTRIBUTE.

Hmm.  Is there any case where the two are different?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical column ordering

2015-02-23 Thread Tomas Vondra
Hi,

attached is the result of my first attempt to make the logical column
ordering patch work. This touches a lot of code in the executor that is
mostly new to me, so if you see something that looks like an obvious
bug, it probably is (so let me know).   


improvements

The main improvements of this version are that:

* initdb actually works (while before it was crashing)

* regression tests work, with two exceptions

  (a) 'subselect' fails because EXPLAIN prints columns in physical order
  (but we expect logical)

  (b) col_order crashes works because of tuple descriptor mismatch in a
  function call (this actually causes a segfault)

The main change is this patch is that tlist_matches_tupdesc() now checks
target list vs. physical attribute order, which may result in doing a
projection (in cases when that would not be done previously).

I don not claim this is the best approach - maybe it would be better to
keep the physical tuple and reorder it lazily. That's why I kept a few
pieces of code (fix_physno_mutator) and a few unused fields in Var.

Over the time I've heard various use cases for this patch, but in most
cases it was quite speculative. If you have an idea where this might be
useful, can you explain it here, or maybe point me to a place where it's
described?

There's also a few FIXMEs, mostly from Alvaro's version of the patch.
Some of them are probably obsolete, but I wasn't 100% sure by that so
I've left them in place until I understand the code sufficiently.


randomized testing
--
I've also attached a python script for simple randomized testing. Just
execute it like this:

$ python randomize-attlognum.py -t test_1 test_2 \
   --init-script attlognum-init.sql \
   --test-script attlognum-test.sql

and it will do this over and over

$ dropdb test
$ createdb test
$ run init script
$ randomly set attlognums for the tables (test_1 and test_2)
$ run test script

It does not actually check the result, but my experience is that when
there's a bug in handling the descriptor, it results in segfault pretty
fast (just put some varlena columns into the table).


plans / future
--
After discussing this with Alvaro, we've both agreed that this is far
too high-risk change to commit in the very last CF (even if it was in a
better shape). So while it's added to 2015-02 CF, we're aiming for 9.6
if things go well.


regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index c5892d3..7528724 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2368,6 +2368,9 @@ get_attnum_pk_pos(int *pkattnums, int pknumatts, int key)
 	return -1;
 }
 
+/*
+ * FIXME this probably needs to be tweaked.
+ */
 static HeapTuple
 get_tuple_of_interest(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals)
 {
diff --git a/contrib/spi/timetravel.c b/contrib/spi/timetravel.c
index 0699438..30e496c 100644
--- a/contrib/spi/timetravel.c
+++ b/contrib/spi/timetravel.c
@@ -314,6 +314,7 @@ timetravel(PG_FUNCTION_ARGS)
 		Oid		   *ctypes;
 		char		sql[8192];
 		char		separ = ' ';
+		Form_pg_attribute *attrs;
 
 		/* allocate ctypes for preparation */
 		ctypes = (Oid *) palloc(natts * sizeof(Oid));
@@ -322,10 +323,11 @@ timetravel(PG_FUNCTION_ARGS)
 		 * Construct query: INSERT INTO _relation_ VALUES ($1, ...)
 		 */
 		snprintf(sql, sizeof(sql), INSERT INTO %s VALUES (, relname);
+		attrs = TupleDescGetLogSortedAttrs(tupdesc);
 		for (i = 1; i = natts; i++)
 		{
-			ctypes[i - 1] = SPI_gettypeid(tupdesc, i);
-			if (!(tupdesc-attrs[i - 1]-attisdropped)) /* skip dropped columns */
+			ctypes[i - 1] = SPI_gettypeid(tupdesc, attrs[i - 1]-attnum);
+			if (!(attrs[i - 1]-attisdropped)) /* skip dropped columns */
 			{
 snprintf(sql + strlen(sql), sizeof(sql) - strlen(sql), %c$%d, separ, i);
 separ = ',';
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 6cd4e8e..14787ce 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -79,6 +79,8 @@
 /*
  * heap_compute_data_size
  *		Determine size of the data area of a tuple to be constructed
+ *
+ * Note: input arrays must be in attnum order.
  */
 Size
 heap_compute_data_size(TupleDesc tupleDesc,
@@ -88,16 +90,23 @@ heap_compute_data_size(TupleDesc tupleDesc,
 	Size		data_length = 0;
 	int			i;
 	int			numberOfAttributes = tupleDesc-natts;
-	Form_pg_attribute *att = tupleDesc-attrs;
+	Form_pg_attribute *att = TupleDescGetPhysSortedAttrs(tupleDesc);
 
+	/*
+	 * We need to consider the attributes in physical order for storage, yet
+	 * our input arrays are in attnum order.  In this loop, i is an index
+	 * into the attphysnum-sorted attribute array, and idx is an index into the
+	 * input arrays.
+	 */
 	for (i = 0; i  numberOfAttributes; i++)