Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 5:23 PM, Stas Kelvich  wrote:

(Please do not top-post, this breaks the thread flow.)

> I’ve looked over proposed patch and migrated my shell tests scripts that i’ve 
> used for testing twophase commits on master/slave to this test framework. 
> Everything looks mature, and I didn’t encountered any problems with writing 
> new tests using this infrastructure.
> From my point of view I don’t see any problems to commit this patches in 
> their current state.

Thanks for the review!

> 0) There are several routines that does actual checking, like 
> is/command_ok/command_fails. I think it will be very handy to have wrappers 
> psql_ok/psql_fails that calls psql through the command_ok/fails.

Do you have a test case in mind for it?

> 1) Better to raise more meaningful error when IPC::Run is absent.

This has been discussed before, and as far as I recall the current
behavior has been concluded as being fine. That's where
--enable-tap-tests becomes meaningful.

> 2) --enable-tap-tests deserves mention in test/recovery/README and more 
> obvious error message when one trying to run make check in test/recovery 
> without --enable-tap-tests.

When running without --enable-tap-tests from src/test/recovery you
would get the following error per how prove_check is defined:
"TAP tests not enabled"

> 3) Is it hard to give ability to run TAP tests in extensions?

Not really. You would need to enforce a check rule or similar. For the
recovery test suite I have mapped the check rule with prove_check.

> 4) It will be handy if make check will write path to log files in case of 
> failed test.

Hm, perhaps. The log files are hardcoded in log/, so it is not like we
don't know it. That's an argument for the main TAP suite though, not
really this series of patch.

> 5) psql() accepts database name as a first argument, but everywhere in tests 
> it is ‘postgres’. Isn’t it simpler to store dbname in connstr, and have 
> separate function to change database?
> 6) Clean logs on prove restart? Clean up tmp installations?

Those are issues proper to the main TAP infrastructure, though I agree
that we could improve things here, particularly for temporary
installations that get automatically... Hm... Cleaned up should a test
failure happen?

> 7) Make check sets PGPORT PG_REGRESS for prove. Is it necessary?

No, that's not needed (I think I noticed that at some point) and
that's a bug. We could live without setting it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2016-01 Commitfest

2016-02-04 Thread Andres Freund
On 2016-02-03 12:32:47 -0500, Tom Lane wrote:
> Heikki and/or Andres have their names on three of the remaining RFC
> patches; it's unlikely any other committer will touch those patches
> unless they take their names off.

If you want to take over the timestamp patch, please feel free to do
so. Otherwise I'll get to it in ~10 days.

I'm working on/off on the checkpointer flushing thing, so I don't think
it makes sense for somebody else to take that over at this point.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2016-02-04 Thread Peter Geoghegan
On Sat, Jan 30, 2016 at 5:29 AM, Robert Haas  wrote:
>> I meant use "quicksort with spillover" simply because an estimated
>> 90%+ of all tuples have already been consumed. Don't consider the
>> tuple width, etc.
>
> Hmm, it's a thought.

To be honest, it's a bit annoying that this is one issue we're stuck
on, because "quicksort with spillover" is clearly of less importance
overall. (This is a distinct issue from the issue of not using a
replacement selection style heap for the first run much of the time,
which seems to be a discussion about whether and to what extent the
*traditional* advantages of replacement selection hold today, as
opposed to a discussion about a very specific crossover point in my
patch.)

>> Really? Just try it with a heap that is not tiny. Performance tanks.
>> The fact that replacement selection can produce one long run then
>> becomes a liability, not a strength. With a work_mem of something like
>> 1GB, it's *extremely* painful.
>
> I'm not sure exactly what you think I should try.  I think a couple of
> people have expressed the concern that your patch might regress things
> on data that is all in order, but I'm not sure if you think I should
> try that case or some case that is not-quite-in-order.  "I don't see
> that you've justified that statement" is referring to the fact that
> you presented no evidence in your original post that it's important to
> sometimes use quicksorting even for run #1.  If you've provided some
> test data illustrating that point somewhere, I'd appreciate a pointer
> back to it.

I think that the answer to what you should try is simple: Any case
involving a large heap (say, a work_mem of 1GB). No other factor like
correlation seems to change the conclusion about that being generally
bad.

If you have a correlation, then that is *worse* if "quicksort with
spillover" always has us use a heap for the first run, because it
prolongs the pain of using the cache inefficient heap (note that this
is an observation about "quicksort with spillover" in particular, and
not replacement selection in general). The problem you'll see is that
there is a large heap which is __slow__ to spill from, and that's
pretty obvious with or without a correlation. In general it seems
unlikely that having one long run during the merge (i.e. no merge --
seen by having the heap build one long run because we got "lucky" and
"quicksort with spillover" encountered a correlation) can ever hope to
make up for this.

It *could* still make up for it if:

1. There isn't much to make up for in the first place, because the
heap is CPU cache resident. Testing this with a work_mem that is the
same size as CPU L3 cache seems a bit pointless to me, and I think
we've seen that a few times.

and:

2. There are many passes required without a replacement selection
heap, because the volume of data is just so much greater than the low
work_mem setting. Replacement selection makes the critical difference
because there is a correlation, perhaps strong enough to make it one
or two runs rather than, say, 10 or 20 or 100.

I've already mentioned many times that linear growth in the size of
work_mem sharply reduces the need for additional passes during the
merge phase (the observation about quadratic growth that I won't
repeat). These days, it's hard to recommend anything other than "use
more memory" to someone trying to use 4MB to sort 10GB of data. Yeah,
it would also be faster to use replacement selection for the first run
in the hope of getting lucky (actually lucky this time; no quotes),
but it's hard to imagine that that's going to be a better option, no
matter how frugal the user is. Helping users recognize when they could
use more memory effectively seems like the best strategy. That was the
idea behind multipass_warning, but you didn't like that (Greg Stark
was won over on the multipass_warning warning, though). I hope we can
offer something roughly like that at some point (a view?), because it
makes sense.

> How about always starting with replacement selection, but limiting the
> amount of memory that can be used with replacement selection to some
> small value?  It could be a separate GUC, or a hard-coded constant
> like 20MB if we're fairly confident that the same value will be good
> for everyone.  If the tuples aren't in order, then we'll pretty
> quickly come to the end of the first run and switch to quicksort.

This seems acceptable, although note that we don't have to decide
until we reach the work_mem limit, and not before.

If you want to use a heap for the first run, I'm not excited about the
idea, but if you insist then I'm glad that you at least propose to
limit it to the kind of cases that we *actually* saw regressed (i.e.
low work_mem settings -- like the default work_mem setting, 4MB).
We've seen no actual case with a larger work_mem that is advantaged by
using a heap, even *with* a strong correlation (this is actually
*worst of all*); that's where I am 

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Ashutosh Bapat
On Thu, Feb 4, 2016 at 3:28 PM, Etsuro Fujita 
wrote:

> On 2016/02/04 17:58, Etsuro Fujita wrote:
>
>> On 2016/02/04 8:00, Robert Haas wrote:
>>
>>> On Wed, Feb 3, 2016 at 5:56 PM, Robert Haas 
>>> wrote:
>>>
 On Wed, Feb 3, 2016 at 12:08 PM, Ashutosh Bapat
  wrote:

> PFA patches with naming conventions similar to previous ones.
> pg_fdw_core_v7.patch: core changes
> pg_fdw_join_v7.patch: postgres_fdw changes for join pushdown
> pg_join_pd_v7.patch: combined patch for ease of testing.
>

> One more: I think the following in postgresGetForeignJoinPaths() is a good
> idea, but I think it's okay to just check whether root->rowMarks is
> non-NIL, because that since we have rowmarks for all base relations except
> the target, if we have root->parse->commandType==CMD_DELETE (or
> root->parse->commandType==CMD_UPDATE), then there would be at least one
> non-target base relation in the joinrel, which would have a rowmark.
>
>
Sorry, I am unable to understand it fully. But what you are suggesting that
if there are root->rowMarks, then we are sure that there is at least one
base relation apart from the target, which needs locking rows. Even if we
don't have one, still changes in a row of target relation after it was
scanned, can result in firing EPQ check, which would need the local plan to
be executed, thus even if root->rowMarks is NIL, EPQ check can fire and we
will need alternate local plan.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] WIP: Detecting SSI conflicts before reporting constraint violations

2016-02-04 Thread Peter Geoghegan
On Wed, Feb 3, 2016 at 10:48 AM, Robert Haas  wrote:
> I don't feel qualified to have an opinion on whether this is an
> improvement.  I'm a little skeptical of changes like this on general
> principle because sometimes one clientele wants error A to be reported
> rather than error B and some other clientele wants the reverse.
> Deciding who is right is above my pay grade.

Exclusion constraints can sometimes have one client deadlock, rather
than see an exclusion violation. The particular client that sees an
error is undefined either way, so I personally felt that that wasn't
very important. But that's a problem that I'm closer to, and I won't
express an opinion on this patch.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2016-02-04 Thread Amit Kapila
On Tue, Feb 2, 2016 at 7:19 PM, Robert Haas  wrote:
>
> On Mon, Feb 1, 2016 at 12:27 AM, Amit Kapila 
wrote:
> > Fixed.
>
> This patch doesn't build:
>
> ./xfunc.sgml:int lwlock_count = 0;
> Tabs appear in SGML/XML files
>

Changed the documentation and now I am not getting build failure.

> The #define NUM_LWLOCKS 1 just seems totally unnecessary, as does int
> lwlock_count = 0.  You're only assigning one lock!  I'd just do
> RequestAddinLWLockTranche("pg_stat_statements locks", 1); pgss->lock =
> GetLWLockAddinTranche("pg_stat_statements locks")->lock; and call it
> good.
>

Changed as per suggestion.

> I think we shouldn't foreclose the idea of core users of this facility
> by using names like NumLWLocksByLoadableModules().  Why can't an
> in-core client use this API?  I think instead of calling these "addin
> tranches" we should call them "named tranches"; thus public APIs
> RequestNamedLWLockTranche()
> and GetNamedLWLockTranche(), and private variables
> NamedLWLockTrancheRequests, NamedLWLockTrancheRequestsAllocated, etc.
> In fact,
>

Changed as per suggestion.

> I do not see an obvious reason why the two looks in CreateLWLocks()
> that end with "} while (++i < LWLockTrancheRequestsCount);" could not
> be merged, and I believe that would be cleaner than what you've got
> now.  Similarly, the two loops in GetLWLockAddinTranche() could also
> be merged.  Just keep a running total and return it when you find a
> match.
>
> I think it would be a good idea to merge LWLockAddInTrancheShmemSize
> into LWLockShmemSize.  I don't see why that function can't compute a
> grand total and return it.
>

Agreed with both the points and changed as per suggestion.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


separate_tranche_extensions_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-02-04 Thread Heikki Linnakangas

On 22/10/15 03:56, Michael Paquier wrote:

On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
 wrote:

Heikki Linnakangas wrote:


Thanks. For comparison, I wrote a patch to implement what I had in mind.

When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap that
would normally be WAL-logged, we check if the relation is in the hash table,
and skip WAL-logging if so.


I think this wasn't applied, was it?


No, it was not applied.


I dropped the ball on this one back in July, so here's an attempt to 
revive this thread.


I spent some time fixing the remaining issues with the prototype patch I 
posted earlier, and rebased that on top of current git master. See attached.


Some review of that would be nice. If there are no major issues with it, 
I'm going to create backpatchable versions of this for 9.4 and below.


- Heikki

>From 063e1aa258800873783190a9678d551b43c0e39e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 4 Feb 2016 15:21:09 +0300
Subject: [PATCH 1/1] Fix the optimization to skip WAL-logging on table created
 in same xact.

There were several bugs in the optimization to skip WAL-logging for a table
that was created (or truncated) in the same transaction, with
wal_level=minimal, leading to data loss if a crash happened after the
optimization was used:

* If the table was created, and then truncated, and then loaded with COPY,
  we would replay the truncate record at commit, and the table would end
  up being empty after replay.

* If there is a trigger on a table that modifies the same table, and you
  COPY to the table in the transaction that created it, you might have some
  WAL-logged operations on a page, performed by the trigger, intermixed with
  the non-WAL-logged inserts done by the COPY. That can lead to crash at
  recovery, because we might try to replay a WAL record that e.g. updates
  a tuple, but insertion of the tuple was not WAL-logged.
---
 src/backend/access/heap/heapam.c| 254 +++-
 src/backend/access/heap/pruneheap.c |   2 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   2 +-
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c   | 250 ---
 src/backend/commands/copy.c |  14 +-
 src/backend/commands/createas.c |   9 +-
 src/backend/commands/matview.c  |   6 +-
 src/backend/commands/tablecmds.c|   5 +-
 src/backend/commands/vacuumlazy.c   |   6 +-
 src/backend/storage/buffer/bufmgr.c |  47 --
 src/include/access/heapam.h |   8 +-
 src/include/access/heapam_xlog.h|   2 +
 src/include/catalog/storage.h   |   3 +
 src/include/storage/bufmgr.h|   2 +
 16 files changed, 487 insertions(+), 134 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..79298e2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -55,6 +77,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2332,12 +2355,6 @@ FreeBulkInsertState(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is 

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Ashutosh Bapat
>
>If so, my concern is, the helper function probably wouldn't
>> extend to the parameterized-foreign-join-path cases, though that
>> would work well for the unparameterized-foreign-join-path cases.  We
>> don't support parameterized-foreign-join paths for 9.6?
>>
>
> If we do not find a local path with given parameterization, it means
>> there are other local parameterized paths which are superior to it. This
>> possibly indicates that there will be foreign join parameterised paths
>> which are superior to this parameterized path, so we basically do not
>> create foreign join path with that parameterization.
>>
>
> The latest version of the postgres_fdw join pushdown patch will support
> only the unparameterized-path case, so we don't have to consider this, but
> why do you think the superiority of parameterizations is preserved between
> remote joining and local joining?
>

AFAIU, parameterization for local paths bubbles up from base relations. For
foreign relations, we calculate the cost of parameterization when
use_remote_estimate is ON, which means it's accurate. So, except that we
will get clause selectivity wrong (if foreign tables were analyzed
regularly even that won't be the case, I guess) resulting in some small
sway in the costs as compared to those of parameterized foreign join paths.
So, I am guessing that the local estimates for parameterized join paths
would be closer to parameterized foreign paths (if we were to produce
those). Hence my statement. There is always a possibility that those two
costs are way too different, hence I have used phrase "possibly" there. I
could be wrong.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-04 Thread Amit Kapila
On Wed, Feb 3, 2016 at 3:08 PM, Andres Freund  wrote:
>
> On 2016-02-02 10:12:25 +0530, Amit Kapila wrote:
> > @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
> >   if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> > CHECKPOINT_END_OF_RECOVERY |
> >CHECKPOINT_FORCE)) == 0)
> >   {
> > - if
> > (prevPtr == ControlFile->checkPointCopy.redo &&
> > + if (GetProgressRecPtr() == prevPtr &&
> >
> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> >   {
> >
> > I think such a check won't consider all the WAL-activity happened
> > during checkpoint (after checkpoint start, till it finishes) which was
> > not the intention of this code without your patch.
>
> Precisely.
>
> > I think both this and previous patch (hs-checkpoints-v1 ) approach
> > won't fix the issue in all kind of scenario's.
>
> Agreed.
>
> > Let me try to explain what I think should fix this issue based on
> > discussion above, suggestions by Andres and some of my own
> > thoughts:
>
> > Have a new variable in WALInsertLock that would indicate the last
> > insertion position (last_insert_pos) we write to after holding that
lock.
> > Ensure that we don't update last_insert_pos while logging standby
> > snapshot (simple way is to pass a flag in XLogInsert or distinguish
> > it based on type of record (XLOG_RUNNING_XACTS) or if you can
> > think of a more smarter way).  Now both during checkpoint and
> > in bgwriter, to find out whether there is any activity since last
> > time, we need to check all the WALInsertLocks for latest insert position
> > (by referring last_insert_pos) and compare it with the latest position
> > we have noted during last such check and decide whether to proceed
> > or not based on comparison result.  If you think it is not adequate to
> > store last_insert_pos in WALInsertLock, then we can think of having
> > it in PGPROC.
>
> Yes, that's pretty much what I was thinking of.
>

I think generally it is good idea, but one thing worth a thought is that
by doing so, we need to acquire all WAL Insertion locks every
LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
every slot, do you think it is matter of concern in any way for write
workloads or it won't effect as we need to do this periodically?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Etsuro Fujita

On 2016/02/04 17:58, Etsuro Fujita wrote:

On 2016/02/04 8:00, Robert Haas wrote:

On Wed, Feb 3, 2016 at 5:56 PM, Robert Haas 
wrote:

On Wed, Feb 3, 2016 at 12:08 PM, Ashutosh Bapat
 wrote:

PFA patches with naming conventions similar to previous ones.
pg_fdw_core_v7.patch: core changes
pg_fdw_join_v7.patch: postgres_fdw changes for join pushdown
pg_join_pd_v7.patch: combined patch for ease of testing.


One more: I think the following in postgresGetForeignJoinPaths() is a 
good idea, but I think it's okay to just check whether root->rowMarks is 
non-NIL, because that since we have rowmarks for all base relations 
except the target, if we have root->parse->commandType==CMD_DELETE (or 
root->parse->commandType==CMD_UPDATE), then there would be at least one 
non-target base relation in the joinrel, which would have a rowmark.


+   if (root->parse->commandType == CMD_DELETE ||
+   root->parse->commandType == CMD_UPDATE ||
+   root->rowMarks)
+   {
+   epq_path = GetPathForEPQRecheck(joinrel);
+   if (!epq_path)
+   {
+			elog(DEBUG3, "could not push down foreign join because a local path 
suitable for EPQ checks was not found");

+   return;
+   }
+   }

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
 wrote:
> On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:
>> Not wrong, and this leads to the following:
>> void rename_safe(const char *old, const char *new, bool isdir, int elevel);
>> Controlling elevel is necessary per the multiple code paths that would
>> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
>> that look fine?
>
> After really coding it, I finished with the following thing:
> +int
> +rename_safe(const char *old, const char *new)
>
> There is no need to extend that for directories, well we could of
> course but all the renames happen on files so I see no need to make
> that more complicated. More refactoring of the other rename() calls
> could be done as well by extending rename_safe() with a flag to fsync
> the data within a critical section, particularly for the replication
> slot code. I have let that out to not complicate more the patch.

Andres just poked me (2m far from each other now) regarding the fact
that we should fsync even after the link() calls when
HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea
culpa. And the patch misses that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising the checkpoint_timeout limit

2016-02-04 Thread Andres Freund
On 2016-02-03 15:07:12 -0600, Jim Nasby wrote:
> On 2/2/16 10:10 PM, Robert Haas wrote:
> >Now, you could also set such configuration settings in
> >a situation where it will not work out well.  But that is true of most
> >configuration settings.

By that argument we should probably raise the lower limit of a bunch of
parameters :P

> Yeah, if we're going to start playing parent then I think the first thing to
> do is remove the fsync GUC. The AWS team has done testing that shows it to
> be worthless from a performance standpoint now that we have synchronous
> commit, and it's an extremely large foot-bazooka to have laying around.

Meh, I don't buy that. There are workloads where that's the case, but
also ones were it's not true. Try e.g. 2PC. And yes, there's definitely
cases where 2PC makes sense, even if you don't need durability on a
local basis.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising the checkpoint_timeout limit

2016-02-04 Thread Peter Geoghegan
On Mon, Feb 1, 2016 at 8:16 PM, Noah Misch  wrote:
>> I'm not sure what'd actually be a good upper limit. I'd be inclined to
>> even go to as high as a week or so. A lot of our settings have
>> upper/lower limits that aren't a good idea in general.
>
> In general, I favor having limits reflect fundamental system limitations
> rather than paternalism.  Therefore, I would allow INT_MAX (68 years).

I agree. I'm in favor of having things be what is sometimes called
foolproof, but I think that you can only take that so far, and it's
mostly a matter of guiding a more or less reasonable user in the right
direction. Making it easy to do the right thing and hard to do the
wrong thing.

I don't think you can effectively design anything around a user that
makes perversely bad decision at every turn. If you investigate why a
user made a bad decision, there will usually be a chain of faulty but
not outrageous reasoning behind it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect formula for SysV IPC parameters

2016-02-04 Thread Fujii Masao
On Wed, Feb 3, 2016 at 12:51 PM, Kyotaro HORIGUCHI
 wrote:
> Hello, I found that the formulas to calculate SEMMNI and SEMMNS
> are incorrect in 9.2 and later.
>
> http://www.postgresql.org/docs/9.5/static/kernel-resources.html
>
> All of them say that the same thing as following,
>
> | SEMMNI  Maximum number of semaphore identifiers (i.e., sets)
> |
> |   at least ceil((max_connections + autovacuum_max_workers + 4) / 16)
> |
> | SEMMNSMaximum number of semaphores system-wide
> |
> |  ceil((max_connections + autovacuum_max_workers + 4) / 16) * 17
> |plus room for other applications
>
> But actually the number of semaphores PostgreSQL needs is
> calculated as following in 9.4 and later.
>
>   numSemas = MaxConnections + NUM_AUXILIARY_PROCS(=4)
>   MaxConnections = max_connections + autovacuum_max_workers + 1 +
>max_worker_processes
>
> So, the formula for SEMMNI should be
>
> ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) / 
> 16)
>
> and SEMMNS should have the same fix.
>
>
> In 9.3 and 9.2, the documentation says the same thing but
> actually it is calculated as following,
>
>   numSemas = MaxConnections + NUM_AUXILIARY_PROCS(=4)
>   MaxConnections = max_connections + autovacuum_max_workers + 1 +
>GetNumShmemAttachedBgworkers()
>
> Omitting GetNumShmemAttachedBgworkers, the actual formula is
>
> ceil((max_connections + autovacuum_max_workers + 5) / 16)
>
>
> In 9.1, NUM_AUXILIARY_PROCS is 3 so the documentations is correct.
>
>
> I attached two patches for 9.2-9.3 and 9.4-9.6dev
> respectively. patch command complains a bit on applying it on
> 9.2.

Good catch!

ISTM that you also need to change the descriptions about SEMMNI and SEMMNS
under the Table 17-1.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Ashutosh Bapat
> * Is it safe to replace outerjoinpath with its fdw_outerpath the following
> way?  I think that if the join relation represented by outerjoinpath has
> local conditions that can't be executed remotely, we have to keep
> outerjoinpath in the path tree; we will otherwise fail to execute the local
> conditions.  No?
>
> +   /*
> +* If either inner or outer path is a ForeignPath
> corresponding to
> +* a pushed down join, replace it with the
> fdw_outerpath, so that we
> +* maintain path for EPQ checks built entirely of
> local join
> +* strategies.
> +*/
> +   if (IsA(joinpath->outerjoinpath, ForeignPath))
> +   {
> +   ForeignPath *foreign_path;
> +   foreign_path = (ForeignPath
> *)joinpath->outerjoinpath;
> +   if (foreign_path->path.parent->reloptkind
> == RELOPT_JOINREL)
> +   joinpath->outerjoinpath =
> foreign_path->fdw_outerpath;
> +   }
>
>
all the conditions (local and remote) should be part of fdw_outerpath as
well, since that's the alternate local path, which should produce (when
converted to the plan) the same result as the foreign path. fdw_outerpath
should be a local path set when paths for outerjoinpath->parent was being
created. Am I missing something?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-02-04 Thread Rushabh Lathia
On Wed, Feb 3, 2016 at 3:31 PM, Etsuro Fujita 
wrote:

> On 2016/01/28 15:20, Rushabh Lathia wrote:
>
>> On Thu, Jan 28, 2016 at 11:33 AM, Etsuro Fujita
>> > wrote:
>>
>> On 2016/01/27 21:23, Rushabh Lathia wrote:
>>
>> If I understood correctly, above documentation means, that if
>> FDW have
>> DMLPushdown APIs that is enough. But in reality thats not the
>> case, we
>> need  ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete
>> in case
>> DML is not pushable.
>>
>> And here fact is DMLPushdown APIs are optional for FDW, so that
>> if FDW
>> don't have DMLPushdown APIs they can still very well perform the
>> DML
>> operations using ExecForeignInsert, ExecForeignUpdate, or
>> ExecForeignDelete.
>>
>
> So documentation should be like:
>>
>> If the IsForeignRelUpdatable pointer is set to NULL, foreign
>> tables are
>> assumed to be insertable, updatable, or deletable if the FDW
>> provides
>> ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete
>> respectively,
>>
>> If FDW provides DMLPushdown APIs and the DML are pushable to the
>> foreign
>> server, then FDW still needs ExecForeignInsert,
>> ExecForeignUpdate, or
>> ExecForeignDelete for the non-pushable DML operation.
>>
>> What's your opinion ?
>>
>
> I agree that we should add this to the documentation, too.
>>
>
> I added docs to the IsForeignRelUpdatable documentation.  Also, a brief
> introductory remark has been added at the beginning of the DML pushdown
> APIs' documentation.
>
> BTW, if I understand correctly, I think we should also modify
>> relation_is_updatabale() accordingly.  Am I right?
>>
>
> Yep, we need to modify relation_is_updatable().
>>
>
> I thought I'd modify that function in the same way as
> CheckValidResultRel(), but I noticed that we cannot do that, because we
> don't have any information on whether each update is pushed down to the
> remote server by PlanDMLPushdown, during relation_is_updatabale().  So, I
> left that function as-is.  relation_is_updatabale() is just used for
> display in the information_schema views, so ISTM that that function is fine
> as-is.  (As for CheckValidResultRel(), I revised it so as to check the
> presence of DML pushdown APIs after checking the existing APIs if the given
> command will be pushed down.  The reason is because we assume the presence
> of the existing APIs, anyway.)
>
> I revised other docs and some comments, mostly for consistency.
>
>
I just started reviewing this and realized that patch is not getting applied
cleanly on latest source, it having some conflicts. Can you please upload
the correct version of patch.


Attached is an updated version of the patch, which has been created on top
> of the updated version of the bugfix patch posted by Robert in [1]
> (attached).
>
>
> Best regards,
> Etsuro Fujita
>
> [1]
> http://www.postgresql.org/message-id/ca+tgmoz40j2uc5ac1nxu03oj4crvolks15xx+ptfp-1u-8z...@mail.gmail.com
>



-- 
Rushabh Lathia


Re: [HACKERS] Using quicksort for every external sort run

2016-02-04 Thread Peter Geoghegan
On Thu, Feb 4, 2016 at 1:46 AM, Peter Geoghegan  wrote:
> It wasn't my original insight that replacement selection has become
> all but obsolete. It took me a while to come around to that point of
> view.

Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:

"By comparison, OpenVMS sort uses a pure replacement-selection sort to
generate runs (Knuth, 1973). Replacement-selection is best for a
memory-constrained environment. On average, replacement-selection
generates runs that are twice as large as available memory, while the
QuickSort runs are typically less than half of available memory.
However, in a memory-rich environment, QuickSort is faster because it
is simpler, makes fewer exchanges on average, and has superior address
locality to exploit processor caching. "

(I believe that the authors state that "QuickSort runs are typically
less than half of available memory" because of the use of explicit
asynchronous I/O in each thread, which doesn't apply to us).

The paper also has very good analysis of the economics of sorting:

"Even for surprisingly large sorts, it is economical to perform the
sort in one pass."

Of course, memory capacities have scaled enormously in the 20 years
since this analysis was performed, so the analysis applies even at the
very low end these days. The high capacity memory system that they
advocate to get a one pass sort (instead of having faster disks) had
100MB of memory, which is of course tiny by contemporary standards. If
you pay Heroku $7 a month, you get a "Hobby Tier" database with 512MB
of memory. The smallest EC2 instance size, the t2.nano, costs about
$1.10 to run for one week, and has 0.5GB of memory.

The economics of using 4MB or even 20MB to sort 10GB of data are
already preposterously bad for everyone that runs a database server,
no matter how budget conscious they may be. I can reluctantly accept
that we need to still use a heap with very low work_mem settings to
avoid the risk of a regression (in the event of a strong correlation)
on general principle, but I'm well justified in proposing "just don't
do that" as the best practical advice.

I thought I had your agreement on that point, Robert; is that actually the case?

[1] http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoints after database start/immediate checkpoints

2016-02-04 Thread Andres Freund
On 2016-02-03 09:57:00 -0500, Robert Haas wrote:
> On Mon, Feb 1, 2016 at 7:43 PM, Andres Freund  wrote:
> > I wonder if this essentially point at checkpoint_timeout being wrongly
> > defined: Currently it means we'll try to finish a checkpoint
> > (1-checkpoint_completion_target) * timeout before the next one - but
> > perhaps it should instead be that we start checkpoint_timeout * _target
> > before the next timeout? Afaics that'd work more graceful in the face of
> > restarts and forced checkpoints.
> 
> There's a certain appeal to that, but at the same time it seems pretty
> wonky.  Right now, you can say that a checkpoint is triggered when the
> amount of WAL reaches X or the amount of time reaches Y, but with the
> alternative definition it's a bit harder to explain what's going on
> there.

Hm, but can you, really? We *start* a checkpoint every
checkpoint_timeout, but we only finish it after
checkpoint_completion_target * timeout, or cct * segments. I find it
pretty hard to explain that we have a gap of checkpoint_timeout, where
nothing happens, after an immediate/manual checkpoint.

Defining it as: We try to *finish* a checkpoint every checkpoint_timeout
or checkpoint_segments/(max_wal_size/~3) actually seems simpler to
me. Then we just need to add that we start a checkpoint
checkpoint_completion_target before either of the above are reached.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Robert Haas
On Thu, Feb 4, 2016 at 11:55 AM, Ashutosh Bapat
 wrote:
> Patches attached with the previous mail.

The core patch seemed to me to be in good shape now, so I committed
that.  Not sure I'll be able to get to another read-through of the
main patch today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2016-02-04 Thread Robert Haas
On Thu, Feb 4, 2016 at 7:00 AM, Amit Kapila  wrote:
> [ new patch ]

I've committed this after heavy rewriting.  In particular, I merged
two loops, used local variables to get rid of the very long and quite
repetitive lines, and did various cleanup on the documentation and
comments.

I think we ought to move the buffer mapping, lock manager, and
predicate lock manager locks into their own tranches also, perhaps
using this new named-tranche facility.  In addition, I think we should
get rid of NumLWLocks().  It's no longer calculating anything; it just
returns NUM_FIXED_LWLOCKS, and with the changes described in the first
sentence of this paragraph, it will simply return
NUM_INDIVIDUAL_LWLOCKS.  We don't need a function just to do that; the
assignment it does can be moved to the relevant caller.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Not that I've heard of.  It's very hard to muster any enthusiasm for
>> improving hash indexes unless their lack of WAL-logging is fixed first.

> This is really strange though.  Surely adding WAL-logging is not an
> enormous task anymore ... I mean, we're undertaking far larger efforts
> now, the WAL logging code is simpler than before, and we even have a
> tool (ok, gotta streamline that one a little bit) to verify that the
> results are correct.

ISTR that we discussed this previously and ran into some stumbling block
or other that made it less-than-trivial.  Don't recall what though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Alvaro Herrera
David G. Johnston wrote:

> ​Learning by reading here...
> 
> http://www.postgresql.org/docs/current/static/wal-internals.html
> """
> ​After a checkpoint has been made and the log flushed, the checkpoint's
> position is saved in the file pg_control. Therefore, at the start of
> recovery, the server first reads pg_control and then the checkpoint record;
> then it performs the REDO operation by scanning forward from the log
> position indicated in the checkpoint record. Because the entire content of
> data pages is saved in the log on the first page modification after a
> checkpoint (assuming full_page_writes is not disabled), all pages changed
> since the checkpoint will be restored to a consistent state.
> 
> To deal with the case where pg_control is corrupt, we should support the
> possibility of scanning existing log segments in reverse order — newest to
> oldest — in order to find the latest checkpoint. This has not been
> implemented yet. pg_control is small enough (less than one disk page) that
> it is not subject to partial-write problems, and as of this writing there
> have been no reports of database failures due solely to the inability to
> read pg_control itself. So while it is theoretically a weak spot,
> pg_control does not seem to be a problem in practice.
> ​"""​
> 
> ​The above comment appears out-of-date if this post describes what
> presently happens.

I think you're misinterpreting Andres, or the docs, or both.

What Andres says is that the control file (pg_control) stores two
checkpoint locations: the latest one, and the one before that.  When
recovery occurs, it starts by looking up the latest checkpoint record;
if it cannot find that for whatever reason, it falls back to reading the
previous one.  (He further claims that falling back to the previous one
is a bad idea.)

What the 2nd para in the documentation is saying is something different:
it is talking about reading all the pg_xlog files (in reverse order),
which is not pg_control, and see what checkpoint records are there, then
figure out which one to use.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] insufficient qualification of some objects in dump files

2016-02-04 Thread Alvaro Herrera
Peter Eisentraut wrote:
> Some dump objects whose names are not unique on a schema level have
> insufficient details in the dump TOC.  For example, a column default
> might have a TOC entry like this:
> 
> 2153; 2604 39696 DEFAULT public a rolename

> I think we should amend the archive tag for these kinds of objects to
> include the table name, so it might look like:
> 
> 2153; 2604 39696 DEFAULT public test a rolename
> 
> Comments?

If we're changing anything, I wonder if it makes sense to go for adding
the object identity there.  Perhaps that would make the whole thing be a
little more regular, rather than have to count elements depending on
object type.  So in this case it'd be

2153; 2604 39696 DEFAULT public.test.a rolename

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread David G. Johnston
On Thu, Feb 4, 2016 at 3:57 PM, Alvaro Herrera 
wrote:

> David G. Johnston wrote:
>
> > ​Learning by reading here...
> >
> > http://www.postgresql.org/docs/current/static/wal-internals.html
> > """
> > ​After a checkpoint has been made and the log flushed, the checkpoint's
> > position is saved in the file pg_control. Therefore, at the start of
> > recovery, the server first reads pg_control and then the checkpoint
> record;
> > then it performs the REDO operation by scanning forward from the log
> > position indicated in the checkpoint record. Because the entire content
> of
> > data pages is saved in the log on the first page modification after a
> > checkpoint (assuming full_page_writes is not disabled), all pages changed
> > since the checkpoint will be restored to a consistent state.
> >
> > To deal with the case where pg_control is corrupt, we should support the
> > possibility of scanning existing log segments in reverse order — newest
> to
> > oldest — in order to find the latest checkpoint. This has not been
> > implemented yet. pg_control is small enough (less than one disk page)
> that
> > it is not subject to partial-write problems, and as of this writing there
> > have been no reports of database failures due solely to the inability to
> > read pg_control itself. So while it is theoretically a weak spot,
> > pg_control does not seem to be a problem in practice.
> > ​"""​
> >
> > ​The above comment appears out-of-date if this post describes what
> > presently happens.
>
> I think you're misinterpreting Andres, or the docs, or both.
>
> What Andres says is that the control file (pg_control) stores two
> checkpoint locations: the latest one, and the one before that.  When
> recovery occurs, it starts by looking up the latest checkpoint record;
> if it cannot find that for whatever reason, it falls back to reading the
> previous one.  (He further claims that falling back to the previous one
> is a bad idea.)
>
> What the 2nd para in the documentation is saying is something different:
> it is talking about reading all the pg_xlog files (in reverse order),
> which is not pg_control, and see what checkpoint records are there, then
> figure out which one to use.
>

Yes, I inferred something that obviously isn't true - that the system
doesn't go hunting for a valid checkpoint to begin recovery from.  While it
does not do so in the case of a corrupted pg_control file I further assumed
it never did.  That would be because the documentation doesn't make the
point of stating that two checkpoint positions exist and that PostgreSQL
will try the second one if the first one proves unusable.  Given the topic
of this thread that omission makes the documentation out-of-date.  Maybe
its covered elsewhere but since this section addresses locating a starting
point I would expect any such description ​to be here as well.

David J.


Re: [HACKERS] checkpoints after database start/immediate checkpoints

2016-02-04 Thread Robert Haas
On Thu, Feb 4, 2016 at 4:42 PM, Andres Freund  wrote:
> On 2016-02-03 09:57:00 -0500, Robert Haas wrote:
>> On Mon, Feb 1, 2016 at 7:43 PM, Andres Freund  wrote:
>> > I wonder if this essentially point at checkpoint_timeout being wrongly
>> > defined: Currently it means we'll try to finish a checkpoint
>> > (1-checkpoint_completion_target) * timeout before the next one - but
>> > perhaps it should instead be that we start checkpoint_timeout * _target
>> > before the next timeout? Afaics that'd work more graceful in the face of
>> > restarts and forced checkpoints.
>>
>> There's a certain appeal to that, but at the same time it seems pretty
>> wonky.  Right now, you can say that a checkpoint is triggered when the
>> amount of WAL reaches X or the amount of time reaches Y, but with the
>> alternative definition it's a bit harder to explain what's going on
>> there.
>
> Hm, but can you, really? We *start* a checkpoint every
> checkpoint_timeout, but we only finish it after
> checkpoint_completion_target * timeout, or cct * segments. I find it
> pretty hard to explain that we have a gap of checkpoint_timeout, where
> nothing happens, after an immediate/manual checkpoint.
>
> Defining it as: We try to *finish* a checkpoint every checkpoint_timeout
> or checkpoint_segments/(max_wal_size/~3) actually seems simpler to
> me. Then we just need to add that we start a checkpoint
> checkpoint_completion_target before either of the above are reached.

Hmm, I could go for that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread Andreas Karlsson

On 02/04/2016 11:34 PM, Tom Lane wrote:

Alvaro Herrera  writes:

This is really strange though.  Surely adding WAL-logging is not an
enormous task anymore ... I mean, we're undertaking far larger efforts
now, the WAL logging code is simpler than before, and we even have a
tool (ok, gotta streamline that one a little bit) to verify that the
results are correct.


ISTR that we discussed this previously and ran into some stumbling block
or other that made it less-than-trivial.  Don't recall what though.


The last discussion I can recall was Robert's ideas on how to solve the 
splitting of buckets.


http://www.postgresql.org/message-id/ca+tgmozymojsrfxhxq06g8jhjxqcskvdihb_8z_7nc7hj7i...@mail.gmail.com

Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread Alvaro Herrera
Tom Lane wrote:
> Shubham Barai  writes:
> > I wanted to know if anyone is working on these projects from todo list.
> > 1.Add UNIQUE capability to hash indexes
> > 2.Allow multi-column hash indexes
> 
> Not that I've heard of.  It's very hard to muster any enthusiasm for
> improving hash indexes unless their lack of WAL-logging is fixed first.

This is really strange though.  Surely adding WAL-logging is not an
enormous task anymore ... I mean, we're undertaking far larger efforts
now, the WAL logging code is simpler than before, and we even have a
tool (ok, gotta streamline that one a little bit) to verify that the
results are correct.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Andres Freund
On 2016-02-03 09:28:24 -0500, Robert Haas wrote:
> Would we still have some way of forcing the older checkpoint record to
> be used if somebody wants to try to do that?

I think currently the best way to force an arbitrary checkpoint to be
used is creating a "custom" backup label. Not that nice.  Not sure if we
need something nice here, I don't really see a frequent need for this.

We could add another option to pg_resetxlog alternatively :/

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count_nulls(VARIADIC "any")

2016-02-04 Thread Tom Lane
Pavel Stehule  writes:
> [ num_nulls_v6.patch ]

I started looking through this.  It seems generally okay, but I'm not
very pleased with the function name "num_notnulls".  I think it would
be better as "num_nonnulls", as I see Oleksandr suggested already.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Development with Eclipse - Wrong error messages in IDE

2016-02-04 Thread Alvaro Herrera
Peter Moser wrote:

> I have some strange error message inside Eclipse, that some symbols cannot
> be found. I work with version 9.6 currently. For instance,
> 
> Symbol 'RM_HEAP_ID' could not be resolved
> src/backend/access/heap/heapam.c
> 
> It affects all occurrences of symbols that are defined in
> src/include/access/rmgrlist.h. Eclipse just says "Syntax error" here.
> 
> However, the source code compiles and runs without any compile-time error or
> warning. It is just an IDE problem I think, but it distracts me from finding
> real bugs.

Disclaimer: I've never used eclipse.

The problem is some perhaps-too-clever stuff we do to avoid repetitive
declarations of things.  The rmgr stuff uses a PG_RMGR macro, which is
defined differently in src/backend/access/transam/rmgr.c and
src/include/access/rmgr.h; the latter contains the actual enum
definition.  On the other hand Eclipse is trying to be too clever by
processing the C files, but not actually getting it completely right
(which is understandable, really).  We have other similar cases, such as
grammar keywords (kwlist.h)

I'm afraid that you'd have to teach Eclipse to deal with such things
(which might be tricky) or live with it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 4 Feb 2016 23:06:45 +0300, Michael Paquier  
wrote in 

Re: [HACKERS] Batch update of indexes

2016-02-04 Thread Jim Nasby

On 2/4/16 1:37 AM, konstantin knizhnik wrote:

>My suspicion is that it would be useful to pre-order the new data before 
trying to apply it to the indexes.

Sorry, but ALTER INDEX is expected to work for all indexes, not only B-Tree, 
and for them sorting may not be possible...
But for B-Tree presorting inserted data should certainly increase performance.
I will think about it.


I wasn't talking about ALTER INDEX.

My theory is that if you're doing a large DML operation it might be more 
efficient to update an index as a single bulk operation, instead of 
doing it for each tuple.


If you want to do that, then you need an efficient method for finding 
everything that a DML statement changed. That's the exact same thing we 
need to support statement-level triggers being able to reference NEW and 
OLD. It's probably also what we need to support incremental update matviews.


If we had such a capability then we could add options to the AM 
infrastructure to allow indexes to support doing bulk maintenance as 
well as per-tuple maintenance (or even support only bulk maintenance...)


I don't think any of that has anything to do with ALTER INDEX.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Andres Freund
On February 5, 2016 2:52:20 AM GMT+03:00, Jim Nasby  
wrote:
>On 2/4/16 3:37 PM, Andres Freund wrote:
>> On 2016-02-03 09:28:24 -0500, Robert Haas wrote:
>>> Would we still have some way of forcing the older checkpoint record
>to
>>> be used if somebody wants to try to do that?
>>
>> I think currently the best way to force an arbitrary checkpoint to be
>> used is creating a "custom" backup label. Not that nice.  Not sure if
>we
>> need something nice here, I don't really see a frequent need for
>this.
>>
>> We could add another option to pg_resetxlog alternatively :/
>
>I guess you'd have to scan through WAL files by hand to find the next 
>oldest checkpoint?

Just look at pg control, it contains the precious location?

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 11:58 PM, Stas Kelvich  wrote:
>> On 04 Feb 2016, at 12:59, Michael Paquier  wrote:
>>> 0) There are several routines that does actual checking, like 
>>> is/command_ok/command_fails. I think it will be very handy to have wrappers 
>>> psql_ok/psql_fails that calls psql through the command_ok/fails.
>>
>> Do you have a test case in mind for it?
>
> Yes, I’ve used that to test prepare/commit while recovery (script attached, 
> it’s in WIP state, i’ll submit that later along with other twophase stuff).

Oh, OK, now I see. Well it seems to make sense for your case, though
it does not seem to be directly linked to the patch here. We could
incrementally add something on top of the existing infrastructure that
gets into the code tree once the 2PC patch gets in a more advanced
shape.

>>> 2) --enable-tap-tests deserves mention in test/recovery/README and more 
>>> obvious error message when one trying to run make check in test/recovery 
>>> without --enable-tap-tests.
>>
>> When running without --enable-tap-tests from src/test/recovery you
>> would get the following error per how prove_check is defined:
>> "TAP tests not enabled"
>
> Yes, but that message doesn’t mention --enable-tap-tests and README also 
> silent about that too. I didn’t know about that flag and had to search in 
> makefiles for this error message to see what conditions leads to it. I think 
> we can save planet from one more stackoverflow question if the error message 
> will mention that flag.

Well, that works for the whole TAP test infrastructure and not really
this patch only. Let's not forget that the goal of this thread is to
provide a basic set of tests and routines to help people building test
cases for more advanced clustering scenarios, so I'd rather not
complicate the code with side things and remain focused on the core
problem.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Jim Nasby

On 2/4/16 5:09 PM, David G. Johnston wrote:


What the 2nd para in the documentation is saying is something different:
it is talking about reading all the pg_xlog files (in reverse order),
which is not pg_control, and see what checkpoint records are there, then
figure out which one to use.


Yes, I inferred something that obviously isn't true - that the system
doesn't go hunting for a valid checkpoint to begin recovery from.  While
it does not do so in the case of a corrupted pg_control file I further
assumed it never did.  That would be because the documentation doesn't
make the point of stating that two checkpoint positions exist and that
PostgreSQL will try the second one if the first one proves unusable.
Given the topic of this thread that omission makes the documentation
out-of-date.  Maybe its covered elsewhere but since this section
addresses locating a starting point I would expect any such description
​to be here as well.


Yeah, I think we should fix the docs. Especially since I imagine that if 
you're reading that part of the docs you're probably having a really bad 
day, and bad info won't help you...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-04 Thread Jim Nasby

On 2/4/16 3:37 PM, Andres Freund wrote:

On 2016-02-03 09:28:24 -0500, Robert Haas wrote:

Would we still have some way of forcing the older checkpoint record to
be used if somebody wants to try to do that?


I think currently the best way to force an arbitrary checkpoint to be
used is creating a "custom" backup label. Not that nice.  Not sure if we
need something nice here, I don't really see a frequent need for this.

We could add another option to pg_resetxlog alternatively :/


I guess you'd have to scan through WAL files by hand to find the next 
oldest checkpoint?


I'm guessing that if this is happening in the field there's a decent 
chance people aren't noticing it, so maybe the best thing for now is to 
turn off the automatic behavior bust still have a relatively easy way to 
re-enable it. In case this is more common than we think...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: PL/Pythonu - function ereport

2016-02-04 Thread Jim Nasby

On 2/4/16 3:13 AM, Catalin Iacob wrote:

Thanks for the overview. Very helpful.


I find existing behaviour for 2, 3 and 4 unlike other Python APIs I've
seen, surprising and not very useful. If I want to log a tuple I can
construct and pass a single tuple, I don't see why plpy.info() needs
to construct it for me. And for the documented, single argument call
nothing changes.


Agreed, that usage is wonky.


The question to Bruce (and others) is: is it ok to change to the new
behaviour illustrated and change meaning for usages like 2, 3 and 4?


If any users have a bunch of code that depends on the old behavior, 
they're going to be rather irritated if we break it. If we want to 
depricate it then I think we need a GUC that allows you to get the old 
behavior back.



If we don't want that, the solution Pavel and I see is introducing a
parallel API named plpy.raise_info or plpy.rich_info or something like
that with the new behaviour and leave the existing functions
unchanged. Another option is some compatibility GUC but I don't think
it's worth the trouble and confusion.


If we're going to provide an alternative API, I'd just do 
plpy.raise(LEVEL, ...).


At this point, my vote would be:

Add a plpython.ereport_mode GUC that has 3 settings: current 
(deprecated) behavior, allow ONLY 1 argument, new behavior. The reason 
for the 1 argument option is it makes it much easier to find code that's 
still using the old behavior. I think it's also worth having 
plpy.raise(LEVEL, ...) as an alternative.


If folks feel that's overkill then I'd vote to leave the existing 
behavior alone and just add plpy.raise(LEVEL, ...).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-04 Thread Jim Nasby
On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
>> 2. A feature to suspend i/o write-out towards a particular blocks
>> >that are registered by other concurrent backend, unless it is not
>> >unregistered (usually, at the end of P2P DMA).
>> >==> to be discussed.

I think there's still a race condition here though...

A
finds buffer not in shared buffers

B
reads buffer in
modifies buffer
starts writing buffer to OS

A
Makes call to block write, but write is already in process; thinks
writes are now blocked
Reads corrupted block
Much hilarity ensues

Or maybe you were just glossing over that part for brevity.

...

> I tried to design a draft of enhancement to realize the above i/o write-out
> suspend/resume, with less invasive way as possible as we can.
> 
>ASSUMPTION: I intend to implement this feature as a part of extension,
>because this i/o suspend/resume checks are pure overhead increment
>for the core features, unless extension which utilizes it.
> 
> Three functions shall be added:
> 
> extern intGetStorageMgrNumbers(void);
> extern f_smgr GetStorageMgrHandlers(int smgr_which);
> extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> 
> As literal, GetStorageMgrNumbers() returns the number of storage manager
> currently installed. It always return 1 right now.
> GetStorageMgrHandlers() returns the currently configured f_smgr table to
> the supplied smgr_which. It allows extensions to know current configuration
> of the storage manager, even if other extension already modified it.
> SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> the current one.
> If extension wants to intermediate 'smgr_write', extension will replace
> the 'smgr_write' by own function, then call the original function, likely
> mdwrite, from the alternative function.
> 
> In this case, call chain shall be:
> 
>FlushBuffer, and others...
> +-- smgrwrite(...)
>  +-- (extension's own function)
>   +-- mdwrite

ISTR someone (Robert Haas?) complaining that this method of hooks is
cumbersome to use and can be fragile if multiple hooks are being
installed. So maybe we don't want to extend it's usage...

I'm also not sure whether this is better done with an smgr hook or a
hook into shared buffer handling...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Noah Misch
On Thu, Feb 04, 2016 at 09:13:46PM +0300, Victor Wagner wrote:
> On Thu, 4 Feb 2016 18:33:27 +0300 Michael Paquier  
> wrote:

> > > Really, it is not so hard to add configure checks for perl modules.
> > > And we need to test not only for IPC::Run, but for Test::More too,
> > > because some Linux distributions put modules which come with perl
> > > into separate package.  
> > 
> > The last time we discussed about that on this list we concluded that
> > it was not really necessary to have such checks, for one it makes the
> > code more simple, and because this is leveraged by the presence of
> > --enable-tap-tests, tests which can get actually costly with
> > check-world. But this is digressing the subject of this thread, which
> > deals with the fact of having recovery tests integrated in core...
> 
> Of course, such configure tests should be run only if
> --enable-tap-tests is passed to the configure script
> 
> It would look  like
> 
> if test "$enable_tap_tests" = "yes"; then
>   AX_PROG_PERL_MODULES( Test::More, , AC_MSG_ERROR([Test::More is
>   necessary to run TAP Tests])
>   AX_PROG_PERL_MODULES( IPC::Run, , AC_MSG_ERROR([IPC::Run is
>   necessary to run TAP Tests])
> fi
> 
> in the configure.in
> 
> May be it is not strictly necessary, but it is really useful to see
> such problems as clear error message during configure stage, rather
> than successfully configure, compile, run tests and only then find out,
> that something is forgotten.
> 
> I don't see why having such tests in the configure.in, makes code more
> complex. It just prevents configure to finish successfully if
> --enable-tap-tests is specified and required modules are not available.

Even if detecting missing modules at "configure" time is the right thing, it
belongs in a distinct patch, discussed on a distinct thread.  The absence of
IPC::Run affects the proposed replication tests in the same way it affects
current TAP suites, so this thread has no business revisiting it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcommitfest reply having corrupted subject line

2016-02-04 Thread Noah Misch
On Thu, Feb 04, 2016 at 09:19:19AM +0100, Magnus Hagander wrote:
> On Thu, Feb 4, 2016 at 7:26 AM, Noah Misch  wrote:
> > The following message, which bears "User-Agent: pgcommitfest",
> >
> >
> > http://www.postgresql.org/message-id/flat/20160202164101.1291.30526.p...@coridan.postgresql.org
> >
> > added spaces after commas in its subject line, compared to the subject
> > line of
> > its In-Reply-To message.
> >
> >  new subject line: Re: Add generate_series(date, date) and
> > generate_series(date, date, integer)
> > previous subject line: Re: Add generate_series(date,date) and
> > generate_series(date,date,integer)
> >
> > I see no way to manually alter the subject line from the
> > commitfest.postgresql.org review UI.  Is commitfest.postgresql.org
> > initiating
> > this change internally?
> >
> 
> That is weird.
> 
> The CF app certainly doesn't do that intentionally - it copies it from the
> archives. And if I look in the db on the cf app, it has the subject without
> the space as the field that it copied :O
> 
> The code is line 355-358 at:
> http://git.postgresql.org/gitweb/?p=pgcommitfest2.git;a=blob;f=pgcommitfest/commitfest/views.py;h=b4f19b2db470e5269943418b2402ca9ddfff0dff;hb=fec3b2431730c131a206a170a99a7610cdbacc6b#l355
> 
> the "subject" field in the db that we copy does not have the spaces... I
> honestly have no idea where they are coming from :O I'm guessing it must be
> something internally in the python libraries that create the MIME.

So it is.  The problem happens when email.mime.text.MIMEText wraps the subject
line; see attached test program.

> Have you
> seen this with any other messages, that you can recall, or just this one?

Just this one.  However, no other hackers subject line since 2015-01-01
contains a comma followed by something other than a space.
import email.mime.text

msg = email.mime.text.MIMEText('body text', _charset='utf-8')
msg['Subject'] = 'Re: generate_series(date,date)'
print(msg.as_string())

msg = email.mime.text.MIMEText('body text', _charset='utf-8')
msg['Subject'] = ('Re: '
  '  '
  '  '
  'generate_series(date,date) '
  '')
print(msg.as_string())

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread Antonin Houska
Tom Lane  wrote:

> Alvaro Herrera  writes:
> > Tom Lane wrote:
> >> Not that I've heard of.  It's very hard to muster any enthusiasm for
> >> improving hash indexes unless their lack of WAL-logging is fixed first.
> 
> > This is really strange though.  Surely adding WAL-logging is not an
> > enormous task anymore ... I mean, we're undertaking far larger efforts
> > now, the WAL logging code is simpler than before, and we even have a
> > tool (ok, gotta streamline that one a little bit) to verify that the
> > results are correct.
> 
> ISTR that we discussed this previously and ran into some stumbling block
> or other that made it less-than-trivial.  Don't recall what though.

Concurrency of bucket split is the issue. It makes sense to fix the problem
before anyone tries to implement WAL. Otherwise WAL will have to be reworked
from scratch someday.

I had some ideas which I even published:

http://www.postgresql.org/message-id/32423.1427413442@localhost

Then I spent some more time on it sometime in October and improved the concept
a bit (and also found bugs in the version I had published, so please don't
spend much time looking at it). I also wrote a function to check if the
consistent (to be possibly added to pageinspect extension). I even wrote a
function that inserts tuples only into index, not into heap - I suppose that
should make comparison of index performance with and without the patch
simpler.

Now, besides making the patch easier to read, I need to test it
thoroughly. The lack of time is one problem, but I need to admit that it's a
personal issue too :-) So far I think I have a good idea, but now I should try
hard to break it.

Now that I see the problem mentioned again, I feel myself kind of "ignited".
I expect to have some leisure time at the end of February, so I'll test the
patch and post my results early in March.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2016-02-04 Thread Peter Eisentraut
On 12/29/15 10:38 AM, Bruce Momjian wrote:
> On Thu, Dec 10, 2015 at 11:10:55AM -0500, Robert Haas wrote:
>> On Wed, Dec 9, 2015 at 12:15 AM, Pavel Stehule  
>> wrote:
 On Wed, Dec 9, 2015 at 2:07 PM, Pavel Stehule 
 wrote:
> should be noted, recorded somewhere so this introduce possible
> compatibility
> issue - due default processing .psqlrc.

 That's written in the commit log, so I guess that's fine.
>>>
>>> ook
>>
>> Bruce uses the commit log to prepare the release notes, so I guess
>> he'll make mention of this.
> 
> Yes, I will certainly see it.

I generally use the master branch psql for normal work, and this change
has caused massive breakage for me.  It's straightforward to fix, but in
some cases the breakage is silent, for example if you do
something=$(psql -c ...) and the .psqlrc processing causes additional
output.  I'm not sure what to make of it yet, but I want to mention it,
because I fear there will be heartache.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2016-01 Commitfest

2016-02-04 Thread Fabien COELHO


Hello Andres,


I'm working on/off on the checkpointer flushing thing, so I don't think
it makes sense for somebody else to take that over at this point.


Yep.

I still wish you could share your current working version? The last 
version sent is from November 11th, and I think someone said that it does 
not apply cleanly anymore on head.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Thom Brown
On 4 February 2016 at 14:34, Fujii Masao  wrote:
> On Wed, Feb 3, 2016 at 11:00 AM, Robert Haas  wrote:
>> On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao  wrote:
>>> So you disagree with only third version that I proposed, i.e.,
>>> adding some hooks for sync replication? If yes and you're OK
>>> with the first and second versions, ISTM that we almost reached
>>> consensus on the direction of multiple sync replication feature.
>>> The first version can cover "one local and one remote sync standbys" case,
>>> and the second can cover "one local and at least one from several remote
>>> standbys" case. I'm thinking to focus on the first version now,
>>> and then we can work on the second to support the quorum commit
>>
>> Well, I think the only hard part of the third problem is deciding on
>> what syntax to use.  It seems like a waste of time to me to go to a
>> bunch of trouble to implement #1 and #2 using one syntax and then have
>> to invent a whole new syntax for #3.  Seriously, this isn't that hard:
>> it's not a technical problem.  It's just that we've got a bunch of
>> people who can't agree on what syntax to use.  IMO, you should just
>> pick something.  You're presumably the committer for this patch, and I
>> think you should just decide which of the 47,123 things proposed so
>> far is best and insist on that.  I trust that you will make a good
>> decision even if it's different than the decision that I would have
>> made.
>
> If we use one syntax for every cases, possible approaches that we can choose
> are mini-language, json, etc. Since my previous proposal covers only very
> simple cases, extra syntax needs to be supported for more complicated cases.
> My plan was to add the hooks so that the developers can choose their own
> syntax. But which might confuse users.
>
> Now I'm thinking that mini-language is better choice. A json has some good
> points, but its big problem is that the setting value is likely to be very 
> long.
> For example, when the master needs to wait for one local standby and
> at least one from three remote standbys in London data center, the setting
> value (synchronous_standby_names) would be
>
>   s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1,
> "nodes":["london1", "london2", "london3"]}]}'
>
> OTOH, the value with mini-language is simple and not so long as follows.
>
>   s_s_names = '2[local1, 1(london1, london2, london3)]'
>
> This is why I'm now thinking that mini-language is better. But it's not easy
> to completely implement mini-language. There seems to be many problems
> that we need to resolve. For example, please imagine the case where
> the master needs to wait for at least one from two standbys "tokyo1", "tokyo2"
> in Tokyo data center. If Tokyo data center fails, the master needs to
> wait for at least one from two standbys "london1", "london2" in London
> data center, instead. This case can be configured as follows in mini-language.
>
>   s_s_names = '1[1(tokyo1, tokyo2), 1(london1, london2)]'
>
> One problem here is; what pg_stat_replication.sync_state value should be
> shown for each standbys? Which standby should be marked as sync? potential?
> any other value like quorum? The current design of pg_stat_replication
> doesn't fit complicated sync replication cases, so maybe we need to separate
> it into several views. It's almost impossible to complete those problems.
>
> My current plan for 9.6 is to support the minimal subset of mini-language;
> simple syntax of "[name, ...]". "" specifies the number of
> sync standbys that the master needs to wait for. "[name, ...]" specifies
> the priorities of the listed standbys. This first version supports neither
> quorum commit nor nested sync replication configuration like
> "[name, [name, ...]]". It just supports very simple
> "1-level" configuration.

Whatever the solution, I'm really don't like the idea of changing the
definition of s_s_names based on the value of another GUC, mainly
because it seems hacky, but also because the name of the GUC stops
making sense.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idle In Transaction Session Timeout, revived

2016-02-04 Thread Fujii Masao
On Sun, Jan 31, 2016 at 10:33 PM, Vik Fearing  wrote:
> Attached is a rebased and revised version of my
> idle_in_transaction_session_timeout patch from last year.
>
> This version does not suffer the problems the old one did where it would
> jump out of SSL code thanks to Andres' patch in commit
> 4f85fde8eb860f263384fffdca660e16e77c7f76.
>
> The basic idea is if a session remains idle in a transaction for longer
> than the configured time, that connection will be dropped thus releasing
> the connection slot and any locks that may have been held by the broken
> client.

+1

But, IIRC, one of the problems that prevent the adoption of this feature is
the addition of gettimeofday() call after every SQL command receipt.
Have you already resolved that problem? Or we don't need to care about
it because it's almost harmless?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Peter Geoghegan
On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown  wrote:
>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |

I see the same.

I use my regular SQL query to see the breakdown of leaf/internal/root pages:

postgres=# with tots as (
  SELECT count(*) c,
  avg(live_items) avg_live_items,
  avg(dead_items) avg_dead_items,
  u.type,
  r.oid
  from (select c.oid,
  c.relpages,
  generate_series(1, c.relpages - 1) i
  from pg_index i
  join pg_opclass op on i.indclass[0] = op.oid
  join pg_am am on op.opcmethod = am.oid
  join pg_class c on i.indexrelid = c.oid
  where am.amname = 'btree') r,
lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
  group by r.oid, type)
select ct.relname table_name,
  tots.oid::regclass::text index_name,
  (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
  upper(type) page_type,
  c npages,
  to_char(avg_live_items, '990.999'),
  to_char(avg_dead_items, '990.999'),
  to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
%' as prop_of_index
  from tots
  join pg_index i on i.indexrelid = tots.oid
  join pg_class ct on ct.oid = i.indrelid
  where tots.oid = 'pgbench_accounts_pkey'::regclass
  order by ct.relnamespace, table_name, index_name, npages, type;
table_name│  index_name   │ non_meta_pages │ page_type
│ npages │ to_char  │ to_char  │ prop_of_index
──┼───┼┼───┼┼──┼──┼───
 pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
│  1 │   97.000 │0.000 │0.004 %
 pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
│ 97 │  282.670 │0.000 │0.354 %
 pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
│ 27,323 │  366.992 │0.000 │   99.643 %
(3 rows)

But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).

Are you sure you have the details right, Thom?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idle In Transaction Session Timeout, revived

2016-02-04 Thread Alvaro Herrera
David Steele wrote:

> > <...> But what I think really happens is
> > some badly-written Java application loses track of a connection
> > someplace and just never finds it again. <...>

I've seen that also, plenty of times.

> That's what I've seen over and over again.  And then sometimes it's not
> a badly-written Java application, but me, and in that case I definitely
> want the connection killed.  Without logging, if you please.

So the way to escape audit logging is to open a transaction, steal some
data, then leave the connection open so that it's not logged when it's
killed?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Victor Wagner
On Thu, 4 Feb 2016 12:59:03 +0300
Michael Paquier  wrote:

 for it?
> 
> > 1) Better to raise more meaningful error when IPC::Run is absent.  
> 
> This has been discussed before, and as far as I recall the current
> behavior has been concluded as being fine. That's where
> --enable-tap-tests becomes meaningful.

Really, it is not so hard to add configure checks for perl modules.
And we need to test not only for IPC::Run, but for Test::More too,
because some Linux distributions put modules which come with perl into
separate package.

The only problem that most m4 files with tests for perl modules, which
can be found in the Internet, have GPL license, so someone have either
to write his own and publish under PostgreSQL license or contact
author of one of them and ask to publish it under PostgreSQL license.

First seems to be much easier.
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Fujii Masao
On Wed, Feb 3, 2016 at 11:00 AM, Robert Haas  wrote:
> On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao  wrote:
>> So you disagree with only third version that I proposed, i.e.,
>> adding some hooks for sync replication? If yes and you're OK
>> with the first and second versions, ISTM that we almost reached
>> consensus on the direction of multiple sync replication feature.
>> The first version can cover "one local and one remote sync standbys" case,
>> and the second can cover "one local and at least one from several remote
>> standbys" case. I'm thinking to focus on the first version now,
>> and then we can work on the second to support the quorum commit
>
> Well, I think the only hard part of the third problem is deciding on
> what syntax to use.  It seems like a waste of time to me to go to a
> bunch of trouble to implement #1 and #2 using one syntax and then have
> to invent a whole new syntax for #3.  Seriously, this isn't that hard:
> it's not a technical problem.  It's just that we've got a bunch of
> people who can't agree on what syntax to use.  IMO, you should just
> pick something.  You're presumably the committer for this patch, and I
> think you should just decide which of the 47,123 things proposed so
> far is best and insist on that.  I trust that you will make a good
> decision even if it's different than the decision that I would have
> made.

If we use one syntax for every cases, possible approaches that we can choose
are mini-language, json, etc. Since my previous proposal covers only very
simple cases, extra syntax needs to be supported for more complicated cases.
My plan was to add the hooks so that the developers can choose their own
syntax. But which might confuse users.

Now I'm thinking that mini-language is better choice. A json has some good
points, but its big problem is that the setting value is likely to be very long.
For example, when the master needs to wait for one local standby and
at least one from three remote standbys in London data center, the setting
value (synchronous_standby_names) would be

  s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1,
"nodes":["london1", "london2", "london3"]}]}'

OTOH, the value with mini-language is simple and not so long as follows.

  s_s_names = '2[local1, 1(london1, london2, london3)]'

This is why I'm now thinking that mini-language is better. But it's not easy
to completely implement mini-language. There seems to be many problems
that we need to resolve. For example, please imagine the case where
the master needs to wait for at least one from two standbys "tokyo1", "tokyo2"
in Tokyo data center. If Tokyo data center fails, the master needs to
wait for at least one from two standbys "london1", "london2" in London
data center, instead. This case can be configured as follows in mini-language.

  s_s_names = '1[1(tokyo1, tokyo2), 1(london1, london2)]'

One problem here is; what pg_stat_replication.sync_state value should be
shown for each standbys? Which standby should be marked as sync? potential?
any other value like quorum? The current design of pg_stat_replication
doesn't fit complicated sync replication cases, so maybe we need to separate
it into several views. It's almost impossible to complete those problems.

My current plan for 9.6 is to support the minimal subset of mini-language;
simple syntax of "[name, ...]". "" specifies the number of
sync standbys that the master needs to wait for. "[name, ...]" specifies
the priorities of the listed standbys. This first version supports neither
quorum commit nor nested sync replication configuration like
"[name, [name, ...]]". It just supports very simple
"1-level" configuration.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-04 Thread Amit Kapila
On Thu, Feb 4, 2016 at 6:40 PM, Andres Freund  wrote:
>
> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
> > I think generally it is good idea, but one thing worth a thought is that
> > by doing so, we need to acquire all WAL Insertion locks every
> > LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
> > every slot, do you think it is matter of concern in any way for write
> > workloads or it won't effect as we need to do this periodically?
>
> Michael and I just had an in-person discussion, and one of the topics
> was that. The plan was basically to adapt the patch to:
> 1) Store the progress lsn inside the wal insert lock
> 2) Change the HasActivity API to return an the last LSN at which there
>was activity, instead of a boolean.
> 2) Individually acquire each insert locks's lwlock to get it's progress
>LSN, but not the exclusive insert lock. We need the lwllock to avoid
>a torn 8byte read on some platforms.
>
> I think that mostly should address your concerns?
>

Yes, this sounds better and in-anycase we can do some benchmarks
to verify the same once patch is in shape.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Idle In Transaction Session Timeout, revived

2016-02-04 Thread David Steele
On 2/4/16 5:00 AM, Alvaro Herrera wrote:
> David Steele wrote:
> 
>>> <...> But what I think really happens is
>>> some badly-written Java application loses track of a connection
>>> someplace and just never finds it again. <...>
> 
> I've seen that also, plenty of times.
> 
>> That's what I've seen over and over again.  And then sometimes it's not
>> a badly-written Java application, but me, and in that case I definitely
>> want the connection killed.  Without logging, if you please.
> 
> So the way to escape audit logging is to open a transaction, steal some
> data, then leave the connection open so that it's not logged when it's
> killed?

Well, of course I was joking, but even so I only meant the disconnect
shouldn't be logged to save me embarrassment.

But you are probably joking as well.  Oh, what a tangled web.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pg_dump data structures for triggers

2016-02-04 Thread Tom Lane
Vik Fearing  writes:
> On 02/04/2016 01:44 AM, Tom Lane wrote:
>> I'm looking into fixing the problem reported here:
>> http://www.postgresql.org/message-id/1445a624-d09f-4b51-9c41-46ba1e2d6...@neveragain.de
>> namely that if we split a view into a table + rule (because of circular
>> dependencies), parallel pg_restore fails to ensure that it creates any
>> triggers for the view only after creating the rule.  If it tries to
>> create the triggers first, the backend may barf because they're the wrong
>> type of triggers for a plain table.

> No objections to this, but my "better idea" is simply to allow INSTEAD
> OF triggers on tables like discussed last year.
> http://www.postgresql.org/message-id/14c6fe168a9-1012-10...@webprd-a87.mail.aol.com

That sounds like a new feature, and not something we'd want to backpatch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Robert Haas
On Thu, Feb 4, 2016 at 11:55 AM, Ashutosh Bapat
 wrote:
> A query with FOR UPDATE/SHARE will be considered parallel unsafe in
> has_parallel_hazard_walker() and root->glob->parallelModeOK will be marked
> false. This implies that none of the base relations and hence join relations
> will be marked as consider_parallel. IIUC your logic, none of the queries
> with FOR UPDATE/SHARE will get a local path which is marked parallel_safe
> and thus join will not be pushed down. Why do you think we need to skip
> paths that aren't parallel_safe? I have left aside this change in the latest
> patches.

I changed this back before committing but, ah nuts, you're right.  Sigh.  Sorry.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-02-04 Thread Etsuro Fujita

On 2016/02/04 19:44, Rushabh Lathia wrote:

I just started reviewing this and realized that patch is not getting applied
cleanly on latest source, it having some conflicts. Can you please upload
the correct version of patch.


I rebased the patch to the latest HEAD.  Attached is a rebased version 
of the patch.  You don't need to apply the patch 
fdw-foreign-modify-rmh-v2.patch attached before.


Thanks for the review!

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/deparse.c
--- b/contrib/postgres_fdw/deparse.c
***
*** 1069,1074  deparseUpdateSql(StringInfo buf, PlannerInfo *root,
--- 1069,1134 
  }
  
  /*
+  * deparse remote UPDATE statement
+  *
+  * The statement text is appended to buf, and we also create an integer List
+  * of the columns being retrieved by RETURNING (if any), which is returned
+  * to *retrieved_attrs.
+  */
+ void
+ deparsePushedDownUpdateSql(StringInfo buf, PlannerInfo *root,
+ 		   Index rtindex, Relation rel,
+ 		   List	*targetlist,
+ 		   List *targetAttrs,
+ 		   List	*remote_conds,
+ 		   List **params_list,
+ 		   List *returningList,
+ 		   List **retrieved_attrs)
+ {
+ 	RelOptInfo *baserel = root->simple_rel_array[rtindex];
+ 	deparse_expr_cxt context;
+ 	int			nestlevel;
+ 	bool		first;
+ 	ListCell   *lc;
+ 
+ 	/* Set up context struct for recursion */
+ 	context.root = root;
+ 	context.foreignrel = baserel;
+ 	context.buf = buf;
+ 	context.params_list = params_list;
+ 
+ 	appendStringInfoString(buf, "UPDATE ");
+ 	deparseRelation(buf, rel);
+ 	appendStringInfoString(buf, " SET ");
+ 
+ 	/* Make sure any constants in the exprs are printed portably */
+ 	nestlevel = set_transmission_modes();
+ 
+ 	first = true;
+ 	foreach(lc, targetAttrs)
+ 	{
+ 		int			attnum = lfirst_int(lc);
+ 		TargetEntry *tle = get_tle_by_resno(targetlist, attnum);
+ 
+ 		if (!first)
+ 			appendStringInfoString(buf, ", ");
+ 		first = false;
+ 
+ 		deparseColumnRef(buf, rtindex, attnum, root);
+ 		appendStringInfoString(buf, " = ");
+ 		deparseExpr((Expr *) tle->expr, );
+ 	}
+ 
+ 	reset_transmission_modes(nestlevel);
+ 
+ 	if (remote_conds)
+ 		appendWhereClause(remote_conds, );
+ 
+ 	deparseReturningList(buf, root, rtindex, rel, false,
+ 		 returningList, retrieved_attrs);
+ }
+ 
+ /*
   * deparse remote DELETE statement
   *
   * The statement text is appended to buf, and we also create an integer List
***
*** 1091,1096  deparseDeleteSql(StringInfo buf, PlannerInfo *root,
--- 1151,1190 
  }
  
  /*
+  * deparse remote DELETE statement
+  *
+  * The statement text is appended to buf, and we also create an integer List
+  * of the columns being retrieved by RETURNING (if any), which is returned
+  * to *retrieved_attrs.
+  */
+ void
+ deparsePushedDownDeleteSql(StringInfo buf, PlannerInfo *root,
+ 		   Index rtindex, Relation rel,
+ 		   List	*remote_conds,
+ 		   List	**params_list,
+ 		   List *returningList,
+ 		   List **retrieved_attrs)
+ {
+ 	RelOptInfo *baserel = root->simple_rel_array[rtindex];
+ 	deparse_expr_cxt context;
+ 
+ 	/* Set up context struct for recursion */
+ 	context.root = root;
+ 	context.foreignrel = baserel;
+ 	context.buf = buf;
+ 	context.params_list = params_list;
+ 
+ 	appendStringInfoString(buf, "DELETE FROM ");
+ 	deparseRelation(buf, rel);
+ 
+ 	if (remote_conds)
+ 		appendWhereClause(remote_conds, );
+ 
+ 	deparseReturningList(buf, root, rtindex, rel, false,
+ 		 returningList, retrieved_attrs);
+ }
+ 
+ /*
   * Add a RETURNING clause, if needed, to an INSERT/UPDATE/DELETE.
   */
  static void
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 1314,1320  INSERT INTO ft2 (c1,c2,c3)
--- 1314,1339 
  (3 rows)
  
  INSERT INTO ft2 (c1,c2,c3) VALUES (1104,204,'ddd'), (1105,205,'eee');
+ EXPLAIN (verbose, costs off)
+ UPDATE ft2 SET c2 = c2 + 300, c3 = c3 || '_update3' WHERE c1 % 10 = 3;  -- can be pushed down
+   QUERY PLAN  
+ --
+  Update on public.ft2
+->  Foreign Update on public.ft2
+  Remote SQL: UPDATE "S 1"."T 1" SET c2 = (c2 + 300), c3 = (c3 || '_update3'::text) WHERE ((("C 1" % 10) = 3))
+ (3 rows)
+ 
  UPDATE ft2 SET c2 = c2 + 300, c3 = c3 || '_update3' WHERE c1 % 10 = 3;
+ EXPLAIN (verbose, costs off)
+ UPDATE ft2 SET c2 = c2 + 400, c3 = c3 || '_update7' WHERE c1 % 10 = 7 RETURNING *;  -- can be pushed down
+ QUERY PLAN
+ --

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2016-02-04 Thread Amit Kapila
On Fri, Feb 5, 2016 at 3:17 AM, Robert Haas  wrote:
>
> On Thu, Feb 4, 2016 at 7:00 AM, Amit Kapila 
wrote:
> > [ new patch ]
>
> I've committed this after heavy rewriting.  In particular, I merged
> two loops, used local variables to get rid of the very long and quite
> repetitive lines, and did various cleanup on the documentation and
> comments.
>

I think there is a typo in the committed code, patch to fix the same
is attached.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


typo_named_extensions_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-04 Thread Kouhei Kaigai
> -Original Message-
> From: Jim Nasby [mailto:jim.na...@bluetreble.com]
> Sent: Friday, February 05, 2016 9:17 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org; Robert Haas
> Cc: Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >that are registered by other concurrent backend, unless it is not
> >> >unregistered (usually, at the end of P2P DMA).
> >> >==> to be discussed.
> 
> I think there's still a race condition here though...
> 
> A
> finds buffer not in shared buffers
> 
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
> 
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
> 
> Or maybe you were just glossing over that part for brevity.
> 
> ...
> 
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >ASSUMPTION: I intend to implement this feature as a part of extension,
> >because this i/o suspend/resume checks are pure overhead increment
> >for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern intGetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >FlushBuffer, and others...
> > +-- smgrwrite(...)
> >  +-- (extension's own function)
> >   +-- mdwrite
> 
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
> 
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
>
# sorry, I oversight the later part of your reply.

I can agree that smgr hooks shall be primarily designed to make storage
systems pluggable, even if we can use this hooks for suspend & resume of
write i/o stuff.
In addition, "pluggable storage" is a long-standing feature, even though
it is not certain whether existing smgr hooks are good starting point.
It may be a risk if we implement a grand feature on top of the hooks
but out of its primary purpose.

So, my preference is a mechanism to hook buffer write to implement this
feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
has nearly zero cost when no extension activate the feature.)
One downside of this approach is larger number of hook points.
We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
and FlushRelationBuffers, in addition to FlushBuffer, at least.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2016-02-04 Thread Haribabu Kommi
On Tue, Feb 2, 2016 at 8:57 AM, Alvaro Herrera  wrote:
> Haribabu Kommi wrote:
>
>> Hi, Do you have any further comments on the patch that needs to be
>> taken care?
>
> I do.  I think the jsonb functions you added should be added to one of
> the json .c files instead, since they seem of general applicability.

moved these functions into jsonb_util.c file.

> But actually, I don't think you have ever replied to the question of why
> are you using JSON in the first place; isn't a normal array suitable?

It was discussed and told to use JSON for options instead of array in [1],
because of that reason I changed.

> The callback stuff is not documented in check_hba() at all.  Can you
> please add an explanation just above the function, so that people trying
> to use it know what can the callback be used for?  Also a few lines
> above the callback itself would be good.

Added some details in explaining the call back function.

>Also, the third argument of
> check_hba() is a translatable message so you should enclose it in _() so
> that it is picked up for translation.  The "skipped"/"matched" words
> (and maybe others?) need to be marked similarly.

Changed mode column (skipped/matched) and reason for mismatch details
are enclosed in _() for translation. Do you find any other details needs to be
enclosed?

> That "Failed" in the errmsg in pg_hba_lookup() should be lowercase.

corrected.

> Moving to next CF.

Thanks. updated patch is attached with the above corrections.
This patch needs to be applied on top discard_hba_and_ident_cxt patch
that is posted earlier.

[1] - http://www.postgresql.org/message-id/5547db0a.2020...@gmx.net

Regards,
Hari Babu
Fujitsu Australia


pg_hba_lookup_poc_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Odd behavior in foreign table modification (Was: Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW)

2016-02-04 Thread Robert Haas
On Thu, Jan 28, 2016 at 7:36 AM, Etsuro Fujita
 wrote:
> Attached is that version of the patch.
>
> I think that postgres_fdw might be able to insert a tableoid value in the
> returned slot in e.g., postgresExecForeignInsert if AFTER ROW Triggers or
> RETURNING expressions reference that value, but I didn't do anything about
> that.

Thanks.  I went with the earlier version, but split it into two commits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-04 Thread Kouhei Kaigai
> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >that are registered by other concurrent backend, unless it is not
> >> >unregistered (usually, at the end of P2P DMA).
> >> >==> to be discussed.
> 
> I think there's still a race condition here though...
> 
> A
> finds buffer not in shared buffers
> 
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
> 
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
> 
> Or maybe you were just glossing over that part for brevity.
>
Thanks, this part was not clear from my previous description.

At the time when B starts writing buffer to OS, extension will catch
this i/o request using a hook around the smgrwrite, then the mechanism
registers the block to block P2P DMA request during B's write operation.
(Of course, it unregisters the block at end of the smgrwrite)
So, even if A wants to issue P2P DMA concurrently, it cannot register
the block until B's write operation.

In practical, this operation shall be "try lock", because B's write
operation implies existence of the buffer in main memory, so B does
not need to wait A's write operation if B switch DMA source from SSD
to main memory.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> ...
> 
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >ASSUMPTION: I intend to implement this feature as a part of extension,
> >because this i/o suspend/resume checks are pure overhead increment
> >for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern intGetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >FlushBuffer, and others...
> > +-- smgrwrite(...)
> >  +-- (extension's own function)
> >   +-- mdwrite
> 
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
> 
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count_nulls(VARIADIC "any")

2016-02-04 Thread Tom Lane
I wrote:
> Pavel Stehule  writes:
>> [ num_nulls_v6.patch ]

> I started looking through this.  It seems generally okay, but I'm not
> very pleased with the function name "num_notnulls".  I think it would
> be better as "num_nonnulls", as I see Oleksandr suggested already.

Not hearing any complaints, I pushed it with that change and some other
cosmetic adjustments.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Odd behavior in foreign table modification (Was: Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW)

2016-02-04 Thread Etsuro Fujita

On 2016/02/05 12:28, Robert Haas wrote:

On Thu, Jan 28, 2016 at 7:36 AM, Etsuro Fujita
 wrote:

Attached is that version of the patch.

I think that postgres_fdw might be able to insert a tableoid value in the
returned slot in e.g., postgresExecForeignInsert if AFTER ROW Triggers or
RETURNING expressions reference that value, but I didn't do anything about
that.



Thanks.  I went with the earlier version, but split it into two commits.


Thank you!

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count_nulls(VARIADIC "any")

2016-02-04 Thread Pavel Stehule
Dne 5. 2. 2016 1:33 napsal uživatel "Tom Lane" :
>
> Pavel Stehule  writes:
> > [ num_nulls_v6.patch ]
>
> I started looking through this.  It seems generally okay, but I'm not
> very pleased with the function name "num_notnulls".  I think it would
> be better as "num_nonnulls", as I see Oleksandr suggested already.

I have no problem with it.

Regards

Pavel
>
> regards, tom lane


Re: [HACKERS] Incorrect formula for SysV IPC parameters

2016-02-04 Thread Kyotaro HORIGUCHI
At Thu, 4 Feb 2016 21:43:04 +0900, Fujii Masao  wrote in 

> On Wed, Feb 3, 2016 at 12:51 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello, I found that the formulas to calculate SEMMNI and SEMMNS
> > are incorrect in 9.2 and later.
> >
> > http://www.postgresql.org/docs/9.5/static/kernel-resources.html
> >
> > But actually the number of semaphores PostgreSQL needs is
> > calculated as following in 9.4 and later.
...
> > So, the formula for SEMMNI should be
> >
> > ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) 
> > / 16)
> >
> > and SEMMNS should have the same fix.
> >
> >
> > In 9.3 and 9.2, the documentation says the same thing but
...
> > ceil((max_connections + autovacuum_max_workers + 5) / 16)
> >
> > In 9.1, NUM_AUXILIARY_PROCS is 3 so the documentations is correct.
> 
> Good catch!

Thanks.

> ISTM that you also need to change the descriptions about SEMMNI and SEMMNS
> under the Table 17-1.

Oops! Thank you for pointing it out.

The original description doesn't mention the magic-number ('5' in
the above formulas, which previously was '4') so I haven't added
anything about it.

Process of which the number is limited by max_worker_processes is
called 'background process' (not 'backend worker') in the
documentation so I used the name to mention it in the additional
description.

The difference in the body text for 9.2, 9.3 is only a literal
'4' to '5' in the formula.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 6ed2f296cc5899f75a2e817f470e5da07bcb0d2c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 3 Feb 2016 11:35:43 +0900
Subject: [PATCH] Fix the formula to calculate SEMMNI and SEMMNS in
 documentation

---
 doc/src/sgml/runtime.sgml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 0db3807..2bc0a91 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -645,13 +645,13 @@ psql: could not connect to server: No such file or directory

 SEMMNI
 Maximum number of semaphore identifiers (i.e., sets)
-at least ceil((max_connections + autovacuum_max_workers + 4) / 16)
+at least ceil((max_connections + autovacuum_max_workers + 5) / 16)

 

 SEMMNS
 Maximum number of semaphores system-wide
-ceil((max_connections + autovacuum_max_workers + 4) / 16) * 17 plus room for other applications
+ceil((max_connections + autovacuum_max_workers + 5) / 16) * 17 plus room for other applications

 

@@ -712,7 +712,7 @@ psql: could not connect to server: No such file or directory
 linkend="sysvipc-parameters">).  The parameter SEMMNI
 determines the limit on the number of semaphore sets that can
 exist on the system at one time.  Hence this parameter must be at
-least ceil((max_connections + autovacuum_max_workers + 4) / 16).
+least ceil((max_connections + autovacuum_max_workers + 5) / 16).
 Lowering the number
 of allowed connections is a temporary workaround for failures,
 which are usually confusingly worded No space
-- 
1.8.3.1

>From c196c82665e4081a6f90938ac86bb63e4e60e45c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 3 Feb 2016 11:07:12 +0900
Subject: [PATCH] Fix the formula to calculate SEMMNI and SEMMNS in
 documentation

---
 doc/src/sgml/runtime.sgml | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index cda05f5..dd42dd0 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -645,13 +645,13 @@ psql: could not connect to server: No such file or directory

 SEMMNI
 Maximum number of semaphore identifiers (i.e., sets)
-at least ceil((max_connections + autovacuum_max_workers + 4) / 16)
+at least ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) / 16)

 

 SEMMNS
 Maximum number of semaphores system-wide
-ceil((max_connections + autovacuum_max_workers + 4) / 16) * 17 plus room for other applications
+ceil((max_connections + autovacuum_max_workers + max_worker_processes + 5) / 16) * 17 plus room for other applications

 

@@ -699,20 +699,22 @@ psql: could not connect to server: No such file or directory
 

 PostgreSQL uses one semaphore per allowed connection
-() and allowed autovacuum worker
-process (), in sets of 16.
+(), allowed autovacuum worker process
+() and allowed background
+process (), in sets of 16.
 Each such set will
 also contain a 17th semaphore which contains a magic
 number, to detect collision with semaphore sets used by
 other 

Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Robert Haas
On Thu, Feb 4, 2016 at 9:34 AM, Fujii Masao  wrote:
> Now I'm thinking that mini-language is better choice. A json has some good
> points, but its big problem is that the setting value is likely to be very 
> long.
> For example, when the master needs to wait for one local standby and
> at least one from three remote standbys in London data center, the setting
> value (synchronous_standby_names) would be
>
>   s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1,
> "nodes":["london1", "london2", "london3"]}]}'
>
> OTOH, the value with mini-language is simple and not so long as follows.
>
>   s_s_names = '2[local1, 1(london1, london2, london3)]'

Yeah, that was my thought also.  Another idea which was suggested is
to create a completely new configuration file for this.  Most people
would only have simple stuff in there, of course, but then you could
have the information spread across multiple lines.

I don't in the end care very much about how we solve this problem.
But I'm glad you agree that whatever we do to solve the simple problem
should be a logical subset of what the full solution will eventually
look like, not a completely different design.  I think that's
important.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Peter Geoghegan
On Thu, Feb 4, 2016 at 8:25 AM, Thom Brown  wrote:
>
> No, I'm not.  I've just realised that all I've been checking is the
> primary key expecting it to change in size, which is, of course,
> nonsense.  I should have been creating an index on the bid field of
> pgbench_accounts and reviewing the size of that.

Right. Because, apart from everything else, unique indexes are not
currently supported.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \crosstabview in psql

2016-02-04 Thread Pavel Stehule
Hi

I tested last version, v11 and I have not any objection

It is working as expected

all regress tests passed, there is related documentation and new test is
attached.

This patch is ready form commiter.

Daniel, thank you very much, it is interesting feature.

Regards

Pavel


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Kyotaro HORIGUCHI
Hello, I'm studying this.

Two hunks in 0003 needed a fix but the other part applied cleanly
on master.

At Fri, 22 Jan 2016 15:17:51 +0900, Michael Paquier  
wrote in 

Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-02-04 Thread Peter Geoghegan
On Sun, Jan 31, 2016 at 7:59 PM, Andreas Karlsson  wrote:
> Nice work, I like your sorting patches.

Thanks. I like your reviews of my sorting patches. :-)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: PL/Pythonu - function ereport

2016-02-04 Thread Catalin Iacob
On Tue, Feb 2, 2016 at 11:52 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> The eventual committer is likely to be much happier with this patch if
>> you guys have achieved consensus among yourselves on the best
>> approach.
>
> Actually I imagine that if there's no agreement between author and first
> reviewer, there might not *be* a committer in the first place.  Perhaps
> try to get someone else to think about it and make a decision.  It is
> possible that some other committer is able to decide by themselves but I
> wouldn't count on it.

Pavel and I agree that the backward incompatible behavior is cleaner,
but it's backward incompatible. Whether that extra cleanness is worth
the incompatibility is never obvious. In this case I think it does.
But since my opinion is just my opinion, I was planning to make the
committer be that someone else that weighs the issues and makes a
decision.

I'm new around here and picked this patch to get started due to having
Python knowledge and the patch seeming self contained enough. Being
new makes me wonder all the time "am I just making life difficult for
the patch author or is this idea genuinely good and therefore I should
push it forward?". I think more beginners that try to do reviews
struggle with this.

But, let's try to reach a decision. The existing behaviour dates back
to 0bef7ba Add plpython code by Bruce. I've added him to the mail,
maybe he can help us with a decision. I'll summarize the patch and
explain the incompatibility decision with some illustrative code.

The patch is trying to make it possible to call ereport from PL/Python
and provide the rich ereport information (detail, hint, sqlcode etc.).
There are already functions like plpy.info() but they only expose
message and they call elog instead of ereport.

See the attached incompat.py. Running it produces:

   existing behaviour
PG elog/ereport with message: 1: hi detail: None hint: None
PG elog/ereport with message: ('2: hi', 'another argument') detail:
None hint: None
PG elog/ereport with message: ('3: hi', 'another argument', 2) detail:
None hint: None
PG elog/ereport with message: ('4: hi', 'another argument', 2, 'lots',
'of', 'arguments') detail: None hint: None

 new behaviour
PG elog/ereport with message: 1: hi detail: None hint: None
PG elog/ereport with message: 2: hi detail: another argument hint: None
PG elog/ereport with message: 3: hi detail: another argument hint: 2
Traceback (most recent call last):
  File "incompat.py", line 43, in 
info_new('4: hi', 'another argument', 2, 'lots', 'of', 'arguments')
TypeError: info_new() takes at most 3 arguments (6 given)

I find existing behaviour for 2, 3 and 4 unlike other Python APIs I've
seen, surprising and not very useful. If I want to log a tuple I can
construct and pass a single tuple, I don't see why plpy.info() needs
to construct it for me. And for the documented, single argument call
nothing changes.

The question to Bruce (and others) is: is it ok to change to the new
behaviour illustrated and change meaning for usages like 2, 3 and 4?
If we don't want that, the solution Pavel and I see is introducing a
parallel API named plpy.raise_info or plpy.rich_info or something like
that with the new behaviour and leave the existing functions
unchanged. Another option is some compatibility GUC but I don't think
it's worth the trouble and confusion.

# simulated PG elog
def elog(message):
ereport(message)

# simulated PG ereport
def ereport(message, detail=None, hint=None):
print 'PG elog/ereport with message: %s detail: %s hint: %s' % (message, detail, hint)


# existing code behaves like this:
#   takes an unlimited number of arguments
#   doesn't take keyword arguments
#   makes a string representation of the tuple of all arguments unless that tuple has size 1 in which case it only makes a string representation of the first one
def info_existing(*args):
if len(args) == 1:
# special case added by Peter in 2e3b16
str_to_log = str(args[0])
else:
str_to_log = str(args)

elog(str_to_log)

# and I'm proposing to change it to do this:
#   take 1 required argument and extra optional arguments for every argument accepted by ereport
#   accepts keyword arguments
#   passing too many arguments will get rejected
def info_new(message, detail=None, hint=None):
ereport(message, detail, hint)

print 'existing behaviour'.center(40)
info_existing('1: hi')
info_existing('2: hi', 'another argument')
info_existing('3: hi', 'another argument', 2)
info_existing('4: hi', 'another argument', 2, 'lots', 'of', 'arguments')

print

print 'new behaviour'.center(40)
info_new('1: hi') # for the documented single argument case same behaviour as existing
info_new('2: hi', 'another argument')
info_new('3: hi', 'another argument', 2)
info_new('4: hi', 'another argument', 2, 'lots', 'of', 'arguments')



-- 
Sent via pgsql-hackers 

[HACKERS] Comment typo in slot.c

2016-02-04 Thread Michael Paquier
Hi all,

I just bumped into the following typo in slot.c:
/*
 * If we'd now fail - really unlikely - we wouldn't know
whether this slot
 * would persist after an OS crash or not - so, force a restart. The
-* restart would try to fysnc this again till it works.
+* restart would try to fsync this again till it works.
 */
START_CRIT_SECTION();
Regards,
-- 
Michael
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 251b549..11b44a4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -984,7 +984,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 	/*
 	 * If we'd now fail - really unlikely - we wouldn't know whether this slot
 	 * would persist after an OS crash or not - so, force a restart. The
-	 * restart would try to fysnc this again till it works.
+	 * restart would try to fsync this again till it works.
 	 */
 	START_CRIT_SECTION();
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 9:18 PM, Victor Wagner wrote:
> It's quite good that patch sets standard of using 'use strict; use
> warnings;' in the test script.

FWIW, this is decided as being a standard rule for any modules/script
added in the main tree.

> It is bad, that Postgres-specific perl modules do not have embedded
> documentation. It would be nice to see POD documentation in the
> TestLib.pm and PostgresNode.pm instead of just comments. It would be
> much easier to test writers to read documentation using perldoc utility,
> rather than browse through the code.

Why not. I am no perlist but those prove to be helpful, however those
Postgres modules are not dedicated to a large audience, so we could
live without for now.

> I think that this patch should be commited as soon as possible in its
> current form (short of already reported reject in the PostgresNode.pm
> init function).

Thanks for your enthusiasm. Now, to do an auto-critic of my patch:

+   if ($params{allows_streaming})
+   {
+   print $conf "wal_level = hot_standby\n";
+   print $conf "max_wal_senders = 5\n";
+   print $conf "wal_keep_segments = 20\n";
+   print $conf "max_wal_size = 128MB\n";
+   print $conf "shared_buffers = 1MB\n";
+   print $conf "wal_log_hints = on\n";
+   print $conf "hot_standby = on\n";
+   }
This could have more thoughts, particularly for wal_log_hints which is
not used all the time, I think that we'd actually want to complete
that with an optional hash of parameter/values that get appended at
the end of the configuration file, then pass wal_log_hints in the
tests where it is needed. The default set of parameter is maybe fine
if done this way, still wal_keep_segments could be removed.

+# Tets for timeline switch
+# Encure that a standby is able to follow a newly-promoted standby
Two typos in two lines.
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 7:27 PM, Robert Haas  wrote:
> I don't in the end care very much about how we solve this problem.
> But I'm glad you agree that whatever we do to solve the simple problem
> should be a logical subset of what the full solution will eventually
> look like, not a completely different design.  I think that's
> important.

Yes, please let's use the custom language, and let's not care of not
more than 1 level of nesting so as it is possible to represent
pg_stat_replication in a simple way for the user.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Robert Haas
On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
 wrote:
> Yes, please let's use the custom language, and let's not care of not
> more than 1 level of nesting so as it is possible to represent
> pg_stat_replication in a simple way for the user.

"not" is used twice in this sentence in a way that renders me not able
to be sure that I'm not understanding it not properly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Ashutosh Bapat
On Thu, Feb 4, 2016 at 4:30 AM, Robert Haas  wrote:

> On Wed, Feb 3, 2016 at 5:56 PM, Robert Haas  wrote:
> > On Wed, Feb 3, 2016 at 12:08 PM, Ashutosh Bapat
> >  wrote:
> >> PFA patches with naming conventions similar to previous ones.
> >> pg_fdw_core_v7.patch: core changes
> >> pg_fdw_join_v7.patch: postgres_fdw changes for join pushdown
> >> pg_join_pd_v7.patch: combined patch for ease of testing.
> >
> > Hmm, I think that GetPathForEPQRecheck() is a pretty terrible name.
> > How about GetExistingJoinPath()?
>

GetExistingLocalJoinPath()? Used that.


>
> Oops.  Hit Send too soon.  Also, how about writing if
> (path->param_info != NULL) continue; instead of burying the core of
> the function in another level of indentation?


Hmm. Done.


> I think you should skip
> paths that aren't parallel_safe, too, and the documentation should be
> clear that this will find an unparameterized, parallel-safe joinpath
> if one exists.
>

A query with FOR UPDATE/SHARE will be considered parallel unsafe in
has_parallel_hazard_walker() and root->glob->parallelModeOK will be marked
false. This implies that none of the base relations and hence join
relations will be marked as consider_parallel. IIUC your logic, none of the
queries with FOR UPDATE/SHARE will get a local path which is marked
parallel_safe and thus join will not be pushed down. Why do you think we
need to skip paths that aren't parallel_safe? I have left aside this change
in the latest patches.



>
> +   ForeignPath *foreign_path;
> +   foreign_path = (ForeignPath
> *)joinpath->outerjoinpath;
>
> Maybe insert a blank line between here, and in the other, similar case.
>

Done.

Patches attached with the previous mail.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-04 Thread Andres Freund
On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
> I think generally it is good idea, but one thing worth a thought is that
> by doing so, we need to acquire all WAL Insertion locks every
> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
> every slot, do you think it is matter of concern in any way for write
> workloads or it won't effect as we need to do this periodically?

Michael and I just had an in-person discussion, and one of the topics
was that. The plan was basically to adapt the patch to:
1) Store the progress lsn inside the wal insert lock
2) Change the HasActivity API to return an the last LSN at which there
   was activity, instead of a boolean.
2) Individually acquire each insert locks's lwlock to get it's progress
   LSN, but not the exclusive insert lock. We need the lwllock to avoid
   a torn 8byte read on some platforms.

I think that mostly should address your concerns?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idle In Transaction Session Timeout, revived

2016-02-04 Thread Vik Fearing
On 02/04/2016 02:24 PM, Fujii Masao wrote:
> On Sun, Jan 31, 2016 at 10:33 PM, Vik Fearing  wrote:
>> Attached is a rebased and revised version of my
>> idle_in_transaction_session_timeout patch from last year.
>>
>> This version does not suffer the problems the old one did where it would
>> jump out of SSL code thanks to Andres' patch in commit
>> 4f85fde8eb860f263384fffdca660e16e77c7f76.
>>
>> The basic idea is if a session remains idle in a transaction for longer
>> than the configured time, that connection will be dropped thus releasing
>> the connection slot and any locks that may have been held by the broken
>> client.
> 
> +1
> 
> But, IIRC, one of the problems that prevent the adoption of this feature is
> the addition of gettimeofday() call after every SQL command receipt.
> Have you already resolved that problem? Or we don't need to care about
> it because it's almost harmless?

I guess it would be possible to look at MyBEEntry somehow and pull
st_state_start_timestamp from it to replace the call to
GetCurrentTimestamp(), but I don't know if it's worth doing that.

The extra call only happens if the timeout is enabled anyway, so I don't
think it matters enough to be a blocker.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2016-02-04 Thread Fabien COELHO


Hello Alvaro,


Something is wrong with patch d.  I noticed two things,

1. the total_weight stuff can overflow,


It can generate an error on overflow by checking the total_weight while 
it is being computed. I've switched total_weight to int64 so it is now 
really impossible to overflow with the 32 bit int_max limit on weight.



2. the chooseScript stuff is broken, or something.


Sorry, probably a <=/< error. I think I've fixed it and I've simplified 
the code a little bit.



Another thing is that the "transaction type" output really deserves some
more work.  I think "multiple scripts" really doesn't cut it; we should
have some YAML-like as in the latency reports, which lists all scripts
in use and their weights.


For me the current output is clear for the reader, which does not 
mean that it cannot be improve, but how is rather a matter of taste.


I've tried to improve it further, see attached patch.

If you want something else, it would help to provide a sample of what you 
expect.



Also, while I have your attention regarding accumulated "technical
debt", please have a look at the "desc" argument used in addScript etc.
It's pretty ridiculous currently.  Maybe findBuiltin / process_builtin /
process_file should return a struct containing Command ** and the
"desc" string, rather than passing desc as a separate argument.


Ok, it can return a pointer to the builtin script.


Attached is my version of the patch.  While you're messing with it, it'd
be nice if you added comments on top of your recently added functions
such as findBuiltin, process_builtin, chooseScript.


Why not.

Find attached a 18-d which addresses these concerns, and a actualized 18-e 
for the prefix.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ade1b53..780a520 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -262,11 +262,13 @@ pgbench  options  dbname
 
 
  
-  -b scriptname
-  --builtin scriptname
+  -b scriptname[@weight]
+  --builtin=scriptname[@weight]
   

 Add the specified builtin script to the list of executed scripts.
+An optional integer weight after @ allows to adjust the
+probability of drawing the script.  If not specified, it is set to 1.
 Available builtin scripts are: tpcb-like,
 simple-update and select-only.
 With special name list, show the list of builtin scripts
@@ -321,12 +323,14 @@ pgbench  options  dbname
  
 
  
-  -f filename
-  --file=filename
+  -f filename[@weight]
+  --file=filename[@weight]
   

 Add a transaction script read from filename to
 the list of executed scripts.
+An optional integer weight after @ allows to adjust the
+probability of drawing the test.
 See below for details.

   
@@ -686,9 +690,13 @@ pgbench  options  dbname
   What is the Transaction Actually Performed in pgbench?
 
   
-   Pgbench executes test scripts chosen randomly from a specified list.
+   pgbench executes test scripts chosen randomly
+   from a specified list.
They include built-in scripts with -b and
user-provided custom scripts with -f.
+   Each script may be given a relative weight specified after a
+   @ so as to change its drawing probability.
+   The default weight is 1.
  
 
   
@@ -1135,12 +1143,11 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 1/1
+latency average = 15.844 ms
+latency stddev = 2.715 ms
 tps = 618.764555 (including connections establishing)
 tps = 622.977698 (excluding connections establishing)
-SQL script 1: builtin: TPC-B (sort of)
- - 1 transactions (100.0% of total, tps = 618.764555)
- - latency average = 15.844 ms
- - latency stddev = 2.715 ms
+script statistics:
  - statement latencies in milliseconds:
 0.004386\set nbranches 1 * :scale
 0.001343\set ntellers 10 * :scale
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7eb6a2d..da4f05c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -38,6 +38,7 @@
 #include "portability/instr_time.h"
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -179,6 +180,8 @@ char	   *login = NULL;
 char	   *dbName;
 const char *progname;
 
+#define WSEP '@'/* weight separator */
+
 volatile bool timer_exceeded = false;	/* flag from signal handler */
 
 /* variable definitions */
@@ -299,23 +302,26 @@ typedef struct
 static struct
 {
 	const char *name;
+	int			weight;
 	Command   **commands;
-	StatsData stats;
+	StatsData	stats;
 }	sql_script[MAX_SCRIPTS];	/* SQL script files */
 static int	num_scripts;		/* number of scripts in sql_script[] */
 static int	num_commands = 0;	/* total number of Command structs */
+static int64 total_weight = 0;
+
 static int	debug = 0;			/* debug flag */
 
 /* Define builtin test 

Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 4:43 PM, Victor Wagner  wrote:
> On Thu, 4 Feb 2016 12:59:03 +0300
> Michael Paquier  wrote:
>> > 1) Better to raise more meaningful error when IPC::Run is absent.
>>
>> This has been discussed before, and as far as I recall the current
>> behavior has been concluded as being fine. That's where
>> --enable-tap-tests becomes meaningful.
>
> Really, it is not so hard to add configure checks for perl modules.
> And we need to test not only for IPC::Run, but for Test::More too,
> because some Linux distributions put modules which come with perl into
> separate package.

The last time we discussed about that on this list we concluded that
it was not really necessary to have such checks, for one it makes the
code more simple, and because this is leveraged by the presence of
--enable-tap-tests, tests which can get actually costly with
check-world. But this is digressing the subject of this thread, which
deals with the fact of having recovery tests integrated in core...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 4:10 PM, Andres Freund  wrote:
> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
>> I think generally it is good idea, but one thing worth a thought is that
>> by doing so, we need to acquire all WAL Insertion locks every
>> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
>> every slot, do you think it is matter of concern in any way for write
>> workloads or it won't effect as we need to do this periodically?
>
> Michael and I just had an in-person discussion, and one of the topics
> was that. The plan was basically to adapt the patch to:
> 1) Store the progress lsn inside the wal insert lock
> 2) Change the HasActivity API to return an the last LSN at which there
>was activity, instead of a boolean.
> 3) Individually acquire each insert locks's lwlock to get it's progress
>LSN, but not the exclusive insert lock. We need the lwllock to avoid
>a torn 8byte read on some platforms.

4) Switch the condition to decide if a checkpoint should be skipped
using the last activity position compared with ProcLastRecPtr in
CreateCheckpoint to see if any activity has occurred since the
checkpoint record was inserted, and do not care anymore if the
previous record and current record are on different segments. This
would basically work.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-04 Thread Andres Freund
Hi,

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The main changes are that:
1) the significant performance regressions I saw are addressed by
   changing the wal writer flushing logic
2) The flushing API moved up a couple layers, and now deals with buffer
   tags, rather than the physical files
3) Writes from checkpoints, bgwriter and files are flushed, configurable
   by individual GUCs. Without that I still saw the spiked in a lot of 
circumstances.

There's also a more experimental reimplementation of bgwriter, but I'm
not sure it's realistic to polish that up within the constraints of 9.6.

Regards,

Andres 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Thom Brown
On 4 February 2016 at 15:07, Peter Geoghegan  wrote:
> On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown  wrote:
>>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
>
> I see the same.
>
> I use my regular SQL query to see the breakdown of leaf/internal/root pages:
>
> postgres=# with tots as (
>   SELECT count(*) c,
>   avg(live_items) avg_live_items,
>   avg(dead_items) avg_dead_items,
>   u.type,
>   r.oid
>   from (select c.oid,
>   c.relpages,
>   generate_series(1, c.relpages - 1) i
>   from pg_index i
>   join pg_opclass op on i.indclass[0] = op.oid
>   join pg_am am on op.opcmethod = am.oid
>   join pg_class c on i.indexrelid = c.oid
>   where am.amname = 'btree') r,
> lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
>   group by r.oid, type)
> select ct.relname table_name,
>   tots.oid::regclass::text index_name,
>   (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
>   upper(type) page_type,
>   c npages,
>   to_char(avg_live_items, '990.999'),
>   to_char(avg_dead_items, '990.999'),
>   to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
> %' as prop_of_index
>   from tots
>   join pg_index i on i.indexrelid = tots.oid
>   join pg_class ct on ct.oid = i.indrelid
>   where tots.oid = 'pgbench_accounts_pkey'::regclass
>   order by ct.relnamespace, table_name, index_name, npages, type;
> table_name│  index_name   │ non_meta_pages │ page_type
> │ npages │ to_char  │ to_char  │ prop_of_index
> ──┼───┼┼───┼┼──┼──┼───
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
> │  1 │   97.000 │0.000 │0.004 %
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
> │ 97 │  282.670 │0.000 │0.354 %
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
> │ 27,323 │  366.992 │0.000 │   99.643 %
> (3 rows)
>
> But this looks healthy -- I see the same with master. And since the
> accounts table is listed as 1281 MB, this looks like a plausible ratio
> in the size of the table to its primary index (which I would not say
> is true of an 87MB primary key index).
>
> Are you sure you have the details right, Thom?

*facepalm*

No, I'm not.  I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense.  I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.

Now I've checked it with the latest patch, and can see it working
fine.  Apologies for the confusion.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Peter Geoghegan
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
 wrote:
> I fixed it in the new version (attached).

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
 * scan key), which could be the last slot + 1.
 */
if (P_ISLEAF(opaque))
+   {
+   if (low <= PageGetMaxOffsetNumber(page))
+   {
+   IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+   /* one excessive check of equality. for possible posting
tuple update or creation */
+   if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+   && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+   *updposing = true;
+   }
return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;
 ?column?
──
 t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
 * it off the old page, not the new one, in case we are not at leaf
 * level.
 */
-   state->btps_minkey = CopyIndexTuple(oitup);
+   ItemId iihk = PageGetItemId(opage, P_HIKEY);
+   IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+   state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage\
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+   (sizeof(ItemPointerData

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

* More explanation of the design would be nice. I suggest modifying
the nbtree README file, so it's easy to tell what the current design
is. It's hard to follow this from the thread. When I reviewed Heikki's
B-Tree patches from a couple of years ago, we spent ~75% of the time
on design, and only ~25% on code.

* I have a paranoid feeling that the deletion locking protocol
(VACUUMing index tuples concurrently and safely) may need special
consideration here. Basically, with the B-Tree code, there are several
complicated locking protocols, like for page splits, page deletion,
and interlocking with vacuum ("super exclusive lock" stuff). These are
why the B-Tree code is complicated in general, and it's very important
to pin down exactly how we deal with each. Ideally, you'd have an
explanation for why your code was correct in each of these existing
cases (especially deletion). With very complicated and important code
like this, it's often wise to be very clear about when we are talking
about your design, and when we are talking about your code. It's
generally too hard to review both at the same time.

Ideally, when you talk about your design, you'll be able to say things
like "it's clear that this existing thing is correct; at least we have
no complaints from the field. Therefore, it must be true that my new
technique is also correct, because it makes that general situation no
worse". Obviously that kind of rigor is just something we aspire to,
and still fall short of at times. Still, it would be nice to
specifically see a reason why the new code isn't special from the
point of view of the super-exclusive lock thing (which is what I mean
by deletion locking protocol + special consideration). Or why it is
special, but that's okay, or whatever. This style of review is normal
when writing B-Tree code. Some other things don't need this rigor, or
have no invariants that need to be respected/used. Maybe this is
obvious to you already, but it isn't obvious to me.

It's okay if you don't know why, but knowing that you don't have a
strong opinion about something is itself useful information.

* I see you disabled the LP_DEAD thing; why? Just because that made
bugs go away?

* Have you done much stress testing? Using pgbench with many
concurrent VACUUM FREEZE operations would be a good idea, if you
haven't already, because that is insistent about getting super
exclusive locks, unlike regular VACUUM.

* Are you keeping the restriction of 1/3 of a buffer 

Re: [HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread Tom Lane
Shubham Barai  writes:
> I wanted to know if anyone is working on these projects from todo list.
> 1.Add UNIQUE capability to hash indexes
> 2.Allow multi-column hash indexes

Not that I've heard of.  It's very hard to muster any enthusiasm for
improving hash indexes unless their lack of WAL-logging is fixed first.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread David Steele
On 2/4/16 1:12 PM, Tom Lane wrote:
> Shubham Barai  writes:
>> I wanted to know if anyone is working on these projects from todo list.
>> 1.Add UNIQUE capability to hash indexes
>> 2.Allow multi-column hash indexes
> 
> Not that I've heard of.  It's very hard to muster any enthusiasm for
> improving hash indexes unless their lack of WAL-logging is fixed first.

+1.

I currently steer customers away from hash indexes and have never used
them for any project myself.  They are simply not safe.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


[HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread Shubham Barai
Hello hackers ,
I wanted to know if anyone is working on these projects from todo list.

1.Add UNIQUE capability to hash indexes

2.Allow multi-column hash indexes

I couldn't find any discussion about these projects.

Thanks,
Shubham Barai


Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Victor Wagner
On Thu, 4 Feb 2016 18:33:27 +0300
Michael Paquier  wrote:


> > Really, it is not so hard to add configure checks for perl modules.
> > And we need to test not only for IPC::Run, but for Test::More too,
> > because some Linux distributions put modules which come with perl
> > into separate package.  
> 
> The last time we discussed about that on this list we concluded that
> it was not really necessary to have such checks, for one it makes the
> code more simple, and because this is leveraged by the presence of
> --enable-tap-tests, tests which can get actually costly with
> check-world. But this is digressing the subject of this thread, which
> deals with the fact of having recovery tests integrated in core...

Of course, such configure tests should be run only if
--enable-tap-tests is passed to the configure script

It would look  like

if test "$enable_tap_tests" = "yes"; then
  AX_PROG_PERL_MODULES( Test::More, , AC_MSG_ERROR([Test::More is
  necessary to run TAP Tests])
  AX_PROG_PERL_MODULES( IPC::Run, , AC_MSG_ERROR([IPC::Run is
  necessary to run TAP Tests])
fi

in the configure.in

May be it is not strictly necessary, but it is really useful to see
such problems as clear error message during configure stage, rather
than successfully configure, compile, run tests and only then find out,
that something is forgotten.

I don't see why having such tests in the configure.in, makes code more
complex. It just prevents configure to finish successfully if
--enable-tap-tests is specified and required modules are not available.




-- 
   Victor Wagner 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Victor Wagner

This patch adds a long-awaited functionality to the PostgreSQL test
suite - testing of cluster configuration.

It contains bare minimum of replication and recovery test, but it should
be a good starting point for other people. 

Really, adding a much more tests for replication and recovery
is problematic, because these tests are resource-hungry, and take big
enough time even on relatively powerful machines, but  it seems to be
necessary, because they need to create several temporary installation.

So, set of tests, included into this patch is reasonably good choice. 

I think that readability of tests can be improved a bit, because these
tests would serve as an example for all tap test writers.

It's quite good that patch sets standard of using 'use strict; use
warnings;' in the test script.

It is bad, that Postgres-specific perl modules do not have embedded
documentation. It would be nice to see POD documentation in the
TestLib.pm and PostgresNode.pm instead of just comments. It would be
much easier to test writers to read documentation using perldoc utility,
rather than browse through the code.

I'll second Stas' suggestion about psql_ok/psql_fail functions.

1. psql_ok instead of just psql would provide visual feedback for the
reader of code. One would see 'here condition is tested, here is
something ended with _ok/_fail'.

It would be nice that seeing say "use Test::More tests => 4"
one can immediately see "Yes, there is three _ok's and one _fail in the
script'

2. I have use case for psql_fail code. In my libpq failover patch there
is number of cases, where it should be tested that connection is not
established,

But this is rather about further evolution of the tap test library, not
about this set of tests.

I think that this patch should be commited as soon as possible in its
current form (short of already reported reject in the PostgresNode.pm
init function).


-- 
   Victor Wagner 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Robert Haas
On Thu, Feb 4, 2016 at 2:49 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas  wrote:
>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>>  wrote:
>>> Yes, please let's use the custom language, and let's not care of not
>>> more than 1 level of nesting so as it is possible to represent
>>> pg_stat_replication in a simple way for the user.
>>
>> "not" is used twice in this sentence in a way that renders me not able
>> to be sure that I'm not understanding it not properly.
>
> 4 times here. Score beaten.
>
> Sorry. Perhaps I am tired... I was just wondering if it would be fine
> to only support configurations up to one level of nested objects, like
> that:
> 2[node1, node2, node3]
> node1, 2[node2, node3], node3
> In short, we could restrict things so as we cannot define a group of
> nodes within an existing group.

I see.  Such a restriction doesn't seem likely to me to prevent people
from doing anything actually useful.  But I don't know that it buys
very much either.  It's often not very much simpler to handle 2 levels
than n levels.  However, I ain't writing the code so...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas  wrote:
>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>>  wrote:
>>> Yes, please let's use the custom language, and let's not care of not
>>> more than 1 level of nesting so as it is possible to represent
>>> pg_stat_replication in a simple way for the user.
>>
>> "not" is used twice in this sentence in a way that renders me not able
>> to be sure that I'm not understanding it not properly.
>
> 4 times here. Score beaten.
>
> Sorry. Perhaps I am tired... I was just wondering if it would be fine
> to only support configurations up to one level of nested objects, like
> that:
> 2[node1, node2, node3]
> node1, 2[node2, node3], node3
> In short, we could restrict things so as we cannot define a group of
> nodes within an existing group.

No, actually, that's stupid. Having up to two nested levels makes more
sense, a quite common case for this feature being something like that:
2{node1,[node2,node3]}
In short, sync confirmation is waited from node1 and (node2 or node3).

Flattening groups of nodes with a new catalog will be necessary to
ease the view of this data to users:
- group name?
- array of members with nodes/groups
- group type: quorum or priority
- number of items to wait for in this group
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idle In Transaction Session Timeout, revived

2016-02-04 Thread Andres Freund
On 2016-02-04 22:24:50 +0900, Fujii Masao wrote:
> But, IIRC, one of the problems that prevent the adoption of this feature is
> the addition of gettimeofday() call after every SQL command receipt.
> Have you already resolved that problem? Or we don't need to care about
> it because it's almost harmless?

Well, it only happens when the feature is enabled, not
unconditionally. So I don't think that's particularly bad, as long as
the feature is not enabled by default.


If we feel we need to something about the gettimeofday() call at some
point, I think it'd made a lot of sense to introduce a more centralize
"statement stop" time, and an extended pgstat_report_activity() that
allows to specify the timestamp. Because right now we'll essentially do
gettimeofday() calls successively when starting a command, starting a
transaction, committing a transaction, and finishing a comment. That's
pretty pointless.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Michael Paquier
On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas  wrote:
> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>  wrote:
>> Yes, please let's use the custom language, and let's not care of not
>> more than 1 level of nesting so as it is possible to represent
>> pg_stat_replication in a simple way for the user.
>
> "not" is used twice in this sentence in a way that renders me not able
> to be sure that I'm not understanding it not properly.

4 times here. Score beaten.

Sorry. Perhaps I am tired... I was just wondering if it would be fine
to only support configurations up to one level of nested objects, like
that:
2[node1, node2, node3]
node1, 2[node2, node3], node3
In short, we could restrict things so as we cannot define a group of
nodes within an existing group.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Etsuro Fujita

On 2016/01/29 17:52, Ashutosh Bapat wrote:

On Fri, Jan 29, 2016 at 2:05 PM, Etsuro Fujita
> wrote:



On 2016/01/29 1:26, Ashutosh Bapat wrote:
Here is the summary of changes from the last set of patches



2. Included fix for EvalPlanQual in postgres_fdw - an alternate
local
path is obtained from the list of paths linked to the joinrel. Since
this is done before adding the ForeignPath, we should be a local
path
available for given join.



I looked at that code in the patch (ie, postgresRecheckForeignScan
and the helper function that creates a local join path for a given
foreign join path.), briefly.  Maybe I'm missing something, but I
think that is basically the same as the fix I proposed for
addressing this issue, posted before [1], right?



Yes, although I have added functions to copy the paths, not consider
pathkeys and change the relevant members of the paths. Sorry  if I have
missed giving you due credits.



   If so, my concern is, the helper function probably wouldn't
extend to the parameterized-foreign-join-path cases, though that
would work well for the unparameterized-foreign-join-path cases.  We
don't support parameterized-foreign-join paths for 9.6?



If we do not find a local path with given parameterization, it means
there are other local parameterized paths which are superior to it. This
possibly indicates that there will be foreign join parameterised paths
which are superior to this parameterized path, so we basically do not
create foreign join path with that parameterization.


The latest version of the postgres_fdw join pushdown patch will support 
only the unparameterized-path case, so we don't have to consider this, 
but why do you think the superiority of parameterizations is preserved 
between remote joining and local joining?


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcommitfest reply having corrupted subject line

2016-02-04 Thread Magnus Hagander
On Thu, Feb 4, 2016 at 7:26 AM, Noah Misch  wrote:

> The following message, which bears "User-Agent: pgcommitfest",
>
>
> http://www.postgresql.org/message-id/flat/20160202164101.1291.30526.p...@coridan.postgresql.org
>
> added spaces after commas in its subject line, compared to the subject
> line of
> its In-Reply-To message.
>
>  new subject line: Re: Add generate_series(date, date) and
> generate_series(date, date, integer)
> previous subject line: Re: Add generate_series(date,date) and
> generate_series(date,date,integer)
>
> I see no way to manually alter the subject line from the
> commitfest.postgresql.org review UI.  Is commitfest.postgresql.org
> initiating
> this change internally?
>

That is weird.

The CF app certainly doesn't do that intentionally - it copies it from the
archives. And if I look in the db on the cf app, it has the subject without
the space as the field that it copied :O

The code is line 355-358 at:
http://git.postgresql.org/gitweb/?p=pgcommitfest2.git;a=blob;f=pgcommitfest/commitfest/views.py;h=b4f19b2db470e5269943418b2402ca9ddfff0dff;hb=fec3b2431730c131a206a170a99a7610cdbacc6b#l355

the "subject" field in the db that we copy does not have the spaces... I
honestly have no idea where they are coming from :O I'm guessing it must be
something internally in the python libraries that create the MIME. Have you
seen this with any other messages, that you can recall, or just this one?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-04 Thread Stas Kelvich
Hi.

I’ve looked over proposed patch and migrated my shell tests scripts that i’ve 
used for testing twophase commits on master/slave to this test framework. 
Everything looks mature, and I didn’t encountered any problems with writing new 
tests using this infrastructure.

>From my point of view I don’t see any problems to commit this patches in their 
>current state.

Also some things that came into mind about test suite:

0) There are several routines that does actual checking, like 
is/command_ok/command_fails. I think it will be very handy to have wrappers 
psql_ok/psql_fails that calls psql through the command_ok/fails.

1) Better to raise more meaningful error when IPC::Run is absend.

2) --enable-tap-tests deserves mention in test/recovery/README and more obvious 
error message when one trying to run make check in test/recovery without 
--enable-tap-tests.

3) Is it hard to give ability to run TAP tests in extensions?

4) It will be handy if make check will write path to log files in case of 
failed test.

5) psql() accepts database name as a first argument, but everywhere in tests it 
is ‘postgres’. Isn’t it simpler to store dbname in connstr, and have separate 
function to change database?

6) Clean logs on prove restart? Clean up tmp installations?

7) Make check sets PGPORT PG_REGRESS for prove. Is it necessary?

> On 22 Jan 2016, at 09:17, Michael Paquier  wrote:
> 
> On Mon, Dec 21, 2015 at 4:45 PM, Michael Paquier
>  wrote:
>> As this thread is stalling a bit, please find attached a series of
>> patch gathering all the pending issues for this thread:
>> - 0001, fix config_default.pl for MSVC builds to take into account TAP tests
>> - 0002, append a node name in get_new_node (per Noah's request)
>> - 0003, the actual recovery test suite
>> Hopefully this facilitates future reviews.
> 
> Patch 2 has been pushed as c8642d9 (thanks Alvaro). The remaining two
> patches still apply and pass cleanly.
> -- 
> Michael
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-04 Thread Etsuro Fujita

On 2016/02/04 8:00, Robert Haas wrote:

On Wed, Feb 3, 2016 at 5:56 PM, Robert Haas  wrote:

On Wed, Feb 3, 2016 at 12:08 PM, Ashutosh Bapat
 wrote:

PFA patches with naming conventions similar to previous ones.
pg_fdw_core_v7.patch: core changes
pg_fdw_join_v7.patch: postgres_fdw changes for join pushdown
pg_join_pd_v7.patch: combined patch for ease of testing.


Thank you for working on this, Ashutosh and Robert!  I've not look at 
the patches closely yet, but ISTM the patches would be really in good shape!



Hmm, I think that GetPathForEPQRecheck() is a pretty terrible name.
How about GetExistingJoinPath()?


+1


Oops.  Hit Send too soon.  Also, how about writing if
(path->param_info != NULL) continue; instead of burying the core of
the function in another level of indentation?  I think you should skip
paths that aren't parallel_safe, too, and the documentation should be
clear that this will find an unparameterized, parallel-safe joinpath
if one exists.

+   ForeignPath *foreign_path;
+   foreign_path = (ForeignPath
*)joinpath->outerjoinpath;

Maybe insert a blank line between here, and in the other, similar case.


* Is it safe to replace outerjoinpath with its fdw_outerpath the 
following way?  I think that if the join relation represented by 
outerjoinpath has local conditions that can't be executed remotely, we 
have to keep outerjoinpath in the path tree; we will otherwise fail to 
execute the local conditions.  No?


+   /*
+* If either inner or outer path is a ForeignPath 
corresponding to
+* a pushed down join, replace it with the 
fdw_outerpath, so that we
+* maintain path for EPQ checks built entirely of local 
join
+* strategies.
+*/
+   if (IsA(joinpath->outerjoinpath, ForeignPath))
+   {
+   ForeignPath *foreign_path;
+   foreign_path = (ForeignPath 
*)joinpath->outerjoinpath;
+   if (foreign_path->path.parent->reloptkind == 
RELOPT_JOINREL)
+   joinpath->outerjoinpath = 
foreign_path->fdw_outerpath;
+   }

* IIUC, that function will be used by custom joins, so I think it would 
be better to put that function somewhere in the /optimizer directory 
(pathnode.c?).


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-04 Thread Michael Paquier
On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund  wrote:
> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:
>> And there is no actual risk of data loss
>
> Huh?

More precise: what I mean here is that should an OS crash or a power
failure happen, we would fall back to recovery at next restart, so we
would not actually *lose* data.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-04 Thread Michael Paquier
On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:
> Not wrong, and this leads to the following:
> void rename_safe(const char *old, const char *new, bool isdir, int elevel);
> Controlling elevel is necessary per the multiple code paths that would
> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
> that look fine?

After really coding it, I finished with the following thing:
+int
+rename_safe(const char *old, const char *new)

There is no need to extend that for directories, well we could of
course but all the renames happen on files so I see no need to make
that more complicated. More refactoring of the other rename() calls
could be done as well by extending rename_safe() with a flag to fsync
the data within a critical section, particularly for the replication
slot code. I have let that out to not complicate more the patch.
-- 
Michael
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..11287aa 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -418,9 +418,10 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	TLHistoryFilePath(path, newTLI);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing file.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link() to rename_safe() here just to be really sure that we
+	 * don't overwrite an existing file.  However, there shouldn't be one,
+	 * so rename_safe() is an acceptable substitute except for the truly
+	 * paranoid.
 	 */
 #if HAVE_WORKING_LINK
 	if (link(tmppath, path) < 0)
@@ -430,7 +431,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 		tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -508,9 +509,10 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	TLHistoryFilePath(path, tli);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link() to rename_safe() here just to be really sure that we
+	 * don't overwrite an existing logfile.  However, there shouldn't be
+	 * one, so rename_safe() is an acceptable substitute except for the
+	 * truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
 	if (link(tmppath, path) < 0)
@@ -520,7 +522,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 		tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..b61fcc7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3251,9 +3251,10 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link() to rename_safe() here just to be really sure that we
+	 * don't overwrite an existing logfile.  However, there shouldn't be
+	 * one, so rename_safe() is an acceptable substitute except for the
+	 * truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
 	if (link(tmppath, path) < 0)
@@ -3268,7 +3269,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3792,7 +3793,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 		 * flag, rename will fail. We'll try again at the next checkpoint.
 		 */
 		snprintf(newpath, MAXPGPATH, "%s.deleted", path);
-		if (rename(path, newpath) != 0)
+		if (rename_safe(path, newpath) != 0)
 		{
 			ereport(LOG,
 	(errcode_for_file_access(),
@@ -3800,10 +3801,12 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 	  path)));
 			return;
 		}
+
 		rc = unlink(newpath);
 #else
 		rc = unlink(path);
 #endif
+
 		if (rc != 0)
 		{
 			ereport(LOG,
@@ -5291,7 +5294,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * re-enter archive recovery mode in a subsequent crash.
 	 */
 	unlink(RECOVERY_COMMAND_DONE);
-	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+	if (rename_safe(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
 		ereport(FATAL,
 (errcode_for_file_access(),