Re: [HACKERS] Asynchronous execution on FDW

2015-07-02 Thread Kyotaro HORIGUCHI
Ouch! I mistakenly made two CF entries for this patch. Could
someone remove this entry for me?

https://commitfest.postgresql.org/5/290/

The correct entry is /5/291/

==
Hello. This is the new version of FDW async exection feature.

The status of this feature is as follows, as of the last commitfest.

- Async execution is valuable to have.
- But do the first kick in ExecInit phase is wrong.

So the design outline of this version is as following,

- The patch set consists of three parts. The fist is the
  infrastracture in core-side, second is the code to enable
  asynchronous execution of Postgres-FDW. The third part is the
  alternative set of three methods to adapt fetch size, which
  makes asynchronous execution more effective.

- It was a problem when to give the first kick for async exec. It
  is not in ExecInit phase, and ExecProc phase does not fit,
  too. An extra phase ExecPreProc or something is too
  invasive. So I tried pre-exec callback.

  Any init-node can register callbacks on their turn, then the
  registerd callbacks are called just before ExecProc phase in
  executor. The first patch adds functions and structs to enable
  this.

- The second part is not changed from the previous version. Add
  PgFdwConn as a extended PgConn which have some members to
  support asynchronous execution.

  The asynchronous execution is kicked only for the first
  ForeignScan node on the same foreign server. And the state
  lasts until the next scan comes. This behavior is mainly
  controlled in fetch_more_data(). The behavior limits the number
  of simultaneous exection for one foreign server to 1. This
  behavior is decided from the reason that no reasonable method
  to limit multiplicity of execution on *single peer* was found
  so far.

- The third part is three kind of trials of adaptive fetch size
  feature.

   The first one is duration-based adaptation. The patch
   increases the fetch size by every FETCH execution but try to
   keep the duration of every FETCH below 500 ms. But it is not
   promising because it looks very unstable, or the behavior is
   nearly unforeseeable..

   The second one is based on byte-based FETCH feature. This
   patch adds to FETCH command an argument to limit the number of
   bytes (octets) to send. But this might be a over-exposure of
   the internals. The size is counted based on internal
   representation of a tuple and the client is needed to send the
   overhead of its internal tuple representation in bytes. This
   is effective but quite ugly..

   The third is the most simple and straight-forward way, that
   is, adds a foreign table option to specify the fetch_size. The
   effect of this is also in doubt since the size of tuples for
   one foreign table would vary according to the return-columns
   list. But it is foreseeable for users and is a necessary knob
   for those who want to tune it. Foreign server also could have
   the same option as the default for that for foreign tables but
   this patch have not added it.


The attached patches are the following,

- 0001-Add-infrastructure-of-pre-execution-callbacks.patch
  Infrastructure of pre-execution callback

- 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch
  FDW asynchronous execution feature

- 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch
  Adaptive fetch size alternative 1: duration based control

- 0003b-POC-Experimental-fetch_by_size-feature.patch
  Adaptive fetch size alternative 2: FETCH by size

- 0003c-Add-foreign-table-option-to-set-fetch-size.patch
  Adaptive fetch size alternative 3: Foreign table option.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Amit Langote
On 2015-07-02 PM 03:43, Beena Emerson wrote:
 Amit wrote:
 
 Does HA software determine a standby to promote based on replication
 progress 
 or would things be reliable enough for it to infer one from the quorum
 setting 
 specified in GUC (or wherever)? Is part of the job of this patch to make
 the 
 latter possible? Just wondering or perhaps I am completely missing the
 point.
 
 Deciding the failover standby is not exactly part of this patch but we
 should be able to set up a mechanism to decide which is the best standby to
 be promoted. 
 
 We might not be able to conclude this from the sync parameter alone.
 
 As specified before in some cases an async standby could also be most
 eligible for the promotion.
 

Thanks for the explanation.

Regards,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Odd behaviour of SELECT ... ORDER BY ... FOR UPDATE

2015-07-02 Thread Etsuro Fujita
Hi,

While working on the foreign-join-pushdown issue, I noticed that in READ
COMMITTED isolation level it's possible that the result of SELECT ...
ORDER BY ... FOR UPDATE is not sorted correctly due to concurrent
updates that replaced the sort key columns with new values as shown in
the below example.  That seems odd to me.  So, I'd like to propose
raising an error rather than returning a possibly-incorrect result for
cases where the sorted tuples to be locked were modified by concurrent
updates.  Patch attached.  Is it OK to add this to the current CF?

Create an environment:

postgres=# create table test (a int);
CREATE TABLE
postgres=# insert into test values (1);
INSERT 0 1
postgres=# insert into test values (2);
INSERT 0 1

Run an example:

[Terminal 1]
postgres=# begin;
BEGIN
postgres=# update test set a = 3 where a = 1;
UPDATE 1

[Terminal 2]
postgres=# select * from test order by a for update;

[Terminal 1]
postgres=# commit;
COMMIT

[Terminal 2]
(The following result will be shown after the commit in Terminal 1.
Note that the output ordering is not correct.)
 a
---
 3
 2
(2 rows)

Best regards,
Etsuro Fujita
*** a/src/backend/executor/nodeLockRows.c
--- b/src/backend/executor/nodeLockRows.c
***
*** 39,44  TupleTableSlot */* return: a tuple or NULL */
--- 39,45 
  ExecLockRows(LockRowsState *node)
  {
  	TupleTableSlot *slot;
+ 	Plan	   *plan;
  	EState	   *estate;
  	PlanState  *outerPlan;
  	bool		epq_needed;
***
*** 47,52  ExecLockRows(LockRowsState *node)
--- 48,54 
  	/*
  	 * get information from the node
  	 */
+ 	plan = node-ps.plan;
  	estate = node-ps.state;
  	outerPlan = outerPlanState(node);
  
***
*** 214,219  lnext:
--- 216,225 
  	ereport(ERROR,
  			(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  			 errmsg(could not serialize access due to concurrent update)));
+ if (((LockRows *) plan)-ordering)
+ 	ereport(ERROR,
+ 			(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ 			 errmsg(tuple to be locked was changed due to concurrent update)));
  if (ItemPointerEquals(hufd.ctid, tuple.t_self))
  {
  	/* Tuple was deleted, so don't return it */
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***
*** 967,972  _copyLockRows(const LockRows *from)
--- 967,973 
  	/*
  	 * copy remainder of node
  	 */
+ 	COPY_SCALAR_FIELD(ordering);
  	COPY_NODE_FIELD(rowMarks);
  	COPY_SCALAR_FIELD(epqParam);
  
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***
*** 842,847  _outLockRows(StringInfo str, const LockRows *node)
--- 842,848 
  
  	_outPlanInfo(str, (const Plan *) node);
  
+ 	WRITE_BOOL_FIELD(ordering);
  	WRITE_NODE_FIELD(rowMarks);
  	WRITE_INT_FIELD(epqParam);
  }
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***
*** 4801,4807  make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
   *	  Build a LockRows plan node
   */
  LockRows *
! make_lockrows(Plan *lefttree, List *rowMarks, int epqParam)
  {
  	LockRows   *node = makeNode(LockRows);
  	Plan	   *plan = node-plan;
--- 4801,4807 
   *	  Build a LockRows plan node
   */
  LockRows *
! make_lockrows(Plan *lefttree, bool ordering, List *rowMarks, int epqParam)
  {
  	LockRows   *node = makeNode(LockRows);
  	Plan	   *plan = node-plan;
***
*** 4816,4821  make_lockrows(Plan *lefttree, List *rowMarks, int epqParam)
--- 4816,4822 
  	plan-lefttree = lefttree;
  	plan-righttree = NULL;
  
+ 	node-ordering = ordering;
  	node-rowMarks = rowMarks;
  	node-epqParam = epqParam;
  
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***
*** 2302,2308  grouping_planner(PlannerInfo *root, double tuple_fraction)
--- 2302,2311 
  	 */
  	if (parse-rowMarks)
  	{
+ 		bool		ordering = parse-sortClause ? true : false;
+ 
  		result_plan = (Plan *) make_lockrows(result_plan,
+ 			 ordering,
  			 root-rowMarks,
  			 SS_assign_special_param(root));
  
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***
*** 809,814  typedef struct SetOp
--- 809,815 
  typedef struct LockRows
  {
  	Plan		plan;
+ 	bool		ordering;		/* do we need to preserve the ordering? */
  	List	   *rowMarks;		/* a list of PlanRowMark's */
  	int			epqParam;		/* ID of Param for EvalPlanQual re-eval */
  } LockRows;
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***
*** 74,80  extern Group *make_group(PlannerInfo *root, List *tlist, List *qual,
  		   Plan *lefttree);
  extern Plan *materialize_finished_plan(Plan *subplan);
  extern Unique *make_unique(Plan *lefttree, List *distinctList);
! extern LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam);
  extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
  		   

Re: [HACKERS] WAL-related tools and .paritial WAL file

2015-07-02 Thread Michael Paquier
On Thu, Jul 2, 2015 at 2:06 PM, Fujii Masao wrote:
 I implemented the patch accordingly. Patch attached.

Cool, thanks.

I have done some tests with pg_archivecleanup, and it works as
expected, aka if I define a backup file, the backup file as well as
the segments equal or newer than it remain. It works as well when
defining a .partial file. I have done as well some testing with
pg_resetxlog and the partial segment gets removed.

Here are some comments:
1) pgarchivecleanup.sgml needs to be updated:
   In this mode, if you specify a filename.backup/ file name, then
only the file prefix
Here we should mention that it is also the case of a .partial file.
2)
-* restartWALFileName is a .backup filename, make sure we use the prefix
-* of the filename, otherwise we will remove wrong files since
+* restartWALFileName is a .partial or .backup filename, make
sure we use
+* the prefix of the filename, otherwise we will remove wrong
files since
 * 00010010.0020.backup is after
 * 00010010.
Shouldn't this be made clearer as well regarding .partial files? For
example with a comment like that:
otherwise we will remove wrong files since
00010010.0020.backup or
00010010.0020.partial are after
00010010. Simply not mentioning those file names
directly is fine for me.
3) Something not caused by this patch that I just noticed... But
pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they
get moved away as well?
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-02 Thread Michael Paquier
On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Ouch! I mistakenly made two CF entries for this patch. Could
 someone remove this entry for me?

 https://commitfest.postgresql.org/5/290/

 The correct entry is /5/291/

Done.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Odd behaviour of SELECT ... ORDER BY ... FOR UPDATE

2015-07-02 Thread Marko Tiikkaja
On 7/2/15 9:15 AM, Etsuro Fujita wrote:
 While working on the foreign-join-pushdown issue, I noticed that in READ
 COMMITTED isolation level it's possible that the result of SELECT ...
 ORDER BY ... FOR UPDATE is not sorted correctly due to concurrent
 updates that replaced the sort key columns with new values as shown in
 the below example.  That seems odd to me.  So, I'd like to propose
 raising an error rather than returning a possibly-incorrect result for
 cases where the sorted tuples to be locked were modified by concurrent
 updates.

I don't like the idea of READ COMMITTED suddenly throwing errors due to
concurrency problems.  Using FOR UPDATE correctly is really tricky, and
this is just one example.  And a documented one, at that, too.


.m


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Fujii Masao
On Thu, Jul 2, 2015 at 3:21 AM, Josh Berkus j...@agliodbs.com wrote:
 All:

 Replying to multiple people below.

 On 07/01/2015 07:15 AM, Fujii Masao wrote:
 On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus j...@agliodbs.com wrote:
 You're confusing two separate things.  The primary manageability problem
 has nothing to do with altering the parameter.  The main problem is: if
 there is more than one synch candidate, how do we determine *after the
 master dies* which candidate replica was in synch at the time of
 failure?  Currently there is no way to do that.  This proposal plans to,
 effectively, add more synch candidate configurations without addressing
 that core design failure *at all*.  That's why I say that this patch
 decreases overall reliability of the system instead of increasing it.

 I agree this is a problem even today, but it's basically independent from
 the proposed feature *itself*. So I think that it's better to discuss and
 work on the problem separately. If so, we might be able to provide
 good way to find new master even if the proposed feature finally fails
 to be adopted.

 I agree that they're separate features.  My argument is that the quorum
 synch feature isn't materially useful if we don't create some feature to
 identify which server(s) were in synch at the time the master died.

 The main reason I'm arguing on this thread is that discussion of this
 feature went straight into GUC syntax, without ever discussing:

 * what use cases are we serving?
 * what features do those use cases need?

 I'm saying that we need to have that discussion first before we go into
 syntax.  We gave up on quorum commit in 9.1 partly because nobody was
 convinced that it was actually useful; that case still needs to be
 established, and if we can determine *under what circumstances* it's
 useful, then we can know if the proposed feature we have is what we want
 or not.

 Myself, I have two use case for changes to sync rep:

 1. the ability to specify a group of three replicas in the same data
 center, and have commit succeed if it succeeds on two of them.  The
 purpose of this is to avoid data loss even if we lose the master and one
 replica.

 2. the ability to specify that synch needs to succeed on two replicas in
 two different data centers.  The idea here is to be able to ensure
 consistency between all data centers.

Yeah, I'm also thinking those *simple* use cases. I'm not sure
how many people really want to have very complicated quorum
commit setting.

 Speaking of which: how does the proposed patch roll back the commit on
 one replica if it fails to get quorum?

You meant the case where there are two sync replicas and the master
needs to wait until both send the ACK, then only one replica goes down?
In this case, the master receives the ACK from only one replica and
it must keep waiting until new sync replica appears and sends back
the ACK. So the committed transaction (written WAL record) would not
be rolled back.

 Well, one possibility is to have each replica keep a flag which
 indicates whether it thinks it's in sync or not.  This flag would be
 updated every time the replica sends a sync-ack to the master. There's a
 couple issues with that though:

I don't think this is good approach because there can be the case where
you need to promote even the standby server not having sync flag.
Please imagine the case where you have sync and async standby servers.
When the master goes down, the async standby might be ahead of the
sync one. This is possible in practice. In this case, it might be better to
promote the async standby instead of sync one. Because the remaining
sync standby which is behind can easily follow up with new master.

We can promote the sync standby in this case. But since the remaining
async standby is ahead, it's not easy to follow up with new master.
Probably new base backup needs to be taken onto async standby from
new master, or pg_rewind needs to be executed. That is, the async
standby basically needs to be set up again.

So I'm thinking that we basically need to check the progress on each
standby to choose new master.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Beena Emerson
Amit wrote:

 Does HA software determine a standby to promote based on replication
 progress 
 or would things be reliable enough for it to infer one from the quorum
 setting 
 specified in GUC (or wherever)? Is part of the job of this patch to make
 the 
 latter possible? Just wondering or perhaps I am completely missing the
 point.

Deciding the failover standby is not exactly part of this patch but we
should be able to set up a mechanism to decide which is the best standby to
be promoted. 

We might not be able to conclude this from the sync parameter alone.

As specified before in some cases an async standby could also be most
eligible for the promotion.



-

--

Beena Emerson

--
View this message in context: 
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856201.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Michael Paquier
On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 On 2015-07-02 PM 03:12, Fujii Masao wrote:

 So I'm thinking that we basically need to check the progress on each
 standby to choose new master.


 Does HA software determine a standby to promote based on replication progress
 or would things be reliable enough for it to infer one from the quorum setting
 specified in GUC (or wherever)? Is part of the job of this patch to make the
 latter possible? Just wondering or perhaps I am completely missing the point.

Replication progress is a factor of choice, but not the only one. The
sole role of this patch is just to allow us to have more advanced
policy in defining how synchronous replication works, aka how we want
to let the master acknowledge a commit synchronously from a set of N
standbys. In any case, this is something unrelated to the discussion
happening here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup and replication slots

2015-07-02 Thread Michael Paquier
On Wed, Jul 1, 2015 at 11:35 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 7/1/15 8:37 AM, Michael Paquier wrote:
 On Wed, Jul 1, 2015 at 10:33 AM, Peter Eisentraut wrote:
 (If you're looking at the patch and wondering why there is no code to
 actually do anything with the replication slot, that's because the code
 that does the WAL streaming is already aware of replication slots
 because of the pg_receivexlog functionality that it also serves.  So all
 that needs to be done is set replication_slot.)

 This is cool to see this patch taking shape.

 +my ($stdout, $stderr);
 +run [ 'psql', '-X', '-A', '-t', '-q', '-d', $dbname, '-f', '-' ],
 '', \$sql, '', \$stdout, '2', \$stderr or die;
 +chomp $stdout;
 +return $stdout;

 Could it be possible to chomp and return $stderr as well here? It
 seems useful to me to perform sanity checks after calling psql.

 Sure, makes sense.

OK, so here is more input about this set of patches.

Patch 1 looks good. It adds some tests to cover pg_basebackup -R and
checks if standby_mode and primary_conninfo are set correctly.
Patch 2 also looks fine. It adds tests for pg_basebackup -X. Both
patches are independent on the feature.

Regarding patch 3, I have more comments:
1) I think that documentation should clearly mention that if -R and -S
are used together, a primary_slot_name entry is added in the
recovery.conf generated with the slot name defined.
2)
 sub psql
 {
my ($dbname, $sql) = @_;
-   run [ 'psql', '-X', '-q', '-d', $dbname, '-f', '-' ], '', \$sql or die;
+   my ($stdout, $stderr);
+   run [ 'psql', '-X', '-A', '-t', '-q', '-d', $dbname, '-f', '-'
], '', \$sql, '', \$stdout, '2', \$stderr or die;
+   chomp $stdout;
+   return $stdout;
 }
RewindTest.pm has a routine called check_query defined. I would be
great to extend a bit more psql than what I thought previously by
returning from it ($result, $stdout, $strerr) and make check_query
directly use it. This way we have a unique interface to run psql and
capture output. I don't mind writing this refactoring patch on top of
your set if that's necessary.
3)
+command_ok([ 'pg_basebackup', '-D', $tempdir/backupxs_sl_R, '-X',
'stream', '-S', 'slot1', '-R' ],
+   'pg_basebackup with replication slot and -R runs');
+$recovery_conf = slurp_file $tempdir/backupR/recovery.conf;
+like(slurp_file($tempdir/backupxs_sl_R/recovery.conf),
slurp_file is slurping an incorrect file and $recovery_conf is used
nowhere after, so you could simply remove this line.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Odd behaviour of SELECT ... ORDER BY ... FOR UPDATE

2015-07-02 Thread Etsuro Fujita
Hi Marko,

On 2015/07/02 16:27, Marko Tiikkaja wrote:
 On 7/2/15 9:15 AM, Etsuro Fujita wrote:
 While working on the foreign-join-pushdown issue, I noticed that in READ
 COMMITTED isolation level it's possible that the result of SELECT ...
 ORDER BY ... FOR UPDATE is not sorted correctly due to concurrent
 updates that replaced the sort key columns with new values as shown in
 the below example.  That seems odd to me.  So, I'd like to propose
 raising an error rather than returning a possibly-incorrect result for
 cases where the sorted tuples to be locked were modified by concurrent
 updates.

 I don't like the idea of READ COMMITTED suddenly throwing errors due to
 concurrency problems.  Using FOR UPDATE correctly is really tricky, and
 this is just one example.  And a documented one, at that, too.

Ah, you are right.  I'll withdraw this.  Sorry for the noise.

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-02 Thread Kyotaro HORIGUCHI
Thank you.

At Thu, 2 Jul 2015 16:02:27 +0900, Michael Paquier michael.paqu...@gmail.com 
wrote in CAB7nPqTs0YCwXedt1P=JjxFJeoj9UzLzkLuiX8=jdtpyutn...@mail.gmail.com
 On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
 horiguchi.kyot...@lab.ntt.co.jp wrote:
  Ouch! I mistakenly made two CF entries for this patch. Could
  someone remove this entry for me?
 
  https://commitfest.postgresql.org/5/290/
 
  The correct entry is /5/291/
 
 Done.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Amit Langote
On 2015-07-02 PM 03:52, Michael Paquier wrote:
 On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote
 langote_amit...@lab.ntt.co.jp wrote:
 On 2015-07-02 PM 03:12, Fujii Masao wrote:

 So I'm thinking that we basically need to check the progress on each
 standby to choose new master.


 Does HA software determine a standby to promote based on replication progress
 or would things be reliable enough for it to infer one from the quorum 
 setting
 specified in GUC (or wherever)? Is part of the job of this patch to make the
 latter possible? Just wondering or perhaps I am completely missing the point.
 
 Replication progress is a factor of choice, but not the only one. The
 sole role of this patch is just to allow us to have more advanced
 policy in defining how synchronous replication works, aka how we want
 to let the master acknowledge a commit synchronously from a set of N
 standbys. In any case, this is something unrelated to the discussion
 happening here.
 

Got it, thanks!

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Amit Langote
On 2015-07-02 PM 03:12, Fujii Masao wrote:
 
 So I'm thinking that we basically need to check the progress on each
 standby to choose new master.
 

Does HA software determine a standby to promote based on replication progress
or would things be reliable enough for it to infer one from the quorum setting
specified in GUC (or wherever)? Is part of the job of this patch to make the
latter possible? Just wondering or perhaps I am completely missing the point.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Beena Emerson
Hello,
There has been a lot of discussion. It has become a bit confusing. 
I am summarizing my understanding of the discussion till now.
Kindly let me know if I missed anything important.

Backward compatibility:
We have to provide support for the current format and behavior for
synchronous replication (The first running standby from list s_s_names)
In case the new format does not include GUC, then a special value to be
specified for s_s_names to indicate that.

Priority and quorum:
Quorum treats all the standby with same priority while in priority behavior,
each one has a different priority and ACK must be received from the
specified k lowest priority servers. 
I am not sure how combining both will work out. 
Mostly we would like to have some standbys from each data center to be in
sync. Can it not be achieved by quorum only?

GUC parameter:
There are some arguments over the text format. However if we continue using
this, specifying the number before the group is a more readable option than
specifying it later.
S_s_names = 3(A, (P,Q), 2(X,Y,Z)) is better compared to
S_s_names = (A, (P,Q), (X,Y,Z) (2)) (3)

Catalog Method:
Is it safe to assume we may not going ahead with the Catalog approach?
A system catalog and some built in functions to set the sync parameters is
not viable because it can cause
 - promoted master to sync rep itself
 - changes to catalog may continuously wait for ACK from a down server.
The main problem of unlogged system catalog is data loss during crash.

JSON:
I agree it would make GUC very complex and unreadable. We can consider using
is as meta data. 
I think the only point in favor of JSON is to be able to set it using
functions instead of having to edit and reload right?

Identifying standby:
The main concern for the current use of application_name seems to be that
multiple standby with same name would form an intentional group (maybe
across data clusters too?).
I agree it would be better to have a mechanism to uniquely identify a
standby and groups can be made using whatever method we use to set the sync
requirements.

Main concern seems to be about deciding which standby is to be promoted is a
separate issue altogether.




-

--

Beena Emerson

--
View this message in context: 
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856216.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-02 Thread Amit Kapila
On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 13 May 2015 at 09:35, Heikki Linnakangas hlinn...@iki.fi wrote:


 In summary, the X^1.5 correction seems to work pretty well. It doesn't
 completely eliminate the problem, but it makes it a lot better.


 Agreed


Do we want to consider if wal_compression is enabled as that
can reduce the effect full_page_writes?


Also I am planning to run some tests for this patch, but not sure
if tps and or latency numbers by pgbench are sufficient or do you
people want to see actual read/write count via some form of
dynamic tracing (stap) as done by the reporter of this issue?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Memory Accounting v11

2015-07-02 Thread Simon Riggs
On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 The current state, where HashAgg just blows up the memory, is just not
 reasonable, and we need to track the memory to fix that problem.


 Meh. HashAgg could track its memory usage without loading the entire
 system with a penalty.


 +1 to a solution like that, although I don't think that's doable without
 digging the info from memory contexts somehow.


Jeff is right, we desperately need a solution and this is the place to
start.

Tom's concern remains valid: we must not load the entire system with a
penalty.


The only questions I have are:

* If the memory allocations adapt to the usage pattern, then we expect to
see few memory chunk allocations. Why are we expecting the entire system
to experience a penalty?

* If we do not manage our resources, how are we certain this does not
induce a penalty? Not tracking memory could be worse than tracking it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Patch to improve a few appendStringInfo* calls

2015-07-02 Thread Heikki Linnakangas

On 06/15/2015 03:56 AM, David Rowley wrote:

On 29 May 2015 at 12:51, Peter Eisentraut pete...@gmx.net wrote:


On 5/12/15 4:33 AM, David Rowley wrote:

Shortly after I sent the previous patch I did a few more searches and
also found some more things that are not quite right.
Most of these are to use the binary append method when the length of the
string is already known.


For these cases it might be better to invent additional functions such
as stringinfo_to_text() and appendStringInfoStringInfo() instead of
repeating the pattern of referring to data and length separately.


You're probably right. It would be nicer to see some sort of wrapper
functions that cleaned these up a bit.

I really think that's something for another patch though, this patch just
intends to put things the way they're meant to be in the least invasive way
possible. What you're proposing is changing the way it's meant to work,
which will cause much more code churn.

I've attached a re-based patch.


Applied the straightforward parts. I left out the changes like


-   appendStringInfoString(collist, buf.data);
+   appendBinaryStringInfo(collist, buf.data, buf.len);


because they're not an improvement in readablity, IMHO, and they were 
not in performance-critical paths.


I also noticed that the space after CREATE EVENT TRIGGER name in 
pg_dump was actually spurious, and just added a space before newline. I 
removed that space altogether,


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL-related tools and .paritial WAL file

2015-07-02 Thread Fujii Masao
On Thu, Jul 2, 2015 at 3:48 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Jul 2, 2015 at 2:06 PM, Fujii Masao wrote:
 I implemented the patch accordingly. Patch attached.

 Cool, thanks.

 I have done some tests with pg_archivecleanup, and it works as
 expected, aka if I define a backup file, the backup file as well as
 the segments equal or newer than it remain. It works as well when
 defining a .partial file. I have done as well some testing with
 pg_resetxlog and the partial segment gets removed.

Thanks for testing the patch!

Attached is the updated version of the patch.

 Here are some comments:
 1) pgarchivecleanup.sgml needs to be updated:
In this mode, if you specify a filename.backup/ file name, then
 only the file prefix
 Here we should mention that it is also the case of a .partial file.

Applied.

 2)
 -* restartWALFileName is a .backup filename, make sure we use the 
 prefix
 -* of the filename, otherwise we will remove wrong files since
 +* restartWALFileName is a .partial or .backup filename, make
 sure we use
 +* the prefix of the filename, otherwise we will remove wrong
 files since
  * 00010010.0020.backup is after
  * 00010010.
 Shouldn't this be made clearer as well regarding .partial files? For
 example with a comment like that:
 otherwise we will remove wrong files since
 00010010.0020.backup or
 00010010.0020.partial are after
 00010010. Simply not mentioning those file names
 directly is fine for me.

Applied.

 3) Something not caused by this patch that I just noticed... But
 pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they
 get moved away as well?

pg_resetxlog doesn't remove also .history file in pg_xlog. Those remaining
.backup and .history files are harmless after pg_resetxlog is executed.
So I don't think that it's required to remove them. Of course we can do that,
but some existing applications might depend on the current behavior...
So unless there is strong reason to do that, I'd like to let it as it is.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/ref/pg_xlogdump.sgml
--- b/doc/src/sgml/ref/pg_xlogdump.sgml
***
*** 215,220  PostgreSQL documentation
--- 215,226 
  Only the specified timeline is displayed (or the default, if none is
  specified). Records in other timelines are ignored.
/para
+ 
+   para
+ applicationpg_xlogdump/ cannot read WAL files with suffix
+ literal.partial/. If those files need to be read, literal.partial/
+ suffix needs to be removed from the filename.
+   /para
   /refsect1
  
   refsect1
*** a/doc/src/sgml/ref/pgarchivecleanup.sgml
--- b/doc/src/sgml/ref/pgarchivecleanup.sgml
***
*** 60,67  archive_cleanup_command = 'pg_archivecleanup replaceablearchivelocation/ %r'
para
 When used as a standalone program all WAL files logically preceding the
 replaceableoldestkeptwalfile/ will be removed from replaceablearchivelocation/.
!In this mode, if you specify a filename.backup/ file name, then only the file prefix
!will be used as the replaceableoldestkeptwalfile/. This allows you to remove
 all WAL files archived prior to a specific base backup without error.
 For example, the following example will remove all files older than
 WAL file name filename000100370010/:
--- 60,69 
para
 When used as a standalone program all WAL files logically preceding the
 replaceableoldestkeptwalfile/ will be removed from replaceablearchivelocation/.
!In this mode, if you specify a filename.partial/ or filename.backup/
!file name, then only the file prefix will be used as the
!replaceableoldestkeptwalfile/. This treatment of filename.backup/
!file name allows you to remove
 all WAL files archived prior to a specific base backup without error.
 For example, the following example will remove all files older than
 WAL file name filename000100370010/:
*** a/src/bin/pg_archivecleanup/pg_archivecleanup.c
--- b/src/bin/pg_archivecleanup/pg_archivecleanup.c
***
*** 125,131  CleanupPriorWALFiles(void)
  			 * file. Note that this means files are not removed in the order
  			 * they were originally written, in case this worries you.
  			 */
! 			if (IsXLogFileName(walfile) 
  strcmp(walfile + 8, exclusiveCleanupFileName + 8)  0)
  			{
  /*
--- 125,131 
  			 * file. Note that this means files are not removed in the order
  			 * they were originally written, in case this worries you.
  			 */
! 			if ((IsXLogFileName(walfile) || IsPartialXLogFileName(walfile)) 
  strcmp(walfile + 8, exclusiveCleanupFileName + 8)  0)
  			{
  /*
***
*** 181,187  CleanupPriorWALFiles(void)
   * SetWALFileNameForCleanup()
   *
   *	  Set the earliest WAL filename that we want to keep on the archive
!  *	  and 

Re: [HACKERS] raw output from copy

2015-07-02 Thread Pavel Stehule
Hi

I'll do it today evening

Pavel

2015-07-02 12:55 GMT+02:00 Simon Riggs si...@2ndquadrant.com:

 On 1 July 2015 at 07:42, Pavel Golub pa...@microolap.com wrote:


 I looked through the patch. Sources are OK. However I didn't find any
 docs and test cases. Would you please provide me with short description on
 this feature and why it is important. Because I didn't manage to find the
 old Andrew Dunstan's post either.


 Feature sounds OK, so lets do it.

 Pavel S, please submit a polished patch. Coding guidelines, tests, docs
 etc. Set back to Waiting On Author.

 --
 Simon Riggshttp://www.2ndQuadrant.com/
 http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-02 Thread Simon Riggs
On 13 May 2015 at 09:35, Heikki Linnakangas hlinn...@iki.fi wrote:


 In summary, the X^1.5 correction seems to work pretty well. It doesn't
 completely eliminate the problem, but it makes it a lot better.


Agreed


 I don't want to over-compensate for the full-page-write effect either,
 because there are also applications where that effect isn't so big. For
 example, an application that performs a lot of updates, but all the updates
 are on a small number of pages, so the full-page-write storm immediately
 after checkpoint doesn't last long. A worst case for this patch would be
 such an application - lots of updates on only a few pages - with a long
 checkpoint_timeoout but relatively small checkpoint_segments, so that
 checkpoints are always driven by checkpoint_segments. I'd like to see some
 benchmarking of that worst case before committing anything like this.


We could do better, but that is not a reason not to commit this, as is.
Commit, please.

This has been in place for a while and still remains: TODO: reduce impact
of full page writes

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx

2015-07-02 Thread Pavel Stehule
2015-07-02 11:03 GMT+02:00 Heikki Linnakangas hlinn...@iki.fi:

 On 05/29/2015 10:41 AM, Pavel Stehule wrote:

 2015-05-29 9:28 GMT+02:00 Jeevan Chalke jeevan.cha...@gmail.com:

  I agree with Peter that We don't tab-complete everything we possibly
 could, but using tabs after SET ROLE TO  provides DEFAULT as an
 option
 which seems wrong.
 This patch adds list of roles over there, which I guess good to have than
 giving something unusual.

  ...

 But back to this topic. I am thinking so it is little bit different due
 fact so we support two very syntax for one feature. And looks little bit
 strange, so one way is supported by autocomplete and second not.


 Yeah, it's a bit strange. We have a specific autocomplete rule for SET
 ROLE, but SET ROLE TO is treated as a generic GUC. With your patch, we'd
 also lose the auto-completion to SET ROLE TO DEFAULT.

 I think we want to encourage people to use the SQL-standard syntax SET
 ROLE ... rather than the PostgreSQL-specific SET ROLE TO  On the
 whole, this just doesn't seem like much of an improvement. I'll mark this
 as 'rejected' in the commitfest app.


ok

Pavel


 PS. I note that the auto-completion for SET XXX TO ... is pretty lousy in
 general. We have rules for DateStyle, IntervalStyle, GEQO and search_path,
 but that's it. That could be expanded a lot. All enum-type GUCs could be
 handled with a single rule that queries pg_settings.enumvals, for example,
 and booleans would be easy too. But that's a different story.


 - Heikki




Re: [HACKERS] PATCH: remove nclients/nthreads constraint from pgbench

2015-07-02 Thread Heikki Linnakangas

On 05/08/2015 03:02 PM, Fabien COELHO wrote:


Remove pgbench constraint that the number of clients must be a multiple of
the number of threads, by sharing clients among threads as evenly as
possible.

Rational: allows to test the database load with any number of client
without unrelated constraint. The current setting means for instance that
testing with a prime number of clients always uses one thread, whereas
other number of clients can use a different number of threads. This
constraint does not make much sense.
..
Minor v2 update to change a not badly chosen variable name.


+1, it's really annoying that you can't just do -jhigh number and 
always run with that.


This doesn't behave correctly if you set -j to greater than -c, and also 
use the rate limit option:


 This works 
$ ./pgbench -R 100 -j3 -c 3 -T 10 postgres
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 5
query mode: simple
number of clients: 3
number of threads: 3
duration: 10 s
number of transactions actually processed: 1031
latency average: 1.397 ms
latency stddev: 0.276 ms
rate limit schedule lag: avg 0.119 (max 1.949) ms
tps = 102.947257 (including connections establishing)
tps = 102.967251 (excluding connections establishing)


 This does not; too small TPS 
$ ./pgbench -R 100 -j100 -c 3 -T 10 postgres
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 5
query mode: simple
number of clients: 3
number of threads: 100
duration: 10 s
number of transactions actually processed: 40
latency average: 3.573 ms
latency stddev: 1.724 ms
rate limit schedule lag: avg 0.813 (max 4.722) ms
tps = 3.246618 (including connections establishing)
tps = 3.246639 (excluding connections establishing)

I think that can be fixed by just setting nthreads internally to 
nclients, if nthreads  nclients. But please double-check that the logic 
used to calculate throttle_delay makes sense in general, when nthreads 
is not a multiple of nclients.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2015-07-02 Thread Heikki Linnakangas

On 07/01/2015 09:19 PM, Peter Geoghegan wrote:

On Wed, Jul 1, 2015 at 10:45 AM, Peter Geoghegan p...@heroku.com wrote:

On Tue, Jun 30, 2015 at 11:19 AM, Heikki Linnakangas hlinn...@iki.fi wrote:

I agree it would be cleaner to have a separate CHECK_UNIQUE_XXX code for
speculative insertions. You've defined CHECK_UNIQUE_SPECULATIVE as like
CHECK_UNIQUE_PARTIAL, but you don't have to insert the index tuple if
there's a conflict. I think it'd be better to define it as like
CHECK_UNIQUE_YES, but return FALSE instead of throwing an error on
conflict. The difference is that the aminsert would not be allowed to
return FALSE when there is no conflict.


Suppose we do it that way. Then what's the difference between
CHECK_UNIQUE_SPECULATIVE and CHECK_UNIQUE_PARTIAL? You've just
effectively required the CHECK_UNIQUE_YES case to not physically
insert a physical tuple before throwing an error, which does not seem
essential to the existing definition of CHECK_UNIQUE_YES -- you've
redefined CHECK_UNIQUE_YES in a way that nbtree happens to meet at the
moment. If we had an amcanunique AM that worked a bit like exclusion
constraints, this new obligation for CHECK_UNIQUE_YES might make it
impossible for that to work.


Another more obvious and important thing: CHECK_UNIQUE_YES waits for
conflicts to be resolved before returning to its caller. If you don't
get an error, you're done. CHECK_UNIQUE_PARTIAL never waits, and if we
add a CHECK_UNIQUE_SPECULATIVE, it ought to not wait either.

Sure, if a speculative inserter detects a conflict, it still has to
wait. But not in the aminsert call, and not until it cleans up its
physical insertion (by super-deleting). Clearly a
CHECK_UNIQUE_SPECULATIVE would have to be much closer to
CHECK_UNIQUE_PARTIAL than to CHECK_UNIQUE_YES.


Why is it not OK for aminsert to do the waiting, with 
CHECK_UNIQUE_SPECULATIVE?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-07-02 Thread Kouhei Kaigai
Hi Fujita-san,

Sorry for my late.

 On 2015/06/27 21:09, Kouhei Kaigai wrote:
  BTW, if you try newer version of postgres_fdw foreign join patch,
  please provide me to reproduce the problem/
 
  OK
 
  Did you forget to attach the patch, or v17 is in use?
 
 Sorry, I made a mistake.  The problem was produced using v16 [1].
 
  I'd like to suggest a solution that re-construct remote tuple according
  to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
  It enables to run local qualifier associated with the ForeignScan node,
  and it will also work for the case when tuple in es_epqTupleSet[] was
  local heap.
 
  Maybe I'm missing something, but I don't think your proposal works
  properly because we don't have any component ForeignScan state node or
  subsidiary join state node once we've replaced the entire join with the
  ForeignScan performing the join remotely, IIUC.  So, my image was to
  have another subplan for EvalPlanQual as well as the ForeignScan, to do
  the entire join locally for the component test tuples if we are inside
  an EvalPlanQual recheck.
 
  Hmm... Probably, we have two standpoints to tackle the problem.
 
  The first standpoint tries to handle the base foreign table as
  a prime relation for locking. Thus, we have to provide a way to
  fetch a remote tuple identified with the supplied ctid.
  The advantage of this approach is the way to fetch tuples from
  base relation is quite similar to the existing form, however,
  its disadvantage is another side of the same coin, because the
  ForeignScan node with scanrelid==0 (that represents remote join
  query) may have local qualifiers which shall run on the tuple
  according to fdw_scan_tlist.
 
 IIUC, I think this approach would also need to evaluate join conditions
 and remote qualifiers in addition to local qualifiers in the local, for
 component tuples that were re-fetched from the remote (and remaining
 component tuples that were copied from whole-row vars, if any), in cases
 where the re-fetched tuples were updated versions of those tuples rather
 than the same versions priviously obtained.
 
  One other standpoint tries to handle a bunch of base foreign
  tables as a unit. That means, if any of base foreign table is
  the target of locking, it prompts FDW driver to fetch the latest
  joined tuple identified by ctid, even if this join contains
  multiple base relations to be locked.
  The advantage of this approach is that we can use qualifiers of
  the ForeignScan node with scanrelid==0 and no need to pay attention
  of remote qualifier and/or join condition individually.
  Its disadvantage is, we may extend EState structure to keep the
  joined tuples, in addition to es_epqTupleSet[].
 
 That is an idea.  However, ISTM there is another disadvantage; that is
 not efficient because that would need to perform another remote join
 query having such additional conditions during an EvalPlanQual check, as
 you proposed.
 
  I'm inclined to think the later standpoint works well, because
  it does not need to reproduce an alternative execution path in
  local side again, even if a ForeignScan node represents much
  complicated remote query.
  If we would fetch tuples of individual base relations, we need
  to reconstruct identical join path to be executed on remote-
  side, don't it?
 
 Yeah, that was my image for fixing this issue.
 
  IIUC, the purpose of EvalPlanQual() is to ensure the tuples to
  be locked is still visible, so it is not an essential condition
  to fetch base tuples individually.
 
 I think so too, but taking the similarity and/or efficiency of
 processing into consideration, I would vote for the idea of having an
 alternative execution path in the local.  That would also allow FDW
 authors to write the foreign join pushdown functionality in their FDWs
 by smaller efforts.

Even though I'd like to see committer's opinion, I could not come up
with the idea better than what you proposed; foreign-/custom-scan
has alternative plan if scanrelid==0.

Let me introduce a few cases we should pay attention.

Foreign/CustomScan node may stack; that means a Foreign/CustomScan node
may have child node that includes another Foreign/CustomScan node with
scanrelid==0.
(At this moment, ForeignScan cannot have child node, however, more
aggressive push-down [1] will need same feature to fetch tuples from
local relation and construct VALUES() clause.)
In this case, the highest Foreign/CustomScan node (that is also nearest
to LockRows or ModifyTuples) run the alternative sub-plan that includes
scan/join plans dominated by fdw_relids or custom_relids.

For example:
  LockRows
   - HashJoin
 - CustomScan (AliceJoin)
   - SeqScan on t1
   - CustomScan (CarolJoin)
 - SeqScan on t2
 - SeqScan on t3
 - Hash
   - CustomScan (BobJoin)
 - SeqScan on t4
 - ForeignScan (remote join involves ft5, ft6)

In this case, AliceJoin will have alternative sub-plan to join t1, t2
and t3, then 

[HACKERS] 9.6 First Commitfest Begins

2015-07-02 Thread Heikki Linnakangas

Hello everyone! It is time to start reviewing patches:

1. Pick a patch from the list at:

https://commitfest.postgresql.org/5/

2. Review it. Test it. Try to break it. Do we want the feature? Any 
weird interactions in corner-cases?


3. Reply on the thread on pgsql-hackers with your findings.

It is perfectly OK to review a patch that someone else has signed up for 
as reviewer - different people tend to pay attention to different things.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx

2015-07-02 Thread Heikki Linnakangas

On 05/29/2015 10:41 AM, Pavel Stehule wrote:

2015-05-29 9:28 GMT+02:00 Jeevan Chalke jeevan.cha...@gmail.com:


I agree with Peter that We don't tab-complete everything we possibly
could, but using tabs after SET ROLE TO  provides DEFAULT as an option
which seems wrong.
This patch adds list of roles over there, which I guess good to have than
giving something unusual.


...

But back to this topic. I am thinking so it is little bit different due
fact so we support two very syntax for one feature. And looks little bit
strange, so one way is supported by autocomplete and second not.


Yeah, it's a bit strange. We have a specific autocomplete rule for SET 
ROLE, but SET ROLE TO is treated as a generic GUC. With your patch, 
we'd also lose the auto-completion to SET ROLE TO DEFAULT.


I think we want to encourage people to use the SQL-standard syntax SET 
ROLE ... rather than the PostgreSQL-specific SET ROLE TO  On the 
whole, this just doesn't seem like much of an improvement. I'll mark 
this as 'rejected' in the commitfest app.


PS. I note that the auto-completion for SET XXX TO ... is pretty lousy 
in general. We have rules for DateStyle, IntervalStyle, GEQO and 
search_path, but that's it. That could be expanded a lot. All enum-type 
GUCs could be handled with a single rule that queries 
pg_settings.enumvals, for example, and booleans would be easy too. But 
that's a different story.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve a few appendStringInfo* calls

2015-07-02 Thread David Rowley
On 2 July 2015 at 21:59, Heikki Linnakangas hlinn...@iki.fi wrote:

 Applied the straightforward parts.


Thanks for committing.


 I left out the changes like

  -   appendStringInfoString(collist, buf.data);
 +   appendBinaryStringInfo(collist, buf.data, buf.len);


 because they're not an improvement in readablity, IMHO, and they were not
 in performance-critical paths.


Perhaps we can come up with appendStringInfoStringInfo at some later date.

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


[HACKERS] pg_archivecleanup, and backup filename to specify as an argument

2015-07-02 Thread Fujii Masao
Hi,

While I'm implementing the patch around pg_archivecleanup, I found
the following warning about pg_archivecleanup in the document.

Note that the .backup file name passed to the program should not
include the extension.

ISTM that pg_archivecleanup works as expected even if the .backup file
with the extension is specified as follows. So, isn't this warning already
obsolete? Or I'm missing something?

$ pg_archivecleanup -d -x .zip myarchive
00010009.0028.backup.zip
pg_archivecleanup: keep WAL file myarchive/00010009 and later
pg_archivecleanup: removing file myarchive/00010005.zip
pg_archivecleanup: removing file myarchive/00010003.zip
pg_archivecleanup: removing file myarchive/00010001.zip
pg_archivecleanup: removing file myarchive/00010007.zip
pg_archivecleanup: removing file myarchive/00010006.zip
pg_archivecleanup: removing file myarchive/00010004.zip
pg_archivecleanup: removing file myarchive/00010002.zip
pg_archivecleanup: removing file myarchive/00010008.zip

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: remove nclients/nthreads constraint from pgbench

2015-07-02 Thread Fabien COELHO


This doesn't behave correctly if you set -j to greater than -c, and also use 
the rate limit option:

[...]

I think that can be fixed by just setting nthreads internally to nclients, if 
nthreads  nclients. But please double-check that the logic used to calculate 
throttle_delay makes sense in general, when nthreads is not a multiple of 
nclients.


Indeed, I probably did not consider nthreads  nclients. The fix you 
suggest seems ok. I'll check it before sending a new v3.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriter active during recovery

2015-07-02 Thread Andres Freund
On 2015-07-02 14:34:48 +0100, Simon Riggs wrote:
 This was pushed back from last CF and I haven't worked on it at all, nor
 will I.
 
 Pushing back again.

Let's return with feedback, not  move, it then.. Moving a entries
along which aren't expected to receive updates anytime soon isn't a good
idea, there's more than enough entries each CF.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 09:44, Beena Emerson memissemer...@gmail.com wrote:


 I am not sure how combining both will work out.


Use cases needed.


 Catalog Method:
 Is it safe to assume we may not going ahead with the Catalog approach?


Yes

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-02 Thread Syed, Rahila
Hello,

Unless I am missing something, I guess you can still keep the actual code that 
updates counters outside the core if you adopt an approach that Simon suggests.
Yes. The code to extract progress information from VACUUM and storing in shared 
memory can be outside core even with pg_stat_activity as a user interface.

Whatever the view (existing/new), any related counters would have a valid 
(non-NULL) value when read off the view iff hooks are set perhaps because you 
have an extension that sets them. 
I guess that means any operation that supports progress tracking would have 
an extension with suitable hooks implemented.
Do you mean to say , any operation/application that want progress  tracking 
feature will dynamically load the progress checker module which will set the 
hooks for progress reporting?
If yes , unless I am missing something such dynamic loading cannot happen if we 
use pg_stat_activity as it gets values from shared memory. Module has to be a 
shared_preload_library
to allocate a shared memory. So this will mean the module can be loaded only at 
server restart. Am I missing something?

Thank you,
Rahila Syed




__
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raw output from copy

2015-07-02 Thread Andrew Dunstan


On 07/02/2015 07:14 AM, Pavel Stehule wrote:

Hi

I'll do it today evening




Pavel,

Please don't top-post on the PostgreSQL lists. You've been around here 
long enough to know that bottom posting is our custom.


I posted a patch for this in 2013 at 
http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but 
it can apply to a SELECT, and doesn't need COPY. Nobody seemed very 
interested, so I dropped it. Apparently people now want something along 
these lines, which is good.


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriter active during recovery

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 14:38, Andres Freund and...@anarazel.de wrote:

 On 2015-07-02 14:34:48 +0100, Simon Riggs wrote:
  This was pushed back from last CF and I haven't worked on it at all, nor
  will I.
 
  Pushing back again.

 Let's return with feedback, not  move, it then.. Moving a entries
 along which aren't expected to receive updates anytime soon isn't a good
 idea, there's more than enough entries each CF.


Although I agree, the interface won't let me do that, so will leave as-is.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Patch to improve a few appendStringInfo* calls

2015-07-02 Thread Alvaro Herrera
David Rowley wrote:
 On 2 July 2015 at 21:59, Heikki Linnakangas hlinn...@iki.fi wrote:
 
  I left out the changes like
 
   -   appendStringInfoString(collist, buf.data);
  +   appendBinaryStringInfo(collist, buf.data, buf.len);
 
 
  because they're not an improvement in readablity, IMHO, and they were not
  in performance-critical paths.
 
 Perhaps we can come up with appendStringInfoStringInfo at some later date.

I had this exact thought when I saw your patch (though it was
appendStringInfoSI to me because the other name is too long and a bit
confusing).  It seems straightforward enough ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing the size of BufferTag remodeling forks

2015-07-02 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 1) Introduce a shared pg_relfilenode table. Every table, even
shared/nailed ones, get an entry therein. It's there to make it
possibly to uniquely allocate relfilenodes across databases 
tablespaces.
 2) Replace relation forks, with the exception of the init fork which is
special anyway, with separate relfilenodes. Stored in seperate
columns in pg_class.

 Thoughts?

I'm concerned about the traffic and contention involved with #1.
I'm also concerned about the assumption that relfilenode should,
or even can be, unique across an entire installation.  (I suppose
widening it to 8 bytes would fix some of the hazards there, but
that bloats your buffer tag again.)

But here's the big problem: you're talking about a huge amount of
work for what seems likely to be a microscopic improvement in some
operations.  Worse, we'll be taking penalties for other operations.
How will you do DropDatabaseBuffers() for instance?

CREATE DATABASE is going to be a problem, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Heikki Linnakangas

On 07/02/2015 04:33 PM, Simon Riggs wrote:

On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote:


On 06/27/2015 07:45 AM, Amit Kapila wrote:


Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.



I'm marking this as returned with feedback in the commitfest. In
addition to the issues raised so far, ISTM that the patch makes dropping a
very large table with small shared_buffers slower
(DropForkSpecificBuffers() is O(n) where n is the size of the relation,
while the current method is O(shared_buffers))


There are no unresolved issues with the approach, nor is it true it is
slower. If you think there are some, you should say what they are, not act
high handedly to reject a patch without reason.


Oh, I missed the NBuffers / 4 threshold.

(The total_blocks calculation is prone to overflowing, btw, if you have 
a table close to 32 TB in size. But that's trivial to fix)



I concur that we should explore using a radix tree or something else that
would naturally allow you to find all buffers for relation/database X
quickly.


I doubt that it can be managed efficiently while retaining optimal memory
management. If it can its going to be very low priority (or should be).

This approach works, so lets do it, now. If someone comes up with a better
way later, great.


*shrug*. I don't think this is ready to be committed. I can't stop you 
from working on this, but as far as this commitfest is concerned, case 
closed.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Heikki Linnakangas

On 07/02/2015 04:18 PM, Amit Kapila wrote:

On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote:


On 06/27/2015 07:45 AM, Amit Kapila wrote:


Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.


I'm marking this as returned with feedback in the commitfest. In

addition to the issues raised so far,

Don't you think the approach discussed (delayed cleanup of buffers
during checkpoint scan) is sufficient to fix the issues raised.


Dunno, but there is no patch for that.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriter active during recovery

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 14:31, Fujii Masao masao.fu...@gmail.com wrote:

 On Thu, Mar 5, 2015 at 5:22 PM, Fujii Masao masao.fu...@gmail.com wrote:
  On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao masao.fu...@gmail.com
 wrote:
  On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com
 wrote:
  Currently, WALReceiver writes and fsyncs data it receives. Clearly,
  while we are waiting for an fsync we aren't doing any other useful
  work.
 
  Following patch starts WALWriter during recovery and makes it
  responsible for fsyncing data, allowing WALReceiver to progress other
  useful actions.
 
  With the patch, replication didn't work fine in my machine. I started
  the standby server after removing all the WAL files from the standby.
  ISTM that the patch doesn't handle that case. That is, in that case,
  the standby tries to start up walreceiver and replication to retrieve
  the REDO-starting checkpoint record *before* starting up walwriter
  (IOW, before reaching the consistent point). Then since walreceiver works
  without walwriter, no received WAL data cannot be fsync'd in the standby.
  So replication cannot advance furthermore. I think that walwriter needs
  to start before walreceiver starts.
 
  I just marked this patch as Waiting on Author.

 This patch was moved to current CF with the status Needs review.
 But there are already some review comments which have not been addressed
 yet,
 so I marked the patch as Waiting on Author again.


This was pushed back from last CF and I haven't worked on it at all, nor
will I.

Pushing back again.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] WAL-related tools and .paritial WAL file

2015-07-02 Thread Michael Paquier
On Thu, Jul 2, 2015 at 8:42 PM, Fujii Masao wrote:
 3) Something not caused by this patch that I just noticed... But
 pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they
 get moved away as well?

 pg_resetxlog doesn't remove also .history file in pg_xlog. Those remaining
 .backup and .history files are harmless after pg_resetxlog is executed.
 So I don't think that it's required to remove them. Of course we can do that,
 but some existing applications might depend on the current behavior...
 So unless there is strong reason to do that, I'd like to let it as it is.

Well, I was just surprised to not see them wiped out. Let's not change
the behavior then that exists for ages. The rest of the patch looks
fine to me (I would add dots at the end of sentences in comment
blocks, but that's a detail). The new macros really make the code
easier to read and understand!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Andres Freund
On 2015-07-02 16:08:03 +0300, Heikki Linnakangas wrote:
 I'm marking this as returned with feedback in the commitfest. In addition
 to the issues raised so far, ISTM that the patch makes dropping a very large
 table with small shared_buffers slower (DropForkSpecificBuffers() is O(n)
 where n is the size of the relation, while the current method is
 O(shared_buffers))

I think upthread there was talk about only using the O(relsize) approach
if relsize  NBuffers. That's actually a pretty common scenario,
especially in testsuites.

 I concur that we should explore using a radix tree or something else that
 would naturally allow you to find all buffers for relation/database X
 quickly.

I unsurprisingly think that's the right way to go. But I'm not sure if
it's not worth to add another layer of bandaid till were there...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Amit Kapila
On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 06/27/2015 07:45 AM, Amit Kapila wrote:

 Sometime back on one of the PostgreSQL blog [1], there was
 discussion about the performance of drop/truncate table for
 large values of shared_buffers and it seems that as the value
 of shared_buffers increase the performance of drop/truncate
 table becomes worse.  I think those are not often used operations,
 so it never became priority to look into improving them if possible.

 I have looked into it and found that the main reason for such
 a behaviour is that for those operations it traverses whole
 shared_buffers and it seems to me that we don't need that
 especially for not-so-big tables.  We can optimize that path
 by looking into buff mapping table for the pages that exist in
 shared_buffers for the case when table size is less than some
 threshold (say 25%) of shared buffers.

 Attached patch implements the above idea and I found that
 performance doesn't dip much with patch even with large value
 of shared_buffers.  I have also attached script and sql file used
 to take performance data.


 I'm marking this as returned with feedback in the commitfest. In
addition to the issues raised so far,

Don't you think the approach discussed (delayed cleanup of buffers
during checkpoint scan) is sufficient to fix the issues raised.

 ISTM that the patch makes dropping a very large table with small
shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the
size of the relation, while the current method is O(shared_buffers))


Can you please explain how did you reach that conclusion?
It does only when relation size is less than 25% of shared
buffers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Reducing the size of BufferTag remodeling forks

2015-07-02 Thread Andres Freund
Hi,

I've complained a number of times that our BufferTag is ridiculously
large:
typedef struct buftag
{
RelFileNode rnode;  /* physical relation identifier */
ForkNumber  forkNum;
BlockNumber blockNum;   /* blknum relative to begin of reln */
} BufferTag;

typedef struct RelFileNode
{
Oid spcNode;/* tablespace */
Oid dbNode; /* database */
Oid relNode;/* relation */
} RelFileNode;

that amounts to 20 bytes. That's problematic because we frequently have
to compare or hash the entire buffer tag. Comparing 20bytes is rather
branch intensive, and shows up noticably on profiles.  It's also a
stumbling block on the way to a smarter buffer mapping data structure,
because it makes e.g. trees rather deep.

The buffer tag is currently used in two situations:

1) Dealing with the buffer mapping, we need to identify the underlying
   file uniquely and we need the block number (8 bytes).

2) When writing out the a block we need, in addition to 1), have
   information about where to store the file. That requires the
   tablespace and database.

You may know that a filenode (RelFileNode-relNode) is currently *not*
unique across databases and tablespaces.

Additionally you might have noticed that the above description also
disregards relation forks.

I think we should work towards 1) being sufficient for its purpose. My
suggestion to get there is twofold:

1) Introduce a shared pg_relfilenode table. Every table, even
   shared/nailed ones, get an entry therein. It's there to make it
   possibly to uniquely allocate relfilenodes across databases 
   tablespaces.

2) Replace relation forks, with the exception of the init fork which is
   special anyway, with separate relfilenodes. Stored in seperate
   columns in pg_class.

This scheme has a number of advantages: We don't need to look at the
filesystem anymore to find out whether a relfilenode exists. The buffer
tags are 8 bytes. The number of stats doesn't scale O(#forks *
#relations) anymore, allowing us to add additional forks more easily.

I think something akin to init forks is going to survive because they've
to be copied without access to the catalogs - but that's fine, they just
aren't allowed to go through shared buffers. Afaics that's not a
problem.

Obviously this is a rather high-level description, but right now this
sounds doable to me.

Thoughts?


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raw output from copy

2015-07-02 Thread Andrew Dunstan


On 07/02/2015 09:43 AM, Simon Riggs wrote:
On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net 
mailto:and...@dunslane.net wrote:



Please don't top-post on the PostgreSQL lists. You've been around
here long enough to know that bottom posting is our custom.

I posted a patch for this in 2013 at
http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net
but it can apply to a SELECT, and doesn't need COPY. Nobody seemed
very interested, so I dropped it. Apparently people now want
something along these lines, which is good.


It's a shame that both solutions are restricted to either COPY or psql.

Both of those are working on suggestions from Tom, so there is no 
history of preference there.


Can we have both please, gentlemen?

If we implemented Andrew's solution, how would we request it in a COPY 
statement? Seems like we would want the RAW format keyword anyway.






What's the use case? My original motivation was that I had a function 
that returned a bytea (it was a PDF in fact) that I wanted to be able to 
write to a file. Of course, this is easy enough to do with a client 
library like perl's DBD::Pg, but it seems sad to have to resort to that 
for something so simple.


My original suggestion 
(http://www.postgresql.org/message-id/4ea1b83b.2050...@pgexperts.com) 
was to invent a \bcopy command.


I don't have a problem in building in a RAW mode for copy, but we'll 
still need to teach psql how to deal with it.


Another case where it could be useful is JSON - so we can avoid having 
to play tricks like 
http://adpgtech.blogspot.com/2014/09/importing-json-data.html. Similar 
considerations probably apply to XML, and the tricks are less guaranteed 
to work.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing the size of BufferTag remodeling forks

2015-07-02 Thread Andres Freund
On 2015-07-02 09:51:59 -0400, Tom Lane wrote:
 Andres Freund and...@anarazel.de writes:
  1) Introduce a shared pg_relfilenode table. Every table, even
 shared/nailed ones, get an entry therein. It's there to make it
 possibly to uniquely allocate relfilenodes across databases 
 tablespaces.
  2) Replace relation forks, with the exception of the init fork which is
 special anyway, with separate relfilenodes. Stored in seperate
 columns in pg_class.
 
  Thoughts?
 
 I'm concerned about the traffic and contention involved with #1.

I don't think that'll be that significant in comparison to all the other
work done when creating a relation. Unless we do something wrong it'll
be highly unlikely to get row level contention, as the oids of the
individual relations will be from the oid counter or something similar.

 I'm also concerned about the assumption that relfilenode should,
 or even can be, unique across an entire installation.  (I suppose
 widening it to 8 bytes would fix some of the hazards there, but
 that bloats your buffer tag again.)

Why? Because it limits the number of relations  forks we can have to
2**32? That seems like an extraordinary large limit? The catalog sizes
(pg_attribute most prominently) are a problem at a much lower number of
relations than that. Also rel/catcache management.

 But here's the big problem: you're talking about a huge amount of
 work for what seems likely to be a microscopic improvement in some
 operations.

I don't think it's microscopic at all. Just hacking away database 
tablespace from hashing  comparisons in the buffer tag (obviously not a
correct thing, but works enough for pgbench) results in quite measurable
performance benefits. But the main point isn't the performance
improvements themselves, but that it opens the door to smarter buffer
mapping algorithms, which e.g. will allow ordered access.  Also not
having the current problem with increasing the number of forks would be
good.

 Worse, we'll be taking penalties for other operations.
 How will you do DropDatabaseBuffers() for instance?

 CREATE DATABASE is going to be a problem, too.

More promently than that, without access to the database/tablespace we
couldn't even write out dirty buffers in a reasonable manner.  That's
why I think we're going to have to continue storing those two in the
buffer descriptors, just not include them in the buffer mapping.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote:
 I'm marking this as returned with feedback in the commitfest.

 There are no unresolved issues with the approach, nor is it true it is
 slower. If you think there are some, you should say what they are, not act
 high handedly to reject a patch without reason.

Have you read the thread?  There were plenty of objections, as well as
a design for a better solution.  In addition, we should think about
more holistic solutions such as Andres' nearby proposal (which I doubt
will work as-is, but maybe somebody will think of how to fix it).
Committing a band-aid will just make it harder to pursue real fixes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-07-02 Thread Etsuro Fujita

Hi KaiGai-san,

On 2015/07/02 18:31, Kouhei Kaigai wrote:

Even though I'd like to see committer's opinion, I could not come up
with the idea better than what you proposed; foreign-/custom-scan
has alternative plan if scanrelid==0.


I'd also like to hear the opinion!


Let me introduce a few cases we should pay attention.

Foreign/CustomScan node may stack; that means a Foreign/CustomScan node
may have child node that includes another Foreign/CustomScan node with
scanrelid==0.
(At this moment, ForeignScan cannot have child node, however, more
aggressive push-down [1] will need same feature to fetch tuples from
local relation and construct VALUES() clause.)
In this case, the highest Foreign/CustomScan node (that is also nearest
to LockRows or ModifyTuples) run the alternative sub-plan that includes
scan/join plans dominated by fdw_relids or custom_relids.

For example:
   LockRows
- HashJoin
  - CustomScan (AliceJoin)
- SeqScan on t1
- CustomScan (CarolJoin)
  - SeqScan on t2
  - SeqScan on t3
  - Hash
- CustomScan (BobJoin)
  - SeqScan on t4
  - ForeignScan (remote join involves ft5, ft6)

In this case, AliceJoin will have alternative sub-plan to join t1, t2
and t3, then it shall be used on EvalPlanQual(). Also, BobJoin will
have alternative sub-plan to join t4, ft5 and ft6. CarolJoin and the
ForeignScan will also have alternative sub-plan, however, these are
not used in this case.
Probably, it works fine.


Yeah, I think so too.


Do we have potential scenario if foreign-/custom-join is located over
LockRows node. (Subquery expansion may give such a case?)
Anyway, doesn't it make a problem, does it?


IIUC, I don't think we have such a case.


On the next step, how do we implement this design?
I guess that planner needs to keep a path that contains neither
foreign-join nor custom-join with scanrelid==0.
Probably, cheapest_builtin_path of RelOptInfo is needed that
never contains these remote/custom join logic, as a seed of
alternative sub-plan.


Yeah, I think so too, but I've not fugiured out how to implement this yet.

To be honest, ISTM that it's difficult to do that simply and efficiently 
for the foreign/custom-join-pushdown API that we have for 9.5.  It's a 
little late, but what I started thinking is to redesign that API so that 
that API is called at standard_join_search, as discussed in [2]; (1) to 
place that API call *after* the set_cheapest call and (2) to place 
another set_cheapest call after that API call for each joinrel.  By the 
first set_cheapest call, I think we could probably save an alternative 
path that we need in cheapest_builtin_path.  By the second 
set_cheapest call following that API call, we could consider 
foreign/custom-join-pushdown paths also.  What do you think about this idea?



create_foreignscan_plan() or create_customscan_plan() will be
able to construct these alternative plan, regardless of the
extensions. So, individual FDW/CSP don't need to care about
this alternative sub-plan, do it?

After that, once ExecScanFetch() is called under EvalPlanQual(),
these Foreign/CustomScan with scanrelid==0 runs the alternative
sub-plan, to validate the latest tuple.

Hmm... It looks to me a workable approach.


Year, I think so too.


Fujita-san, are you available to make a patch with this approach?
If so, I'd like to volunteer its reviewing.


Yeah, I'm willing to make a patch if we obtain the consensus!  And I'd 
be happy if you help me doing the work!


Best regards,
Etsuro Fujita

[2] http://www.postgresql.org/message-id/5451.1426271...@sss.pgh.pa.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Fujii Masao
On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson memissemer...@gmail.com wrote:
 Hello,
 There has been a lot of discussion. It has become a bit confusing.
 I am summarizing my understanding of the discussion till now.
 Kindly let me know if I missed anything important.

 Backward compatibility:
 We have to provide support for the current format and behavior for
 synchronous replication (The first running standby from list s_s_names)
 In case the new format does not include GUC, then a special value to be
 specified for s_s_names to indicate that.

 Priority and quorum:
 Quorum treats all the standby with same priority while in priority behavior,
 each one has a different priority and ACK must be received from the
 specified k lowest priority servers.
 I am not sure how combining both will work out.
 Mostly we would like to have some standbys from each data center to be in
 sync. Can it not be achieved by quorum only?

So you're wondering if there is the use case where both quorum and priority are
used together?

For example, please imagine the case where you have two standby servers
(say A and B) in local site, and one standby server (say C) in remote disaster
recovery site. You want to set up sync replication so that the master waits for
ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A
or B crashes, you want to make the master wait for ACK from either the
remaining local standby or C. On the other hand, you don't want to use the
setting like 1(A, B, C). Because in this setting, C can be sync standby when
the master craches, and both A and B might be very behind of C. In this case,
you need to promote the remote standby server C to new master,,, this is what
you'd like to avoid.

The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 06/27/2015 07:45 AM, Amit Kapila wrote:

 Sometime back on one of the PostgreSQL blog [1], there was
 discussion about the performance of drop/truncate table for
 large values of shared_buffers and it seems that as the value
 of shared_buffers increase the performance of drop/truncate
 table becomes worse.  I think those are not often used operations,
 so it never became priority to look into improving them if possible.

 I have looked into it and found that the main reason for such
 a behaviour is that for those operations it traverses whole
 shared_buffers and it seems to me that we don't need that
 especially for not-so-big tables.  We can optimize that path
 by looking into buff mapping table for the pages that exist in
 shared_buffers for the case when table size is less than some
 threshold (say 25%) of shared buffers.

 Attached patch implements the above idea and I found that
 performance doesn't dip much with patch even with large value
 of shared_buffers.  I have also attached script and sql file used
 to take performance data.


 I'm marking this as returned with feedback in the commitfest. In
 addition to the issues raised so far, ISTM that the patch makes dropping a
 very large table with small shared_buffers slower
 (DropForkSpecificBuffers() is O(n) where n is the size of the relation,
 while the current method is O(shared_buffers))


There are no unresolved issues with the approach, nor is it true it is
slower. If you think there are some, you should say what they are, not act
high handedly to reject a patch without reason.

I concur that we should explore using a radix tree or something else that
 would naturally allow you to find all buffers for relation/database X
 quickly.


I doubt that it can be managed efficiently while retaining optimal memory
management. If it can its going to be very low priority (or should be).

This approach works, so lets do it, now. If someone comes up with a better
way later, great.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Amit Kapila
On Thu, Jul 2, 2015 at 7:27 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/02/2015 04:18 PM, Amit Kapila wrote:


 Don't you think the approach discussed (delayed cleanup of buffers
 during checkpoint scan) is sufficient to fix the issues raised.


 Dunno, but there is no patch for that.


That's fair enough, however your reply sounded to me like there is
no further need to explore this idea, unless we have something like
radix tree based approach.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] raw output from copy

2015-07-02 Thread Pavel Stehule
2015-07-02 16:02 GMT+02:00 Andrew Dunstan and...@dunslane.net:


 On 07/02/2015 09:43 AM, Simon Riggs wrote:

 On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net mailto:
 and...@dunslane.net wrote:


 Please don't top-post on the PostgreSQL lists. You've been around
 here long enough to know that bottom posting is our custom.

 I posted a patch for this in 2013 at
 http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net
 but it can apply to a SELECT, and doesn't need COPY. Nobody seemed
 very interested, so I dropped it. Apparently people now want
 something along these lines, which is good.


 It's a shame that both solutions are restricted to either COPY or psql.

 Both of those are working on suggestions from Tom, so there is no history
 of preference there.

 Can we have both please, gentlemen?

 If we implemented Andrew's solution, how would we request it in a COPY
 statement? Seems like we would want the RAW format keyword anyway.




 What's the use case? My original motivation was that I had a function that
 returned a bytea (it was a PDF in fact) that I wanted to be able to write
 to a file. Of course, this is easy enough to do with a client library like
 perl's DBD::Pg, but it seems sad to have to resort to that for something so
 simple.

 My original suggestion (
 http://www.postgresql.org/message-id/4ea1b83b.2050...@pgexperts.com) was
 to invent a \bcopy command.

 I don't have a problem in building in a RAW mode for copy, but we'll still
 need to teach psql how to deal with it.


It can be used from psql without any problems.



 Another case where it could be useful is JSON - so we can avoid having to
 play tricks like 
 http://adpgtech.blogspot.com/2014/09/importing-json-data.html. Similar
 considerations probably apply to XML, and the tricks are less guaranteed to
 work.

 cheers

 andrew



Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-07-02 Thread Heikki Linnakangas

On 06/27/2015 07:45 AM, Amit Kapila wrote:

Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.


I'm marking this as returned with feedback in the commitfest. In 
addition to the issues raised so far, ISTM that the patch makes dropping 
a very large table with small shared_buffers slower 
(DropForkSpecificBuffers() is O(n) where n is the size of the relation, 
while the current method is O(shared_buffers))


I concur that we should explore using a radix tree or something else 
that would naturally allow you to find all buffers for relation/database 
X quickly.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raw output from copy

2015-07-02 Thread Andrew Dunstan


On 07/02/2015 09:02 AM, Andrew Dunstan wrote:


On 07/02/2015 07:14 AM, Pavel Stehule wrote:

Hi

I'll do it today evening




Pavel,

Please don't top-post on the PostgreSQL lists. You've been around here 
long enough to know that bottom posting is our custom.


I posted a patch for this in 2013 at 
http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net 
but it can apply to a SELECT, and doesn't need COPY. Nobody seemed 
very interested, so I dropped it. Apparently people now want something 
along these lines, which is good.



For reference, here's the Wayback Machine's version of the original blog 
post referred to: 
http://web.archive.org/web/20110916023912/http://people.planetpostgresql.org/andrew/index.php?/archives/196-Clever-trick-challenge.html


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriter active during recovery

2015-07-02 Thread Fujii Masao
On Thu, Mar 5, 2015 at 5:22 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Currently, WALReceiver writes and fsyncs data it receives. Clearly,
 while we are waiting for an fsync we aren't doing any other useful
 work.

 Following patch starts WALWriter during recovery and makes it
 responsible for fsyncing data, allowing WALReceiver to progress other
 useful actions.

 With the patch, replication didn't work fine in my machine. I started
 the standby server after removing all the WAL files from the standby.
 ISTM that the patch doesn't handle that case. That is, in that case,
 the standby tries to start up walreceiver and replication to retrieve
 the REDO-starting checkpoint record *before* starting up walwriter
 (IOW, before reaching the consistent point). Then since walreceiver works
 without walwriter, no received WAL data cannot be fsync'd in the standby.
 So replication cannot advance furthermore. I think that walwriter needs
 to start before walreceiver starts.

 I just marked this patch as Waiting on Author.

This patch was moved to current CF with the status Needs review.
But there are already some review comments which have not been addressed yet,
so I marked the patch as Waiting on Author again.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raw output from copy

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net wrote:


 Please don't top-post on the PostgreSQL lists. You've been around here
 long enough to know that bottom posting is our custom.

 I posted a patch for this in 2013 at 
 http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but
 it can apply to a SELECT, and doesn't need COPY. Nobody seemed very
 interested, so I dropped it. Apparently people now want something along
 these lines, which is good.


It's a shame that both solutions are restricted to either COPY or psql.

Both of those are working on suggestions from Tom, so there is no history
of preference there.

Can we have both please, gentlemen?

If we implemented Andrew's solution, how would we request it in a COPY
statement? Seems like we would want the RAW format keyword anyway.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] raw output from copy

2015-07-02 Thread Pavel Stehule
2015-07-02 15:43 GMT+02:00 Simon Riggs si...@2ndquadrant.com:

 On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net wrote:


 Please don't top-post on the PostgreSQL lists. You've been around here
 long enough to know that bottom posting is our custom.

 I posted a patch for this in 2013 at 
 http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but
 it can apply to a SELECT, and doesn't need COPY. Nobody seemed very
 interested, so I dropped it. Apparently people now want something along
 these lines, which is good.


 It's a shame that both solutions are restricted to either COPY or psql.

 Both of those are working on suggestions from Tom, so there is no history
 of preference there.

 Can we have both please, gentlemen?

 If we implemented Andrew's solution, how would we request it in a COPY
 statement? Seems like we would want the RAW format keyword anyway.


I prefer a COPY like solution - it can be used on both sides (server,
client), and it can be used little bit simply for psql -c XXX pattern.

Regards

Pavel



 --
 Simon Riggshttp://www.2ndQuadrant.com/
 http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



Re: [HACKERS] Information of pg_stat_ssl visible to all users

2015-07-02 Thread Peter Eisentraut
On 6/10/15 2:17 AM, Magnus Hagander wrote:
 AIUI that one was just about the DN field, and not about the rest. If I
 understand you correctly, you are referring to the whole thing, not just
 one field?

I think at least the DN field shouldn't be visible to unprivileged users.

Actually, I think the whole view shouldn't be accessible to unprivileged
users, except maybe your own row.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-07-02 Thread Kouhei Kaigai
  Let me introduce a few cases we should pay attention.
 
  Foreign/CustomScan node may stack; that means a Foreign/CustomScan node
  may have child node that includes another Foreign/CustomScan node with
  scanrelid==0.
  (At this moment, ForeignScan cannot have child node, however, more
  aggressive push-down [1] will need same feature to fetch tuples from
  local relation and construct VALUES() clause.)
  In this case, the highest Foreign/CustomScan node (that is also nearest
  to LockRows or ModifyTuples) run the alternative sub-plan that includes
  scan/join plans dominated by fdw_relids or custom_relids.
 
  For example:
 LockRows
  - HashJoin
- CustomScan (AliceJoin)
  - SeqScan on t1
  - CustomScan (CarolJoin)
- SeqScan on t2
- SeqScan on t3
- Hash
  - CustomScan (BobJoin)
- SeqScan on t4
- ForeignScan (remote join involves ft5, ft6)
 
  In this case, AliceJoin will have alternative sub-plan to join t1, t2
  and t3, then it shall be used on EvalPlanQual(). Also, BobJoin will
  have alternative sub-plan to join t4, ft5 and ft6. CarolJoin and the
  ForeignScan will also have alternative sub-plan, however, these are
  not used in this case.
  Probably, it works fine.
 
 Yeah, I think so too.

Sorry, I need to adjust my explanation above a bit:

In this case, AliceJoin will have alternative sub-plan to join t1 and
CarolJoin, then CarolJoin will have alternative sub-plan to join t2 and
t3. Also, BobJoin will have alternative sub-plan to join t4 and the
ForeignScan with remote join, and this ForeignScan node will have
alternative sub-plan to join ft5 and ft6.

Why this recursive design is better? Because it makes planner enhancement
much simple than overall approach. Please see my explanation in the
section below.

  On the next step, how do we implement this design?
  I guess that planner needs to keep a path that contains neither
  foreign-join nor custom-join with scanrelid==0.
  Probably, cheapest_builtin_path of RelOptInfo is needed that
  never contains these remote/custom join logic, as a seed of
  alternative sub-plan.
 
 Yeah, I think so too, but I've not fugiured out how to implement this yet.

 To be honest, ISTM that it's difficult to do that simply and efficiently
 for the foreign/custom-join-pushdown API that we have for 9.5.  It's a
 little late, but what I started thinking is to redesign that API so that
 that API is called at standard_join_search, as discussed in [2]; (1) to
 place that API call *after* the set_cheapest call and (2) to place
 another set_cheapest call after that API call for each joinrel.  By the
 first set_cheapest call, I think we could probably save an alternative
 path that we need in cheapest_builtin_path.  By the second
 set_cheapest call following that API call, we could consider
 foreign/custom-join-pushdown paths also.  What do you think about this idea?

Disadvantage is larger than advantage, sorry.
The reason why we put foreign/custom-join hook on add_paths_to_joinrel()
is that the source relations (inner/outer) were not obvious, thus,
we cannot reproduce which relations are the source of this join.
So, I had to throw a spoon when I tried this approach before.


My idea is that we save the cheapest_total_path of RelOptInfo onto the
new cheapest_builtin_path just before the GetForeignJoinPaths() hook.
Why? It should be a built-in join logic, never be a foreign/custom-join,
because of the hook location; only built-in logic shall be added here.
Even if either/both of join sub-trees contains foreign/custom-join,
these path have own alternative sub-plan at their level, no need to
care about at current level. (It is the reason why I adjust my explanation
above.)
Once this built-in path is kept and foreign/custom-join get chosen by
set_cheapest(), it is easy to attach this sub-plan to ForeignScan or
CustomScan node.
I don't find any significant down-side in this approach.
How about your opinion?


Regarding to the development timeline, I prefer to put something
workaround not to kick Assert() on ExecScanFetch().
We may add a warning in the documentation not to replace built-in
join if either/both of sub-trees are target of UPDATE/DELETE or
FOR SHARE/UPDATE.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raw output from copy

2015-07-02 Thread Andrew Dunstan


On 07/02/2015 11:02 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

Does the COPY line protocol even support binary data?

The protocol, per se, just transmits a byte stream.  There is a field
in the CopyInResponse/CopyOutResponse messages that indicates whether
a text or binary copy is being done.  One thing we'd have to consider
is whether raw mode is sufficiently different from binary to justify
an additional value for this field, and if so whether that constitutes
a protocol break.

IIRC, psql wouldn't really care; it just transfers the byte stream to or
from the target file, regardless of text or binary mode.  But there might
be other client libraries that are smarter and expect binary mode to
mean the binary file format specified in the COPY reference page.  So
there may be value in being explicit about raw mode in these messages.

A key point in all this is that people who need raw transfer probably
need it in both directions, a point that your SELECT proposal cannot
satisfy, but hacking COPY could.  So I lean towards the latter really.





OK, let's do that. I await the result with interest.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #13126: table constraint loses its comment

2015-07-02 Thread Petr Jelinek

On 2015-05-27 15:10, Michael Paquier wrote:

On Wed, Apr 29, 2015 at 1:30 PM, Michael Paquier
michael.paqu...@gmail.com wrote:

On Sun, Apr 26, 2015 at 6:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:

x...@resolvent.net writes:

In some circumstances, the comment on a table constraint disappears.  Here
is an example:


Hm, yeah.  The problem is that ATExecAlterColumnType() rebuilds all the
affected indexes from scratch, and it isn't doing anything about copying
their comments to the new objects (either comments on the constraints, or
comments directly on the indexes).

The least painful way to fix it might be to charter ATPostAlterTypeCleanup
to create COMMENT commands and add those to the appropriate work queue,
rather than complicating the data structure initially emitted by
ATExecAlterColumnType.  But it'd still be a fair amount of new code I'm
afraid.

Not planning to fix this personally, but maybe someone else would like to
take it up.


In order to fix this, an idea would be to add a new routine in
ruleutils.c that generates the COMMENT query string, and then call it
directly from tablecmds.c. On master, I imagine that we could even add
some SQL interface if there is some need.
Thoughts?


After looking at this problem, I noticed that the test case given
above does not cover everything: primary key indexes, constraints
(CHECK for example) and indexes alone have also their comments removed
after ALTER TABLE when those objects are re-created. I finished with
the patch attached to fix everything, patch that includes a set of
regression tests covering all the code paths added.



I was going through the code and have few comments:
- Why do you change the return value of TryReuseIndex? Can't we use 
reuse the same OidIsValid(stmt-oldNode) check that ATExecAddIndex is 
doing instead?


- I think the changes to ATPostAlterTypeParse should follow more closely 
the coding of transformTableLikeClause - namely use the idxcomment


- Also the changes to ATPostAlterTypeParse move the code somewhat 
uncomfortably over the 80 char width, I don't really see a way to fix 
that except for moving some of the nesting out to another function.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raw output from copy

2015-07-02 Thread Andrew Dunstan


On 07/02/2015 10:07 AM, Pavel Stehule wrote:



2015-07-02 16:02 GMT+02:00 Andrew Dunstan and...@dunslane.net 
mailto:and...@dunslane.net:



On 07/02/2015 09:43 AM, Simon Riggs wrote:

On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net
mailto:and...@dunslane.net mailto:and...@dunslane.net
mailto:and...@dunslane.net wrote:


Please don't top-post on the PostgreSQL lists. You've been
around
here long enough to know that bottom posting is our custom.

I posted a patch for this in 2013 at
   
http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net

but it can apply to a SELECT, and doesn't need COPY.
Nobody seemed
very interested, so I dropped it. Apparently people now want
something along these lines, which is good.


It's a shame that both solutions are restricted to either COPY
or psql.

Both of those are working on suggestions from Tom, so there is
no history of preference there.

Can we have both please, gentlemen?

If we implemented Andrew's solution, how would we request it
in a COPY statement? Seems like we would want the RAW format
keyword anyway.




What's the use case? My original motivation was that I had a
function that returned a bytea (it was a PDF in fact) that I
wanted to be able to write to a file. Of course, this is easy
enough to do with a client library like perl's DBD::Pg, but it
seems sad to have to resort to that for something so simple.

My original suggestion
(http://www.postgresql.org/message-id/4ea1b83b.2050...@pgexperts.com)
was to invent a \bcopy command.

I don't have a problem in building in a RAW mode for copy, but
we'll still need to teach psql how to deal with it.


It can be used from psql without any problems.



In fact your patch will not work with psql's \copy nor to stdout at all, 
unless I'm misreading it:


   -if (cstate-binary)
   +if (cstate-binary || cstate-raw)
 ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg(COPY BINARY is not supported to stdout or from
   stdin)));


So it looks like you're only supporting this where the server is writing 
to a file. That's horribly narrow, and certainly doesn't meet my 
original need.


Does the COPY line protocol even support binary data? If not, we're dead 
in the water here from the psql POV. Because my patch doesn't use the 
COPY protocol it doesn't have this problem.


Perhaps we should do both, although I'm not sure I understand the use 
case for the COPY solution.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-02 Thread Tom Lane
Syed, Rahila rahila.s...@nttdata.com writes:
 Hello,
 Unless I am missing something, I guess you can still keep the actual code 
 that updates counters outside the core if you adopt an approach that Simon 
 suggests.
 Yes. The code to extract progress information from VACUUM and storing in 
 shared memory can be outside core even with pg_stat_activity as a user 
 interface.

 Whatever the view (existing/new), any related counters would have a valid 
 (non-NULL) value when read off the view iff hooks are set perhaps because 
 you have an extension that sets them. 
 I guess that means any operation that supports progress tracking would 
 have an extension with suitable hooks implemented.
 Do you mean to say , any operation/application that want progress  tracking 
 feature will dynamically load the progress checker module which will set the 
 hooks for progress reporting?
 If yes , unless I am missing something such dynamic loading cannot happen if 
 we use pg_stat_activity as it gets values from shared memory. Module has to 
 be a shared_preload_library
 to allocate a shared memory. So this will mean the module can be loaded
 only at server restart. Am I missing something?

TBH, I think that designing this as a hook-based solution is adding a
whole lot of complexity for no value.  The hard parts of the problem are
collecting the raw data and making the results visible to users, and
both of those require involvement of the core code.  Where is the benefit
from pushing some trivial intermediate arithmetic into an external module?
If there's any at all, it's certainly not enough to justify problems such
as you mention here.

So I'd just create a pgstat_report_percent_done() type of interface in
pgstat.c and then teach VACUUM to call it directly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-07-02 Thread Sawada Masahiko
On Thu, Jul 2, 2015 at 1:06 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby jim.na...@bluetreble.com 
 wrote:
 On 4/23/15 11:06 AM, Petr Jelinek wrote:

 On 23/04/15 17:45, Bruce Momjian wrote:

 On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
 Agreed, no extra file, and the same write volume as currently.  It 
 would
 also match pg_clog, which uses two bits per transaction --- maybe we 
 can
 reuse some of that code.


 Yeah, this approach seems promising. We probably can't reuse code from
 clog because the usage pattern is different (key for clog is xid, while
 for visibility/freeze map ctid is used). But visibility map storage
 layer is pretty simple so it should be easy to extend it for this use.


 Actually, there may be some bit manipulation functions we could reuse;
 things like efficiently counting how many things in a byte are set. 
 Probably
 doesn't make sense to fully refactor it, but at least CLOG is a good 
 source
 for cut/paste/whack.


 I agree with adding a bit that indicates corresponding page is
 all-frozen into VM, just like CLOG.
 I'll change the patch as second version patch.


 The second patch is attached.

 In second patch, I added a bit that indicates all tuples in page are
 completely frozen into visibility map.
 The visibility map became a bitmap with two bit per heap page:
 all-visible and all-frozen.
 The logics around vacuum, insert/update/delete heap are almost same as
 previous version.

 This patch lack some point: documentation, comment in source code,
 etc, so it's WIP patch yet,
 but I think that it's enough to discuss about this.


 The previous patch is no longer applied cleanly to HEAD.
 The attached v2 patch is latest version.

 Please review it.

 Attached new rebased version patch.
 Please give me comments!

 Now we should review your design and approach rather than code,
 but since I got an assertion error while trying the patch, I report it.

 initdb -D test -k caused the following assertion failure.

 vacuuming database template1 ... TRAP:
 FailedAssertion(!PageHeader) (heapPage))-pd_flags  0x0004)),
 File: visibilitymap.c, Line: 328)
 sh: line 1: 83785 Abort trap: 6
 /dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres --single
 -F -O -c search_path=pg_catalog -c exit_on_error=true template1 
 /dev/null
 child process exited with exit code 134
 initdb: removing data directory test

Thank you for bug report, and comments.

Fixed version is attached, and source code comment is also updated.
Please review it.

And I explain again here about what this patch does, current design.

- A additional bit for visibility map.
I added additional bit, say all-frozen bit, which indicates whether
the all pages of corresponding page are frozen, to visibility map.
This structure is similar to CLOG.
So the size of VM grew as twice as today.
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visible

- Set and clear a all-frozen bit
Update and delete and insert(multi insert) operation would clear a bit
of that page, and clear flags of page header at same time.
Only vauum operation can set a bit if all tuple of a page are frozen.

- Anti-wrapping vacuum
We have to scan whole table for XID anti-warring today, and it's
really quite expensive because disk I/O.
The main benefit of this proposal is to reduce and avoid such
extremely large quantity I/O even when anti-wrapping vacuum is
executed.
We have to scan whole table for XID anti-warring today, and it's
really quite expensive.
In lazy_scan_heap() function, I added a such logic for experimental.

There were several another idea on previous discussion such as
read-only table, frozen map. But advantage of this direction is that
we don't need additional heap file, and can use the matured VM
mechanism.

Regards,

--
Sawada Masahiko
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..835d714 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 Buffer newbuf, HeapTuple oldtup,
 HeapTuple newtup, HeapTuple old_key_tup,
-bool all_visible_cleared, bool new_all_visible_cleared);
+bool all_visible_cleared, bool new_all_visible_cleared,
+bool all_frozen_cleared, bool new_all_frozen_cleared);
 static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
 			 Bitmapset *hot_attrs,
 			 Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2107,7 +2108,8 @@ heap_insert(Relation 

Re: [HACKERS] Raising our compiler requirements for 9.6

2015-07-02 Thread Robert Haas
On Wed, Jul 1, 2015 at 6:24 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-07-02 00:15:14 +0200, Andres Freund wrote:
 animal   OS  compiler inline   quiet inline  
varargs

 brolga   cygwin  gcc-4.3  yy

 4.3 obviously supports varargs. Human error.

 pademelonHP-UX 10.2  HP C Compiler 10.32  nn 
n

 Since, buildfarm/quiet inline test issues aside, pademelon is the only
 animal not supporting inlines and varargs, I think we should just go
 ahead and start to use both.

So far this thread is all about the costs of desupporting compilers
that don't have these features, and you're making a good argument
(that I think we all agree with) that the cost is small.  But you
haven't really said what the benefits are.

In the case of static inline, the main problem with the status quo
AFAICS is that you have to do a little dance any time you'd otherwise
use a static inline function (for those following along at home, git
grep INCLUDE_DEFINITIONS src/include to see how this is done).  Now,
it is obviously the case that the first time somebody has to do this
dance, they will be somewhat inconvenienced, but once you know how to
do it, it's not incredibly complicated.  So, just to play the devil's
advocate here, who really cares?  The way we're doing it right now
works everywhere and is as efficient on each platform as the compiler
on that platform can support.  I accept that there is some developer
convenience to not having to worry about the INCLUDE_DEFINITIONS
thing, and maybe that's a sufficient reason to do it, but is that the
only benefit we're hoping to get?

I'd consider an argument for adopting one of these features to be much
stronger if were accompanied by arguments like:

- If we require feature X, we can reduce the size of the generated
code and improve performance.
- If we require feature X, we can reduce the risk of bugs.
- If we require feature X, static analyzers will be able to understand
our code better.

If everybody else here says working around the lack of feature X is
too much of a pain in the butt and we want to adopt it purely too make
development easier, I'm not going to sit here and try to fight the
tide.  But I personally don't find such arguments all that exciting.
It's not at all clear to me that static inline or variadic macros
would make our lives meaningfully better, and like Peter, I think //
comments are a not something we should adopt, because one style is
better than two.  Designated initializers would have meaningful
documentation value and might help to decrease the risk of bugs, so
that almost seems more interesting to me than static inline.  We'd
need to check what pgindent thinks of them, though, and how much
platform coverage we'd be surrendering.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-07-02 Thread Alvaro Herrera
Amit Kapila wrote:
 On Sat, Jun 27, 2015 at 12:54 AM, Robert Haas robertmh...@gmail.com wrote:
 
  On Mon, Jun 15, 2015 at 2:52 PM, Amit Kapila
  amit.kapil...@gmail.com wrote:
   Attached patch provides a fix as per above discussion.
 
  I think we should emit some LOG messages here.  When we detect the
  file is there:
 
  LOG: ignoring tablespace_map file because no backup_label file exists
 
  If the rename fails:
 
  LOG: could not rename file %s to %s: %m
 
  If it works:
 
  LOG: renamed file %s to %s
 
 Added the above log messages in attached patch with small change
 such that in message, file names will be displayed with quotes as most
 of other usages of rename (failure) in that file uses quotes to display
 filenames.

Why emit two messages?  Can we reduce that to a single one?  Maybe the
first one could be errdetail or something.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raw output from copy

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 15:07, Pavel Stehule pavel.steh...@gmail.com wrote:


 It can be used from psql without any problems.


It can, but your patch does not yet do that, while Andrew's does.

We want a solution that works from psql and other clients. Hopefully the
same-ish solution.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Rework the way multixact truncations work

2015-07-02 Thread Robert Haas
On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund and...@anarazel.de wrote:
 New version attached.

0002 looks good, but the commit message should perhaps mention the
comment fix.  Or commit that separately.

Will look at 0003 next.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] two-arg current_setting() with fallback

2015-07-02 Thread Tom Lane
Jeevan Chalke jeevan.cha...@enterprisedb.com writes:
 Attached patch which fixes my review comments.

Applied with minor adjustments (mostly cosmetic, but did neither of you
notice the compiler warning?)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add checksums without --initdb

2015-07-02 Thread Heikki Linnakangas

On 07/02/2015 11:28 PM, Andres Freund wrote:

On 2015-07-02 22:53:40 +0300, Heikki Linnakangas wrote:

Add a enabling-checksums mode to the server where it calculates checksums
for anything it writes, but doesn't check or complain about incorrect
checksums on reads. Put the server into that mode, and then have a
background process that reads through all data in the cluster, calculates
the checksum for every page, and writes all the data back. Once that's
completed, checksums can be fully enabled.


You'd need, afaics, a bgworker that connects to every database to read
pg_class, to figure out what type of page a relfilenode has. And this
probably should call back into the relevant AM or such.


Nah, we already assume that every relation data file follows the 
standard page format, at least enough to have the checksum field at the 
right location. See FlushBuffer() - it just unconditionally calculates 
the checksum before writing out the page. (I'm not totally happy about 
that, but that ship has sailed)

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH:do not set Win32 server-side socket buffer size on windows 2012

2015-07-02 Thread Heikki Linnakangas

On 04/10/2015 01:46 PM, chenhj wrote:

PostgreSQL set Win32 server-side socket buffer size to 32k since 2006, for 
performance reasons.

While,on the newer version of Windows,such as windows 2012,the default socket 
buffer size is 64k,
and seem has better performance(high throughput).
So, i propose to apply the attached patch(based on the snapshot of 9.5dev) to  
set Win32 server-side
  socket buffer size to 32k only when the default value is less than 32k.


Seems reasonable. I edited the comment somewhat, and added #ifdefs on 
the new variables to avoid compiler warnings on other platforms.



OSdefault socket buffer size(get from getsockopt(SO_SNDBUF))
Window7: 8k
Windows2003:8k
Windows2008:8k
Windows8:   64k
Windows2012:64k


The following is my performance test for various SO_SNDBUF setting.


Test method:
Use psql to fetch about 100MB data from PostgreSQL(Windows) with various 
SO_SNDBUF setting.
[chenhj@node2 ~]$ time psql -h dbsvr -p 5432 -U postgres -A -t -c select 
'1'::char(1000),generate_series(1,10)/dev/null


real0m3.295s
user0m0.222s
sys0m0.250s


Environment1(default SO_SNDBUF 32k):
Client: PostgreSQL 9.4.1 at RHEL6(x64)
Server: PostgreSQL 9.4.1 at Windows 2012(x64)
Network:1Gbit LAN


Result(execute time):
default(64K),1.118s
set SO_SNDBUF to 32K,  3.295s(the current implement)
set SO_SNDBUF to 64K,  2.048s
set SO_SNDBUF to 128K,   1.404s
set SO_SNDBUF to 256K,   1.290s


1)When use Windows as client OS,the result is similar,but there's no 
/dev/null used by my test in windows.
2)I think the reason that the default(64K) is fast than set SO_SNDBUF to 64K 
is
that dynamic send buffering was disabled after set SO_SNDBUF option.
https://msdn.microsoft.com/en-us/library/windows/desktop/bb736549(v=vs.85).aspx

Dynamic send buffering for TCP was added on Windows 7 and Windows Server 2008 
R2. By default,
dynamic send buffering for TCP is enabled unless an application sets the 
SO_SNDBUF socket option on the stream socket.

Environment2(default SO_SNDBUF 32k):
Client: PostgreSQL 9.4.1 at RHEL6(x64)
Server: PostgreSQL 9.4.1 at Windows 2008 R2(x64)
Network:1Gbit LAN


Result(execute time):
default(8K),  7.370s
set SO_SNDBUF to 32K, 4.159s(the current implement)
set SO_SNDBUF to 64K, 2.875s
set SO_SNDBUF to 128K,  1.593s
set SO_SNDBUF to 256K,  1.324s


I was about to commit the attached, but when I tested this between my 
Windows 8.1 virtual machine and Linux host, I was not able to see any 
performance difference. It may be because the case is hobbled by other 
inefficiencies, in the virtualization or somewhere else, but I wonder if 
others can reproduce the speedup?


- Heikki
From 2a8b9c5fe10d5b8dc4e4bcf00ed0025bc6d24a23 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakan...@iki.fi
Date: Thu, 2 Jul 2015 22:09:08 +0300
Subject: [PATCH 1/1] Don't set SO_SNDBUF on recent Windows versions that have
 a bigger default.

It's unnecessary to set it if the default is higher in the first place.
Furthermore, setting SO_SNDBUF disables the so-called dynamic send
buffering feature, which hurts performance further.

Chen Huajun

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4b37ed..7672287 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -726,6 +726,11 @@ StreamConnection(pgsocket server_fd, Port *port)
 	if (!IS_AF_UNIX(port-laddr.addr.ss_family))
 	{
 		int			on;
+#ifdef WIN32
+		int			oldopt;
+		int			optlen;
+		int			newopt;
+#endif
 
 #ifdef	TCP_NODELAY
 		on = 1;
@@ -747,16 +752,43 @@ StreamConnection(pgsocket server_fd, Port *port)
 #ifdef WIN32
 
 		/*
-		 * This is a Win32 socket optimization.  The ideal size is 32k.
-		 * http://support.microsoft.com/kb/823764/EN-US/
+		 * This is a Win32 socket optimization.  The OS send buffer should be
+		 * large enough to send the whole Postgres send buffer in one go, or
+		 * performance suffers.  The Postgres send buffer can be enlarged if a
+		 * very large message needs to be sent, but we won't attempt to
+		 * enlarge the OS buffer if that happens, so somewhat arbitrarily
+		 * ensure that the OS buffer is at least PQ_SEND_BUFFER_SIZE * 4.
+		 * (That's 32kB with the current default).
+		 *
+		 * The default OS buffer size used to be 8kB in earlier Windows
+		 * versions, but was raised to 64kB in Windows 2012.  So it shouldn't
+		 * be necessary to change it in later versions anymore.  Changing it
+		 * unnecessarily can even reduce performance, because setting
+		 * SO_SNDBUF in the application disables the dynamic send buffering
+		 * feature that was introduced in Windows 7.  So before fiddling with
+		 * SO_SNDBUF, check if the current buffers size is already large
+		 * enough and only increase it if necessary.
+		 *
+		 * See 

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-07-02 Thread Amit Kapila
On Thu, Jul 2, 2015 at 7:44 PM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 Amit Kapila wrote:
 
  Added the above log messages in attached patch with small change
  such that in message, file names will be displayed with quotes as most
  of other usages of rename (failure) in that file uses quotes to display
  filenames.

 Why emit two messages?

not necessary.

  Can we reduce that to a single one?  Maybe the
 first one could be errdetail or something.


I think it is better other way (basically have second one as errdetail).
We already have one similar message in xlog.c that way.
ereport(LOG,
(errmsg(online backup mode canceled),
errdetail(\%s\ was renamed to \%s\.,
  BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));

Attached patch consolidates errmsg into one message.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


rename_mapfile_if_backupfile_not_present_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch failover WAS: Support for N synchronous standby servers - take 2

2015-07-02 Thread Fujii Masao
On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus j...@agliodbs.com wrote:
 On 07/02/2015 12:44 PM, Andres Freund wrote:
 On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
 So there's two parts to this:

 1. I need to ensure that data is replicated to X places.

 2. I need to *know* which places data was synchronously replicated to
 when the master goes down.

 My entire point is that (1) alone is useless unless you also have (2).

 I think there's a good set of usecases where that's really not the case.

 Please share!  My plea for usecases was sincere.  I can't think of any.

 And do note that I'm talking about information on the replica, not on
 the master, since in any failure situation we don't have the old
 master around to check.

 How would you, even theoretically, synchronize that knowledge to all the
 replicas? Even when they're temporarily disconnected?

 You can't, which is why what we need to know is when the replica thinks
 it was last synced from the replica side.  That is, a sync timestamp and
 lsn from the last time the replica ack'd a sync commit back to the
 master successfully.  Based on that information, I can make an informed
 decision, even if I'm down to one replica.

 ... because we would know definitively which servers were in sync.  So
 maybe that's the use case we should be supporting?

 If you want automated failover you need a leader election amongst the
 surviving nodes. The replay position is all they need to elect the node
 that's furthest ahead, and that information exists today.

 I can do that already.  If quorum synch commit doesn't help us minimize
 data loss any better than async replication or the current 1-redundant,
 why would we want it?  If it does help us minimize data loss, how?

In your example of 2 : { local_replica, london_server, nyc_server },
if there is not something like quorum commit, only local_replica is synch
and the other two are async. In this case, if the local data center gets
destroyed, you need to promote either london_server or nyc_server. But
since they are async, they might not have the data which have been already
committed in the master. So data loss! Of course, as I said yesterday,
they might have all the data and no data loss happens at the promotion.
But the point is that there is no guarantee that no data loss happens.
OTOH, if we use quorum commit, we can guarantee that either london_server
or nyc_server has all the data which have been committed in the master.

So I think that quorum commit is helpful for minimizing the data loss.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bugfix: incomplete implementation of errhidecontext

2015-07-02 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On 2015-06-08 14:44:53 +, Jeevan Chalke wrote:
 This is trivial bug fix in the area of hiding error context.
 
 I observed that there are two places from which we are calling this function
 to hide the context in log messages. Those were broken.

 Broken in which sense? They did prevent stuff to go from the server log?

 I'm not convinced that hiding stuff from the client is really
 necessarily the same as hiding it from the server log. We e.g. always
 send the verbose log to the client, even if we only send the terse
 version to the server log.  I don't mind adjusting things for
 errhidecontext(), but it's not just a bug.

Not only is it not just a bug, I disagree that it's a bug at all.
The documentation of the errhidestmt function is crystal clear about
what it does:

 * errhidecontext --- optionally suppress CONTEXT: field of log entry

That says log entry, not anything else.  Furthermore, this is clearly
modeled on errhidestmt(), which also only affects what's written to the
log.

Generally our position on error reporting is that it's the client's
responsibility to decide what parts of a report it will or won't show
to the user, so even if we agreed the overall behavior was undesirable,
I do not think this is the appropriate fix.

I especially object to the part of the patch that suppresses calling the
context callback stack functions; that's just introducing inconsistent
behavior for no reason.  It doesn't prevent collection of context (there
are lots of errcontext() calls directly in ereports, which this wouldn't
stop), and it will break callers that are using those callbacks for
anything more than just calling errcontext().  An example here is that in
clauses.c's sql_inline_error_callback, this would not only suppress the
CONTEXT line but also reporting of the error cursor location.

What is the actual use-case that prompted this complaint?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL-related tools and .paritial WAL file

2015-07-02 Thread Fujii Masao
On Thu, Jul 2, 2015 at 8:59 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Jul 2, 2015 at 8:42 PM, Fujii Masao wrote:
 3) Something not caused by this patch that I just noticed... But
 pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they
 get moved away as well?

 pg_resetxlog doesn't remove also .history file in pg_xlog. Those remaining
 .backup and .history files are harmless after pg_resetxlog is executed.
 So I don't think that it's required to remove them. Of course we can do that,
 but some existing applications might depend on the current behavior...
 So unless there is strong reason to do that, I'd like to let it as it is.

 Well, I was just surprised to not see them wiped out. Let's not change
 the behavior then that exists for ages. The rest of the patch looks
 fine to me (I would add dots at the end of sentences in comment
 blocks, but that's a detail). The new macros really make the code
 easier to read and understand!

Yep!

Applied the patch. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-07-02 Thread Etsuro Fujita

On 2015/07/02 23:13, Kouhei Kaigai wrote:

To be honest, ISTM that it's difficult to do that simply and efficiently
for the foreign/custom-join-pushdown API that we have for 9.5.  It's a
little late, but what I started thinking is to redesign that API so that
that API is called at standard_join_search, as discussed in [2]; (1) to
place that API call *after* the set_cheapest call and (2) to place
another set_cheapest call after that API call for each joinrel.  By the
first set_cheapest call, I think we could probably save an alternative
path that we need in cheapest_builtin_path.  By the second
set_cheapest call following that API call, we could consider
foreign/custom-join-pushdown paths also.  What do you think about this idea?



Disadvantage is larger than advantage, sorry.
The reason why we put foreign/custom-join hook on add_paths_to_joinrel()
is that the source relations (inner/outer) were not obvious, thus,
we cannot reproduce which relations are the source of this join.
So, I had to throw a spoon when I tried this approach before.


Maybe I'm missing something, but my image about this approach is that if 
base relations for a given joinrel are all foreign tables and belong to 
the same foreign server, then by calling that API there, we consider the 
remote join over all the foreign tables, and that if not, we give up to 
consider the remote join.



My idea is that we save the cheapest_total_path of RelOptInfo onto the
new cheapest_builtin_path just before the GetForeignJoinPaths() hook.
Why? It should be a built-in join logic, never be a foreign/custom-join,
because of the hook location; only built-in logic shall be added here.


My concern about your idea is that since that (a) add_paths_to_joinrel 
is called multiple times per joinrel and that (b) repetitive add_path 
calls through GetForeignJoinPaths in add_paths_to_joinrel might remove 
old paths that are builtin, it's possible to save a path that is not 
builtin onto the cheapest_total_path and thus to save that path wrongly 
onto the cheapest_builtin_path.  There might be a good way to cope with 
that, though.



Regarding to the development timeline, I prefer to put something
workaround not to kick Assert() on ExecScanFetch().
We may add a warning in the documentation not to replace built-in
join if either/both of sub-trees are target of UPDATE/DELETE or
FOR SHARE/UPDATE.


I'm not sure that that is a good idea, but anyway, I think we need to 
hurry fixing this issue.


Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-02 Thread Amit Langote
On 2015-07-02 PM 11:00, Syed, Rahila wrote:
 Hello,
 
 Unless I am missing something, I guess you can still keep the actual code 
 that updates counters outside the core if you adopt an approach that Simon 
 suggests.
 Yes. The code to extract progress information from VACUUM and storing in 
 shared memory can be outside core even with pg_stat_activity as a user 
 interface.
 
 Whatever the view (existing/new), any related counters would have a valid 
 (non-NULL) value when read off the view iff hooks are set perhaps because 
 you have an extension that sets them. 
 I guess that means any operation that supports progress tracking would 
 have an extension with suitable hooks implemented.
 Do you mean to say , any operation/application that want progress  tracking 
 feature will dynamically load the progress checker module which will set the 
 hooks for progress reporting?
 If yes , unless I am missing something such dynamic loading cannot happen if 
 we use pg_stat_activity as it gets values from shared memory. Module has to 
 be a shared_preload_library
 to allocate a shared memory. So this will mean the module can be loaded only 
 at server restart. Am I missing something?
 

Assuming that set of hooks per command and shared memory structure(s) is a way
to go, I meant to say that hook implementations per command would be in their
separate modules, of course loaded at the server start for shared memory). Of
those, your proposed patch has vacuum_progress, for example. And in context of
my comment above, that means the view would say NULL for commands for which
the module has not been set up in advance. IOW, between showing NULL in the
view and dynamically loading hook functions, we choose the former because I
don't know what the latter means in postgres.

Having said that, Tom's suggestion to export pgstat.c function(s) for
command(s) may be a more appealing way to go.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-02 Thread Fabien COELHO


Hello Simon,


We could do better, but that is not a reason not to commit this, as is.
Commit, please.


My 0,02€: Please do not commit without further testing...

I've submitted a patch to improve checkpoint write scheduling, including 
X00 hours of performance test on various cases. This patch changes 
significantly the load distribution over the whole checkpoint, and AFAICS 
has been tested on rather small cases.


I'm not sure that the power 1.5 is the right one for all cases. For a big 
checkpoint over 30 minutes, it may have, or not, very large and possibly 
unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what 
I really think is that it needs performance measures.


In conclusion, and very egoistically, I would prefer if this patch could 
wait for the checkpoint scheduling patch to be considered, as it would 
basically invalidate the X00 hours of performance tests I ran:-)


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Fujii Masao
On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout
klep...@svana.org wrote:
 On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
 Hi,

 On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
  === Start with an empty database

 My guess is you have wal_level = minimal?

 Default config, was just initdb'd. So yes, the default wal_level =
 minimal.

  === Note the index file is 8KB.
  === At this point nuke the database server (in this case it was simply
  === destroying the container it was running in.

 How did you continue from there? The container has persistent storage?
 Or are you repapplying the WAL to somewhere else?

 The container has persistant storage on the host. What I think is
 actually unusual is that the script that started postgres was missing
 an 'exec so postgres never gets the signal to shutdown.

  martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p 
  /data/postgres/pg_xlog/ 00010001 |grep -wE 
  '16389|16387|16393'
  rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
  0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; 
  tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; 
  oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; 
  shutdown
  rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
  0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: 
  base/16385/16389 to 0 blocks
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: 
  base/16385/16393 to 0 blocks
  pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero 
  length at 0/16BE740

 Note that the truncate will lead to a new, different, relfilenode.

 Really? Comparing the relfilenodes gives the same values before and
 after the truncate.

Yep, the relfilenodes are not changed in this case because CREATE TABLE and
TRUNCATE were executed in the same transaction block.

  ctmp=# select * from test;
  ERROR:  could not read block 0 in file base/16385/16393: read only 0 of 
  8192 bytes

 Hm. I can't reproduce this. Can you include a bit more details about how
 to reproduce?

 Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
 can probably construct a Dockerfile that reproduces it pretty reliably.

I could reproduce the problem in the master branch by doing
the following steps.

1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
 begin;
 create table test(id serial primary key);
 truncate table test;
 commit;
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement
select * from test;

The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing PG_VERSION_NUM in pg_config

2015-07-02 Thread Michael Paquier
On Fri, Jul 3, 2015 at 6:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Michael Paquier michael.paqu...@gmail.com writes:
 On Wed, Mar 25, 2015 at 8:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... So attached is a patch that adds VERSION_NUM in
 Makefile.global.

 While there was not exactly universal consensus that we need this, the
 patch as given is merely two lines, so it seems awfully cheap to Just
 Do It.  Hence, I've gone ahead and committed it.  If we start getting
 complaints about use-cases this doesn't cover, we can re-discuss whether
 it's worth doing more.

This looks fine to me. Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Interesting study what is C in practice

2015-07-02 Thread Greg Stark
We spend a lot of time debating what non standard behaviours are safe to
rely on because other software relies on it. Someone actually went and
surveyed some major projects about what behaviours are being counted on.

Three results aren't as conclusive as I would have expected but even so
some of then are still interesting.

http://www.cl.cam.ac.uk/~pes20/cerberus/notes50-2015-05-24-survey-discussion.html


Re: [HACKERS] bugfix: incomplete implementation of errhidecontext

2015-07-02 Thread Pavel Stehule
2015-07-03 1:07 GMT+02:00 Tom Lane t...@sss.pgh.pa.us:

 Andres Freund and...@anarazel.de writes:
  On 2015-06-08 14:44:53 +, Jeevan Chalke wrote:
  This is trivial bug fix in the area of hiding error context.
 
  I observed that there are two places from which we are calling this
 function
  to hide the context in log messages. Those were broken.

  Broken in which sense? They did prevent stuff to go from the server log?

  I'm not convinced that hiding stuff from the client is really
  necessarily the same as hiding it from the server log. We e.g. always
  send the verbose log to the client, even if we only send the terse
  version to the server log.  I don't mind adjusting things for
  errhidecontext(), but it's not just a bug.

 Not only is it not just a bug, I disagree that it's a bug at all.
 The documentation of the errhidestmt function is crystal clear about
 what it does:

  * errhidecontext --- optionally suppress CONTEXT: field of log entry

 That says log entry, not anything else.  Furthermore, this is clearly
 modeled on errhidestmt(), which also only affects what's written to the
 log.

 Generally our position on error reporting is that it's the client's
 responsibility to decide what parts of a report it will or won't show
 to the user, so even if we agreed the overall behavior was undesirable,
 I do not think this is the appropriate fix.

 I especially object to the part of the patch that suppresses calling the
 context callback stack functions; that's just introducing inconsistent
 behavior for no reason.  It doesn't prevent collection of context (there
 are lots of errcontext() calls directly in ereports, which this wouldn't
 stop), and it will break callers that are using those callbacks for
 anything more than just calling errcontext().  An example here is that in
 clauses.c's sql_inline_error_callback, this would not only suppress the
 CONTEXT line but also reporting of the error cursor location.


I didn't know it - My idea was, when CONTEXT is not showed, then is useless
to collect data for it.



 What is the actual use-case that prompted this complaint?


I would to use it for controlling (enabling, disabling) CONTEXT in RAISE
statement in plpgsql. I am thinking so one option for this purpose is
enough, and I would not to add other option to specify LOG, CLIENT.

Regards

Pavel



 regards, tom lane



Re: [HACKERS] PATCH: remove nclients/nthreads constraint from pgbench

2015-07-02 Thread Fabien COELHO


This doesn't behave correctly if you set -j to greater than -c, and also use 
the rate limit option:


Indeed. v3 attached fixed the case when nthreads  nclients.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index a808546..2517a3a 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -326,8 +326,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
para
 Number of worker threads within applicationpgbench/application.
 Using more than one thread can be helpful on multi-CPU machines.
-The number of clients must be a multiple of the number of threads,
-since each thread is given the same number of client sessions to manage.
+Clients are distributed as evenly as possible among available threads.
 Default is 1.
/para
   /listitem
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 2c3e365..8dfc2fb 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2792,6 +2792,7 @@ main(int argc, char **argv)
 	int			c;
 	int			nclients = 1;	/* default number of simulated clients */
 	int			nthreads = 1;	/* default number of threads */
+	int			nclients_dealt;	/* clients distributed between threads */
 	int			is_init_mode = 0;		/* initialize mode? */
 	int			is_no_vacuum = 0;		/* no vacuum at all before testing? */
 	int			do_vacuum_accounts = 0; /* do vacuum accounts before testing? */
@@ -3114,6 +3115,10 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* Adjust threads to clients */
+	if (nthreads  nclients)
+		nthreads = nclients;
+
 	/* compute a per thread delay */
 	throttle_delay *= nthreads;
 
@@ -3153,12 +3158,6 @@ main(int argc, char **argv)
 	if (nxacts = 0  duration = 0)
 		nxacts = DEFAULT_NXACTS;
 
-	if (nclients % nthreads != 0)
-	{
-		fprintf(stderr, number of clients (%d) must be a multiple of number of threads (%d)\n, nclients, nthreads);
-		exit(1);
-	}
-
 	/* --sampling-rate may be used only with -l */
 	if (sample_rate  0.0  !use_log)
 	{
@@ -3359,19 +3358,24 @@ main(int argc, char **argv)
 
 	/* set up thread data structures */
 	threads = (TState *) pg_malloc(sizeof(TState) * nthreads);
+	nclients_dealt = 0;
+
 	for (i = 0; i  nthreads; i++)
 	{
 		TState	   *thread = threads[i];
 
 		thread-tid = i;
-		thread-state = state[nclients / nthreads * i];
-		thread-nstate = nclients / nthreads;
+		thread-state = state[nclients_dealt];
+		thread-nstate =
+			(nclients - nclients_dealt + nthreads - i - 1) / (nthreads - i);
 		thread-random_state[0] = random();
 		thread-random_state[1] = random();
 		thread-random_state[2] = random();
 		thread-throttle_latency_skipped = 0;
 		thread-latency_late = 0;
 
+		nclients_dealt += thread-nstate;
+
 		if (is_latencies)
 		{
 			/* Reserve memory for the thread to store per-command latencies */
@@ -3395,6 +3399,9 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* all clients must be assigned to a thread */
+	Assert(nclients_dealt == nclients);
+
 	/* get start up time */
 	INSTR_TIME_SET_CURRENT(start_time);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2015-07-02 Thread Martijn van Oosterhout
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
 Hi,
 
 On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
  === Start with an empty database
 
 My guess is you have wal_level = minimal?

Default config, was just initdb'd. So yes, the default wal_level =
minimal.

  === Note the index file is 8KB.
  === At this point nuke the database server (in this case it was simply 
  === destroying the container it was running in.
 
 How did you continue from there? The container has persistent storage?
 Or are you repapplying the WAL to somewhere else?

The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec so postgres never gets the signal to shutdown.

  martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p 
  /data/postgres/pg_xlog/ 00010001 |grep -wE 
  '16389|16387|16393'
  rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
  0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; 
  tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; 
  oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; 
  shutdown
  rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
  0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393
  rmgr: Sequencelen (rec/tot):158/   190, tx:686, lsn: 
  0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: 
  base/16385/16389 to 0 blocks
  rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 
  0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: 
  base/16385/16393 to 0 blocks
  pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero 
  length at 0/16BE740
 
 Note that the truncate will lead to a new, different, relfilenode.

Really? Comparing the relfilenodes gives the same values before and
after the truncate.
 
  ctmp=# select * from test;
  ERROR:  could not read block 0 in file base/16385/16393: read only 0 of 
  8192 bytes
 
 Hm. I can't reproduce this. Can you include a bit more details about how
 to reproduce?

Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] multivariate statistics / patch v7

2015-07-02 Thread Kyotaro HORIGUCHI
Hello, I started to work on this patch.

 attached is v7 of the multivariate stats patch. The main improvement
 is major refactoring of the clausesel.c portion - splitting the
 awfully long spaghetti-style functions into smaller pieces, making it
 much more understandable etc.

Thank you, it looks clearer. I have some comment for the brief
look at this. This patchset is relatively large so I will comment
on per-notice basis.. which means I'll send comment before
examining the entire of this patchset. Sorry in advance for the
desultory comments.

===
General comments:

- You included unnecessary stuffs such like regression.diffs in
  these patches.

- Now OID 3307 is used by pg_stat_file. I moved
  pg_mv_stats_dependencies_info/show to 3311/3312.

- Single-variate stats have a mechanism to inject arbitrary
  values as statistics, that is, get_relation_stats_hook and the
  similar stuffs. I want the similar mechanism for multivariate
  statistics, too.

0001:

- I also don't think it is right thing for expression_tree_walker
  to recognize RestrictInfo since it is not a part of expression.

0003:

- In clauselist_selectivity, find_stats is uselessly called for
  single clause. This should be called after the clauselist found
  to consist more than one clause.

- Searching vars to be compared with mv-stat columns which
  find_stats does should stop at disjunctions. But this patch
  doesn't behave so and it should be an unwanted behavior. The
  following steps shows that.


 =# CREATE TABLE t1 (a int, b int, c int);
 =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, ) a);
 =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
  Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
 =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
 =# ANALZYE t1;
 =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
  Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)

 Rows changed unwantedly.

 It seems not so simple thing as your code assumes.

 I do assume some of those pieces are unnecessary because there already
 is a helper function with the same purpose (but I'm not aware of
 that). But IMHO this piece of code begins to look reasonable
 (especially when compared to the previous state).

Year, such kind of work should be done later:p This patch is
not-so-invasive so as to make it undoable.

 The other major improvement it review of the comments (including
 FIXMEs and TODOs), and removal of the obsolete / misplaced ones. And
 there was plenty of those ...
 
 These changes made this version ~20k smaller than v6.
 
 The patch also rebases to current master, which I assume shall be
 quite stable - so hopefully no more duplicate OIDs for a while.
 
 There are 6 files attached, but only 0002-0006 are actually part of
 the multivariate statistics patch itself. The first part makes it
 possible to use pull_varnos() with expression trees containing
 RestrictInfo nodes, but maybe this is not the right way to fix this
 (there's another thread where this was discussed).

As mentioned above, checking if mv stats can be applied would be
more complex matter than now you are assuming. I also will
consider that.

 Also, the regression tests testing plan choice with multivariate stats
 (e.g. that a bitmap index scan is chosen instead of index scan) fail
 from time to time. I suppose this happens because the invalidation
 after ANALYZE is not processed before executing the query, so the
 optimizer does not see the stats, or something like that.

I saw that occurs, but have no idea how it occurs so far..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Beena Emerson
Josh Berkus wrote:
 
 Say you take this case:
 
 2 : { local_replica, london_server, nyc_server }
 
 ... which should ensure that any data which is replicated is replicated
 to at least two places, so that even if you lose the entire local
 datacenter, you have the data on at least one remote data center.

 EXCEPT: say you lose both the local datacenter and communication with
 the london server at the same time (due to transatlantic cable issues, a
 huge DDOS, or whatever).  You'd like to promote the NYC server to be the
 new master, but only if it was in sync at the time its communication
 with the original master was lost ... except that you have no way of
 knowing that.

Please consider the following:

If we have multiple replica on each DC, we can use the following:

3(local1, 1(london1, london2), 1(nyc1, nyc2))

In this case at least 1 from each DC is sync rep. When local and London
center is lost, NYC promotion can be done by comparing the LSN.

Also quorum would also ensure that even if one of the standby in a data
center goes down, another can take over, preventing data loss.

In the case 3(local1, london1, nyc1)

If nyc1, is down, the transaction would wait continuously. This can be
avoided.









-

--

Beena Emerson

--
View this message in context: 
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856394.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PostgreSQL 9.5 Alpha 1 build fail with perl 5.22

2015-07-02 Thread Marco Atzeri

On 7/2/2015 5:16 PM, Dave Page wrote:

The PostgreSQL Global Development Group announces that the alpha
release of PostgreSQL 9.5, the latest version of the world's leading
open source database, is available today.  This release contains
previews of all of the features which will be available in the final
release of version 9.5, although some details will change before then.
Please download, test, and report what you find.

Help Test for Bugs
--




building on cygwin and

$ perl --version

This is perl 5, version 22, subversion 0 (v5.22.0) built for 
cygwin-thread-multi-64int


build fail here, anyone seeing the same on other platforms ?

make -C hstore_plperl all
make[1]: Entering directory 
'/cygdrive/e/cyg_pub/devel/postgresql/prova/postgres 
   ql-9.5alpha1-1.i686/build/contrib/hstore_plperl'
gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -We   ndif-labels 
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -f 
   wrapv -fexcess-precision=standard -ggdb -O2 
-pipe -Wimplicit-function-declaratio   n 
-I/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql-9.5a 
  lpha1/src/pl/plperl 
-I/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr 
   c/postgresql-9.5alpha1/contrib/hstore -I. 
-I/pub/devel/postgresql/prova/postgres 
ql-9.5alpha1-1.i686/src/postgresql-9.5alpha1/contrib/hstore_plperl 
-I../../src/i   nclude 
-I/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql- 
 9.5alpha1/src/include 
-I/usr/lib/perl5/5.22/i686-cygwin-threads-64int/CORE  -c 
-o hstore_plperl.o 
/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/p 
ostgresql-9.5alpha1/contrib/hstore_plperl/hstore_plperl.c
In file included from 
/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr 


c/postgresql-9.5alpha1/contrib/hstore_plperl/hstore_plperl.c:6:0:
/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql-9.5alpha1 
  /contrib/hstore/hstore.h:79:0: warning: 
HS_KEY redefined

 #define HS_KEY(arr_,str_,i_) ((str_) + HSE_OFF((arr_)[2*(i_)]))
 ^
In file included from 
/usr/lib/perl5/5.22/i686-cygwin-threads-64int/CORE/perl.h: 
 3730:0,
 from 
/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr 
 c/postgresql-9.5alpha1/src/pl/plperl/plperl.h:48,
 from 
/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr 


c/postgresql-9.5alpha1/contrib/hstore_plperl/hstore_plperl.c:4:
/usr/lib/perl5/5.22/i686-cygwin-threads-64int/CORE/util.h:221:0: note: 
this is t   he location of the previous definition

 #  define HS_KEY(setxsubfn, popmark, apiver, xsver) \
 ^
gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -We   ndif-labels 
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -f 
   wrapv -fexcess-precision=standard -ggdb -O2 
-pipe -Wimplicit-function-declaratio   n 
-shared -o hstore_plperl.dll -Wl,--out-implib=libhstore_plperl.a 
hstore_plpe   rl.o  -L../../src/port 
-L../../src/common -Wl,-no-undefined -Wl,--allow-multiple 
-definition -Wl,--enable-auto-import  -L/usr/local/lib 
-Wl,--as-needed   -L../..   /src/backend 
-lpostgres -lpgcommon -lpgport -lintl -lssl -lcrypto -lz -lreadline 
   -lcrypt -lldap

hstore_plperl.o: In function `hstore_to_plperl':
/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql-9.5alpha1 
  /contrib/hstore_plperl/hstore_plperl.c:16: 
undefined reference to `hstoreUpgrade   '



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rework the way multixact truncations work

2015-07-02 Thread Robert Haas
On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas robertmh...@gmail.com wrote:
 Will look at 0003 next.

+appendStringInfo(buf, offsets [%u, %u), members [%u, %u),

I don't think we typically use this style for notating intervals.

 case XLOG_MULTIXACT_CREATE_ID:
 id = CREATE_ID;
 break;
+case XLOG_MULTIXACT_TRUNCATE_ID:
+id = TRUNCATE;
+break;

If XLOG_MULTIXACT_CREATE_ID - CREATE_ID, then why not
XLOG_MULTIXACT_TRUNCATE_ID - TRUNCATE_ID?

+ * too old to general truncation records.

s/general/generate/

+MultiXactId oldestMXactDB;

Data type should be OID.

+ * Recompute limits as we are now fully started, we now can correctly
+ * compute how far a members wraparound is away.

s/,/:/ or something.  This isn't particularly good English as written.

+ * Computing the actual limits is only possible once the data directory is
+ * in a consistent state. There's no need to compute the limits while
+ * still replaying WAL as no new multis can be created anyway. So we'll
+ * only do further checks after TrimMultiXact() has been called.

Multis can be and are created during replay.  What this should really
say is that we have no choice about whether to create them or not: we
just have to replay whatever's there.

+(errmsg(performing legacy multixact truncation,
upgrade master)));

This message needs work.  I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.

I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().  The checkpoint hasn't really happened at that
point yet; you might truncate away stuff, then crash before the
checkpoint is complete, and then we you restart recovery, you've got
trouble.  I think you should be very, very cautious about rejiggering
the order of operations here.  The current situation is not good, but
casually rejiggering it can make things much worse.

- * If no multixacts exist, then oldestMultiXactId will be the next
- * multixact that will be created, rather than an existing multixact.
+ * Determine the offset of the oldest multixact.  Normally, we can read
+ * the offset from the multixact itself, but there's an important special
+ * case: if there are no multixacts in existence at all, oldestMXact
+ * obviously can't point to one.  It will instead point to the multixact
+ * ID that will be assigned the next time one is needed.

There's no need to change this; it means the same thing either way.

Generally, I think you've weakened the logic in SetOffsetVacuumLimit()
appreciably here.  The existing code is careful never to set
oldestOffsetKnown false when it was previously true.  Your rewrite
removes that property.  Generally, I see no need for this function to
be overhauled to the extent that you have, and would suggest reverting
the changes that aren't absolutely required.

I don't particularly like the fact that find_multixact_start() calls
SimpleLruFlush().  I think that's really a POLA violation: you don't
expect that a function that looks like a simple inquiry is going to go
do a bunch of unrelated I/O in the background.  If somebody called
find_multixact_start() with any frequency, you'd be sad.  You're just
doing it this way because of the context *in which you expect
find_multixact_start* to be called, which does not seem very
future-proof.  I prefer Thomas's approach.

If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?

+ * Make sure to only attempt truncation if there's values to truncate
+ * away. In normal processing values shouldn't go backwards, but there's
+ * some corner cases (due to bugs) where that's possible.

I think this comment should be more detailed.  Is that talking about
the same thing as this comment:

- * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
- * oldestMXact might point to a multixact that does not exist.
- * Autovacuum will eventually advance it to a value that does exist,
- * and we want to set a proper offsetStopLimit when that happens,
- * so call DetermineSafeOldestOffset here even if we're not actually
- * truncating.

This comment seems to be saying there's a race condition:

+ * XXX: It's also possible that the page that oldestMXact is on has
+ * already been truncated away, and we crashed before updating
+ * oldestMXact.

But why is that an XXX?  I think this is just a case of recovery
needing tolerate redo of an action already redone.

I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing.  It seems like a
very good idea to have two separate pieces of state in shared memory:

- The oldest offset that we think anyone might need to access to make

Re: [HACKERS] Memory leak fixes for pg_dump, pg_dumpall, initdb and pg_upgrade

2015-07-02 Thread Heikki Linnakangas

On 06/08/2015 09:48 AM, Michael Paquier wrote:

Hi all,

Please find attached a set of fixes for a couple of things in src/bin:
- pg_dump/pg_dumpall:
-- getFormattedTypeName, convertTSFunction and myFormatType return
strdup'd results that are never free'd.
-- convertTSFunction returns const char. I fail to see the point of
that... In my opinion we are fine with just returning a char pointer,
which is strdup'd so as it can be freed by the caller.
- initdb's and pg_regress' use getaddrinfo, but do not free the
returned result with freeaddrinfo().
- Coverity noticed on the way some leaked memory in pg_upgrade's
equivalent_locale().

Those issues have been mostly spotted by Coverity, I may have spotted
some of them while looking at similar code paths... In any case that's
Coverity's win ;)


Fixing most of these is not really an improvement, IMO. They're in 
pg_dump and pg_ugprade, which you only run once and then it exits, so as 
long as the leaks are not in some tight loops that execute millions of 
time, it doesn't matter.


I committed some of these that seemed like improvements on readability 
grounds, but please just mark the rest as ignore in coverity.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing ClogControlLock contention

2015-07-02 Thread Robert Haas
On Wed, Jul 1, 2015 at 6:21 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-07-01 11:19:40 +0100, Simon Riggs wrote:
 What tricks are being used??

 Please explain why taking 2 locks is bad here, yet works fine elsewhere.

 I didn't say anything about 'bad'. It's more complicated than one
 lock. Suddenly you have to care about lock ordering and such. The
 algorithms for ensuring correctness gets more complicated.

Taking two locks might also be more expensive than just taking one.  I
suppose benchmarking will reveal whether there is an issue there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Configurable location for extension .control files

2015-07-02 Thread Heikki Linnakangas

On 03/04/2015 09:41 AM, Oskari Saarenmaa wrote:

18.02.2015, 01:49, Jim Nasby kirjoitti:

On 2/17/15 4:39 PM, Oskari Saarenmaa wrote:

10.06.2013, 17:51, Dimitri Fontaine kirjoitti:

Andres Freund and...@2ndquadrant.com writes:

In any case, no packager is going to ship an insecure-by-default
configuration, which is what Dimitri seems to be fantasizing would
happen.  It would have to be local option to relax the permissions
on the directory, no matter where it is.


*I* don't want that at all. All I'd like to have is a postgresql.conf
   option specifying additional locations.


Same from me. I think I would even take non-plural location.


Here's a patch to allow overriding extension installation directory.
The patch allows superusers to change it at runtime, but we could also
restrict it to postgresql.conf if needed.  I don't really see a point in
restricting it (or not implementing the option at all) as the superuser
can already load custom C code and sql from anywhere in the filesystem;
not having configurable extension directory just makes it harder to test
extensions and to create private clusters of PG data in a custom
location without using custom binaries.


I'm interested in this because it could potentially make it possible to
install SQL extensions without OS access. (My understanding is there's
some issue with writing the extension files even if LIBDIR is writable
by the OS account).


I'm not sure this patch makes extensions without OS access any easier to
implement; you still need to write the files somewhere, and someone
needs to set up the cluster with a nonstandard extension path.


Hmm. I think you're on to something with this patch, but I couldn't 
exactly figure out what. What is the purpose of this patch?


I can see a few things that you might want to do:

1. You might want to create a cluster using existing binaries, and set 
it up so that you can install extra extensions from a custom directory. 
Handy if you don't have write access to /usr, or you only want to make 
an extension available in one cluster but not others.


2. Like 1, but you want to replace the set of set of extensions altogether.

3. Writing an automated regression test for a custom extension.

With all of those, you have the problem that you actually want to 
override both the lib-dir and the extensions-dir. So you'll need to set 
dynamic_library_path too. For 3, fiddling with the configuration file is 
a bit tedious. And PGXS doesn't currently have support for setting up a 
test cluster anyway.


Oh, and why are we only worried about extensions? There's other stuff in 
'share'-directory that you might also want to override in some clusters 
or as part of regression tests: timezone and tsearch data.


Note that you can make Postgres to search for all of those things in a 
different directory by copying the postgres binary elsewhere. It's a 
bit hacky, but works.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2015-07-02 Thread Peter Geoghegan
On Thu, Jul 2, 2015 at 1:30 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 Sure, if a speculative inserter detects a conflict, it still has to
 wait. But not in the aminsert call, and not until it cleans up its
 physical insertion (by super-deleting). Clearly a
 CHECK_UNIQUE_SPECULATIVE would have to be much closer to
 CHECK_UNIQUE_PARTIAL than to CHECK_UNIQUE_YES.


 Why is it not OK for aminsert to do the waiting, with
 CHECK_UNIQUE_SPECULATIVE?

Well, waiting means getting a ShareLock on the other session's XID.
You can't do that without first releasing your locks, unless you're
okay with unprincipled deadlocks.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Josh Berkus
On 07/01/2015 11:12 PM, Fujii Masao wrote:
 I don't think this is good approach because there can be the case where
 you need to promote even the standby server not having sync flag.
 Please imagine the case where you have sync and async standby servers.
 When the master goes down, the async standby might be ahead of the
 sync one. This is possible in practice. In this case, it might be better to
 promote the async standby instead of sync one. Because the remaining
 sync standby which is behind can easily follow up with new master.

If we're always going to be polling the replicas for furthest ahead,
then why bother implementing quorum synch at all? That's the basic
question I'm asking.  What does it buy us that we don't already have?

I'm serious, here.  Without any additional information on synch state at
failure time, I would never use quorum synch.  If there's someone on
this thread who *would*, let's speak to their use case and then we can
actually get the feature right.  Anyone?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising our compiler requirements for 9.6

2015-07-02 Thread Andres Freund
On 2015-07-02 11:46:25 -0400, Robert Haas wrote:
 In the case of static inline, the main problem with the status quo
 AFAICS is that you have to do a little dance any time you'd otherwise
 use a static inline function (for those following along at home, git
 grep INCLUDE_DEFINITIONS src/include to see how this is done).
 Now,
 it is obviously the case that the first time somebody has to do this
 dance, they will be somewhat inconvenienced, but once you know how to
 do it, it's not incredibly complicated.

I obviously know the schema (having introduced it in pg), but I think
the costs are actually rather significant. It makes development more
complex, it makes things considerably harder to follow, and it's quite
annoying to test (manually redefine PG_USE_INLINE, recompile whole
tree).

Several times people also argued against using inlines with that trick
because it's slowing down older platforms.

 So, just to play the devil's advocate here, who really cares?

I consider our time one of the most scarce resources around.

 I'd consider an argument for adopting one of these features to be much
 stronger if were accompanied by arguments like:
 
 - If we require feature X, we can reduce the size of the generated
 code and improve performance.

Converting some of the bigger macros (I tested fastgetattr) to inliens,
actually does both.

See also 
http://archives.postgresql.org/message-id/4407.1435763473%40sss.pgh.pa.us

Now, all that is possible with the INCLUDE_DEFINITIONS stuff, but it
makes it considerably more expensive time/complexity wise.


 If everybody else here says working around the lack of feature X is
 too much of a pain in the butt and we want to adopt it purely too make
 development easier, I'm not going to sit here and try to fight the
 tide.  But I personally don't find such arguments all that exciting.

I find that an odd attitude.

 I think // comments are a not something we should adopt, because one
 style is better than two

I don't particularly care about // comments either. They do have the
advantage of allowing three more characters of actual content in one
line comments ...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2015-07-02 Thread Peter Eisentraut
On 5/30/15 10:14 PM, Andreas Karlsson wrote:
 I have written a patch which makes it possible to change SSL
 certificates (and other SSL parameters, including the CRL) without
 restarting PostgreSQL. In fact this patch also makes it possible to turn
 on or off ssl entirely without restart. It does so by initializing a new
 SSL context when the postmaster receives a SIGHUP, and if the
 initialization succeeded the old context is replaced by the new.

I think this would be a useful feature, and the implementation looks
sound.  But I don't like how the reload is organized.  Reinitializing
the context in the sighup handler, aside from questions about how much
work one should do in a signal handler, would cause SSL reinitialization
for unrelated reloads.  We have the GUC assign hook mechanism for
handling this sort of thing.  The trick would be that when multiple
SSL-related settings change, you only want to do one reinitialization.
You could either have the different assign hooks communicate with each
other somehow, or have them set a need SSL init flag that is checked
somewhere else.

 There was some previous discussion[1] on the mailing list about what the
 proper context should be for the SSL parameters, but as far as I can
 tell the discussion never reached a conclusion. I have changed the SSL
 GUCs to PGC_SIGUP since I felt that was the closest to the truth, but it
 is not a perfect fit (the backends wont reload the SSL context). Should
 we add a new context for the SSL GUCs?

I think PGC_SIGHUP is fine for this.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising our compiler requirements for 9.6

2015-07-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So far this thread is all about the costs of desupporting compilers
 that don't have these features, and you're making a good argument
 (that I think we all agree with) that the cost is small.  But you
 haven't really said what the benefits are.

I made the same remark with respect to varargs macros, and I continue
to think that the cost-benefit ratio there is pretty marginal.

However, I do think that there's a case to be made for adopting static
inline.  The INCLUDE_DEFINITIONS dance is very inconvenient, so it
discourages people from using static inlines over macros, even in cases
where the macro approach is pretty evil (usually, because of multiple-
evaluation hazards).  If we allowed static inlines then we could work
towards having a safer coding standard along the lines of thou shalt
not write macros that evaluate their arguments more than once.
So the benefits here are easier development and less risk of bugs.
On the other side of the ledger, the costs are pretty minimal; we will
not actually break any platforms, at worst we'll make them unpleasant
to develop on because of lots of warning messages.  We have some platforms
like that already, and it's not been a huge problem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-07-02 Thread Simon Riggs
On 2 July 2015 at 16:30, Sawada Masahiko sawada.m...@gmail.com wrote:


 Also, the flags of each heap page header might be set PD_ALL_FROZEN,
 as well as all-visible


Is it possible to have VM bits set to frozen but not visible?

The description makes those two states sound independent of each other.

Are they? Or not? Do we test for an impossible state?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Fix autoconf deprecation warnings

2015-07-02 Thread Heikki Linnakangas

On 05/31/2015 05:22 AM, Andreas Karlsson wrote:

I have attached new versions which apply on the current master.


Thanks, applied.


--- a/configure.in
+++ b/configure.in
@@ -1473,7 +1473,7 @@ if test x$pgac_cv_func_sigsetjmp = xyes; then
   AC_DEFINE(HAVE_SIGSETJMP, 1, [Define to 1 if you have sigsetjmp().])
 fi

-AC_DECL_SYS_SIGLIST
+AC_CHECK_DECLS([sys_siglist])

 AC_CHECK_FUNC(syslog,
   [AC_CHECK_HEADER(syslog.h,


Hmm, according to the autoconf manual:


Macro: AC_DECL_SYS_SIGLIST

Same as:

  AC_CHECK_DECLS([sys_siglist], [], [],
  [#include signal.h
  /* NetBSD declares sys_siglist in unistd.h.  */
  #ifdef HAVE_UNISTD_H
  # include unistd.h
  #endif
  ])

See AC_CHECK_DECLS.


So I replaced AC_DECL_SYS_SIGLIST with that full snippet instead.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-07-02 Thread Simon Riggs
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Looks functionally complete

Need a test to show that ALTER TABLE works on views, as discussed on this 
thread. And confirmation that pg_dump is not broken by this.

Message-ID: 20140321205828.gb3969...@tornado.leadboat.com

Needs documentation

The new status of this patch is: Waiting on Author


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-02 Thread Josh Berkus
On 07/02/2015 11:31 AM, Andres Freund wrote:
 On 2015-07-02 11:10:27 -0700, Josh Berkus wrote:
 If we're always going to be polling the replicas for furthest ahead,
 then why bother implementing quorum synch at all? That's the basic
 question I'm asking.  What does it buy us that we don't already have?
 
 What do those topic have to do with each other? A standby fundamentally
 can be further ahead than what the primary knows about. So you can't do
 very much with that knowledge on the master anyway?
 
 I'm serious, here.  Without any additional information on synch state at
 failure time, I would never use quorum synch.  If there's someone on
 this thread who *would*, let's speak to their use case and then we can
 actually get the feature right.  Anyone?
 
 How would you otherwise ensure that your data is both on a second server
 in the same DC and in another DC? Which is a pretty darn common desire?

So there's two parts to this:

1. I need to ensure that data is replicated to X places.

2. I need to *know* which places data was synchronously replicated to
when the master goes down.

My entire point is that (1) alone is useless unless you also have (2).
And do note that I'm talking about information on the replica, not on
the master, since in any failure situation we don't have the old master
around to check.

Say you take this case:

2 : { local_replica, london_server, nyc_server }

... which should ensure that any data which is replicated is replicated
to at least two places, so that even if you lose the entire local
datacenter, you have the data on at least one remote data center.

EXCEPT: say you lose both the local datacenter and communication with
the london server at the same time (due to transatlantic cable issues, a
huge DDOS, or whatever).  You'd like to promote the NYC server to be the
new master, but only if it was in sync at the time its communication
with the original master was lost ... except that you have no way of
knowing that.

Given that, we haven't really reduced our data loss potential or
improved availabilty from the current 1-redundant synch rep.  We still
need to wait to get the London server back to figure out if we want to
promote or not.

Now, this configuration would reduce the data loss window:

3 : { local_replica, london_server, nyc_server }

As would this one:

2 : { local_replica, nyc_server }

... because we would know definitively which servers were in sync.  So
maybe that's the use case we should be supporting?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >