Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-08 Thread Amit Kapila
On Mon, Sep 8, 2014 at 10:39 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Sat, Jul 26, 2014 at 9:32 PM, Robert Haas robertmh...@gmail.com
wrote:
  On Fri, Jul 25, 2014 at 4:16 PM, Alvaro Herrera
  alvhe...@2ndquadrant.com wrote:
   On Fri, Jul 25, 2014 at 02:11:32PM -0400, Robert Haas wrote:
   + pq_mq_busy = true;
   +
   + iov[0].data = msgtype;
   + iov[0].len = 1;
   + iov[1].data = s;
   + iov[1].len = len;
   +
   + Assert(pq_mq_handle != NULL);
   + result = shm_mq_sendv(pq_mq_handle, iov, 2, false);
   +
   + pq_mq_busy = false;
  
   Don't you need a PG_TRY block here to reset pq_mq_busy?
 
  No.  If shm_mq_sendv is interrupted, we can't use the shm_mq any more.
  But since that should only happen if an interrupt arrives while the
  queue is full, I think that's OK.

 I think here not only on interrupt, but any other error in this
 function shm_mq_sendv() path (one example is WaitLatch())
 could lead to same behaviour.

  (Think about the alternatives: if
  the queue is full, we have no way of notifying the launching process
  without waiting for it to retrieve the results, but it might not do
  that right away, and if we've been killed we need to die *now* not
  later.)

 So in such cases what is the advise to users, currently they will
 see the below message:
 postgres=# select * from pg_background_result(5124) as (x int);
 ERROR:  lost connection to worker process with PID 5124

 One way is to ask them to check logs, but what about if they want
 to handle error and take some action?

 Another point about error handling is that to execute the sql in
 function pg_background_worker_main(), it starts the transaction
 which I think doesn't get aborted if error occurs

For this I was just referring error handling code of
StartBackgroundWorker(), however during shutdown of process, it
will call AbortOutOfAnyTransaction() which will take care of aborting
transaction.

 and similarly handling
 for timeout seems to be missing in error path.

As we are anyway going to exit on error, so not sure, if this will be
required, however having it for clean exit seems to be better.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors

2014-09-08 Thread Pavel Stehule
2014-09-08 6:27 GMT+02:00 Stephen Frost sfr...@snowman.net:

 * Pavel Stehule (pavel.steh...@gmail.com) wrote:
  ignore_nulls in array_to_json_pretty probably is not necessary. On second
  hand, the cost is zero, and we can have it for API consistency.

 I'm willing to be persuaded either way on this, really.  I do think it
 would be nice for both array_to_json and row_to_json to be single
 functions which take default values, but as for if array_to_json has a
 ignore_nulls option, I'm on the fence and would defer to people who use
 that function regularly (I don't today).

 Beyond that, I'm pretty happy moving forward with this patch.


ok

Regards

Pavel



 Thanks,

 Stephen



Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Heikki Linnakangas

On 09/07/2014 05:11 PM, Костя Кузнецов wrote:

hello.
i recode vacuum for gist index.
all tests is ok.
also i test vacuum on table size 2 million rows. all is ok.
on my machine old vaccum work about 9 second. this version work about 6-7 sec .
review please.


If I'm reading this correctly, the patch changes gistbulkdelete to scan 
the index in physical order, while the old code starts from the root and 
scans the index from left to right, in logical order.


Scanning the index in physical order is wrong, if any index pages are 
split while vacuum runs. A page split could move some tuples to a 
lower-numbered page, so that the vacuum will not scan those tuples.


In the b-tree code, we solved that problem back in 2006, so it can be 
done but requires a bit more code. In b-tree, we solved it with a 
vacuum cycle ID number that's set on the page halves when a page is 
split. That allows VACUUM to identify pages that have been split 
concurrently sees them, and jump back to vacuum them too. See commit 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. 
It should be possible to do something similar in GiST, and in fact you 
might be able to reuse the NSN field that's already set on the page 
halves on split, instead of adding a new vacuum cycle ID.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-09-08 Thread Albe Laurenz
I wrote:
 I gave it a spin and could not find any undesirable behaviour, and the
 output of EXPLAIN ANALYZE looks like I'd expect.
 
 I noticed that you use the list length of fdw_private to check if
 the UPDATE or DELETE is pushed down to the remote server or not.
 
 While this works fine, I wonder if it wouldn't be better to have some
 explicit flag in fdw_private for that purpose.  Future modifications that
 change the list length might easily overlook that it is used for this
 purpose, thereby breaking the code.
 
 Other than that it looks alright to me.

Maybe I should have mentioned that I have set the patch to Waiting for Author
because I'd like to hear your opinion on that, but I'm prepared to set it
to Ready for Committer soon.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Alexander Korotkov
On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 09/07/2014 05:11 PM, Костя Кузнецов wrote:

 hello.
 i recode vacuum for gist index.
 all tests is ok.
 also i test vacuum on table size 2 million rows. all is ok.
 on my machine old vaccum work about 9 second. this version work about 6-7
 sec .
 review please.


 If I'm reading this correctly, the patch changes gistbulkdelete to scan
 the index in physical order, while the old code starts from the root and
 scans the index from left to right, in logical order.

 Scanning the index in physical order is wrong, if any index pages are
 split while vacuum runs. A page split could move some tuples to a
 lower-numbered page, so that the vacuum will not scan those tuples.

 In the b-tree code, we solved that problem back in 2006, so it can be done
 but requires a bit more code. In b-tree, we solved it with a vacuum cycle
 ID number that's set on the page halves when a page is split. That allows
 VACUUM to identify pages that have been split concurrently sees them, and
 jump back to vacuum them too. See commit http://git.postgresql.org/
 gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d9
 7b82555c80. It should be possible to do something similar in GiST, and in
 fact you might be able to reuse the NSN field that's already set on the
 page halves on split, instead of adding a new vacuum cycle ID.


Idea is right. But in fact, does GiST ever recycle any page? It has
F_DELETED flag, but ISTM this flag is never set. So, I think it's possible
that this patch is working correctly. However, probably GiST sometimes
leaves new page unused due to server crash.
Anyway, I'm not fan of committing patch in this shape. We need to let GiST
recycle pages first, then implement VACUUM similar to b-tree.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Alexander Korotkov
On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 09/07/2014 05:11 PM, Костя Кузнецов wrote:

 hello.
 i recode vacuum for gist index.
 all tests is ok.
 also i test vacuum on table size 2 million rows. all is ok.
 on my machine old vaccum work about 9 second. this version work about
 6-7 sec .
 review please.


 If I'm reading this correctly, the patch changes gistbulkdelete to scan
 the index in physical order, while the old code starts from the root and
 scans the index from left to right, in logical order.

 Scanning the index in physical order is wrong, if any index pages are
 split while vacuum runs. A page split could move some tuples to a
 lower-numbered page, so that the vacuum will not scan those tuples.

 In the b-tree code, we solved that problem back in 2006, so it can be
 done but requires a bit more code. In b-tree, we solved it with a vacuum
 cycle ID number that's set on the page halves when a page is split. That
 allows VACUUM to identify pages that have been split concurrently sees
 them, and jump back to vacuum them too. See commit
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=
 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do
 something similar in GiST, and in fact you might be able to reuse the NSN
 field that's already set on the page halves on split, instead of adding a
 new vacuum cycle ID.


 Idea is right. But in fact, does GiST ever recycle any page? It has
 F_DELETED flag, but ISTM this flag is never set. So, I think it's possible
 that this patch is working correctly. However, probably GiST sometimes
 leaves new page unused due to server crash.
 Anyway, I'm not fan of committing patch in this shape. We need to let GiST
 recycle pages first, then implement VACUUM similar to b-tree.


Another note. Assuming we have NSN which can play the role of vacuum cycle
ID, can we implement sequential (with possible jump back) index scan for
GiST?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Heikki Linnakangas

On 09/08/2014 11:08 AM, Alexander Korotkov wrote:

On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com

wrote:



On 09/07/2014 05:11 PM, Костя Кузнецов wrote:


hello.
i recode vacuum for gist index.
all tests is ok.
also i test vacuum on table size 2 million rows. all is ok.
on my machine old vaccum work about 9 second. this version work about 6-7
sec .
review please.



If I'm reading this correctly, the patch changes gistbulkdelete to scan
the index in physical order, while the old code starts from the root and
scans the index from left to right, in logical order.

Scanning the index in physical order is wrong, if any index pages are
split while vacuum runs. A page split could move some tuples to a
lower-numbered page, so that the vacuum will not scan those tuples.

In the b-tree code, we solved that problem back in 2006, so it can be done
but requires a bit more code. In b-tree, we solved it with a vacuum cycle
ID number that's set on the page halves when a page is split. That allows
VACUUM to identify pages that have been split concurrently sees them, and
jump back to vacuum them too. See commit http://git.postgresql.org/
gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d9
7b82555c80. It should be possible to do something similar in GiST, and in
fact you might be able to reuse the NSN field that's already set on the
page halves on split, instead of adding a new vacuum cycle ID.


Idea is right. But in fact, does GiST ever recycle any page? It has
F_DELETED flag, but ISTM this flag is never set. So, I think it's possible
that this patch is working correctly. However, probably GiST sometimes
leaves new page unused due to server crash.


Hmm. It's correctness depends on the fact that when a page is split, the 
new right page is always put on a higher-numbered page than the old 
page, which hinges on the fact that we never recycle pages.


We used to delete pages in VACUUM FULL, but that got removed when 
old-style VACUUM FULL was changed to just rewrite the whole heap. So if 
you have pg_upgraded from an old index, it's possible that you still 
have some F_DELETED pages in the tree. And then there's the case of 
unused pages after crash that you mentioned. So no, you can't really 
rely on that. We could of course remove the code that checks the FSM 
altogether, and always extend the relation on page split. But of course 
the long-term solution is to allow recycling pages.



Anyway, I'm not fan of committing patch in this shape. We need to let GiST
recycle pages first, then implement VACUUM similar to b-tree.


+1. Although I guess we could implement the b-tree style strategy first, 
and implement page recycling later. It's just a bit hard to test that 
it's correct, when there is no easy way to get deleted pages in the 
index to begin with.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Heikki Linnakangas

On 09/08/2014 11:19 AM, Alexander Korotkov wrote:

On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com
wrote:


On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:


In the b-tree code, we solved that problem back in 2006, so it can be
done but requires a bit more code. In b-tree, we solved it with a vacuum
cycle ID number that's set on the page halves when a page is split. That
allows VACUUM to identify pages that have been split concurrently sees
them, and jump back to vacuum them too. See commit
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=
5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do
something similar in GiST, and in fact you might be able to reuse the NSN
field that's already set on the page halves on split, instead of adding a
new vacuum cycle ID.

...


Another note. Assuming we have NSN which can play the role of vacuum cycle
ID, can we implement sequential (with possible jump back) index scan for
GiST?


Yeah, I think it would work. It's pretty straightforward, the page split 
code already sets the NSN, just when we need it. Vacuum needs to 
memorize the current NSN when it begins, and whenever it sees a page 
with a higher NSN (or the FOLLOW_RIGHT flag is set), follow the 
right-link if it points to lower-numbered page.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-08 Thread Mitsumasa KONDO
Hi,

Here is the review result.

#1. Patch compatibility
Little bit hunk, but it can patch to latest master.

#2. Functionality
No problem.

#3. Documentation
I think modulo operator explanation should put at last at the doc line.
Because the others are more frequently used.

#4. Algorithm
You proposed three modulo algorithm, that are
1. general modulo, 2. floor modulo and 3. euclidian modulo.

They calculate different value when modulo2 or reminder is negative number.
Calculate examples are here,

1. general modulo (patch1)
5 %  3 = 2
5 % -3 = 1
   -5 %  3 = -1

2. floor modulo (patch2, 3)
   5 %  3  =  2
   5 % -3  = -2
  -5 %  3  =  2

3. euclidian modulo (patch2)
   5 %  3  =  2
   5 % -3  =  4
  -5 %  3  =  2

That's all.

I think if we want to create equal possibility and inutuitive random
generator, we select floor modulo, as you see the upper example. It can
create contrapositive random number. 1 and 2 method cannot.

I think euclidian modulo doesn't need by a lot of people. If we add it,
many people will confuse, because they doesn't know the mathematic
algorithms.

So I like patch3 which is simple and practical.

If you agree or reply my comment, I will mark ready for commiter.

Best Regards,
--
Mitsumsasa KONDO


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-09-08 Thread Etsuro Fujita

(2014/09/08 16:18), Albe Laurenz wrote:

I wrote:

I gave it a spin and could not find any undesirable behaviour, and the
output of EXPLAIN ANALYZE looks like I'd expect.


Thank you for the review!


I noticed that you use the list length of fdw_private to check if
the UPDATE or DELETE is pushed down to the remote server or not.

While this works fine, I wonder if it wouldn't be better to have some
explicit flag in fdw_private for that purpose.  Future modifications that
change the list length might easily overlook that it is used for this
purpose, thereby breaking the code.



Other than that it looks alright to me.



Maybe I should have mentioned that I have set the patch to Waiting for Author
because I'd like to hear your opinion on that, but I'm prepared to set it
to Ready for Committer soon.


I agree with you on that point.  So, I've updated the patch to have the 
explicit flag, as you proposed.  Attached is the updated version of the 
patch.  In this version, I've also revised code and its comments a bit.


Sorry for the delay.

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/deparse.c
--- b/contrib/postgres_fdw/deparse.c
***
*** 188,197  is_foreign_expr(PlannerInfo *root,
  	if (!foreign_expr_walker((Node *) expr, glob_cxt, loc_cxt))
  		return false;
  
- 	/* Expressions examined here should be boolean, ie noncollatable */
- 	Assert(loc_cxt.collation == InvalidOid);
- 	Assert(loc_cxt.state == FDW_COLLATE_NONE);
- 
  	/*
  	 * An expression which includes any mutable functions can't be sent over
  	 * because its result is not stable.  For example, sending now() remote
--- 188,193 
***
*** 927,932  deparseUpdateSql(StringInfo buf, PlannerInfo *root,
--- 923,982 
  }
  
  /*
+  * deparse remote UPDATE statement
+  *
+  * The statement text is appended to buf, and we also create an integer List
+  * of the columns being retrieved by RETURNING (if any), which is returned
+  * to *retrieved_attrs.
+  */
+ void
+ deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root,
+ 	   Index rtindex, Relation rel,
+ 	   List	*remote_conds,
+ 	   List	*targetlist,
+ 	   List *targetAttrs,
+ 	   List *returningList,
+ 	   List **retrieved_attrs)
+ {
+ 	RelOptInfo *baserel = root-simple_rel_array[rtindex];
+ 	List	   *params_list = NIL;
+ 	deparse_expr_cxt context;
+ 	bool		first;
+ 	ListCell   *lc;
+ 
+ 	/* Set up context struct for recursion */
+ 	context.root = root;
+ 	context.foreignrel = baserel;
+ 	context.buf = buf;
+ 	context.params_list = params_list;
+ 
+ 	appendStringInfoString(buf, UPDATE );
+ 	deparseRelation(buf, rel);
+ 	appendStringInfoString(buf,  SET );
+ 
+ 	first = true;
+ 	foreach(lc, targetAttrs)
+ 	{
+ 		int			attnum = lfirst_int(lc);
+ 		TargetEntry *tle = get_tle_by_resno(targetlist, attnum);
+ 
+ 		if (!first)
+ 			appendStringInfoString(buf, , );
+ 		first = false;
+ 
+ 		deparseColumnRef(buf, rtindex, attnum, root);
+ 		appendStringInfo(buf,  = );
+ 		deparseExpr((Expr *) tle-expr, context);
+ 	}
+ 	if (remote_conds)
+ 		appendWhereClause(buf, root, baserel, remote_conds,
+ 		  true, params_list);
+ 
+ 	deparseReturningList(buf, root, rtindex, rel, false,
+ 		 returningList, retrieved_attrs);
+ }
+ 
+ /*
   * deparse remote DELETE statement
   *
   * The statement text is appended to buf, and we also create an integer List
***
*** 949,954  deparseDeleteSql(StringInfo buf, PlannerInfo *root,
--- 999,1031 
  }
  
  /*
+  * deparse remote DELETE statement
+  *
+  * The statement text is appended to buf, and we also create an integer List
+  * of the columns being retrieved by RETURNING (if any), which is returned
+  * to *retrieved_attrs.
+  */
+ void
+ deparseDirectDeleteSql(StringInfo buf, PlannerInfo *root,
+ 	   Index rtindex, Relation rel,
+ 	   List	*remote_conds,
+ 	   List *returningList,
+ 	   List **retrieved_attrs)
+ {
+ 	RelOptInfo *baserel = root-simple_rel_array[rtindex];
+ 	List	   *params_list = NIL;
+ 
+ 	appendStringInfoString(buf, DELETE FROM );
+ 	deparseRelation(buf, rel);
+ 	if (remote_conds)
+ 		appendWhereClause(buf, root, baserel, remote_conds,
+ 		  true, params_list);
+ 
+ 	deparseReturningList(buf, root, rtindex, rel, false,
+ 		 returningList, retrieved_attrs);
+ }
+ 
+ /*
   * Add a RETURNING clause, if needed, to an INSERT/UPDATE/DELETE.
   */
  static void
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 998,1004  INSERT INTO ft2 (c1,c2,c3)
--- 998,1025 
  (3 rows)
  
  INSERT INTO ft2 (c1,c2,c3) VALUES (1104,204,'ddd'), (1105,205,'eee');
+ EXPLAIN (verbose, costs off)
+ UPDATE ft2 SET c2 = c2 + 300, c3 = c3 || '_update3' WHERE c1 % 10 = 3;  -- can be pushed down
+   QUERY PLAN  
+ 

Re: [HACKERS] pg_receivexlog and replication slots

2014-09-08 Thread Michael Paquier
On Wed, Sep 3, 2014 at 11:40 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Aug 31, 2014 at 9:45 AM, Magnus Hagander mag...@hagander.net wrote:
 Do we really want those Asserts? There is not a single Assert in
 bin/pg_basebackup today - as is the case for most things in bin/. We
 typically use regular if statements for things that can happen, and
 just ignore the others I think - since the callers are fairly simple
 to trace.

 I have no opinion on whether we want these particular Assert() calls,
 but I note that using Assert() in front-end code only became possible
 in February of 2013, as a result of commit
 e1d25de35a2b1f809e8f8d7b182ce0af004f3ec9.  So the lack of assertions
 there may not be so much because people thought it was a bad idea as
 that it didn't use to work.  Generally, I favor the use of Assert() in
 front-end code in the same scenarios in which we use it in back-end
 code: for checks that shouldn't burden production builds but are
 useful during development.

Well that was exactly why they have been added first. The assertions
have been placed in some functions to check for incompatible
combinations of argument values when those functions are called. I
don't mind re-adding them if people agree that they make sense. IMO
they do and they help as well for the development of future utilities
using replication protocol in src/bin/pg_basebackup as much as the
refactoring work done on this thread.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-08 Thread Andres Freund
On 2014-09-04 14:19:47 +0200, Andres Freund wrote:
 Yes. I plan to push the patch this weekend. Sorry for the delay.

I'm about to push this. Is it ok to first push it to master and only
backpatch once a couple buildfarm cycles haven't complained?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Alexander Korotkov
On Mon, Sep 8, 2014 at 12:51 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 09/08/2014 11:19 AM, Alexander Korotkov wrote:

 On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com
 
 wrote:

  On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

  In the b-tree code, we solved that problem back in 2006, so it can be
 done but requires a bit more code. In b-tree, we solved it with a
 vacuum
 cycle ID number that's set on the page halves when a page is split.
 That
 allows VACUUM to identify pages that have been split concurrently sees
 them, and jump back to vacuum them too. See commit
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=
 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do
 something similar in GiST, and in fact you might be able to reuse the
 NSN
 field that's already set on the page halves on split, instead of adding
 a
 new vacuum cycle ID.

 ...


 Another note. Assuming we have NSN which can play the role of vacuum
 cycle
 ID, can we implement sequential (with possible jump back) index scan
 for
 GiST?


 Yeah, I think it would work. It's pretty straightforward, the page split
 code already sets the NSN, just when we need it. Vacuum needs to memorize
 the current NSN when it begins, and whenever it sees a page with a higher
 NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points
 to lower-numbered page.


I mean full index scan feature for SELECT queries might be implemented as
well as sequential VACUUM.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] postgres_fdw behaves oddly

2014-09-08 Thread Etsuro Fujita
(2014/09/02 18:55), Etsuro Fujita wrote:
 (2014/09/01 20:15), Etsuro Fujita wrote:
 While working on [1], I've found that postgres_fdw behaves oddly:

I've revised the patch a bit further.  Please find attached a patch.

Thanks,

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/deparse.c
--- b/contrib/postgres_fdw/deparse.c
***
*** 252,257  foreign_expr_walker(Node *node,
--- 252,265 
  if (var-varno == glob_cxt-foreignrel-relid 
  	var-varlevelsup == 0)
  {
+ 	/*
+ 	 * System columns can't be sent to remote.
+ 	 *
+ 	 * XXX: we should probably send ctid to remote.
+ 	 */
+ 	if (var-varattno  0)
+ 		return false;
+ 
  	/* Var belongs to foreign table */
  	collation = var-varcollid;
  	state = OidIsValid(collation) ? FDW_COLLATE_SAFE : FDW_COLLATE_NONE;
***
*** 1261,1266  deparseVar(Var *node, deparse_expr_cxt *context)
--- 1269,1279 
  	if (node-varno == context-foreignrel-relid 
  		node-varlevelsup == 0)
  	{
+ 		/*
+ 		 * System columns shouldn't be examined here.
+ 		 */
+ 		Assert(node-varattno = 0);
+ 
  		/* Var belongs to foreign table */
  		deparseColumnRef(buf, node-varno, node-varattno, context-root);
  	}
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***
*** 20,25 
--- 20,26 
  #include math.h
  
  #include access/skey.h
+ #include access/sysattr.h
  #include catalog/pg_class.h
  #include foreign/fdwapi.h
  #include miscadmin.h
***
*** 1945,1950  create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
--- 1946,1953 
  	RelOptInfo *rel = best_path-path.parent;
  	Index		scan_relid = rel-relid;
  	RangeTblEntry *rte;
+ 	Bitmapset  *attrs_used = NULL;
+ 	ListCell   *lc;
  	int			i;
  
  	/* it should be a base rel... */
***
*** 1993,2008  create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
  	 * bit of a kluge and might go away someday, so we intentionally leave it
  	 * out of the API presented to FDWs.
  	 */
  	scan_plan-fsSystemCol = false;
  	for (i = rel-min_attr; i  0; i++)
  	{
! 		if (!bms_is_empty(rel-attr_needed[i - rel-min_attr]))
  		{
  			scan_plan-fsSystemCol = true;
  			break;
  		}
  	}
  
  	return scan_plan;
  }
  
--- 1996,2030 
  	 * bit of a kluge and might go away someday, so we intentionally leave it
  	 * out of the API presented to FDWs.
  	 */
+ 
+ 	/*
+ 	 * Add all the attributes needed for joins or final output.  Note: we must
+ 	 * look at reltargetlist, not the attr_needed data, because attr_needed
+ 	 * isn't computed for inheritance child rels.
+ 	 */
+ 	pull_varattnos((Node *) rel-reltargetlist, rel-relid, attrs_used);
+ 
+ 	/* Add all the attributes used by restriction clauses. */
+ 	foreach(lc, rel-baserestrictinfo)
+ 	{
+ 		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+ 
+ 		pull_varattnos((Node *) rinfo-clause, rel-relid, attrs_used);
+ 	}
+ 
+ 	/* Are any system columns requested from rel? */
  	scan_plan-fsSystemCol = false;
  	for (i = rel-min_attr; i  0; i++)
  	{
! 		if (bms_is_member(i - FirstLowInvalidHeapAttributeNumber, attrs_used))
  		{
  			scan_plan-fsSystemCol = true;
  			break;
  		}
  	}
  
+ 	bms_free(attrs_used);
+ 
  	return scan_plan;
  }
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-08 Thread Robert Haas
On Mon, Sep 8, 2014 at 8:07 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-04 14:19:47 +0200, Andres Freund wrote:
 Yes. I plan to push the patch this weekend. Sorry for the delay.

 I'm about to push this. Is it ok to first push it to master and only
 backpatch once a couple buildfarm cycles haven't complained?

That will have the disadvantage that src/tools/git_changelog will show
the commits separately instead of grouping them together; so it's
probably best not to make a practice of it.  But I think it's up to
your discretion how to handle it in any particular case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-08 Thread Fabien COELHO


Hello Mutsumara-san,


#3. Documentation
I think modulo operator explanation should put at last at the doc line.
Because the others are more frequently used.



So I like patch3 which is simple and practical.


Ok.


If you agree or reply my comment, I will mark ready for commiter.


Please find attached v4, which is v3 plus an improved documentation
which is clearer about the sign of the remainder.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..a815ad3 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -1574,6 +1574,22 @@ top:
 	}
 	snprintf(res, sizeof(res), INT64_FORMAT, ope1 / ope2);
 }
+else if (strcmp(argv[3], %) == 0)
+{
+	int64_t remainder;
+	if (ope2 == 0)
+	{
+		fprintf(stderr, %s: division by zero in modulo\n, argv[0]);
+		st-ecnt++;
+		return true;
+	}
+	/* divisor signed remainder */
+	remainder = ope1 % ope2;
+	if ((ope2  0  remainder  0) ||
+		(ope2  0  remainder  0))
+		remainder += ope2;
+	snprintf(res, sizeof(res), INT64_FORMAT, remainder);
+}
 else
 {
 	fprintf(stderr, %s: unsupported operator %s\n, argv[0], argv[3]);
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index b479105..66ec622 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -735,7 +735,9 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
   Each replaceableoperand/ is either an integer constant or a
   literal:/replaceablevariablename/ reference to a variable
   having an integer value.  The replaceableoperator/ can be
-  literal+/, literal-/, literal*/, or literal//.
+  literal+/, literal-/, literal*/, literal%/ or literal//.
+  The modulo operation (literal%/) is based on the floored division, so
+  that the remainder keeps the sign of the divisor.
  /para
 
  para

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-08 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-09-04 14:19:47 +0200, Andres Freund wrote:
 Yes. I plan to push the patch this weekend. Sorry for the delay.

 I'm about to push this. Is it ok to first push it to master and only
 backpatch once a couple buildfarm cycles haven't complained?

It makes for a cleaner commit history if you push concurrently into
all the branches you intend to patch.  That also gives more buildfarm
runs, which seems like a good thing for this sort of patch.

That is, assuming that we ought to backpatch at all, which to my mind
is debatable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-08 Thread Robert Haas
On Mon, Sep 8, 2014 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 It makes for a cleaner commit history if you push concurrently into
 all the branches you intend to patch.  That also gives more buildfarm
 runs, which seems like a good thing for this sort of patch.

 That is, assuming that we ought to backpatch at all, which to my mind
 is debatable.

We're not going to backpatch the main patch to make spinlock
primitives act as compiler barriers - or at least, I will object
loudly.

But what we're talking about here is a bug fix for Sparc.  And surely
we ought to back-patch that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL 2

2014-09-08 Thread Merlin Moncure
On Fri, Sep 5, 2014 at 6:18 PM, Andrew Dunstan and...@dunslane.net wrote:

 On 09/05/2014 12:37 PM, Merlin Moncure wrote:

 On Thu, Sep 4, 2014 at 6:40 PM, Florian Pflug f...@phlo.org wrote:

 Cost of hidden IO cast is negative too. If we can change it, then we can
 increase a sped.

 But the whole power of PL/pgSQL comes from the fact that it allows you to
 use the full set of postgres data types and operatores, and that it
 seamlessly
 integrated with SQL. Without that, PL/pgSQL is about as appealing as
 BASIC
 as a programming language...

 Right, and it's exactly those types and operators that are the cause
 of the performance issues.  A compiled pl/pgsql would only get serious
 benefit for scenarios involving tons of heavy iteration or funky local
 data structure manipulation.  Those scenarios are somewhat rare in
 practice for database applications and often better handled in a
 another pl should they happen.

 plv8 is emerging as the best non-sql it's JIT compiled by the plv8
 runtime, the javascript language is designed for embedding. and the
 json data structure has nice similarities with postgres's arrays and
 types.  In fact, if I *were* to attempt pl/pgsql compiling, I'd
 probably translate the code to plv8 and hand it off to the llvm
 engine.  You'd still have to let postgres handle most of the operator
 and cast operations but you could pull some things into the plv8
 engine.  Probably, this would be a net loser since plv8 (unlike
 plpgsql) has to run everything through SPI.

 plpgsql makes extensive use of SPI. Just look at the source code if you
 don't believe me.

oh, certainly.  pl/pgsql also has the ability to bypass SPI for many
simple expressions.  Other pls generally don't do this because they
can't if they want to guarantee SQL semanticsthat's ok then
because they don't have to as the language runtime handles operations
local to the function and everything runs under that language's rules.

In a nutshell, my thinking here is to translate pl/pgsql to pl/v8
javascript and then let the optimizing v8 runtime take it from there.
This is IMNSHO a tiny challenge relative to writing an optimization
engine for pl/pgsql by hand.  Think of it as coffeescript for
databases.

It's a nice thought, but there's a lot of roadblocks to making it
happen -- starting with the lack of a javascript library that would
wrap the C postgres datatype routines so you wouldn't have to call in
to SPI for every little thing; as you know even i := i + 1; can't be
handled by native javascript operations.

 plv8 also has a nice find_function gadget that lets you find and call
 another plv8 function directly instead of having to use an SPI call.

Yeah -- this is another reason why pl/v8 is a nice as a compilation
target.  javascript as we all know is a language with a long list of
pros and cons but it's designed for embedding.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-08 Thread Bruce Momjian
On Mon, Sep  8, 2014 at 10:08:04AM -0400, Robert Haas wrote:
 On Mon, Sep 8, 2014 at 8:07 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-09-04 14:19:47 +0200, Andres Freund wrote:
  Yes. I plan to push the patch this weekend. Sorry for the delay.
 
  I'm about to push this. Is it ok to first push it to master and only
  backpatch once a couple buildfarm cycles haven't complained?
 
 That will have the disadvantage that src/tools/git_changelog will show
 the commits separately instead of grouping them together; so it's
 probably best not to make a practice of it.  But I think it's up to
 your discretion how to handle it in any particular case.

Uh, git_changelog timespan check is 24 hours, so if the delay is less
then 24 hours, I think we are ok, e.g.:

# Might want to make this parameter user-settable.
my $timestamp_slop = 24 * 60 * 60;


-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-08 Thread Mitsumasa KONDO
Hi Fabien-san,

Thank you for your fast work!

2014-09-08 23:08 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr:


 Hello Mutsumara-san,

  #3. Documentation
 I think modulo operator explanation should put at last at the doc line.
 Because the others are more frequently used.


  So I like patch3 which is simple and practical.


 Ok.

  If you agree or reply my comment, I will mark ready for commiter.


 Please find attached v4, which is v3 plus an improved documentation
 which is clearer about the sign of the remainder.


 The attached is seemed no problem. But I'd like to say about order of
explanation in five formulas.

Fix version is here. Please confirm, and I mark it for ready for commiter.

Best regards,
--
Mitsumasa KONDO


pgbench-modulo-4-1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 But what we're talking about here is a bug fix for Sparc.  And surely
 we ought to back-patch that.

Ah.  OK, no objection.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-08 Thread Fabien COELHO



The attached is seemed no problem. But I'd like to say about order of
explanation in five formulas.

Fix version is here. Please confirm, and I mark it for ready for 
commiter.


I'm ok with this version.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-09-08 Thread Robert Haas
On Wed, Sep 3, 2014 at 7:34 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Well, I think you're certainly right that a hash table lookup is more
 expensive than modulo on a 32-bit integer; so much is obvious.  But if
 join can increase the number of batches on the fly, but only by
 doubling it, so you might go from 4 batches to 8 when 5 would really
 have been enough.  And a hash join also can't *reduce* the number of
 batches on the fly, which might matter a lot.  Getting the number of
 batches right avoids I/O, which is a lot more expensive than CPU.

 Regarding the estimates, I don't see much difference between the two
 approaches when handling this issue.

 It's true you can wait with deciding how many partitions (aka batches)
 to create until work_mem is full, at which point you have more
 information than at the very beginning. You know how many tuples you've
 already seen, how many tuples you expect (which is however only an
 estimate etc.). And you may use that to estimate the number of
 partitions to create.

I think it's significantly better than that.  The first point I'd make
is that if work_mem never fills up, you don't need to batch anything
at all.  That's a potentially huge win over batching a join we thought
was going to overrun work_mem, but didn't.

But even work_mem does fill up, I think we still come out ahead,
because we don't necessarily need to dump the *entirety* of each batch
to disk.  For example, suppose there are 900 distinct values and only
300 of them can fit in memory at a time.  We read the input until
work_mem is full and we see a previously-unseen value, so we decide to
split the input up into 4 batches.  We now finish reading the input.
Each previously-seen value gets added to an existing in-memory group,
and each each new value gets written into one of four disk files.  At
the end of the input, 300 groups are complete, and we have four files
on disk each of which contains the data for 150 of the remaining 600
groups.

Now, the alternative strategy is to batch from the beginning.  Here,
we decide right from the get-go that we're using 4 batches, so batch
#1 goes into memory and the remaining 3 batches get written to three
different disk files.  At the end of the input, 225 groups are
complete, and we have three files on disk each of which contains the
data for 225 of the remaining 675 groups.  This seems clearly
inferior, because we have written 675 groups to disk when it would
have been possible to write only 600.

The gains can be even more significant when the input data is skewed.
For example, suppose things are as above, but ten values accounts for
90% of all the inputs, and the remaining 890 values account for the
other 10% of the inputs.  Furthermore, let's suppose we have no table
statistics or they are totally wrong.  In Jeff's approach, as long as
each of those values occurs at least once before work_mem fills up,
they'll all be processed in the initial pass through the data, which
means we will write at most 10% of the data to disk.  In fact it will
be a little bit less, because batch 1 will have not only then 10
frequently-occurring values but also 290 others, so our initial pass
through the data will complete 300 groups covering (if the
less-frequent values are occur with uniform frequency) 93.258% of the
data.  The remaining ~6.8% will be split up into 4 files which we can
then reread and process.  But if we use the other approach, we'll only
get 2 or 3 of the 10 commonly-occurring values in the first batch, so
we expect to write about 75% of the data out to one of our three batch
files.  That's a BIG difference - more than 10x the I/O load that
Jeff's approach would have incurred.  Now, admittedly, we could use a
skew optimization similar to the one we use for hash joins to try to
get the MCVs into the first batch, and that would help a lot when the
statistics are right - but sometimes the statistics are wrong, and
Jeff's approach doesn't care.  It just keeps on working.

 That however comes at a cost - it's not really a memory-bounded hash
 aggregate, because you explicitly allow exceeding work_mem as more
 tuples for existing groups arrive.

Well, that would be true for now, but as has been mentioned, we can
add new methods to the aggregate infrastructure to serialize and
de-serialize transition states.  I guess I agree that, in the absence
of such infrastructure, your patch might be a better way to handle
cases like array_agg, but I'm pretty happy to see that infrastructure
get added.

Hmm.  It occurs to me that it could also be really good to add a
merge transition states operator to the aggregate infrastructure.
That would allow further improvements to Jeff's approach for cases
like array_agg.  If we serialize a transition state to disk because
it's not fitting in memory, we don't need to reload it before
continuing to process the group, or at least not right away.  We can
instead just start a new transitions state and then merge all of the
accumulated 

Re: [HACKERS] proposal (9.5) : psql unicode border line styles

2014-09-08 Thread Pavel Stehule
Hi

I removed dynamic allocation  and reduced patch size.

What I tested a old unicode style is same as new unicode style. There
nothing was changed .. some fields are specified in refresh_utf8format
function

Regards

Pavel




2014-09-08 4:44 GMT+02:00 Stephen Frost sfr...@snowman.net:

 Pavel,

 * Pavel Stehule (pavel.steh...@gmail.com) wrote:
  2014-07-23 8:38 GMT+02:00 Tomas Vondra t...@fuzzy.cz:
   OK, thanks. The new version seems OK to me.
 
  Thank you

 I've started looking over the patch and went back through the previous
 thread about it.  For my part, I'm in favor of adding this capability,
 but I'm not terribly happy about how it was done.  In particular,
 get_line_style() seems pretty badly hacked around, and I don't really
 like having the prepare_unicode_format call underneath it allocating
 memory and then passing back up the need to free that memory via a new
 field in the structure.  Also, on a quick glance, are you sure that the
 new 'unicode' output matches the same as the old 'unicode' did (with
 pg_utf8format)?

 I would think we'd simply set up a structure which is updated when the
 linestyle is changed, which is surely going to be much less frequently
 than the request for which linestyle to use happens, and handle all of
 the line styles in more-or-less the same way rather than doing something
 completely different for unicode than for the others.

 Thanks,

 Stephen

commit 509f8a92525889651653a75356d3fa57b58f3141
Author: Pavel Stehule pavel.steh...@gooddata.com
Date:   Mon Sep 8 17:18:43 2014 +0200

remove palloc

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index db314c3..84233d0 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2299,6 +2299,42 @@ lo_import 152801
   /para
   /listitem
   /varlistentry
+
+  varlistentry
+  termliteralunicode_border_style/literal/term
+  listitem
+  para
+  Sets the border line drawing style to one
+  of literalsingle/literal or  literaldouble/literal
+  This option only affects the literalunicode/
+  linestyle
+  /para
+  /listitem
+  /varlistentry
+
+  varlistentry
+  termliteralunicode_column_style/literal/term
+  listitem
+  para
+  Sets the column line drawing style to one
+  of literalsingle/literal or  literaldouble/literal
+  This option only affects the literalunicode/
+  linestyle
+  /para
+  /listitem
+  /varlistentry
+
+  varlistentry
+  termliteralunicode_header_style/literal/term
+  listitem
+  para
+  Sets the header line drawing style to one
+  of literalsingle/literal or  literaldouble/literal
+  This option only affects the literalunicode/
+  linestyle
+  /para
+  /listitem
+  /varlistentry
 /variablelist
 /para
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index a66093a..fd05aae 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1054,6 +1054,9 @@ exec_command(const char *cmd,
 footer, format, linestyle, null,
 numericlocale, pager, recordsep,
 tableattr, title, tuples_only,
+unicode_border_linestyle,
+unicode_column_linestyle,
+unicode_header_linestyle,
 NULL
 			};
 
@@ -2248,6 +2251,40 @@ _align2string(enum printFormat in)
 	return unknown;
 }
 
+/*
+ * Parse entered unicode linestyle. Returns true, when entered string is
+ * known linestyle: single, double else returns false.
+ */
+static bool 
+set_unicode_line_style(printQueryOpt *popt, const char *value, size_t vallen, unicode_linestyle *linestyle)
+{
+	if (pg_strncasecmp(single, value, vallen) == 0)
+		*linestyle = UNICODE_LINESTYLE_SINGLE;
+	else if (pg_strncasecmp(double, value, vallen) == 0)
+		*linestyle = UNICODE_LINESTYLE_DOUBLE;
+	else
+		return false;
+
+	/* input is ok, generate new unicode style */
+	refresh_utf8format((popt-topt));
+
+	return true;
+}
+
+static const char *
+_unicode_linestyle2string(int linestyle)
+{
+	switch (linestyle)
+	{
+		case UNICODE_LINESTYLE_SINGLE:
+			return single;
+			break;
+		case UNICODE_LINESTYLE_DOUBLE:
+			return double;
+			break;
+	}
+	return unknown;
+}
 
 bool
 do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
@@ -2305,6 +2342,42 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 
 	}
 
+	/* set unicode border line style */
+	else if (strcmp(param, unicode_border_linestyle) == 0)
+	{
+		if (!value)
+			;
+		else if (!set_unicode_line_style(popt, value, vallen, popt-topt.unicode_border_linestyle))
+		{
+			psql_error(\\pset: allowed unicode border linestyle are single, double\n);
+			return false;
+		}
+	}
+
+	/* set unicode column line style */
+	else if (strcmp(param, unicode_column_linestyle) == 0)
+	{

Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract

2014-09-08 Thread Robert Haas
On Thu, Sep 4, 2014 at 6:38 PM, David Johnston
david.g.johns...@gmail.com wrote:
 The implied suggestion is that if I do find any other areas that look like
 they need fixing - even in the same file - I should separate them out into a
 separate patch.

Yes.

 Though I have seen various while I was in there I also
 fixed such-and-such commits previously so the line is at least somewhat
 fluid.

Yep, it's a judgement call.  As a general rule though, I think there's
often a pretty clear connection between the main topic of the commit
and the changes folded in.  Another way to think about this is that,
at least for non-committers (and frequently also for committers), any
patch someone writes is going to need an upvote from *at least* one
other person in order to go in.  If somebody can look at the patch and
easily determine that it solves some problem and doesn't make anything
worse, then they're likely to like it (and if they're a committer,
maybe commit it).  But if they see changes they like mixed in with
changes they don't like, then they're likely to either bounce it back,
or just say, hmm, this looks like it will take some time to deal with,
let me put that on my TODO list.  And of course sometimes it never
makes it off the TODO list.

Now, it's a fair point that this makes it hard to get large-scale
changes done.  But, to some extent, that's a good thing.  Prudence,
indeed, will dictate that source code or documentation long
established should not be changed for light or transient causes.  For
the rest, if you feel a large scale change is really needed, it's best
to start with a proposal: I think we need to rehash the documentation
in section X because of reason Y.  You've made a few proposals to
rehash sections of the documentation on pgsql-hackers, but I haven't
actually seen clear and compelling justification for those reworkings.
Clearly, you like it better the new way, but the person who did it the
existing way likely prefers their version, and they've been around
longer than you.  :-)  By stating your objectives up-front, you can
see whether people agree with those objectives or not.  If they do,
you can hope to find the final patch criticized only on fine details
which are easily remedied; but if they don't, then some rethinking may
be needed.

 This seems pointless.  Of course general documentation will be less
 specific than documentation for specific functions.

 The existing wording was being verbose in order to be correct.  In a summary
 like this I'd trade being reasonably accurate and general for the precision
 that is included elsewhere.

This is an example of a goal where you might solicit people's general
thoughts before starting.  Maybe people will agree that removing
details in a certain place is useful, or maybe they won't, but it
needs discussion.

 One of the trade-offs I mentioned...its more style than anything but
 removing the parenthetical (if there is not error in the command) and
 writing it more directly seemed preferable in an overview such as this.

 Better:  The function will either throw an error or return a PGresult
 object[...]

Nope.  This is not C++, nor is it the backend.  It will not throw anything.

 +  para
 +   Second, the application should use the functions in this
 +   section to receive data rows or transmit buffer loads.  Buffer loads
 are
 +   not guaranteed to be processed until the copy transfer is completed.
 +  /para

 The main change here vs. the existing text is that you're now using
 the phase buffer loads to refer to what gets transmitted, and data
 rows to refer to what gets received.  The existing text uses the term
 data rows for both, which seems correct to me.  My first reaction on
 reading your revised text was wait, what's a buffer load?.

 So, my generalization policy working in reverse - since the transmit side
 does not have to be in complete rows implying that they are here is (albeit
 acceptably) inaccurate.

The existing wording doesn't say that each call to one of the
functions in question must contain only whole data rows of itself.  It
merely says that these are the functions you need to use to send data
rows, which is true.

 -   At this point further SQL commands can be issued via
 -   functionPQexec/function.  (It is not possible to execute other SQL
 -   commands using the same connection while the commandCOPY/command
 -   operation is in progress.)

 Removing this text doesn't seem like a good idea.  It's a quite
 helpful clarification.  The note you've added in its place doesn't
 seem like a good substitute for it, and more generally, I think we
 should avoid the overuse of constructs like note.  Emphasis needs to
 be used minimally or it loses its meaning.

 Was trying to remove repetition here - happy to consider alternative way of
 doing so if the note is objectionable.

I guess it doesn't seem repetitive to me.

 Its great to be able to read the documentation like a book but it also wants
 to be useful for scanning and a 

Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Heikki Linnakangas

On 09/08/2014 03:26 PM, Alexander Korotkov wrote:

On Mon, Sep 8, 2014 at 12:51 PM, Heikki Linnakangas hlinnakan...@vmware.com

wrote:



On 09/08/2014 11:19 AM, Alexander Korotkov wrote:


On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com



wrote:

  On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas 

hlinnakan...@vmware.com wrote:

  In the b-tree code, we solved that problem back in 2006, so it can be

done but requires a bit more code. In b-tree, we solved it with a
vacuum
cycle ID number that's set on the page halves when a page is split.
That
allows VACUUM to identify pages that have been split concurrently sees
them, and jump back to vacuum them too. See commit
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=
5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do
something similar in GiST, and in fact you might be able to reuse the
NSN
field that's already set on the page halves on split, instead of adding
a
new vacuum cycle ID.


...


Another note. Assuming we have NSN which can play the role of vacuum
cycle
ID, can we implement sequential (with possible jump back) index scan
for
GiST?


Yeah, I think it would work. It's pretty straightforward, the page split
code already sets the NSN, just when we need it. Vacuum needs to memorize
the current NSN when it begins, and whenever it sees a page with a higher
NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points
to lower-numbered page.


I mean full index scan feature for SELECT queries might be implemented as
well as sequential VACUUM.


Oh, sorry, I missed that. If you implement a full-index scan like that, 
you might visit some tuples twice, so you'd have to somehow deal with 
the duplicates. For a bitmap index scan it would be fine.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-08 Thread Jeff Janes
On Sun, Aug 17, 2014 at 7:46 PM, Fujii Masao masao.fu...@gmail.com wrote:


 Thanks for reviewing the patch! ISTM that I failed to make the patch from
 my git repository... Attached is the rebased version.



I get some compiler warnings on v2 of this patch:

reloptions.c:219: warning: excess elements in struct initializer
reloptions.c:219: warning: (near initialization for 'intRelOpts[15]')


Cheers,

Jeff


Re: [HACKERS] [v9.5] Custom Plan API

2014-09-08 Thread Robert Haas
On Thu, Sep 4, 2014 at 7:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Regarding to the attached three patches:
 [1] custom-path and hook
 It adds register_custom_path_provider() interface for registration of
 custom-path entrypoint. Callbacks are invoked on set_plain_rel_pathlist
 to offer alternative scan path on regular relations.
 I may need to explain the terms in use. I calls the path-node custom-path
 that is the previous step of population of plan-node (like custom-scan
 and potentially custom-join and so on). The node object created by
 CreateCustomPlan() is called custom-plan because it is abstraction for
 all the potential custom-xxx node; custom-scan is the first of all.

I don't think it's a good thing that add_custom_path_type is declared
as void (*)(void *) rather than having a real type.  I suggest we add
the path-creation callback function to CustomPlanMethods instead, like
this:

void (*CreateCustomScanPath)(PlannerInfo *root, RelOptInfo *baserel,
RangeTblEntry *rte);

Then, register_custom_path_provider() can just take CustomPathMethods
* as an argument; and create_customscan_paths can just walk the list
of CustomPlanMethods objects and call CreateCustomScanPath for each
one where that is non-NULL.  This conflates the path generation
mechanism with the type of path getting generated a little bit, but I
don't see any real downside to that.  I don't see a reason why you'd
ever want two different providers to offer the same type of
custompath.

Don't the changes to src/backend/optimizer/plan/createplan.c belong in patch #2?

 [2] custom-scan node
 It adds custom-scan node support. The custom-scan node is expected to
 generate contents of a particular relation or sub-plan according to its
 custom-logic.
 Custom-scan provider needs to implement callbacks of CustomScanMethods
 and CustomExecMethods. Once a custom-scan node is populated from
 custom-path node, the backend calls back these methods in the planning
 and execution stage.

It looks to me like this patch is full of holdovers from its earlier
life as a more-generic CustomPlan node.  In particular, it contains
numerous defenses against the case where scanrelid != 0.  These are
confusingly written as scanrelid  0, but I think really they're just
bogus altogether: if this is specifically a CustomScan, not a
CustomPlan, then the relid should always be filled in.  Please
consider what can be simplified here.

The comment in _copyCustomScan looks bogus to me.  I think we should
*require* a static method table.

In create_custom_plan, you if (IsA(custom_plan, CustomScan)) { lots of
stuff; } else elog(ERROR, ...).  I think it would be clearer to write
if (!IsA(custom_plan, CustomScan)) elog(ERROR, ...); lots of stuff;

 Also, regarding to the use-case of multi-exec interface.
 Below is an EXPLAIN output of PG-Strom. It shows the custom GpuHashJoin has
 two sub-plans; GpuScan and MultiHash.
 GpuHashJoin is stacked on the GpuScan. It is a case when these nodes utilize
 multi-exec interface for more efficient data exchange between the nodes.
 GpuScan already keeps a data structure that is suitable to send to/recv from
 GPU devices and constructed on the memory segment being DMA available.
 If we have to form a tuple, pass it via row-by-row interface, then deform it,
 it will become a major performance degradation in this use case.

 postgres=# explain select * from t10 natural join t8 natural join t9 where x 
  10;
   QUERY PLAN
 ---
  Custom (GpuHashJoin)  (cost=10979.56..90064.15 rows=333 width=49)
pseudo scan tlist: 1:(t10.bid), 3:(t10.aid), 4:t10.x, 2:t8.data, 
 5:[t8.aid], 6:[t9.bid]
hash clause 1: ((t8.aid = t10.aid) AND (t9.bid = t10.bid))
-  Custom (GpuScan) on t10  (cost=1.00..88831.26 rows=327 
 width=16)
  Host References: aid, bid, x
  Device References: x
  Device Filter: (x  10::double precision)
-  Custom (MultiHash)  (cost=464.56..464.56 rows=1000 width=41)
  hash keys: aid, bid
  -  Hash Join  (cost=60.06..464.56 rows=1000 width=41)
Hash Cond: (t9.data = t8.data)
-  Index Scan using t9_pkey on t9  (cost=0.29..357.29 
 rows=1 width=37)
-  Hash  (cost=47.27..47.27 rows=1000 width=37)
  -  Index Scan using t8_pkey on t8  (cost=0.28..47.27 
 rows=1000 width=37)
  Planning time: 0.810 ms
 (15 rows)

Why can't the Custom(GpuHashJoin) node build the hash table internally
instead of using a separate node?

Also, for this patch we are only considering custom scan.  Custom join
is another patch.  We don't need to provide infrastructure for that
patch in this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Merlin Moncure
On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Client Count/Patch_Ver (tps) 8 16 32 64 128
 HEAD 58614 107370 140717 104357 65010
 Patch 60092 113564 165014 213848 216065

 This data is median of 3 runs, detailed report is attached with mail.
 I have not repeated the test for all configurations, as there is no
 major change in design/algorithm which can effect performance.
 Mark has already taken tpc-b data which ensures that there is
 no problem with it, however I will also take it once with latest version.

Well, these numbers are pretty much amazing.  Question: It seems
there's obviously quite a bit of contention left;  do you think
there's still a significant amount of time in the clock sweep, or is
the primary bottleneck the buffer mapping locks?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP signatures

2014-09-08 Thread Jeff Janes
On Fri, Sep 5, 2014 at 4:38 AM, Marko Tiikkaja ma...@joh.to wrote:

 Hi all,

 I've updated the patch with a number of changes:
   1) I've documented the current limitations of signatures
   2) I've expanded section F.25.3 to add information about signatures
 (though I'm not sure why this part is in the user-facing documentation in
 the first place).
   3) I've changed the code to use ntohl() and pg_time_t as per Thomas'
 comments.
   4) I've changed the code to consistently use while (1) instead of for
 (;;) (except for the math library, but I didn't touch that at all)

 I've also changed the behaviour when passing a message with a signature to
 the decrypt functions which don't verify signatures.  They now report
 ERROR:  Wrong key or corrupt data instead of decrypting and silently
 ignoring the signature.  The behaviour is now backwards compatible, but I
 see two ways we could possibly possibly improve this:
   1) Produce a better error message (I'm sure most people don't know about
 the hidden debug=1 setting)
   2) Provide an option to ignore the signature if decrypting the data is
 desirable even if the signature can't be verified



If i understand the sequence here: The current git HEAD is that
pgp_pub_decrypt would throw an error if given a signed and encrypted
message, and earlier version of your patch changed that to decrypt the
message and ignore the signature, and the current version went back to
throwing an error.

I think I prefer the middle of those behaviors.  The original behavior
seems like a bug to me, and I don't think we need to be backwards
compatible with bugs.  Why should a function called decrypt care if the
message is also signed?  That is not its job.

If we decide to throw the error, a better error message certainly wouldn't
hurt.  And the output of 'debug=1' is generally not comprehensible unless
you are familiar with the source code, so that is not a substitute.

(By the way, there are now 2 patches in this series named
pgcrypto_sigs.v3.patch--so be careful which one you look it.)

There seems to be a memory leak in pgp_sym_decrypt_verify that does not
exist in pgp_sym_decrypt.  It is about 58 bytes per decryption.

Perl test script:


my $dbh=connect(...);
my $pub=`cat public.asc`;
my $pri=`cat private.asc`;


my $enc= $dbh-prepare(select
armor(pgp_sym_encrypt_sign('asdlkfjsldkfjsadf',?,dearmor(?),'debug=1')));
my $dec= $dbh-prepare(select
pgp_sym_decrypt_verify(dearmor(?),?,dearmor(?),'debug=1'));
my $i=1;

$enc-execute(foobar$i,$pri);
my ($message)=$enc-fetchrow();

foreach my $ii (1..100) {
  ## my $i=$ii;
  $dec-execute($message,foobar$i,$pub);
  my ($message2)=$dec-fetchrow();
  die unless $message2 eq asdlkfjsldkfjsadf;
  warn $i\t, time() if $i%1000 ==0;
};



Cheers,

Jeff


Re: [HACKERS] implement subject alternative names support for SSL connections

2014-09-08 Thread Heikki Linnakangas

On 09/05/2014 07:30 PM, Alexey Klyukin wrote:

On Thu, Sep 4, 2014 at 10:23 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:


Hmm. Perhaps we should use X509_NAME_get_index_by_NID + X509_NAME_get_entry
instead of X509_NAME_get_text_by_NID. You could then pass the ASN1_STRING
object to the certificate_name_entry_validate_match() function, and have it
do the ASN1_STRING_length() and ASN1_STRING_data() calls too.

...

I think we should:

1. Check if there's a common name, and if so, print that
2. Check if there is exactly one SAN, and if so, print that
3. Just print an error without mentioning names.

There's a lot of value in printing the name if possible, so I'd really like
to keep that. But I agree that printing all the names if there are several
would get complicated and the error message could become very long. Yeah,
the error message might need to be different for cases 1 and 2. Or maybe
phrase it server certificate's name \%s\ does not match host name
\%s\, which would be reasonable for both 1. and 2.


Thank you, I've implemented both suggestions in the attached new
version of the patch.
On the error message, I've decided to show either a single name, or
the first examined name and the number of other names present in the
certificate, i.e:


psql: server name example.com and 2 other names from server SSL certificate do not 
match host name example-foo.com


The error does not state whether the names comes from the CN or from
the SAN section.


I'd reword that slightly, to:

psql: server certificate for example.com (and 2 other names) does not 
match host name example-foo.com


I never liked the current wording, server name foo very much. I 
think it's important to use the word server certificate in the error 
message, to make it clear where the problem is.


For translations, that message should be pluralized, but there is no 
libpq_ngettext macro equivalent to ngettext(), like libpq_gettext. I 
wonder if that was left out on purpose, or if we just haven't needed 
that in libpq before. Anyway, I think we need to add that for this.



This version also checks for the availability of the subject name, it
looks like RFC 5280 does not require it at all.

4.1.2.6.  Subject

The subject field identifies the entity associated with the public
key stored in the subject public key field.  The subject name MAY be
carried in the subject field and/or the subjectAltName extension.


Ok.


The pattern of allocating the name in the subroutine and then
reporting it (and releasing when necessary) in the main function is a
little bit ugly, but it looks to me the least ugly of anything else I
could come up (i.e. extracting those names at the time the error
message is shown).


I reworked that a bit, see attached. What do you think?

I think this is ready for commit, except for two things:

1. The pluralization of the message for translation

2. I still wonder if we should follow the RFC 6125 and not check the 
Common Name at all, if SANs are present. I don't really understand the 
point of that rule, and it seems unlikely to pose a security issue in 
practice, because surely a CA won't sign a certificate with a 
bogus/wrong CN, because an older client that doesn't look at the SANs at 
all would use the CN anyway. But still... what does a Typical Web 
Browser do?


- Heikki

diff --git a/src/interfaces/libpq/fe-secure-openssl.c 
b/src/interfaces/libpq/fe-secure-openssl.c
index f950fc3..4f6f324 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -60,9 +60,13 @@
 #ifdef USE_SSL_ENGINE
 #include openssl/engine.h
 #endif
+#include openssl/x509v3.h
 
 static bool verify_peer_name_matches_certificate(PGconn *);
 static int verify_cb(int ok, X509_STORE_CTX *ctx);
+static int verify_peer_name_matches_certificate_name(PGconn *conn,
+   
  ASN1_STRING *name,
+   
  char **store_name);
 static void destroy_ssl_system(void);
 static int initialize_SSL(PGconn *conn);
 static PostgresPollingStatusType open_client_SSL(PGconn *);
@@ -473,98 +477,229 @@ wildcard_certificate_match(const char *pattern, const 
char *string)
 
 
 /*
- * Verify that common name resolves to peer.
+ * Check if a name from a server's certificate matches the peer's hostname.
+ *
+ * Returns 1 if the name matches, and 0 if it does not. On error, returns
+ * -1, and sets the libpq error message.
+ *
+ * The name extracted from the certificate is returned in *store_name. The
+ * caller is responsible for freeing it.
  */
-static bool
-verify_peer_name_matches_certificate(PGconn *conn)
+static int
+verify_peer_name_matches_certificate_name(PGconn *conn, ASN1_STRING 
*name_entry,
+   
  char **store_name)
 {
-   char   

Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-09-08 Thread Erik Rijkers
On Mon, September 8, 2014 18:02, Alvaro Herrera wrote:
 Here's version 18.  I have renamed it: These are now BRIN indexes.


I get into a BadArgument after:


$ cat crash.sql

-- drop table if exists t_100_000_000 cascade;
   create table t_100_000_000 as select cast(i as integer) from 
generate_series(1, 1) as f(i) ;

-- drop index if exists t_100_000_000_i_brin_idx;
   create index t_100_000_000_i_brin_idx on t_100_000_000 using 
brin(i); select
pg_size_pretty(pg_relation_size('t_100_000_000_i_brin_idx'));

select i from t_100_000_000 where i between 1 and 100; -- ( + 99 )


Log file says:

TRAP: BadArgument(!(((context) != ((void *)0)  (const 
Node*)((context)))-type) == T_AllocSetContext, File:
mcxt.c, Line: 752)
2014-09-08 19:54:46.071 CEST 30151 LOG:  server process (PID 30336) was 
terminated by signal 6: Aborted
2014-09-08 19:54:46.071 CEST 30151 DETAIL:  Failed process was running: select 
i from t_100_000_000 where i between 1
and 100;



The crash is caused by the last select statement; the table and index create 
are OK.

it only happens with a largish table; small tables are OK.



Linux / Centos / 32 GB.

 PostgreSQL 9.5devel_minmax_20140908_1809_0640c1bfc091 on 
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.9.1, 64-bit

 setting  |  current_setting
--+
 autovacuum   | off
 port | 6444
 shared_buffers   | 100MB
 effective_cache_size | 4GB
 work_mem | 10MB
 maintenance_work_mem | 1GB
 checkpoint_segments  | 20
 data_checksums   | on
 server_version   | 9.5devel_minmax_20140908_1809_0640c1bfc091
 pg_postmaster_start_time | 2014-09-08 19:53 (uptime: 0d 0h 6m 54s)

'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' 
'--with-pgport=6444'
'--bindir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/bin'
'--libdir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/lib' 
'--enable-depend' '--enable-cassert' '--enable-debug'
'--with-perl' '--with-openssl' '--with-libxml' 
'--with-extra-version=_minmax_20140908_1809_0640c1bfc091'


pgpatches/0095/minmax/20140908/minmax-18.patch


thanks,


Erik Rijkers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP signatures

2014-09-08 Thread Marko Tiikkaja

On 2014-09-08 7:30 PM, Jeff Janes wrote:

On Fri, Sep 5, 2014 at 4:38 AM, Marko Tiikkaja ma...@joh.to wrote:

I've also changed the behaviour when passing a message with a signature to
the decrypt functions which don't verify signatures.  They now report
ERROR:  Wrong key or corrupt data instead of decrypting and silently
ignoring the signature.  The behaviour is now backwards compatible, but I
see two ways we could possibly possibly improve this:
   1) Produce a better error message (I'm sure most people don't know about
the hidden debug=1 setting)
   2) Provide an option to ignore the signature if decrypting the data is
desirable even if the signature can't be verified




If i understand the sequence here: The current git HEAD is that
pgp_pub_decrypt would throw an error if given a signed and encrypted
message, and earlier version of your patch changed that to decrypt the
message and ignore the signature, and the current version went back to
throwing an error.


You got that right, yes.


I think I prefer the middle of those behaviors.  The original behavior
seems like a bug to me, and I don't think we need to be backwards
compatible with bugs.  Why should a function called decrypt care if the
message is also signed?  That is not its job.


Yeah, that seems reasonable, I guess.  I'm kind of torn between the two 
behaviours to be honest.  But perhaps it would make sense to change the 
previous behaviour (i.e. go back to way this patch was earlier) and 
document that somewhere.



There seems to be a memory leak in pgp_sym_decrypt_verify that does not
exist in pgp_sym_decrypt.  It is about 58 bytes per decryption.


Interesting.  Thanks!  I'll have a look.


.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-09-08 Thread Alvaro Herrera
Erik Rijkers wrote:

 Log file says:
 
 TRAP: BadArgument(!(((context) != ((void *)0)  (const 
 Node*)((context)))-type) == T_AllocSetContext, File:
 mcxt.c, Line: 752)
 2014-09-08 19:54:46.071 CEST 30151 LOG:  server process (PID 30336) was 
 terminated by signal 6: Aborted
 2014-09-08 19:54:46.071 CEST 30151 DETAIL:  Failed process was running: 
 select i from t_100_000_000 where i between 1
 and 100;

A double-free mistake -- here's a patch.  Thanks.


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c89a167..6ac012c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -388,10 +388,7 @@ bringetbitmap(PG_FUNCTION_ARGS)
 			PointerGetDatum(key));
 	addrange = DatumGetBool(add);
 	if (!addrange)
-	{
-		brin_free_tuple(tup);
 		break;
-	}
 }
 			}
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Robert Haas
On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 Apart from above, I think for this patch, cat version bump is required
 as I have modified system catalog.  However I have not done the
 same in patch as otherwise it will be bit difficult to take performance
 data.

 One regression failed on linux due to spacing issue which is
 fixed in attached patch.

I took another read through this patch.  Here are some further review comments:

1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
both num_to_free and tmp_num_to_free.  I'd get rid of tmp_num_to_free
and move the declaration of num_to_free inside the outer loop.  I'd
also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
tmp_recent_backend_clocksweep into the innermost scope in which they
are used.

2. Also in that function, I think the innermost bit of logic could be
rewritten more compactly, and in such a way as to make it clearer for
what set of instructions the buffer header will be locked.
LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if
(bufHdr-usage_count  0) bufHdr-usage_count--; else add_to_freelist
= true; } UnlockBufHdr(bufHdr); if (add_to_freelist 
StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

3. This comment is now obsolete:

+   /*
+* If bgwriterLatch is set, we need to waken the bgwriter, but we should
+* not do so while holding freelist_lck; so set it after releasing the
+* freelist_lck.  This is annoyingly tedious, but it happens
at most once
+* per bgwriter cycle, so the performance hit is minimal.
+*/
+

We're not actually holding any lock in need of releasing at that point
in the code, so this can be shortened to If bgwriterLatch is set, we
need to waken the bgwriter.

 * Ideally numFreeListBuffers should get called under freelist spinlock,

That doesn't make any sense.  numFreeListBuffers is a variable, so you
can't call it.  The value should be *read* under the spinlock, but
it is.  I think this whole comment can be deleted and replaced with
If the number of free buffers has fallen below the low water mark,
awaken the bgreclaimer to repopulate it.

4. StrategySyncStartAndEnd() is kind of a mess.  One, it can return
the same victim buffer that's being handed out - at almost the same
time - to a backend running the clock sweep; if it does, they'll fight
over the buffer.  Two, the *end out parameter actually returns a
count, not an endpoint.  I think we should have
BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
top of the inner loop rather than the bottom, and change
StrategySyncStartAndEnd() so that it knows nothing about
victimbuf_lck.  Let's also change StrategyGetBuffer() to call
StrategySyncNextVictimBuffer() so that the logic is centralized in one
place, and rename StrategySyncStartAndEnd() to something that better
matches its revised purpose.  Maybe StrategyGetReclaimInfo().

5. Have you tested that this new bgwriter statistic is actually
working?  Because it looks to me like BgMoveBuffersToFreelist is
changing BgWriterStats but never calling pgstat_send_bgwriter(), which
I'm thinking will mean the counters accumulate forever inside the
reclaimer but never get forwarded to the stats collector.

6. StrategyInitialize() uses #defines for
HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
for clamping.  Let's have constants for all of those (and omit mention
of the specific values in the comments).

7. The line you've added to the definition of view pg_stat_bgwriter
doesn't seem to be indented the same way as all the others.  Tab vs.
space problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-08 Thread Robert Haas
On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra t...@fuzzy.cz wrote:
 as Heikki mentioned in his commitfest status message, this patch still
 has no reviewers :-( Is there anyone willing to pick up that, together
 with the 'dense allocation' patch (as those two are closely related)?

 I'm looking in Robert's direction, as he's the one who came up with the
 dense allocation idea, and also commented on the hasjoin bucket resizing
 patch ...

 I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
 not really independent ;-) I'll do some reviews on the other patches
 over the weekend, to balance this of course.

Will any of those patches to be, heh heh, mine?

I am a bit confused by the relationship between the two patches you
posted.  The combined patch applies, but the other one does not.  If
this is a sequence of two patches, it seems like it would be more
helpful to post A and B rather than B and A+B.  Perhaps the other
patch is on a different thread but there's not an obvious reference to
said thread here.

In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes?  That seems like it'd do the same
thing for less work.  Either way, what happens if buckets_bytes is
already bigger than hash_table_bytes?

A few more stylistic issues that I see:

+   if ((hashtable-nbatch == 1) 
+   if (hashtable-nbuckets_optimal = (INT_MAX/2))
+   if (size  (HASH_CHUNK_SIZE/8))

While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.

On a related note, the first of these reads like this if (stuff) { if
(more stuff) { do things } } which makes one wonder why we need two if
statements.

+
+   /* used for dense allocation of tuples (into linked chunks) */
+   HashChunk   chunks_batch;   /*  one list for the whole batch */
+
 }  HashJoinTableData;

If the original coding didn't have a blank line between the last
structure member and the closing brace, it's probably not right to
make it that way now.  There are similar issues at the end of some
functions you modified, and a few other places (like the new code in
ExecChooseHashTableSize and chunk_alloc) where there are extra blank
lines at the starts and ends of blocks.

+char * chunk_alloc(HashJoinTable hashtable, int size) {

Eh... no.

+   /* XXX maybe we should use MAXALIGN(size) here ... */

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-08 Thread Robert Haas
On Mon, Sep 8, 2014 at 1:09 AM, Amit Kapila amit.kapil...@gmail.com wrote:
  Don't you need a PG_TRY block here to reset pq_mq_busy?

 No.  If shm_mq_sendv is interrupted, we can't use the shm_mq any more.
 But since that should only happen if an interrupt arrives while the
 queue is full, I think that's OK.

 I think here not only on interrupt, but any other error in this
 function shm_mq_sendv() path (one example is WaitLatch())
 could lead to same behaviour.

Probably true.  But I think we generally count on that to be no-fail,
or close to it, so I'm not really worried about it.

 (Think about the alternatives: if
 the queue is full, we have no way of notifying the launching process
 without waiting for it to retrieve the results, but it might not do
 that right away, and if we've been killed we need to die *now* not
 later.)

 So in such cases what is the advise to users, currently they will
 see the below message:
 postgres=# select * from pg_background_result(5124) as (x int);
 ERROR:  lost connection to worker process with PID 5124

 One way is to ask them to check logs, but what about if they want
 to handle error and take some action?

They have to check the logs.  To reiterate what I said before, there
is no reasonable way to both have the background worker terminate
quickly and also guarantee that the full error message is received by
the process that started it.  You have to pick one, and I stick by the
one I picked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-08 Thread Robert Haas
On Mon, Sep 8, 2014 at 2:05 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Another point about error handling is that to execute the sql in
 function pg_background_worker_main(), it starts the transaction
 which I think doesn't get aborted if error occurs

 For this I was just referring error handling code of
 StartBackgroundWorker(), however during shutdown of process, it
 will call AbortOutOfAnyTransaction() which will take care of aborting
 transaction.

Yeah.

 and similarly handling
 for timeout seems to be missing in error path.

 As we are anyway going to exit on error, so not sure, if this will be
 required, however having it for clean exit seems to be better.

Can you be more specific?

I'm generally of the view that there's little point in spending time
cleaning things up that will go away anyway when the process exits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLESPACE MOVE command tag tweak

2014-09-08 Thread Alvaro Herrera
Stephen Frost wrote:
 Alvaro,
 
 * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
  ALTER TABLE ALL IN TABLESPACE xyz
  which AFAICS should work since ALL is already a reserved keyword.
 
 Pushed to master and REL9_4_STABLE.

Thanks.  One more tweak --- the whole reason for fiddling with this is
to ensure that event triggers support this operation.  Therefore this
node should be handled by ProcessUtilitySlow, not
standard_ProcessUtility, as in the attached patch.

(I checked the documentation for necessary updates; turns out that the
table in the event triggers chapter says that ddl_command_end etc
support the command ALTER TABLE, and since this is the tag returned by
the new ALTER TABLE ALL IN TABLESPACE command, there is no update
needed.  In fact, one can argue that the table is wrong currently
because it doesn't say that ALTER TABLE ALL IN TABLESPACE is not
supported.)

I propose this for 9.4 too.

 Apologies on it taking so long-
 things have a bit interesting for me over the past month or two. :)

I bet they have!  Have fun,

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 40ac47f..e2c2d3d 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -507,10 +507,6 @@ standard_ProcessUtility(Node *parsetree,
 			AlterTableSpaceOptions((AlterTableSpaceOptionsStmt *) parsetree);
 			break;
 
-		case T_AlterTableMoveAllStmt:
-			AlterTableMoveAll((AlterTableMoveAllStmt *) parsetree);
-			break;
-
 		case T_TruncateStmt:
 			ExecuteTruncate((TruncateStmt *) parsetree);
 			break;
@@ -1296,6 +1292,10 @@ ProcessUtilitySlow(Node *parsetree,
 AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree);
 break;
 
+			case T_AlterTableMoveAllStmt:
+AlterTableMoveAll((AlterTableMoveAllStmt *) parsetree);
+break;
+
 			case T_DropStmt:
 ExecDropStmt((DropStmt *) parsetree, isTopLevel);
 break;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-08 Thread Robert Haas
On Fri, Sep 5, 2014 at 2:16 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Assert is usually implemented as custom functions and used via PERFORM
 statement now

 -- usual current solution
 PERFORM Assert(some expression)

 I would to implement Assert as plpgsql internal statement due bigger
 possibilities to design syntax and internal implementation now and in
 future. More - as plpgsql statement should be compatible with any current
 code - because there should not be collision between SQL and PLpgSQL space.
 So this design doesn't break any current code.

 I propose following syntax with following ecosystem:

 ASSERT [ NOTICE, WARNING, EXCEPTION ]
   [ string expression or literal - explicit message ]
   [ USING clause - same as RAISE stmt (possible in future ) ]
   ( ROW_COUNT ( = |  ) ( 1 | 0 ) |
   ( QUERY some query should not be empty ) |
   ( CHECK some expression should be true )
   ( IS NOT NULL expression should not be null )

 Every variant (ROW_COUNT, QUERY, CHECK, IS NOT NULL) has own default message

That's probably not the ugliest syntax I've *ever* seen, but it's
definitely the ugliest syntax I've seen today.

I previously proposed RAISE ASSERT ... WHERE, which seems like a nice
variant of what we've already got, but perhaps this whole discussion
merely illustrates that it's hard to get more than 1 vote for any
proposal in this area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLESPACE MOVE command tag tweak

2014-09-08 Thread Stephen Frost
Alvaro,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Stephen Frost wrote:
  * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
   ALTER TABLE ALL IN TABLESPACE xyz
   which AFAICS should work since ALL is already a reserved keyword.
  
  Pushed to master and REL9_4_STABLE.
 
 Thanks.  One more tweak --- the whole reason for fiddling with this is
 to ensure that event triggers support this operation.  Therefore this
 node should be handled by ProcessUtilitySlow, not
 standard_ProcessUtility, as in the attached patch.

Ah, right, of course.  I recall looking for what else might need to be
changed and apparently missed that distinction in the call sites.

 (I checked the documentation for necessary updates; turns out that the
 table in the event triggers chapter says that ddl_command_end etc
 support the command ALTER TABLE, and since this is the tag returned by
 the new ALTER TABLE ALL IN TABLESPACE command, there is no update
 needed.  In fact, one can argue that the table is wrong currently
 because it doesn't say that ALTER TABLE ALL IN TABLESPACE is not
 supported.)

Heh, yes, true.

 I propose this for 9.4 too.

Agreed.  Looks pretty straight-forward, will update soon.

  Apologies on it taking so long-
  things have a bit interesting for me over the past month or two. :)
 
 I bet they have!  Have fun,

Thanks! :)

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Thom Brown
On 5 September 2014 14:19, Amit Kapila amit.kapil...@gmail.com wrote:

 On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:
 
  Apart from above, I think for this patch, cat version bump is required
  as I have modified system catalog.  However I have not done the
  same in patch as otherwise it will be bit difficult to take performance
  data.

 One regression failed on linux due to spacing issue which is
 fixed in attached patch.


Here's a set of test results against this patch:

8-core AMD FX-9590, 32GB RAM

Config:
checkpoint_segments = 256
checkpoint_timeout = 15min
shared_buffers = 8GB

pgbench scale factor = 3000
run duration time = 5min

HEAD
Client Count/No. Of Runs (tps) 8 16 32 64 128
Run-1 31568 41890 48638 49266 41845
Run-2 31613 41879 48597 49332 41736
Run-3 31674 41987 48647 49320 41745


Patch=scalable_buffer_eviction_v7.patch
Client Count/No. Of Runs (tps) 8 16 32 64 128
Run-1 31880 42943 49359 49901 43779
Run-2 32150 43078 48934 49828 43769
Run-3 32323 42894 49481 49600 43529

-- 
Thom


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-08 Thread Tomas Vondra
On 8.9.2014 22:44, Robert Haas wrote:
 On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra t...@fuzzy.cz wrote:
 as Heikki mentioned in his commitfest status message, this patch
 still has no reviewers :-( Is there anyone willing to pick up that,
 together with the 'dense allocation' patch (as those two are
 closely related)?

 I'm looking in Robert's direction, as he's the one who came up with
 the dense allocation idea, and also commented on the hasjoin bucket
 resizing patch ...

 I'd ask Pavel Stehule, but as he's sitting next to me in the office
 he's not really independent ;-) I'll do some reviews on the other
 patches over the weekend, to balance this of course.
 
 Will any of those patches to be, heh heh, mine?

I'll exchange some of the credit for +1. Deal? ;-)

 I am a bit confused by the relationship between the two patches you 
 posted. The combined patch applies, but the other one does not. If 
 this is a sequence of two patches, it seems like it would be more 
 helpful to post A and B rather than B and A+B. Perhaps the other 
 patch is on a different thread but there's not an obvious reference
 to said thread here.

Yeah, it's probably a bit confusing. The thing is the dense allocation
idea was discussed in a different thread, so I posted the patch there:

  http://www.postgresql.org/message-id/53cbeb8a@fuzzy.cz

The patch tweaking hash join buckets builds on the dense allocation,
because it (a) makes up for the memory used for more buckets and (b)
it's actually easier to rebuild the buckets this way.

So I only posted the separate patch for those who want to do a review,
and then a complete patch with both parts combined. But it sure may be
a bit confusing.


 In ExecChooseHashTableSize(), if buckets_bytes is independent of 
 nbatch, could we just compute it before working out dbatch, and just 
 deduct it from hash_table_bytes? That seems like it'd do the same 
 thing for less work.

Well, we can do that. But I really don't think that's going to make
measurable difference. And I think the code is more clear this way.

 Either way, what happens if buckets_bytes is already bigger than 
 hash_table_bytes?

Hm, I don't see how that could happen.

FWIW, I think the first buckets_bytes formula is actually wrong:

   buckets_bytes
 = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);

and should be

   buckets_bytes =
 sizeof(HashJoinTuple) * (1  my_log2(ntuples / NTUP_PER_BUCKET));

instead. Also, we might consider that we never use less than 1024
buckets (but that's minor issue, I guess).


But once we get into the need batching branch, we do this:

   lbuckets = 1  my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET
+ sizeof(HashJoinTuple)));

which includes both the bucket (pointer) and tuple size, and IMHO
guarantees that bucket_bytes will never be over work_mem (which is what
hash_table_bytes is).

The only case where this might happen is (tupsize  8), thanks to the
my_log2 (getting (50% work_mem + epsilon), doubled to 100% work_mem).

But tupsize is defined as this:

tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(sizeof(MinimalTupleData)) +
MAXALIGN(tupwidth);

and HJTUPLE_OVERHEAD alone is 12B, MinimalTupleData is 11B (ignoring
alignment) etc.

So I believe this is safe ...


 A few more stylistic issues that I see:
 
 +   if ((hashtable-nbatch == 1) 
 +   if (hashtable-nbuckets_optimal = (INT_MAX/2))
 +   if (size  (HASH_CHUNK_SIZE/8))
 
 While I'm all in favor of using parentheses generously where it's
 needed to add clarity, these and similar usages seem over the top to
 me.

It seemed more readable for me at the time of coding it, and it still
seems better this way, but ...

https://www.youtube.com/watch?v=CxK_nA2iVXw

 On a related note, the first of these reads like this if (stuff) { if
 (more stuff) { do things } } which makes one wonder why we need two if
 statements.

We probably don't ...

 
 +
 +   /* used for dense allocation of tuples (into linked chunks) */
 +   HashChunk   chunks_batch;   /*  one list for the whole batch */
 +
  }  HashJoinTableData;
 
 If the original coding didn't have a blank line between the last
 structure member and the closing brace, it's probably not right to
 make it that way now.  There are similar issues at the end of some
 functions you modified, and a few other places (like the new code in
 ExecChooseHashTableSize and chunk_alloc) where there are extra blank
 lines at the starts and ends of blocks.

I'll fix that. FWIW, those issues seem to be in the 'dense allocation'
patch.

 
 +char * chunk_alloc(HashJoinTable hashtable, int size) {
 
 Eh... no.
 
 +   /* XXX maybe we should use MAXALIGN(size) here ... */
 
 +1.

I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
should be on the next line? And is the +1 about doing MAXALING?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make 

Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract

2014-09-08 Thread David G Johnston
On Mon, Sep 8, 2014 at 11:45 AM, Robert Haas [via PostgreSQL] 
ml-node+s1045698n5818200...@n5.nabble.com wrote:

 On Thu, Sep 4, 2014 at 6:38 PM, David Johnston
 [hidden email] http://user/SendEmail.jtp?type=nodenode=5818200i=0
 wrote:

  One of the trade-offs I mentioned...its more style than anything but
  removing the parenthetical (if there is not error in the command) and
  writing it more directly seemed preferable in an overview such as this.
 
  Better:  The function will either throw an error or return a PGresult
  object[...]

 Nope.  This is not C++, nor is it the backend.  It will not throw
 anything.


​The current sentence reads:
​The response to this (if there is no error in the command) will be a
PGresult object bearing a status code of PGRES_COPY_OUT or PGRES_COPY_IN
(depending on the specified copy direction).

Maybe this is taken for granted by others, and thus can be excluded here,
but I'm trying to specify what happens if there is an error in the
command...  Apparently one does not get back a PGresult object...

Would simply saying: A successful response to this will be a PGresult
object... be accurate (enough)?


  I appreciate the time you have taken on this and will look at my
 thoughts
  with the new understanding you have given me.
 
  Thank You!

 Thanks for getting involved.  I apologize if I was brusque here or at
 other times in the past (or future).  Unfortunately I've been insanely
 busy and it doesn't bring out the best in me, but I really do value
 your (and everyone's efforts) to move PostgreSQL forward.


​The tone of all your responses, including the first one pointing out my
lack of supporting context and initially over-ambitious patch, have been
very professional.  A lot of what you've said resonates since I think
subconsciously I understood that I needed to make fundamental changes to my
approach and style but it definitely helps to have someone point out
specific items and concerns to move those thoughts to the conscious part of
the mind.  Thank you again for doing that for me.

​Ignoring style and such did anything I wrote help to clarify your
understanding of why pgPutCopyEnd does what it does?  As I say this and
start to try and read the C code I think that it is a good source for
understanding novice assumptions but there is a gap between the docs and
the code - though I'm not sure you've identified the only (or even proper)
location.



Unlike PQputCopyData there is no explicit logic (pqIsnonblocking(conn) in
PQputCopyEnd for non-blocking mode (there is an implicit one via pqFlush).
 This does seem like an oversight - if a minor one since the likihood of
not being able to add the EOF to the connection's buffer seems highly
unlikely - but I would expect on the basis of symmetry alone - for both of
them to have buffer filled testing logic.  And, depending on how large
*errormsg is, the risk of being unable to place the data in the buffer
isn't as small and the expected EOF case.

I'm getting way beyond my knowledge level here but my assumption from the
documentation was that the async mode behavior of returning zero revolved
around retrying in order to place the data into the buffer - not retrying
to make sure the buffer is empty.  The caller deals with that on their own
based upon their blocking mode decision.  Though we would want to call
pqFlush for blocking mode callers here since this function should block
until the buffer is clear.

​So, I thought I agreed with your suggestion that if the final pqFlush
returns a 1 that pqPutCopyEnd should return 0.  Additionally, if in
non-blocking mode, and especially if *errormsg is not NULL, the available
connection buffer length logic in pqPutCopyData should be evaluated here as
well.

However, the 0 being returned due to the pqFlush really isn't anything
special for a non-blocking mode user (they have already been told to call
pqFlush themselves after calling pqPutCopyEnd) and doesn't apply for
blocking mode.  Admittedly they could skip their call if they get a return
value of 1 from pqPutCopyEnd but I'm not sure recommending that
optimization has much going for it.  And, again as you said (I am just
discovering this for myself as much as possible), if the pqFlush caused the
0 you would not want to retry whereas in the filled buffer version you
would.  So we do have to pick one or the other situation and adjust the
documentation accordingly.

The most useful and compatible solution is to make pqPutCopyEnd synchronous
regardless of the user selected blocking mode - which it mostly is now but
the final pqFlush should be in a loop - and adjust the documentation in the
areas noted in my patch-email to accommodate that fact.

Regardless, the logic surrounding placing the *errormsg string into the
buffer needs affirmation and a note whether it will block waiting to be put
on a full connection buffer.

Note that non-blocking users seeing a 1 on the pqPutCopyEnd probably still
end up doing their own pqFlush looping calls but 

Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract

2014-09-08 Thread David G Johnston
On Mon, Sep 8, 2014 at 6:19 PM, David Johnston david.g.johns...@gmail.com
wrote:

 On Mon, Sep 8, 2014 at 11:45 AM, Robert Haas [via PostgreSQL] 
 ml-node+s1045698n5818200...@n5.nabble.com wrote:

 On Thu, Sep 4, 2014 at 6:38 PM, David Johnston
 [hidden email] http://user/SendEmail.jtp?type=nodenode=5818200i=0
 wrote:

  One of the trade-offs I mentioned...its more style than anything but
  removing the parenthetical (if there is not error in the command) and
  writing it more directly seemed preferable in an overview such as this.
 
  Better:  The function will either throw an error or return a PGresult
  object[...]

 Nope.  This is not C++, nor is it the backend.  It will not throw
 anything.


 ​The current sentence reads:
 ​The response to this (if there is no error in the command) will be a
 PGresult object bearing a status code of PGRES_COPY_OUT or PGRES_COPY_IN
 (depending on the specified copy direction).

 Maybe this is taken for granted by others, and thus can be excluded here,
 but I'm trying to specify what happens if there is an error in the
 command...  Apparently one does not get back a PGresult object...

 Would simply saying: A successful response to this will be a PGresult
 object... be accurate (enough)?


​Apparently, NULL is the correct answer.  Cannot that just be assumed to be
the case or does saying that a failure scenario here returns NULL (or
something other than pqResult object) impart useful knowledge?
​
Dave




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/PQputCopyEnd-doesn-t-adhere-to-its-API-contract-tp5803240p5818254.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Amit Kapila
On Mon, Sep 8, 2014 at 10:12 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Client Count/Patch_Ver (tps) 8 16 32 64 128
  HEAD 58614 107370 140717 104357 65010
  Patch 60092 113564 165014 213848 216065
 
  This data is median of 3 runs, detailed report is attached with mail.
  I have not repeated the test for all configurations, as there is no
  major change in design/algorithm which can effect performance.
  Mark has already taken tpc-b data which ensures that there is
  no problem with it, however I will also take it once with latest
version.

 Well, these numbers are pretty much amazing.  Question: It seems
 there's obviously quite a bit of contention left;  do you think
 there's still a significant amount of time in the clock sweep, or is
 the primary bottleneck the buffer mapping locks?

I think contention around clock sweep is very minimal and for buffer
mapping locks it has reduced significantly (you can refer upthread
some of LWLOCK stat data, I have posted after this patch), but
might be we can get more out of it by improving hash table
concurrency.  However apart from both of the above, the next thing
I have seen in profiles was palloc (at least that is what I remember
when I had done some profiling few months back during the
development of this patch).  It seems to me at that time a totally
different optimisation, so I left it for another patch.
Another point is that the m/c on which I am doing performance
test has 16 cores (64 hardware threads), so not sure how much
scaling we can expect.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-08 Thread Craig Ringer
On 09/05/2014 05:21 PM, Pavel Stehule wrote:
 
 *shrug*  Doing it in SQL would probably break more stuff.  I'm trying to
 contain the damage.  And arguably, this is mostly only useful in PL/PgSQL.

I've wanted assertions in SQL enough that I often write trivial wrappers
around `raise` in PL/PgSQL for use in `CASE` statements etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-08 Thread Craig Ringer
On 09/09/2014 05:20 AM, Robert Haas wrote:
 
 I previously proposed RAISE ASSERT ... WHERE, which seems like a nice
 variant of what we've already got, but perhaps this whole discussion
 merely illustrates that it's hard to get more than 1 vote for any
 proposal in this area.

Well, you have Petr's for RAISE EXCEPTION ... WHEN, and I'd also like
that or RAISE ASSERT ... WHEN.

Much (much) saner than the other proposals on this thread IMO.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers