Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Michael Paquier
On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 8 January 2014 08:33, Simon Riggs si...@2ndquadrant.com wrote:

 Patch attached, implemented to reduce writes by SELECTs only.

This patch is registered in this CF. It does not apply anymore and
needs a rebase. Robert and Amit have provided as well some comments
but they have not been addressed. Is it fair to mark it as returned
with feedback even if it has not been reviewed within the last month?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal (9.5) : psql unicode border line styles

2014-09-12 Thread Pavel Stehule
2014-09-12 5:14 GMT+02:00 Stephen Frost sfr...@snowman.net:

 Pavel,

 * Pavel Stehule (pavel.steh...@gmail.com) wrote:
  Any idea, tip how to it?

 Attached is what I had been thinking.

 Thoughts?


yes, it is better. I didn't understand before.

I though about it because it allows to design multiline end style in future.

I am not a fan of ↵ and … in resulted tables, and can be nice if I can
to optionally change it.

With your change we can design other pure style against current rich
style. But I don't have a idea, how to name it.

Regards

Pavel



 Thanks!

 Stephen



Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Amit Kapila
On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
  +++ b/src/backend/postmaster/bgreclaimer.c

 A fair number of comments in that file refer to bgwriter...

Will fix.

  @@ -0,0 +1,302 @@
 
+/*-
  + *
  + * bgreclaimer.c
  + *
  + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.
 It
  + * attempts to keep regular backends from having to run clock sweep
(which
  + * they would only do when they don't find a usable shared buffer from
  + * freelist to read in another page).

 That's not really accurate. Freelist pages are often also needed to
 write new pages, without reading anything in.

Agreed, but the same is used in bgwriter file as well; so if we
change here, we might want to change bgwriter file header as well.

 I'd phrase it as which
 they only need to do if they don't find a victim buffer from the
 freelist

victim buffer sounds more like a buffer which it will get from
clock sweep, how about next candidate (same is used in function
header of StrategyGetBuffer()).

   In the best scenario all requests
  + * for shared buffers will be fulfilled from freelist as the background
  + * reclaimer process always tries to maintain buffers on freelist.
 However,
  + * regular backends are still empowered to run clock sweep to find a
usable
  + * buffer if the bgreclaimer fails to maintain enough buffers on
freelist.

 empowered sounds strange to me. 'still can run the clock sweep'?

No harm in changing like what you are suggesting, however the same is
used in file header of bgwriter.c as well, so I think lets keep this usage
as
it is because there is no correctness issue here.

  + * The bgwriter is started by the postmaster as soon as the startup
subprocess
  + * finishes, or as soon as recovery begins if we are doing archive
recovery.

 Why only archive recovery? I guess (only read this far...) it's not just
 during InArchiveRecoveryb recovery but also StandbyMode?

It will be for both.

 But I don't see
 why we shouldn't use it during normal crash recovery. That's also often
 painfully slow and the reclaimer could help? Less, but still.

Yes, it might improve a bit, however the main benefit with this patch is
under heavy load which means that it mainly addresses contention and
in case of crash recovery there will not be any contention because there
will be no backend processes.

Also I think to enable bgreclaimer during crash recovery, we need to have
one new signal which is not a problem, but it will make more sense to enable
it later based on benefit if it is there.


  +static void bgreclaim_quickdie(SIGNAL_ARGS);
  +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
  +static void ReqShutdownHandler(SIGNAL_ARGS);
  +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

 This looks inconsistent.

I have kept based on bgwriter, so not sure if it's good to change.
However I we want consistent in naming, I would like to keep
something like:

ReclaimShutdownHandler
ReclaimQuickDieHandler
..
..

  + /*
  +  * If an exception is encountered, processing resumes here.
  +  *
  +  * See notes in postgres.c about the design of this coding.
  +  */
  + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
  + {
  + /* Since not using PG_TRY, must reset error stack by hand
*/
  + error_context_stack = NULL;
..

 No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
 good idea, regardless of it possibly being true today (which I'm not
 sure about yet).

I will add LWLockReleaseAll() in exception handling as discussed
elsewhere in thread.


  +
  + /* Now we can allow interrupts again */
  + RESUME_INTERRUPTS();

 Other processes sleep for a second here, I think that's a good
 idea. E.g. that bit:

Agreed, will make change as per suggestion.


  + /*
  +  * Loop forever
  +  */
  + for (;;)
  + {
  + int rc;
  +
  +
  + /*
  +  * Backend will signal bgreclaimer when the number of
buffers in
  +  * freelist falls below than low water mark of freelist.
  +  */
  + rc = WaitLatch(MyProc-procLatch,
  +WL_LATCH_SET |
WL_POSTMASTER_DEATH,
  +-1);

 That's probably not going to work well directly after a (re)start of
 bgreclaim (depending on how you handle the water mark, I'll see in a
 bit).

Could you please be more specific here?


  +Background Reclaimer's Processing
  +-
..
  +Two water mark indicators are used to maintain sufficient number of
buffers
  +on freelist.  Low water mark indicator is used by backends to wake
bgreclaimer
  +when the number of buffers in freelist falls below it.  High water mark
  +indicator is used by bgreclaimer to move 

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Michael Paquier
On Fri, Sep 12, 2014 at 3:19 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 8 January 2014 08:33, Simon Riggs si...@2ndquadrant.com wrote:

 Patch attached, implemented to reduce writes by SELECTs only.

 This patch is registered in this CF. It does not apply anymore and
 needs a rebase. Robert and Amit have provided as well some comments
 but they have not been addressed. Is it fair to mark it as returned
 with feedback even if it has not been reviewed within the last month?
For the time being, status has been changed to waiting on author.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-09-12 Thread Albe Laurenz
Tom Lane wrote:
 Stephen Frost sfr...@snowman.net writes:
 I have to admit that, while I applaud the effort made to have this
 change only be to postgres_fdw, I'm not sure that having the
 update/delete happening during the Scan phase and then essentially
 no-op'ing the ExecForeignUpdate/ExecForeignDelete is entirely in-line
 with the defined API.
 
 Yeah, I've started looking at this patch and that seemed like not
 necessarily such a wise choice.  I think it'd be better if the generated
 plan tree had a different structure in this case.  The existing approach
 is an impressive hack but it's hard to call it anything but a hack.

I guess that the idea is inspired by this comment on postgres_fdw.c:

 * Note: currently, the plan tree generated for UPDATE/DELETE will always
 * include a ForeignScan that retrieves ctids (using SELECT FOR UPDATE)
 * and then the ModifyTable node will have to execute individual remote
 * UPDATE/DELETE commands.  If there are no local conditions or joins
 * needed, it'd be better to let the scan node do UPDATE/DELETE RETURNING
 * and then do nothing at ModifyTable.  Room for future optimization ...

 I'm not sure offhand what the new plan tree ought to look like.  We could
 just generate a ForeignScan node, but that seems like rather a misnomer.
 Is it worth inventing a new ForeignUpdate node type?  Or maybe it'd still
 be ForeignScan but with a flag field saying hey this is really an update
 (or a delete).  The main benefit I can think of right now is that the
 EXPLAIN output would be less strange-looking --- but EXPLAIN is hardly
 the only thing that ever looks at plan trees, so having an outright
 misleading plan structure is likely to burn us down the line.

I can understand these qualms.
I wonder if ForeignUpdate is such a good name though, since it would
surprise the uninitiate that in the regular (no push-down) case the
actual modification is *not* performed by ForeignUpdate.
So it should rather be a ForeignModifyingScan, but I personally would
prefer a has_side_effects flag on ForeignScan.

 One advantage of getting the core code involved in the decision about
 whether an update/delete can be pushed down is that FDW-independent checks
 like whether there are relevant triggers could be implemented in the core
 code, rather than having to duplicate them (and later maintain them) in
 every FDW that wants to do this.  OTOH, maybe the trigger issue is really
 the only thing that could be shared, not sure.  Stuff like is there a
 LIMIT probably has to be in the FDW since some FDWs could support pushing
 down LIMIT and others not.

You are right, the gain would probably be limited.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2014-09-12 Thread Etsuro Fujita

(2014/09/11 20:51), Heikki Linnakangas wrote:

On 09/11/2014 02:30 PM, Etsuro Fujita wrote:

So, should I split the patch into the two?


Yeah, please do.


OK, Will do.

Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect initialization of sentPtr in walsender.c

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 03:17 AM, Michael Paquier wrote:

On Fri, Sep 12, 2014 at 9:08 AM, Michael Paquier
michael.paqu...@gmail.com wrote:

In walsender.c, sentPtr is initialized as follows:
static XLogRecPtr sentPtr = 0;
Isn't that incorrect and shouldn't we use InvalidXLogRecPtr instead?

Actually by looking more around I found a couple of extra places where
the same inconsistencies are present, mainly in xlog.c and
walreceiver.c. Updated patch attached for all those things.


InvalidXLogRecPtr == 0, so it's just a style issue which one is more 
correct.


I haven't looked at those places closely, but it seems possible that at 
least some of those variables are supposed to be initialized to a value 
smaller than any valid WAL position, rather than just Invalid. In other 
words, if we defined InvalidXLogRecPtr as INT64_MAX rather than 0, we 
would still want those variables to be initialized to zero. As I said, I 
didn't check the code, but before we change those that ought to be checked.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-09-12 Thread Etsuro Fujita

(2014/09/12 16:03), Albe Laurenz wrote:

Tom Lane wrote:

Stephen Frost sfr...@snowman.net writes:

I have to admit that, while I applaud the effort made to have this
change only be to postgres_fdw, I'm not sure that having the
update/delete happening during the Scan phase and then essentially
no-op'ing the ExecForeignUpdate/ExecForeignDelete is entirely in-line
with the defined API.


Yeah, I've started looking at this patch and that seemed like not
necessarily such a wise choice.  I think it'd be better if the generated
plan tree had a different structure in this case.  The existing approach
is an impressive hack but it's hard to call it anything but a hack.


Thank you for the review, Tom and Stephen!


I guess that the idea is inspired by this comment on postgres_fdw.c:

  * Note: currently, the plan tree generated for UPDATE/DELETE will always
  * include a ForeignScan that retrieves ctids (using SELECT FOR UPDATE)
  * and then the ModifyTable node will have to execute individual remote
  * UPDATE/DELETE commands.  If there are no local conditions or joins
  * needed, it'd be better to let the scan node do UPDATE/DELETE RETURNING
  * and then do nothing at ModifyTable.  Room for future optimization ...


That's right.  Thanks, Laurenz!

And in addition to that, I've created the patch with the conscious aim 
of not getting the core code involved, because I was thinking this is 
just an optimization.  But I have to admit I was conscious of that too much.



I'm not sure offhand what the new plan tree ought to look like.  We could
just generate a ForeignScan node, but that seems like rather a misnomer.
Is it worth inventing a new ForeignUpdate node type?  Or maybe it'd still
be ForeignScan but with a flag field saying hey this is really an update
(or a delete).  The main benefit I can think of right now is that the
EXPLAIN output would be less strange-looking --- but EXPLAIN is hardly
the only thing that ever looks at plan trees, so having an outright
misleading plan structure is likely to burn us down the line.


I can understand these qualms.
I wonder if ForeignUpdate is such a good name though, since it would
surprise the uninitiate that in the regular (no push-down) case the
actual modification is *not* performed by ForeignUpdate.
So it should rather be a ForeignModifyingScan, but I personally would
prefer a has_side_effects flag on ForeignScan.


+1 for improving the EXPLAIN output by inventing a plan tree with a 
different structure in this case, in general.



One advantage of getting the core code involved in the decision about
whether an update/delete can be pushed down is that FDW-independent checks
like whether there are relevant triggers could be implemented in the core
code, rather than having to duplicate them (and later maintain them) in
every FDW that wants to do this.


Good idea!


Stuff like is there a
LIMIT probably has to be in the FDW since some FDWs could support pushing
down LIMIT and others not.


That's what I have in mind.  I'll work on that at the next CF.

Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-12 Thread Dilip kumar
On 11 September 2014 10:21, Amit kapila Wrote,

I don't think currently such a limitation is mentioned in docs,
however I think we can update the docs at below locations:

1. In description of pg_start_backup in below page:
http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-ADMIN-BACKUP
2. In Explanation of Base Backup, basically under heading
Making a Base Backup Using the Low Level API at below
page:
http://www.postgresql.org/docs/devel/static/continuous-archiving.html#BACKUP-BASE-BACKUP

In general, we can explain about symlink_label file where ever
we are explaining about backup_label file.

If you think it is sufficient to explain about symlink_label in
above places, then I can update the patch.

I think this will be sufficient….


Regards,
Dilip




Re: [HACKERS] Suspicious check (src/backend/access/gin/gindatapage.c)

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 03:48 AM, Michael Paquier wrote:

On Fri, Sep 12, 2014 at 7:35 AM, Gaetano Mendola mend...@gmail.com wrote:

At line 650 I can read:

  if ((leaf-lsize - segsize) - (leaf-lsize - segsize)  BLCKSZ / 4)
  break;

I believe one of the two should be leaf-rsize

Yes this condition is broken. Shouldn't it be that instead when
appending items at the end of a page?
if ((leaf-lsize - segsize) - (leaf-rsize + segsize)  BLCKSZ / 4)


Yep, that's what it was supposed to be. The consequence was that when 
appending to the index, the left page was always packed as tight as 
possible. Which is not actually that bad a behavior, in fact if you're 
just appending values and never do any other updates, it's optimal. 
That's probably why no-one has noticed. But it's not what was intended.


But now that I look at it more closely, fixing this as you suggested 
doesn't actually yield the intended behavior either. The intention was 
to fill the left page 75% full. However, after fixing that typo, the 
code would move items to the left page so that the difference between 
the halves is BLCKSZ / 4, which does not give a 75% ratio. For example, 
if there are 8200 bytes of data in total, the fixed code would 
actually put 5124 bytes on the left page, and 3076 bytes on the right, 
which makes the left page only 62% full.


Fixed, per the attached patch. Now that the logic is fixed, I hope we 
won't get complaints that the indexes are bigger, if you fill a table by 
appending to the end. I wonder if we should aim at an even more uneven 
split; the default fillfactor for B-trees is 90%, for example. I didn't 
go that high when I wrote that, because the code in previous versions 
always did a 50/50 split. But it could be argued that a higher 
fillfactor makes sense for a GIN index - they typically don't get as 
much random updates as a B-tree.


- Heikki

diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 76e0cb3..e2d15a8 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -621,9 +621,9 @@ dataPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 		/*
 		 * Had to split.
 		 *
-		 * We already divided the segments between the left and the right
-		 * page. The left page was filled as full as possible, and the rest
-		 * overflowed to the right page. When building a new index, that's
+		 * leafRepackItems already divided the segments between the left and
+		 * the right page. It filled the left page as full as possible, and
+		 * put the rest to the right page. When building a new index, that's
 		 * good, because the table is scanned from beginning to end and there
 		 * won't be any more insertions to the left page during the build.
 		 * This packs the index as tight as possible. But otherwise, split
@@ -631,9 +631,10 @@ dataPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 		 * until they're balanced.
 		 *
 		 * As a further heuristic, when appending items to the end of the
-		 * page, split 75/25, one the assumption that subsequent insertions
-		 * will probably also go to the end. This packs the index somewhat
-		 * tighter when appending to a table, which is very common.
+		 * page, try make the left page 75% full, one the assumption that
+		 * subsequent insertions will probably also go to the end. This packs
+		 * the index somewhat tighter when appending to a table, which is very
+		 * common.
 		 */
 		if (!btree-isBuild)
 		{
@@ -645,14 +646,18 @@ dataPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 if (lastleftinfo-action != GIN_SEGMENT_DELETE)
 {
 	segsize = SizeOfGinPostingList(lastleftinfo-seg);
+
+	/*
+	 * Note that we check that the right page doesn't become
+	 * more full than the left page even when appending. It's
+	 * possible that we added enough items to make both pages
+	 * more than 75% full.
+	 */
+	if ((leaf-lsize - segsize) - (leaf-rsize + segsize)  0)
+		break;
 	if (append)
 	{
-		if ((leaf-lsize - segsize) - (leaf-lsize - segsize)  BLCKSZ / 4)
-			break;
-	}
-	else
-	{
-		if ((leaf-lsize - segsize) - (leaf-rsize + segsize)  0)
+		if ((leaf-lsize - segsize)  (BLCKSZ * 3) / 4)
 			break;
 	}
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suspicious check (src/backend/access/gin/gindatapage.c)

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 11:38 AM, Heikki Linnakangas wrote:

Now that the logic is fixed, I hope we
won't get complaints that the indexes are bigger, if you fill a table by
appending to the end. I wonder if we should aim at an even more uneven
split; the default fillfactor for B-trees is 90%, for example. I didn't
go that high when I wrote that, because the code in previous versions
always did a 50/50 split. But it could be argued that a higher
fillfactor makes sense for a GIN index - they typically don't get as
much random updates as a B-tree.


Actually, we should add a fillfactor reloption to GIN. But that's 9.5 
material.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-12 Thread Amit Kapila
On Fri, Sep 12, 2014 at 1:50 PM, Dilip kumar dilip.ku...@huawei.com wrote:
 On 11 September 2014 10:21, Amit kapila Wrote,
 I don't think currently such a limitation is mentioned in docs,

 however I think we can update the docs at below locations:
 1. In description of pg_start_backup in below page:
 
http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-ADMIN-BACKUP

 2. In Explanation of Base Backup, basically under heading
 Making a Base Backup Using the Low Level API at below
 page:
 
http://www.postgresql.org/docs/devel/static/continuous-archiving.html#BACKUP-BASE-BACKUP

 In general, we can explain about symlink_label file where ever
 we are explaining about backup_label file.


 If you think it is sufficient to explain about symlink_label if
 above places, then I can update the patch.

 I think this will be sufficient….

Please find updated patch to include those documentation changes.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


extend_basebackup_to_include_symlink_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Heikki Linnakangas

On 09/11/2014 11:47 PM, Peter Geoghegan wrote:

On Tue, Sep 9, 2014 at 12:01 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

+ Although it could be argued that Lehman and Yao isn't followed to the
+ letter because single pages are read locked as the tree is descended,
+ this is at least necessary to support deletion, a common requirement
+ which LY hardly acknowledge.  Read locks also ensure that B-tree
+ pages are self-consistent (LY appear to assume atomic page reads and
+ writes).


This is just duplicating the existing paragraph. I don't see the point of
this.


The point is to make clear the reason for the differences - evidently,
Amit felt it was unclear why we don't follow LY. I am suggesting that
LY's talk of having no read locks is unrealistic (it might be
realistic in the 21st century, but that's a whole other story).


IMHO the existing paragraph does a much better job explaining that:


Lehman and Yao don't require read locks, but assume that in-memory
copies of tree pages are unshared.  Postgres shares in-memory buffers
among backends.  As a result, we do page-level read locking on btree
pages in order to guarantee that no record is modified while we are
examining it.


Amit: did you notice that paragraph in the README? If not, and now that 
you have read it, does that paragraph make things clear? If that's not 
enough, what do you think is missing?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-09-12 Thread David Rowley
On Fri, Sep 12, 2014 at 3:35 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Sep 11, 2014 at 7:14 AM, David Rowley dgrowle...@gmail.com
 wrote:
 

 1. I don't think that I'm currently handling removing eclass members
  properly. So far the code just removes the Vars that belong to the
 relation
  being removed. I likely should also be doing
 bms_del_member(ec-ec_relids,
  relid); on the eclass, but perhaps I should just be marking the whole
 class
  as ec_broken = true and adding another eclass all everything that the
  broken one has minus the parts from the removed relation?

 I haven't read the patch, but I think the question is why this needs
 to be different than what we do for left join removal.


I discovered over here -
http://www.postgresql.org/message-id/CAApHDvo5wCRk7uHBuMHJaDpbW-b_oGKQOuisCZzHC25=h3_...@mail.gmail.com
during the early days of the semi and anti join removal code that the
planner was trying to generate paths to the dead rel. I managed to track
the problem down to eclass members still existing for the dead rel. I guess
we must not have eclass members for outer rels? or we'd likely have seen
some troubles with left join removals already.

In the meantime I'll fix up the inner join removal code to properly delete
the ec_relids member for the dead rel. I guess probably the restrict info
should come out too.

I know it's late in the commitfest, but since there was next to no interest
in semi and anti join removals, can I rename the patch in the commitfest
app to be Inner join removals? It's either that or I'd mark that patch as
rejected and submit this one in October.

Regards

David Rowley


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 12:46 AM, Peter Geoghegan wrote:

On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com wrote:

I think I said pretty clearly that it was.


I agree that you did, but it wasn't clear exactly what factors you
were asking me to simulate.


All factors.


Do you want me to compare the same string a million times in a loop,
both with a strcoll() and with a memcmp()?


Yes.


Should I copy it into a buffer to add a NUL byte?


Yes.


Or should it be a new string each time, with a cache miss expected
some proportion of the time?


Yes.

I'm being facetious - it's easy to ask for tests when you're not the one 
running them. But seriously, please do run the all the tests that you 
think make sense.


I'm particularly interested in the worst case. What is the worst case 
for the proposed memcmp() check? Test that. If the worst case regresses 
significantly, then we need to have a discussion of how likely that 
worst case is to happen in real life, what the performance is like in 
more realistic almost-worst-case scenarios, does it need to be tunable, 
is the trade-off worth it, etc. But if the worst case regresses less 
than, say, 1%, and there are some other cases where you get a 300% speed 
up, then I think it's safe to say that the optimization is worth it, 
without any more testing or discussion.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-09-12 Thread David Rowley
On Fri, Sep 12, 2014 at 3:47 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Robert Haas robertmh...@gmail.com writes:
  On Thu, Sep 11, 2014 at 7:14 AM, David Rowley dgrowle...@gmail.com
 wrote:
  5. I've added a flag to pg_class called relhasfkey. Currently this gets
 set
  to true when a foreign key is added, though I've added nothing to set it
  back to false again. I notice that relhasindex gets set back to false
 during
  vacuum, if vacuum happens to find there to not be any indexes on the
 rel. I
  didn't put my logic here as I wasn't too sure if scanning pg_constraint
  during a vacuum seemed very correct, so I just left out the setting it
 to
  false logic based on the the fact that I noticed that relhaspkey gets
 away
  with quite lazy setting back to false logic (only when there's no
 indexes of
  any kind left on the relation at all).

  The alternative to resetting the flag somehow is not having it in the
  first place.  Would that be terribly expensive?


I'd imagine not really expensive. I guess I just thought that it would be a
good idea to save from having to bother looking in pg_constraint for
foreign keys when none exist. The scan uses pg_constraint_conrelid_index so
only would ever see the constraints for the rel being cached/loaded.



 The behavior of relhaspkey is a legacy thing that we've tolerated only
 because nothing whatsoever in the backend depends on it at all.  I'm not
 eager to add more equally-ill-defined pg_class columns.


I guess it's certainly not required. It would be easier to add it later if
we decided it was a good idea, rather than having to keep it forever and a
day if it's next to useless.

I'll remove it from the patch.

Regards

David Rowley


Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Ants Aasma
On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote:
  Hm. Perhaps we should do a bufHdr-refcount != zero check without
  locking here? The atomic op will transfer the cacheline exclusively to
  the reclaimer's CPU. Even though it very shortly afterwards will be
  touched afterwards by the pinning backend.

 Meh.  I'm not in favor of adding more funny games with locking unless
 we can prove they're necessary for performance.

 Well, this in theory increases the number of processes touching buffer
 headers regularly. Currently, if you have one read IO intensive backend,
 there's pretty much only process touching the cachelines. This will make
 it two. I don't think it's unreasonable to try to reduce the cacheline
 pingpong caused by that...

I don't think it will help much. A pinned buffer is pretty likely to
be in modified state in the cache of the cpu of the pinning backend.
Even taking a look at the refcount will trigger a writeback and
demotion to shared state. When time comes to unpin the buffer the
cacheline must again be promoted to exclusive state introducing
coherency traffic. Not locking the buffer only saves transfering the
cacheline back to the pinning backend, not a huge amount of savings.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Andres Freund
On 2014-09-12 12:38:48 +0300, Ants Aasma wrote:
 On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote:
   Hm. Perhaps we should do a bufHdr-refcount != zero check without
   locking here? The atomic op will transfer the cacheline exclusively to
   the reclaimer's CPU. Even though it very shortly afterwards will be
   touched afterwards by the pinning backend.
 
  Meh.  I'm not in favor of adding more funny games with locking unless
  we can prove they're necessary for performance.
 
  Well, this in theory increases the number of processes touching buffer
  headers regularly. Currently, if you have one read IO intensive backend,
  there's pretty much only process touching the cachelines. This will make
  it two. I don't think it's unreasonable to try to reduce the cacheline
  pingpong caused by that...
 
 I don't think it will help much. A pinned buffer is pretty likely to
 be in modified state in the cache of the cpu of the pinning backend.

Right. Unless you're on a MOESI platforms. I'd really like to know why
that's not more widely used.

 Even taking a look at the refcount will trigger a writeback and
 demotion to shared state.  When time comes to unpin the buffer the
 cacheline must again be promoted to exclusive state introducing
 coherency traffic. Not locking the buffer only saves transfering the
 cacheline back to the pinning backend, not a huge amount of savings.

Yes. But: In many, if not most, cases the cacheline will be read a
couple times before modifying it via the spinlock.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] implement subject alternative names support for SSL connections

2014-09-12 Thread Heikki Linnakangas

On 09/11/2014 08:46 PM, Alexey Klyukin wrote:

On Mon, Sep 8, 2014 at 8:04 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

2. I still wonder if we should follow the RFC 6125 and not check the Common
Name at all, if SANs are present. I don't really understand the point of
that rule, and it seems unlikely to pose a security issue in practice,
because surely a CA won't sign a certificate with a bogus/wrong CN, because
an older client that doesn't look at the SANs at all would use the CN
anyway. But still... what does a Typical Web Browser do?


At least Firefox and Safari seem to follow RFC strictly, according to
some anecdotical evidences (I haven't got enough time to dig into the
source code yet):

http://superuser.com/questions/230136/primary-common-name-in-subjectaltname
https://bugzilla.mozilla.org/show_bug.cgi?id=238142

Chrome seem to follow them, so the major open-source browsers are
ignoring the Common Name if the SubjectAltName is present.
Still, I see no win in ignoring CN (except for the shorter code), it
just annoys some users that forgot to put the CN name also in the SAN
section, so my 5 cents is that we should check both.


Hmm. If that's what the browsers do, I think we should also err on the 
side of caution here. Ignoring the CN is highly unlikely to cause anyone 
a problem; a CA worth its salt should not issue a certificate with a CN 
that's not also listed in the SAN section. But if you have such a 
certificate anyway for some reason, it shouldn't be too difficult to get 
a new certificate. Certificates expire every 1-3 years anyway, so there 
must be a procedure to renew them anyway.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

2014-09-12 Thread Andres Freund
Hi,

Abhijit and I investigated a customer problem which has showed that crash 
recovery +
unlogged relations don't always work well together:

A condensed version of how crash recovery works is:

StartupXLOG()
{
...
if (ControlFile-state != DB_SHUTDOWNED)
   InRecovery = true;

if (InRecovery)
{
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);

/* perform crash recovery till the end of WAL */
...
CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
...
}

PreallocXlogFiles(EndOfLog);

if (InRecovery)
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
...
/* finish startup */
}

the second important part is:

CreateCheckPoint(flags)
{
...
if (flags  (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
shutdown = true;
...
if (shutdown)
ControlFile-state = DB_SHUTDOWNED;
UpdateControlFile();
...
}

If you consider a crash due to ENOSPC the problem is that first crash
recovery is performed. Then a checkpoint is performed which is marked as
END_OF_RECOVERY - which marks the database as being properly shut down!
So far, while not pretty, so good. The problem is that if we crash after
the CreateCheckPoint(), e.g. because of xlog preallocation or the new
files created in ResetUnloggedRelations(), the server will restart *but*
will *not* perform crash recovery anymore as pg_control marked as
DB_SHUTDOWNED!

That leaves you with a database which has all the _init forks, but not
the main forks... Leading to scary an unexpected errors.

Should somebody google this: The problem can be fixed by forcing the
server into crash recovery again using an immediate shutdown.


Personally I think it's quite the mistake that END_OF_RECOVERY
checkpoints are treated as shutdown checkpoints. The claimed reason
that:
 *
 * Note that we write a shutdown checkpoint rather than an on-line
 * one. This is not particularly critical, but since we may be
 * assigning a new TLI, using a shutdown checkpoint allows us to have
 * the rule that TLI only changes in shutdown checkpoints, which
 * allows some extra error checking in xlog_redo.
 *
and
/*
 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 * issued at a different time.
 */

isn't very convincing as those checks could just as well be saved in a
flags argument in the checkpoint. The likelihood of this confusion
causing further bugs (IIRC it already has caused a couple) seems high.

What I like even less is that pg_control is actually marked as
DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
the database was *NOT* shutdown peacefully. I don't see active bugs due
it besides this, but I think it's likely to either have or create futher
ones.


Because at least the former is something that obviously we can't (and
don't want) to change in the back branches I think the solution for this
particular problem is to simply move the ResetUnloggedRelations() call a
couple lines up to before the CreateCheckPoint(). That should fix this.

Comments, other opinions?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

2014-09-12 Thread Andres Freund
Hi,

On 2014-09-12 13:22:46 +0200, Andres Freund wrote:
 Because at least the former is something that obviously we can't (and
 don't want) to change in the back branches I think the solution for this
 particular problem is to simply move the ResetUnloggedRelations() call a
 couple lines up to before the CreateCheckPoint(). That should fix this.
 
 Comments, other opinions?

A second thing I realized, just after hitting send, is that I think
ResetUnloggedRelations(UNLOGGED_RELATION_INIT) actually has to fsync the
created files. As we rely on the !init relations being there they really
need to be fsynced. During normal running that's ensured via the smgr
during checkpoints, but that's not the case here.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] expanded mode is still broken

2014-09-12 Thread Pavel Stehule
Hi

I checked HEAD and border formatting is broken


postgres=# \pset border 3
Border style (border) is 3.
postgres=# \l
List of databases
[ RECORD 1 ]-+--
| Name  | postgres  |
| Owner | postgres  |
| Encoding  | UTF8  |
| Collate   | en_US.UTF-8   |
| Ctype | en_US.UTF-8   |
| Access privileges |   |
[ RECORD 2 ]-+--
| Name  | template0 |
| Owner | postgres  |
| Encoding  | UTF8  |
| Collate   | en_US.UTF-8   |
| Ctype | en_US.UTF-8   |
| Access privileges | =c/postgres  +|
|   | postgres=CTc/postgres |
[ RECORD 3 ]-+--
| Name  | template1 |
| Owner | postgres  |
| Encoding  | UTF8  |
| Collate   | en_US.UTF-8   |
| Ctype | en_US.UTF-8   |
| Access privileges | =c/postgres  +|
|   | postgres=CTc/postgres |
-+--

postgres=# \pset linestyle unicode
Line style (linestyle) is unicode.
postgres=# \l
List of databases
[ RECORD 1 ]─┬──
│ Name  │ postgres  │
│ Owner │ postgres  │
│ Encoding  │ UTF8  │
│ Collate   │ en_US.UTF-8   │
│ Ctype │ en_US.UTF-8   │
│ Access privileges │   │
[ RECORD 2 ]─┼──
│ Name  │ template0 │
│ Owner │ postgres  │
│ Encoding  │ UTF8  │
│ Collate   │ en_US.UTF-8   │
│ Ctype │ en_US.UTF-8   │
│ Access privileges │ =c/postgres  ↵│
│   │ postgres=CTc/postgres │
[ RECORD 3 ]─┼──
│ Name  │ template1 │
│ Owner │ postgres  │
│ Encoding  │ UTF8  │
│ Collate   │ en_US.UTF-8   │
│ Ctype │ en_US.UTF-8   │
│ Access privileges │ =c/postgres  ↵│
│   │ postgres=CTc/postgres │
─┴──

Probably not all Sergey's fixes was applied, when I tested it (9.4 with
Sergey' fixes) it works without issues (I though)

Regards

Pavel


Re: [HACKERS] proposal (9.5) : psql unicode border line styles

2014-09-12 Thread Pavel Stehule
2014-09-12 8:19 GMT+02:00 Pavel Stehule pavel.steh...@gmail.com:



 2014-09-12 5:14 GMT+02:00 Stephen Frost sfr...@snowman.net:

 Pavel,

 * Pavel Stehule (pavel.steh...@gmail.com) wrote:
  Any idea, tip how to it?

 Attached is what I had been thinking.

 Thoughts?


 yes, it is better. I didn't understand before.

 I though about it because it allows to design multiline end style in
 future.

 I am not a fan of ↵ and … in resulted tables, and can be nice if I can
 to optionally change it.


I though about it, and we cannot to drop it now. These symbols are
necessary, because we don't support line between rows.

I am thinking so 05 patch should be final

Regards

Pavel



 With your change we can design other pure style against current rich
 style. But I don't have a idea, how to name it.

 Regards

 Pavel



 Thanks!

 Stephen





Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-12 Thread Dilip kumar
On 12 September 2014 14:34, Amit Kapila Wrote

Please find updated patch to include those documentation changes.

Looks fine, Moved to Ready for committer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.comhttp://www.enterprisedb.com/


Re: [HACKERS] Incorrect initialization of sentPtr in walsender.c

2014-09-12 Thread Michael Paquier
On Fri, Sep 12, 2014 at 4:55 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I haven't looked at those places closely, but it seems possible that at
 least some of those variables are supposed to be initialized to a value
 smaller than any valid WAL position, rather than just Invalid. In other
 words, if we defined InvalidXLogRecPtr as INT64_MAX rather than 0, we would
 still want those variables to be initialized to zero. As I said, I didn't
 check the code, but before we change those that ought to be checked.

Ah, OK. I just had a look at that, and receivedUpto and lastComplaint
in xlog.c need to use the lowest pointer value possible as they do a
couple of comparisons with other positions. This is as well the case
of sentPtr in walsender.c. However, that's not the case of writePtr
and flushPtr in walreceiver.c as those positions are just used for
direct comparison with LogstreamResult, so we could use
InvalidXLogRecPtr in this case.

What do you think of the addition of a #define for the lowest possible
XLOG location pointer? I've wanted that as well a couple of times when
working on clients using replication connections for for example
START_REPLICATION. That would be better than hardcoding a position
like '0/0', and would make the current code more solid.

Patch attached in case.

Regards,
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 34f2fc0..fc42b5f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -163,7 +163,7 @@ HotStandbyState standbyState = STANDBY_DISABLED;
 static XLogRecPtr LastRec;
 
 /* Local copy of WalRcv-receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+static XLogRecPtr receivedUpto = LowestXLogRecPtr;
 static TimeLineID receiveTLI = 0;
 
 /*
@@ -11003,7 +11003,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		curFileTLI = tli;
 		RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 			 PrimarySlotName);
-		receivedUpto = 0;
+		receivedUpto = LowestXLogRecPtr;
 	}
 
 	/*
@@ -11266,7 +11266,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int
 emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
-	static XLogRecPtr lastComplaint = 0;
+	static XLogRecPtr lastComplaint = LowestXLogRecPtr;
 
 	if (readSource == XLOG_FROM_PG_XLOG  emode == LOG)
 	{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..c5103f7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1037,8 +1037,8 @@ XLogWalRcvFlush(bool dying)
 static void
 XLogWalRcvSendReply(bool force, bool requestReply)
 {
-	static XLogRecPtr writePtr = 0;
-	static XLogRecPtr flushPtr = 0;
+	static XLogRecPtr writePtr = InvalidXLogRecPtr;
+	static XLogRecPtr flushPtr = InvalidXLogRecPtr;
 	XLogRecPtr	applyPtr;
 	static TimestampTz sendTime = 0;
 	TimestampTz now;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 844a5de..819ac28 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -140,7 +140,7 @@ static XLogRecPtr sendTimeLineValidUpto = InvalidXLogRecPtr;
  * How far have we sent WAL already? This is also advertised in
  * MyWalSnd-sentPtr.  (Actually, this is the next WAL location to send.)
  */
-static XLogRecPtr sentPtr = 0;
+static XLogRecPtr sentPtr = LowestXLogRecPtr;
 
 /* Buffers for constructing outgoing messages and processing reply messages. */
 static StringInfoData output_message;
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 3b8e738..f7d88b4 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -25,9 +25,13 @@ typedef uint64 XLogRecPtr;
  * WAL segment, initializing the first WAL page at XLOG_SEG_SIZE, so no XLOG
  * record can begin at zero.
  */
-#define InvalidXLogRecPtr	0
+#define InvalidXLogRecPtr		((XLogRecPtr) 0)
 #define XLogRecPtrIsInvalid(r)	((r) == InvalidXLogRecPtr)
 
+/* Minimum value possible for a location pointer of XLOG */
+#define LowestXLogRecPtr		((XLogRecPtr) 0)
+#define XLogRecPtrIsLowest(r)	((r) == LowestXLogRecPtr)
+
 /*
  * XLogSegNo - physical log file sequence number.
  */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Robert Haas
On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Attached is the patch split as suggested:

 (a) hashjoin-nbuckets-v14a-size.patch

 * NTUP_PER_BUCKET=1
 * counting buckets towards work_mem
 * changes in ExecChooseHashTableSize (due to the other changes)

OK, I'm going to work on this one now.  I have some ideas about how to
simplify this a bit and will post an update in a few hours once I see
how that pans out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-12 Thread Amit Kapila
On Fri, Sep 12, 2014 at 5:07 PM, Dilip kumar dilip.ku...@huawei.com wrote:

 On 12 September 2014 14:34, Amit Kapila Wrote

 Please find updated patch to include those documentation changes.



 Looks fine, Moved to Ready for committer.

Thanks a lot for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] proposal (9.5) : psql unicode border line styles

2014-09-12 Thread Stephen Frost
* Pavel Stehule (pavel.steh...@gmail.com) wrote:
 I though about it, and we cannot to drop it now. These symbols are
 necessary, because we don't support line between rows.
 
 I am thinking so 05 patch should be final

Ok.  I'm going to play with it a bit more but barring issues, should get
it committed today.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-09-12 Thread Tom Lane
David Rowley dgrowle...@gmail.com writes:
 On Fri, Sep 12, 2014 at 3:35 AM, Robert Haas robertmh...@gmail.com wrote:
 I haven't read the patch, but I think the question is why this needs
 to be different than what we do for left join removal.

 I discovered over here -
 http://www.postgresql.org/message-id/CAApHDvo5wCRk7uHBuMHJaDpbW-b_oGKQOuisCZzHC25=h3_...@mail.gmail.com
 during the early days of the semi and anti join removal code that the
 planner was trying to generate paths to the dead rel. I managed to track
 the problem down to eclass members still existing for the dead rel. I guess
 we must not have eclass members for outer rels? or we'd likely have seen
 some troubles with left join removals already.

Mere existence of an eclass entry ought not cause paths to get built.
It'd be worth looking a bit harder into what's happening there.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Alexander Korotkov
Hi!

Let's consider some examples.

# select '[1,2]'::jsonb @ '[1,2,2]'::jsonb;
 ?column?
--
 f
(1 row)

One may think it's because second jsonb array contain two 2. So, contains
takes care about count of equal elements.

# select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb;
 ?column?
--
 t
(1 row)

But, it's not. Jsonb contains takes care only about length of array.

# select '[[1,2]]'::jsonb @ '[[1,2,2]]'::jsonb;
 ?column?
--
 t
(1 row)

Even more weird :)
The reason why jsonb contains behaves so is check in the beginning
of jsonb_contains. It makes fast check of jsonb type and elements count
before calling JsonbDeepContains.

if (JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) ||
JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
PG_RETURN_BOOL(false);

It's likely that JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) should be
checked only for objects, not arrays. Also, should JsonbDeepContains does
same fast check when it deals with nested objects?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] expanded mode is still broken

2014-09-12 Thread Stephen Frost
Pavel,

* Pavel Stehule (pavel.steh...@gmail.com) wrote:
 I checked HEAD and border formatting is broken

I agree that 'border 3' and the ascii / unicode linestyles generates an
odd looking output, however...

[...]

 Probably not all Sergey's fixes was applied, when I tested it (9.4 with
 Sergey' fixes) it works without issues (I though)

I went through all of psql-wrapped-expanded-fix-v5.patch (from
CAJTaR30koAru2tucYiAZ=jjffi8tz0k7dhfsy-y7bsfj347...@mail.gmail.com) and
the whole thing appears to have been applied.  There aren't any
regression tests with '\pset border 3' and the documentation indicates
that it's only sensible for HTML (where it simply tranlates into the
border=.. attribute) and latex/latex-longtable.  That said, it looks
like it worked in 9.3.  I'm guessing there's a case somewhere which
isn't handling border = 3 as identical to border = 2 after these changes.

If there's a specific patch that fixes this, please share a link.  I'll
spend a bit of time looking at it, but I'll admit that I've not spent a
lot of time in this code and it isn't exactly the most straight-forward
code in our tree today..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Alvaro Herrera
Michael Paquier wrote:
 On Fri, Sep 12, 2014 at 3:19 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila amit.kapil...@gmail.com wrote:
  On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On 8 January 2014 08:33, Simon Riggs si...@2ndquadrant.com wrote:
 
  Patch attached, implemented to reduce writes by SELECTs only.
 
  This patch is registered in this CF. It does not apply anymore and
  needs a rebase. Robert and Amit have provided as well some comments
  but they have not been addressed. Is it fair to mark it as returned
  with feedback even if it has not been reviewed within the last month?
 For the time being, status has been changed to waiting on author.

As it happens, I was studying this patch yesterday on the flight back
home.  I gave it a quick look; I noticed it was in the commitfest and
hadn't seen any review activity for many months, which seemed odd.

Anyway I first read the whole thread to know what to focus on, before
going over the patch itself.  Once I finished reading the emails, I had
a vague idea of how I thought it would work: my thinking was that
heap/index scans would either call heap_page_prune_opt, or not,
depending on whether they were part of a read-only executor node.  So if
you have a query that updates a certain table, and while doing so scans
another table in read-only mode, then the HOT updates would be enabled
for the table being written, but disabled for the one being read.

As it turns out, the patch as written is nothing at all like that, and
TBH I don't think I like it very much.

My idea is that we would have a new executor flag, say
EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
read-only, and reset it on those that aren't, such as LockRows and
ModifyTable (obviously we need to pass it down correctly from parent to
children).  Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
flag set, we call heap/index_set_allow_prune(false) for the heap scan;
same thing in index scans.  (I envisioned it as a boolean rather than
enabling a certain number of cleanups per scan.)

I tried to code this but I think it doesn't work correctly, and no time
for debug currently.  Anyway let me know what you think of this general
idea.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2b336b0..b859f89 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -342,9 +342,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	snapshot = scan-rs_snapshot;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * enabled.
 	 */
-	heap_page_prune_opt(scan-rs_rd, buffer);
+	if (scan-rs_allow_prune)
+		heap_page_prune_opt(scan-rs_rd, buffer);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -1349,6 +1351,9 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan-rs_allow_sync = allow_sync;
 	scan-rs_temp_snap = temp_snap;
 
+	/* HOT pruning is initially allowed; caller can turn it off if desired */
+	scan-rs_allow_prune = true;
+
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
 	 */
@@ -1440,6 +1445,12 @@ heap_endscan(HeapScanDesc scan)
 	pfree(scan);
 }
 
+void
+heap_set_allow_prune(HeapScanDesc scan, bool allow)
+{
+	scan-rs_allow_prune = allow;
+}
+
 /* 
  *		heap_getnext	- retrieve next tuple in scan
  *
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..cb2f075 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -253,6 +253,7 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan-heapRelation = heapRelation;
 	scan-xs_snapshot = snapshot;
+	scan-xs_allow_prune = true;
 
 	return scan;
 }
@@ -387,6 +388,12 @@ index_endscan(IndexScanDesc scan)
 	IndexScanEnd(scan);
 }
 
+void
+index_set_allow_prune(IndexScanDesc scan, bool allow_prune)
+{
+	scan-xs_allow_prune = allow_prune;
+}
+
 /* 
  *		index_markpos  - mark a scan position
  * 
@@ -520,9 +527,9 @@ index_fetch_heap(IndexScanDesc scan)
 			 ItemPointerGetBlockNumber(tid));
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page if enabled, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan-xs_cbuf)
+		if (prev_buf != scan-xs_cbuf  scan-xs_allow_prune)
 			heap_page_prune_opt(scan-heapRelation, scan-xs_cbuf);
 	}
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index fbd7492..8f4964c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1351,7 +1351,7 @@ BeginCopy(bool is_from,
 		 *
 		 * ExecutorStart computes a result tupdesc for us
 		 */
-		

Re: [HACKERS] expanded mode is still broken

2014-09-12 Thread Stephen Frost
Pavel,

* Stephen Frost (sfr...@snowman.net) wrote:
 That said, it looks like it worked in 9.3.

Hmm, actually, no, it didn't.

sfrost@tamriel:~ psql --version
psql (PostgreSQL) 9.3.5
sfrost@tamriel:~ psql -d postgres
psql (9.3.5)
Type help for help.

postgres=# \pset expanded 
Expanded display is on.
postgres=# \pset border 3
Border style is 3.
postgres=# \l
List of databases
[ RECORD 1 ]-+--
| Name  | postgres  |
| Owner | postgres  |
| Encoding  | UTF8  |
| Collate   | en_US.UTF-8   |
| Ctype | en_US.UTF-8   |
| Access privileges |   |

...

I've found a few places where we don't treat border = 2 as the same
border == 2 and if I'm able to make it work then I'll probably go ahead
and commit it, unless anyone objects, but if I run into trouble then
I'll probably just punt on it as I don't know that it really deserves a
lot of effort.

The way the documentation reads for 'border' is pretty terrible though,
in my view, so I'll take a pass at cleaning that up too.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Simon Riggs
On 12 September 2014 14:54, Alvaro Herrera alvhe...@2ndquadrant.com wrote:

 My idea is that we would have a new executor flag, say
 EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
 read-only, and reset it on those that aren't, such as LockRows and
 ModifyTable (obviously we need to pass it down correctly from parent to
 children).  Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
 flag set, we call heap/index_set_allow_prune(false) for the heap scan;
 same thing in index scans.  (I envisioned it as a boolean rather than
 enabling a certain number of cleanups per scan.)

 I tried to code this but I think it doesn't work correctly, and no time
 for debug currently.  Anyway let me know what you think of this general
 idea.

Thanks for looking at this.

My concern was to ensure that UPDATEs and DELETEs continue to call
heap_page_prune_opt while larger SELECTs do not.

This is achieved without a counter, so after some thought like it
better; simple is good. Happy to progress from here, or you can?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Alvaro Herrera
Simon Riggs wrote:
 On 12 September 2014 14:54, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 
  My idea is that we would have a new executor flag, say
  EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
  read-only, and reset it on those that aren't, such as LockRows and
  ModifyTable (obviously we need to pass it down correctly from parent to
  children).  Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
  flag set, we call heap/index_set_allow_prune(false) for the heap scan;
  same thing in index scans.  (I envisioned it as a boolean rather than
  enabling a certain number of cleanups per scan.)
 
  I tried to code this but I think it doesn't work correctly, and no time
  for debug currently.  Anyway let me know what you think of this general
  idea.
 
 Thanks for looking at this.
 
 My concern was to ensure that UPDATEs and DELETEs continue to call
 heap_page_prune_opt while larger SELECTs do not.
 
 This is achieved without a counter, so after some thought like it
 better; simple is good. Happy to progress from here, or you can?

Please feel free to take over.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] implement subject alternative names support for SSL connections

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 01:30 PM, Heikki Linnakangas wrote:

On 09/11/2014 08:46 PM, Alexey Klyukin wrote:

On Mon, Sep 8, 2014 at 8:04 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

2. I still wonder if we should follow the RFC 6125 and not check the Common
Name at all, if SANs are present. I don't really understand the point of
that rule, and it seems unlikely to pose a security issue in practice,
because surely a CA won't sign a certificate with a bogus/wrong CN, because
an older client that doesn't look at the SANs at all would use the CN
anyway. But still... what does a Typical Web Browser do?


At least Firefox and Safari seem to follow RFC strictly, according to
some anecdotical evidences (I haven't got enough time to dig into the
source code yet):

http://superuser.com/questions/230136/primary-common-name-in-subjectaltname
https://bugzilla.mozilla.org/show_bug.cgi?id=238142

Chrome seem to follow them, so the major open-source browsers are
ignoring the Common Name if the SubjectAltName is present.
Still, I see no win in ignoring CN (except for the shorter code), it
just annoys some users that forgot to put the CN name also in the SAN
section, so my 5 cents is that we should check both.


Hmm. If that's what the browsers do, I think we should also err on the
side of caution here. Ignoring the CN is highly unlikely to cause anyone
a problem; a CA worth its salt should not issue a certificate with a CN
that's not also listed in the SAN section. But if you have such a
certificate anyway for some reason, it shouldn't be too difficult to get
a new certificate. Certificates expire every 1-3 years anyway, so there
must be a procedure to renew them anyway.


Committed, with that change, ie. the CN is not checked if SANs are present.

Thanks for bearing through all these iterations!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 12 September 2014 14:54, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 My idea is that we would have a new executor flag, say
 EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
 read-only, and reset it on those that aren't, such as LockRows and
 ModifyTable (obviously we need to pass it down correctly from parent to
 children).  Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
 flag set, we call heap/index_set_allow_prune(false) for the heap scan;
 same thing in index scans.  (I envisioned it as a boolean rather than
 enabling a certain number of cleanups per scan.)
 
 I tried to code this but I think it doesn't work correctly, and no time
 for debug currently.  Anyway let me know what you think of this general
 idea.

 Thanks for looking at this.

 My concern was to ensure that UPDATEs and DELETEs continue to call
 heap_page_prune_opt while larger SELECTs do not.

I think there's another way to think about it: what about saying that
the query's target relation(s) are subject to pruning, while others
are not?  Then you do not need an executor flag, you just need to
look at the estate-es_result_relations array (or maybe even only at
estate-es_result_relation_info).  This would have the advantage of
doing what-I-think-is-the-right-thing for updates/deletes involving
joins to other tables.  The mechanism Alvaro describes would probably
have to prune all tables involved in such a query; do we really want
that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Tom Lane
I wrote:
 I think there's another way to think about it: what about saying that
 the query's target relation(s) are subject to pruning, while others
 are not?  Then you do not need an executor flag, you just need to
 look at the estate-es_result_relations array (or maybe even only at
 estate-es_result_relation_info).

After a little bit I remembered there was already a function for this.
So specifically, I'd suggest using ExecRelationIsTargetRelation()
to decide whether to mark the scan as requiring pruning.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanded mode is still broken

2014-09-12 Thread Stephen Frost
Pavel, All,

* Stephen Frost (sfr...@snowman.net) wrote:
 * Stephen Frost (sfr...@snowman.net) wrote:
  That said, it looks like it worked in 9.3.
 
 Hmm, actually, no, it didn't.
[...]

Alright, attached is a patch which fixes it.  Barring objections, I'll
go ahead and back-patch this also, since it's clearly wrong and there
was a definite attempt to have this case work (a few places explicitly
reset opt_border to 2 if it's higher, but that wasn't being passed down
to print_aligned_vertical_line, and it wasn't doing it).

Thanks,

Stephen
diff --git a/src/bin/psql/print.c b/src/bin/psql/print.c
index 34365fd..d6e0487 100644
--- a/src/bin/psql/print.c
+++ b/src/bin/psql/print.c
@@ -1147,16 +1147,15 @@ cleanup:
 
 
 static void
-print_aligned_vertical_line(const printTableContent *cont,
+print_aligned_vertical_line(const printTextFormat *format,
+			const unsigned short opt_border,
 			unsigned long record,
 			unsigned int hwidth,
 			unsigned int dwidth,
 			printTextRule pos,
 			FILE *fout)
 {
-	const printTextFormat *format = get_line_style(cont-opt);
 	const printTextLineFormat *lformat = format-lrule[pos];
-	unsigned short opt_border = cont-opt-border;
 	unsigned int i;
 	int			reclen = 0;
 
@@ -1435,11 +1434,11 @@ print_aligned_vertical(const printTableContent *cont, FILE *fout)
 lhwidth++; /* for newline indicators */
 
 			if (!opt_tuples_only)
-print_aligned_vertical_line(cont, record++, lhwidth,
-			dwidth, pos, fout);
+print_aligned_vertical_line(format, opt_border, record++,
+			lhwidth, dwidth, pos, fout);
 			else if (i != 0 || !cont-opt-start_table || opt_border == 2)
-print_aligned_vertical_line(cont, 0, lhwidth, dwidth,
-			pos, fout);
+print_aligned_vertical_line(format, opt_border, 0, lhwidth,
+			dwidth, pos, fout);
 		}
 
 		/* Format the header */
@@ -1624,7 +1623,7 @@ print_aligned_vertical(const printTableContent *cont, FILE *fout)
 	if (cont-opt-stop_table)
 	{
 		if (opt_border == 2  !cancel_pressed)
-			print_aligned_vertical_line(cont, 0, hwidth, dwidth,
+			print_aligned_vertical_line(format, opt_border, 0, hwidth, dwidth,
 		PRINT_RULE_BOTTOM, fout);
 
 		/* print footers */


signature.asc
Description: Digital signature


Re: [HACKERS] pgbench throttling latency limit

2014-09-12 Thread Heikki Linnakangas

On 09/11/2014 03:36 PM, Fabien COELHO wrote:


Hello Heikki,


Now that I've finished the detour and committed and backpatched the changes
to the way latency is calculated, we can get back to this patch. It needs to
be rebased.


Before rebasing, I think that there are a few small problems with the
modification applied to switch from an integer range to double.

(1) ISTM that the + 0.5 which remains in the PoissonRand computation comes
from the previous integer approach and is not needed here. If I'm not
mistaken the formula should be plain:

   -log(uniform) * center


No. The +0.5 is to round the result to the nearest integer, instead of 
truncating it down.



(2) I'm not sure of the name center, I think that lambda or
  mean would be more appropriate.


(shrug), I guess. The comment says that it's the value the average of a 
series values is centered on, so center doesn't seem too bad. I guess 
the mathematically accurate term would be expected value.



(3) I wish that the maximum implied multiplier could be explicitely
  documented in the source code. From pg_rand48 source code, I think
  that it is 33.27106466687737


Oh, ok. That's an even smaller multiplier than I got just by feeding 
DBL_MIN to the formula. I don't think that's worth worrying about. That 
might change if the implementation of pg_erand48() is changed, so I'm a 
bit reluctant to state it explicitly.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER SYSTEM RESET?

2014-09-12 Thread Fujii Masao
On Wed, Sep 10, 2014 at 9:06 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Sep 3, 2014 at 12:08 AM, Christoph Berg c...@df7cb.de wrote:
 Re: Vik Fearing 2014-09-02 5405d2d9.9050...@dalibo.com
  Uhm, are we agreed on the decision on not to backpatch this?  I would
  think this should have been part of the initial ALTER SYSTEM SET patch
  and thus should be backpatched to 9.4.

 I think it belongs in 9.4 as well, especially if we're having another beta.

 My original complaint was about 9.4, so I'd like to see it there, yes.

 ISTM that the consensus is to do the back-patch to 9.4.
 Does anyone object to the back-patch? If not, I will do that.

Done because no one voiced objection against the back-patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-12 Thread Heikki Linnakangas

On 09/11/2014 05:16 PM, Fabien COELHO wrote:



How should skipped transactions should be taken into account in the log file
output, with and without aggregation? I assume we'll want to have some trace
of skipped transactions in the logs.


The problem with this point is that how to report something not done is
unclear, especially as the logic of the log is one line per performed
transaction.

Obviously we can log something, but as the transaction are not performed
the format would be different, which break the expectation of a simple and
homogeneous log file format that people like to process directly.

So bar any great idea, I would suggest not to log skipped transactions and
to wait for someone to need to have access to this detailed information
and for whom the final report is not enough.


We have to come up with something. The point of the log file is that it 
contains all the information you need to build your own graphs, when 
pgbench's built-in reporting features are not enough. If it doesn't 
contain detailed information about skipped transactions, a report based 
on the log file would be inaccurate, or at least misleading.


How about printing a line in the log for every skipped transaction, with 
the string skipped in place of the latency. The completion time can 
show the time when the transaction was skipped, and the lag can show the 
difference between the scheduled time and the time it was skipped. Or 
put another way, print a line as if the transaction completed 
immediately, but with the skipped string in the latency field.


The skipped string will trip a program that doesn't expect that, but 
since this is a new feature that you have to enable manually, that's OK.


The output would look something like this (modified from the manual's 
example by hand, so the numbers don't add up):


0 199 2241 0 1175850568 995598 1020
0 200 2465 0 1175850568 998079 1010
0 201 skipped 1175850569 608 3011
0 202 skipped 1175850569 608 2400
0 203 skipped 1175850569 608 1000
0 204 2513 0 1175850569 608 500
0 205 2038 0 1175850569 2663 500

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-09-12 Thread Tom Lane
Albe Laurenz laurenz.a...@wien.gv.at writes:
 Tom Lane wrote:
 I'm not sure offhand what the new plan tree ought to look like.  We could
 just generate a ForeignScan node, but that seems like rather a misnomer.
 Is it worth inventing a new ForeignUpdate node type?  Or maybe it'd still
 be ForeignScan but with a flag field saying hey this is really an update
 (or a delete).  The main benefit I can think of right now is that the
 EXPLAIN output would be less strange-looking --- but EXPLAIN is hardly
 the only thing that ever looks at plan trees, so having an outright
 misleading plan structure is likely to burn us down the line.

 I can understand these qualms.
 I wonder if ForeignUpdate is such a good name though, since it would
 surprise the uninitiate that in the regular (no push-down) case the
 actual modification is *not* performed by ForeignUpdate.
 So it should rather be a ForeignModifyingScan, but I personally would
 prefer a has_side_effects flag on ForeignScan.

I was envisioning that the EXPLAIN output would look like

Foreign Scan on tab1
  Remote SQL: SELECT ...

for the normal case, versus

Foreign Update on tab1
  Remote SQL: UPDATE ...

for the pushed-down-update case (and similarly for DELETE).  For a
non-optimized update it'd still be a ForeignScan underneath a ModifyTable.

As for the internal representation, I was thinking of adding a CmdType
field to struct ForeignScan, with currently only CMD_SELECT, CMD_UPDATE,
CMD_DELETE as allowed values, though possibly in future we'd think of a
reason to allow CMD_INSERT there.  This is more or less isomorphic to your
has_side_effects flag, but it allows distinguishing UPDATE from DELETE
which might be useful.

The only thing that's bothering me about this concept is that I'm not
seeing how to scale it up to handling a pushed-down update on a join,
ie, UPDATE foo ... FROM bar ... where both foo and bar are remote.
Maybe it's silly to worry about that until join push-down is done;
but in that case I'd vote for postponing this whole patch until we
have join push-down.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP signatures

2014-09-12 Thread Alvaro Herrera
Marko Tiikkaja wrote:

 On 9/8/14 7:30 PM, Jeff Janes wrote:

 If i understand the sequence here: The current git HEAD is that
 pgp_pub_decrypt would throw an error if given a signed and encrypted
 message, and earlier version of your patch changed that to decrypt the
 message and ignore the signature, and the current version went back to
 throwing an error.
 
 I think I prefer the middle of those behaviors.  The original behavior
 seems like a bug to me, and I don't think we need to be backwards
 compatible with bugs.  Why should a function called decrypt care if the
 message is also signed?  That is not its job.
 
 I haven't updated the patch yet because I don't want to waste my
 time going back and forth until we have a consensus, but I think I
 prefer Jeff's suggestion here to make the _decrypt() functions
 ignore signatures.  Does anyone else want to voice their opinion?

+1 for ignoring sigs.  If somebody want to check sigs, that's a
separate step.  Maybe we could have an optional boolean flag, default
false, to enable checking sigs, but that seems material for a future
patch.

That said, I do wonder if it's a behavior change with security
implications: if somebody is relying on the current behavior of throwing
an error when sigs don't match, they wouldn't be thrilled to hear that
their security checks now fail to detect a problem because we don't
verify signatures when decrypting.  In other words, is this established
practice already?  If not, it's okay; otherwise, hmm.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal (9.5) : psql unicode border line styles

2014-09-12 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
 * Pavel Stehule (pavel.steh...@gmail.com) wrote:
  I though about it, and we cannot to drop it now. These symbols are
  necessary, because we don't support line between rows.
  
  I am thinking so 05 patch should be final
 
 Ok.  I'm going to play with it a bit more but barring issues, should get
 it committed today.

Alright, pushed with some additional cleanup, and also cleaned up the
border documentation a bit.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] proposal (9.5) : psql unicode border line styles

2014-09-12 Thread Pavel Stehule
2014-09-12 18:09 GMT+02:00 Stephen Frost sfr...@snowman.net:

 * Stephen Frost (sfr...@snowman.net) wrote:
  * Pavel Stehule (pavel.steh...@gmail.com) wrote:
   I though about it, and we cannot to drop it now. These symbols are
   necessary, because we don't support line between rows.
  
   I am thinking so 05 patch should be final
 
  Ok.  I'm going to play with it a bit more but barring issues, should get
  it committed today.

 Alright, pushed with some additional cleanup, and also cleaned up the
 border documentation a bit.


Thank you very much

Regards

Pavel



 Thanks!

 Stephen



Re: [HACKERS] expanded mode is still broken

2014-09-12 Thread Pavel Stehule
It works perfectly now

Thank you

Pavel

2014-09-12 16:37 GMT+02:00 Stephen Frost sfr...@snowman.net:

 Pavel, All,

 * Stephen Frost (sfr...@snowman.net) wrote:
  * Stephen Frost (sfr...@snowman.net) wrote:
   That said, it looks like it worked in 9.3.
 
  Hmm, actually, no, it didn't.
 [...]

 Alright, attached is a patch which fixes it.  Barring objections, I'll
 go ahead and back-patch this also, since it's clearly wrong and there
 was a definite attempt to have this case work (a few places explicitly
 reset opt_border to 2 if it's higher, but that wasn't being passed down
 to print_aligned_vertical_line, and it wasn't doing it).

 Thanks,

 Stephen



Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Attached is the patch split as suggested:

 (a) hashjoin-nbuckets-v14a-size.patch

 * NTUP_PER_BUCKET=1
 * counting buckets towards work_mem
 * changes in ExecChooseHashTableSize (due to the other changes)

 OK, I'm going to work on this one now.  I have some ideas about how to
 simplify this a bit and will post an update in a few hours once I see
 how that pans out.

Here's what I came up with.  I believe this has the same semantics as
your patch for less code, but please check, because I might be wrong.
The main simplifications I made were:

(1) In master, we decide whether or not to batch based on the size of
the data, without knowing the number of buckets.  If we want to
account for the bucket overhead, that doesn't work.  But in your
patch, we basically computed the number of buckets we'd need for the
single-batch case, then decide whether to do a single batch, and if
yes, computed the same values over again with the same formula phrased
differently.  I revised that so that the single-batch case is
calculated only once.  If everything fits, we're all set, and don't
need to recalculate anything.

(2) In your patch, once we decide we need to batch, you set nbatch as
now, and then check whether the computed value is still adequate after
subtracting buckets_byte from hash_table_bytes.  I revised that so
that we make the initial batch estimate of dbatch by dividing
inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
hash_table_bytes.  I think that avoids the need to maybe increase
nbatch again afterwards.

I'm comfortable with this version if you are, but (maybe as a
follow-on commit) I think we could make this even a bit smarter.  If
inner_rel_bytes + bucket_bytes  hash_table_bytes but inner_rel_bytes
+ bucket_bytes / 4  hash_table_bytes, then reduce the number of
buckets by 2x or 4x to make it fit.  That would provide a bit of
additional safety margin against cases where this patch might
unluckily lose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..b3203ba 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -386,7 +386,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -396,12 +396,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	long		bucket_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
-	int			nbatch;
+	int			nbatch = 1;
 	int			nbuckets;
-	int			i;
+	double		dbuckets;
 
 	/* Force a plausible relation size if no info */
 	if (ntuples = 0.0)
@@ -460,56 +461,61 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 	/*
 	 * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
-	 * memory is filled.  Set nbatch to the smallest power of 2 that appears
-	 * sufficient.  The Min() steps limit the results so that the pointer
-	 * arrays we'll try to allocate do not exceed work_mem.
+	 * memory is filled, assuming a single batch.  The Min() step limits the
+	 * results so that the pointer arrays we'll try to allocate do not exceed
+	 * work_mem.
 	 */
 	max_pointers = (work_mem * 1024L) / sizeof(void *);
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
+	dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
+	dbuckets = Min(dbuckets, max_pointers);
+	nbuckets = Max((int) dbuckets, 1024);
+	nbuckets = 1  my_log2(nbuckets);
+	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
 
-	if (inner_rel_bytes  hash_table_bytes)
+	/*
+	 * If there's not enough space to store the projected number of tuples
+	 * and the required bucket headers, we will need multiple batches.
+	 */
+	if (inner_rel_bytes + bucket_bytes  hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
+		long		bucket_size;
 
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
+		/*
+		 * Estimate the number of buckets we'll want to have when work_mem
+		 * is entirely full.  Each bucket will contain a bucket pointer plus
+		 * NTUP_PER_BUCKET tuples, whose projected size already includes
+		 * overhead for the hash code, pointer to the next tuple, etc.
+		 */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+		lbuckets = 1  my_log2(hash_table_bytes / bucket_size);
 		lbuckets = Min(lbuckets, max_pointers);
 		nbuckets = (int) lbuckets;
+		bucket_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are 

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-12 Thread Josh Berkus
On 09/11/2014 06:56 PM, Arthur Silva wrote:
 
 In my testings with the github archive data the savings -
 performance-penalty was fine, but I'm not confident in those results
 since there were only 8 top level keys.

Well, we did want to see that the patch doesn't create a regression with
data which doesn't fall into the problem case area, and your test did
that nicely.

 For comparison, some twitter api objects[1] have 30+ top level keys. If
 I have time in the next couple of days I'll conduct some testings with
 the public twitter fire-hose data.

Yah, if we have enough time for me to get the Mozilla Socorro test
environment working, I can also test with Mozilla crash data.  That has
some deep nesting and very large values.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 2:15 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Amit: did you notice that paragraph in the README? If not, and now that you
 have read it, does that paragraph make things clear? If that's not enough,
 what do you think is missing?

Amit raised the fact that LY say that no read locks are required, so
clearly he read that paragraph. We don't try and justify that right
now, and trying to explain the whole point of LY without first
explaining this satisfactorily seems odd. The current text reads:


Lehman and Yao don't require read locks, but assume that in-memory
copies of tree pages are unshared.


If they assume that they're unshared, uh, then why bother with all
that locking stuff? Or does this mean that page reads and writes are
(dubiously) assumed atomic without locks? If you take a step back, you
can see how confusing that is. LY don't get around to explaining
this, but it's pretty clear that this is what's going on. If LY did a
better job of explaining their algorithm, they'd just say that the
latch coupling (coupling, or crabbing, of what we'd call buffer
locks) previously required is no longer required, but single shared
locks are required. As I pointed out, it looks like someone figured
out a way to make that true much, much later [1]. I think page-level
MVCC designs might have also been used, but basically our
interpretation of LY is the standard one. We cannot really be
considered to have deviated from LY, since AFAICT everyone else went
with the same interpretation.

FWIW, in recent years Stonebraker has used latch coupling as an
example of the (implicitly no longer acceptable) overhead of designs
influenced by System R and ARIES. This is unfounded, but plenty of
bright people still at least believe that latch coupling is common.

[1] http://www.vldb.org/conf/2001/P181.pdf
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-12 Thread Robert Haas
On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote:
 So, I finally got time to test Tom's latest patch on this.

 TLDR: we want to go with Tom's latest patch and release beta3.

 Figures:

 So I tested HEAD against the latest lengths patch.  Per Arthur Silva, I
 checked uncompressed times for JSONB against compressed times.  This
 changed the picture considerably.

Did you


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote:
 So, I finally got time to test Tom's latest patch on this.

 TLDR: we want to go with Tom's latest patch and release beta3.

 Figures:

 So I tested HEAD against the latest lengths patch.  Per Arthur Silva, I
 checked uncompressed times for JSONB against compressed times.  This
 changed the picture considerably.

 Did you

Blah.

Did you test Heikki's patch from here?

http://www.postgresql.org/message-id/53ec8194.4020...@vmware.com

Tom didn't like it, but I thought it was rather clever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 12:57 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Sep 12, 2014 at 2:15 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Amit: did you notice that paragraph in the README? If not, and now that you
 have read it, does that paragraph make things clear? If that's not enough,
 what do you think is missing?

 Amit raised the fact that LY say that no read locks are required, so
 clearly he read that paragraph. We don't try and justify that right
 now, and trying to explain the whole point of LY without first
 explaining this satisfactorily seems odd. The current text reads:

 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.
 

 If they assume that they're unshared, uh, then why bother with all
 that locking stuff? Or does this mean that page reads and writes are
 (dubiously) assumed atomic without locks? If you take a step back, you
 can see how confusing that is. LY don't get around to explaining
 this, but it's pretty clear that this is what's going on. If LY did a
 better job of explaining their algorithm, they'd just say that the
 latch coupling (coupling, or crabbing, of what we'd call buffer
 locks) previously required is no longer required, but single shared
 locks are required. As I pointed out, it looks like someone figured
 out a way to make that true much, much later [1]. I think page-level
 MVCC designs might have also been used, but basically our
 interpretation of LY is the standard one. We cannot really be
 considered to have deviated from LY, since AFAICT everyone else went
 with the same interpretation.

Gosh, I think you're making this way more complicated than it needs to
be.  My interpretation of the above statement was that they knew
individual page reads and writes would need to be made atomic -
probably using some form of simple locking - but omitted that from
their pseudocode for clarity.  Such elisions are common in computer
science literature and don't typically detract from understanding.  It
is assumed that the implementor knows enough to avoid the trivial
pitfalls of whatever is being discussed and therefore focuses on the
higher-level algorithmic issues.

If this is what we're arguing about, it's completely not worth the
time we've spent on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-12 Thread Josh Berkus
On 09/12/2014 10:00 AM, Robert Haas wrote:
 On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote:
 So, I finally got time to test Tom's latest patch on this.

 TLDR: we want to go with Tom's latest patch and release beta3.

 Figures:

 So I tested HEAD against the latest lengths patch.  Per Arthur Silva, I
 checked uncompressed times for JSONB against compressed times.  This
 changed the picture considerably.

 Did you
 
 Blah.
 
 Did you test Heikki's patch from here?
 
 http://www.postgresql.org/message-id/53ec8194.4020...@vmware.com
 
 Tom didn't like it, but I thought it was rather clever.
 

Yes, I posted the results for that a couple weeks ago; Tom had posted a
cleaned-up version of that patch, but materially it made no difference
in sizes or extraction times compared with Tom's lengths-only patch.
Same for Arthur's tests.

It's certainly possible that there is a test case for which Heikki's
approach is superior, but if so we haven't seen it.  And since it's
approach is also more complicated, sticking with the simpler
lengths-only approach seems like the way to go.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-12 Thread Fabien COELHO



(1) ISTM that the + 0.5 which remains in the PoissonRand computation comes
from the previous integer approach and is not needed here. If I'm not
mistaken the formula should be plain:

   -log(uniform) * center


No. The +0.5 is to round the result to the nearest integer, instead of 
truncating it down.


Hmmm... probably ok. I'll have to think about it a bit.

In that case, it seems much clearer to do: round(-log(uniform) * xxx) 
instead of relying on the truncation of the cast.



(2) I'm not sure of the name center, I think that lambda or
  mean would be more appropriate.


(shrug), I guess. The comment says that it's the value the average of a 
series values is centered on, so center doesn't seem too bad. I guess the 
mathematically accurate term would be expected value.


The word center does not appear once of the wikipedia page about 
Poisson distribution. Its mean is called lambda (or rather λ:-) all 
over the place. I find expected value rather too generic, but it is 
better than center.



(3) I wish that the maximum implied multiplier could be explicitely
  documented in the source code. From pg_rand48 source code, I think
  that it is 33.27106466687737


Oh, ok. That's an even smaller multiplier than I got just by feeding DBL_MIN 
to the formula. I don't think that's worth worrying about. That might change 
if the implementation of pg_erand48() is changed, so I'm a bit reluctant to 
state it explicitly.


I think that it is an important information, so it deserves to appear.

If pg_erand48 implementation changes, it should not be called 48, 
because the max value and the above multiplier limit is completely linked 
to the 16*3=48 structure of the random construction.


If the code change then the comments need be updated, that is life.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-12 Thread Simon Riggs
On 12 September 2014 15:30, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 I think there's another way to think about it: what about saying that
 the query's target relation(s) are subject to pruning, while others
 are not?  Then you do not need an executor flag, you just need to
 look at the estate-es_result_relations array (or maybe even only at
 estate-es_result_relation_info).

 After a little bit I remembered there was already a function for this.
 So specifically, I'd suggest using ExecRelationIsTargetRelation()
 to decide whether to mark the scan as requiring pruning.

Sounds cool. Thanks both, this is sounding like a viable route now.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 10:10 AM, Robert Haas robertmh...@gmail.com wrote:
 Gosh, I think you're making this way more complicated than it needs to
 be.  My interpretation of the above statement was that they knew
 individual page reads and writes would need to be made atomic -
 probably using some form of simple locking - but omitted that from
 their pseudocode for clarity.

That clearly isn't the case. The introductory paragraph of LY says
the following:

Our solution compares favorably with earlier solutions in that the
locking scheme is simpler (no read-locks are used) and only a (small)
constant number of nodes are locked by any update process at any given
time.

They clearly and prominently state that not needing read locks is a
major advantage of their algorithm, which doesn't quite ring true.

 If this is what we're arguing about, it's completely not worth the
 time we've spent on it.

It isn't. It's a minor point, originally raised by Amit.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/12/2014 10:00 AM, Robert Haas wrote:
 On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote:
 So, I finally got time to test Tom's latest patch on this.

 TLDR: we want to go with Tom's latest patch and release beta3.

 Figures:

 So I tested HEAD against the latest lengths patch.  Per Arthur Silva, I
 checked uncompressed times for JSONB against compressed times.  This
 changed the picture considerably.

 Did you

 Blah.

 Did you test Heikki's patch from here?

 http://www.postgresql.org/message-id/53ec8194.4020...@vmware.com

 Tom didn't like it, but I thought it was rather clever.

 Yes, I posted the results for that a couple weeks ago; Tom had posted a
 cleaned-up version of that patch, but materially it made no difference
 in sizes or extraction times compared with Tom's lengths-only patch.
 Same for Arthur's tests.

 It's certainly possible that there is a test case for which Heikki's
 approach is superior, but if so we haven't seen it.  And since it's
 approach is also more complicated, sticking with the simpler
 lengths-only approach seems like the way to go.

Huh, OK.  I'm slightly surprised, but that's why we benchmark these things.

Thanks for following up on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Gregory Smith

On 9/11/14, 7:01 AM, Andres Freund wrote:
I'm not convinced that these changes can be made without also changing 
the bgwriter logic. Have you measured whether there are differences in 
how effective the bgwriter is? Not that it's very effective right now :)


The current background writer tuning went out of its way to do nothing 
when it wasn't clear there was something that always worked.  What 
happened with all of the really clever schemes was that they worked on 
some workloads, and just trashed others.  Most of the gain from the 8.3 
rewrite came from looking at well theorized ideas for how to handle 
things like pre-emptive LRU scanning for writes, and just throwing them 
out altogether in favor of ignoring the problem.  The magic numbers left 
in or added to the code  were tuned to do very little work, 
intentionally.  If anything, since then the pressure to do nothing has 
gone up in the default install, because now people are very concerned 
about extra wakeups using power.


To find bad cases before, I was running about 30 different test 
combinations by the end, Heikki was running another set in the EDB lab, 
I believe there was a lab at NTT running their own set too. What went in 
was the combination that didn't break any of them badly--not the one 
that performed best on the good workloads.


This looks like it's squashed one of the very fundamental buffer scaling 
issues though; well done Amit.  What I'd like to see is preserving the 
heart of that while touching as little as possible. When in doubt, do 
nothing; let the backends suck it up and do the work themselves.


I had to take a health break from community development for a while, and 
I'm hoping to jump back into review again for the rest of the current 
development cycle.  I'll go back to my notes and try to recreate the 
pathological cases that plagued both the 8.3 BGW rewrite and the aborted 
9.2 fsync spreading effort I did; get those running again and see how 
they do on this new approach.  I have a decent sized 24 core server that 
should be good enough for this job.  I'll see what I can do.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 6:40 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 Even more weird :)

Agreed.

 The reason why jsonb contains behaves so is check in the beginning of
 jsonb_contains. It makes fast check of jsonb type and elements count before
 calling JsonbDeepContains.

 if (JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) ||
 JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
 PG_RETURN_BOOL(false);

 It's likely that JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) should be
 checked only for objects, not arrays. Also, should JsonbDeepContains does
 same fast check when it deals with nested objects?

I think this is due to commit 364ddc. That removed the extra step that
had arrays sorted (and then de-duped) ahead of time, which made arrays
behave like objects at the top level. I think that this sort + de-dup
step was mischaracterized as purely a performance thing (possibly by
me). Basically, JsonbDeepContains() is consistent with the previous
behavior at the top level, but not the current (inadvertently altered)
behavior. I think the fix is probably a return to the previous
behavior. I'll take a closer look.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.5] Custom Plan API

2014-09-12 Thread Robert Haas
On Thu, Sep 11, 2014 at 8:40 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 On Thu, Sep 11, 2014 at 11:24 AM, Kouhei Kaigai kai...@ak.jp.nec.com
 wrote:
  Don't the changes to src/backend/optimizer/plan/createplan.c belong
  in patch #2?
 
  The borderline between #1 and #2 is little bit bogus. So, I moved most
  of portion into #1, however, invocation of InitCustomScan (that is a
  callback in CustomPlanMethod) in create_custom_plan() is still in #2.

 Eh, create_custom_scan() certainly looks like it is in #1 from here, or
 at least part of it is.  It calculates tlist and clauses and then does
 nothing with them.  That clearly can't be the right division.

 I think it would make sense to have create_custom_scan() compute tlist and
 clauses first, and then pass those to CreateCustomPlan().  Then you don't
 need a separate InitCustomScan() - which is misnamed anyway, since it has
 nothing to do with ExecInitCustomScan().

 The only reason why I put separate hooks here is, create_custom_scan() needs
 to know exact size of the CustomScan node (including private fields), however,
 it is helpful for extensions to kick its callback to initialize the node
 next to the common initialization stuff.

Why does it need to know that?  I don't see that it's doing anything
that requires knowing the size of that node, and if it is, I think it
shouldn't be.  That should get delegated to the callback provided by
the custom plan provider.

 Regarding to the naming, how about GetCustomScan() instead of 
 InitCustomScan()?
 It follows the manner in create_foreignscan_plan().

I guess that's a bit better, but come to think of it, I'd really like
to avoid baking in the assumption that the custom path provider has to
return any particular type of plan node.  A good start would be to
give it a name that doesn't imply that - e.g. PlanCustomPath().

  OK, I revised. Now custom-scan assumes it has a particular valid
  relation to be scanned, so no code path with scanrelid == 0 at this moment.
 
  Let us revisit this scenario when custom-scan replaces relation-joins.
  In this case, custom-scan will not be associated with a particular
  base- relation, thus it needs to admit a custom-scan node with scanrelid
 == 0.

 Yeah, I guess the question there is whether we'll want let CustomScan have
 scanrelid == 0 or require that CustomJoin be used there instead.

 Right now, I cannot imagine a use case that requires individual CustomJoin
 node because CustomScan with scanrelid==0 (that performs like custom-plan
 rather than custom-scan in actually) is sufficient.

 If a CustomScan gets chosen instead of built-in join logics, it shall looks
 like a relation scan on the virtual one that is consists of two underlying
 relation. Callbacks of the CustomScan has a responsibility to join underlying
 relations; that is invisible from the core executor.

 It seems to me CustomScan with scanrelid==0 is sufficient to implement
 an alternative logic on relation joins, don't need an individual node
 from the standpoint of executor.

That's valid logic, but it's not the only way to do it.  If we have
CustomScan and CustomJoin, either of them will require some adaption
to handle this case.  We can either allow a custom scan that isn't
scanning any particular relation (i.e. scanrelid == 0), or we can
allow a custom join that has no children.  I don't know which way will
come out cleaner, and I think it's good to leave that decision to one
side for now.

  Why can't the Custom(GpuHashJoin) node build the hash table
  internally instead of using a separate node?
 
  It's possible, however, it prevents to check sub-plans using EXPLAIN
  if we manage inner-plans internally. So, I'd like to have a separate
  node being connected to the inner-plan.

 Isn't that just a matter of letting the EXPLAIN code print more stuff?
 Why can't it?

 My GpuHashJoin takes multiple relations to load them a hash-table.
 On the other hand, Plan node can have two underlying relations at most
 (inner/outer). Outer-side is occupied by the larger relation, so it
 needs to make multiple relations visible using inner-branch.
 If CustomScan can has a list of multiple underlying plan-nodes, like
 Append node, it can represent the structure above in straightforward
 way, but I'm uncertain which is the better design.

Right.  I think the key point is that it is *possible* to make this
work without a multiexec interface, and it seems like we're agreed
that it is.  Now perhaps we will decide that there is enough benefit
in having multiexec support that we want to do it anyway, but it's
clearly not a hard requirement, because it can be done without that in
the way you describe here.  Let's leave to the future the decision as
to how to proceed here; getting the basic thing done is hard enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-12 Thread Fujii Masao
On Wed, Sep 10, 2014 at 10:37 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Fujii Masao wrote:
 On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:

  PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
  Wouldn't it be easy-to-use to have only one parameter,
  PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE to
  work_mem as the default value when running the CREATE INDEX command?

 That's an idea. But there might be some users who want to change
 the cleanup size per session like they can do by setting work_mem,
 and your idea would prevent them from doing that...

 So what about introduing pending_list_cleanup_size also as GUC?
 That is, users can set the threshold by using either the reloption or
 GUC.

 Yes, I think having both a GUC and a reloption makes sense -- the GUC
 applies to all indexes, and can be tweaked for individual indexes using
 the reloption.

Agreed.

 I'm not sure about the idea of being able to change it per session,
 though.  Do you mean that you would like insert processes use a very
 large value so that they can just append new values to the pending list,
 and have vacuum use a small value so that it cleans up as soon as it
 runs?  Two things: 1. we could have an autovacuum_ reloption which
 only changes what autovacuum does; 2. we could have autovacuum run
 index cleanup actions separately from actual vacuuming.

Yes, I was thinking something like that. But if autovacuum
has already been able to handle that, it's nice. Anyway,
as you pointed out, it's better to have both GUC and reloption
for the cleanup size of pending list.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-12 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote:
 It's certainly possible that there is a test case for which Heikki's
 approach is superior, but if so we haven't seen it.  And since it's
 approach is also more complicated, sticking with the simpler
 lengths-only approach seems like the way to go.

 Huh, OK.  I'm slightly surprised, but that's why we benchmark these things.

The argument for Heikki's patch was never that it would offer better
performance; it's obvious (at least to me) that it won't.  The argument
was that it'd be upward-compatible with what we're doing now, so that
we'd not have to force an on-disk compatibility break with 9.4beta2.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-09-12 Thread Fujii Masao
On Wed, Sep 10, 2014 at 7:39 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Thanks for reviewing the patch!

 On Wed, Sep 10, 2014 at 4:57 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 On 08/28/2014 11:38 AM, Fujii Masao wrote:

 On Thu, Jun 19, 2014 at 5:29 PM, Ian Barwick i...@2ndquadrant.com wrote:

 - minor rewording for the description, mentioning that statements will
still be logged as DEBUG1 even if parameter set to 'off' (might
prevent reports of the kind I set it to 'off', why am I still seeing
log entries?).

 para
  Causes each replication command to be logged in the server log.
  See xref linkend=protocol-replication for more information
 about
  replication commands. The default value is literaloff/. When
 set
 to
  literaloff/, commands will be logged at log level
 literalDEBUG1/literal.
  Only superusers can change this setting.
 /para


 Yep, fixed. Attached is the updated version of the patch.


 I don't think it's necessary to mention that the commands will still be
 logged at DEBUG1 level. We log all kinds of crap at the various DEBUG
 levels, and they're not supposed to be used in normal operation.

 Agreed. I removed that mention from the document.


 - I feel it would be more consistent to use the plural form
for this parameter, i.e. log_replication_commands, in line with
log_lock_waits, log_temp_files, log_disconnections etc.


 But log_statement is in the singular form. So I just used
 log_replication_command. For the consistency, maybe we need to
 change both parameters in the plural form? I don't have strong
 opinion about this.


 Yeah, we seem to be inconsistent. log_replication_commands would sound
 better to me in isolation, but then there is log_statement..

 Agreed. I changed the GUC name to log_replication_commands.


 I'll mark this as Ready for Committer in the commitfest app (I assume you'll
 take care of committing this yourself when ready).

 Attached is the updated version of the patch. After at least one day
 I will commit the patch.

Applied. Thanks all!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-12 Thread Fabien COELHO


The output would look something like this (modified from the manual's example 
by hand, so the numbers don't add up):


0 199 2241 0 1175850568 995598 1020
0 200 2465 0 1175850568 998079 1010
0 201 skipped 1175850569 608 3011
0 202 skipped 1175850569 608 2400
0 203 skipped 1175850569 608 1000
0 204 2513 0 1175850569 608 500
0 205 2038 0 1175850569 2663 500


My 0.02€: ISTM that the number of columns should stay the same whether it 
is skipped or not, so the file_no should be kept. Maybe to keep it a 
number would make sense (-1) or just a sign (-) which means no value 
with gnuplot for instance. Or skipped.


Basically I would be fine with that, but as I do not use the log file 
feature I'm not sure that my opinion should count.


Note that there are also potential issues with the aggregate logging and 
the sampling stuff.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 I think this is due to commit 364ddc. That removed the extra step that
 had arrays sorted (and then de-duped) ahead of time, which made arrays
 behave like objects at the top level. I think that this sort + de-dup
 step was mischaracterized as purely a performance thing (possibly by
 me). Basically, JsonbDeepContains() is consistent with the previous
 behavior at the top level, but not the current (inadvertently altered)
 behavior. I think the fix is probably a return to the previous
 behavior. I'll take a closer look.

I'm confused.  Are you proposing to return to sort + de-dup of JSON
arrays?  Surely that is completely broken.  Arrays are ordered.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm confused.  Are you proposing to return to sort + de-dup of JSON
 arrays?  Surely that is completely broken.  Arrays are ordered.

Sorry, my earlier remarks were premature. In fact, that alteration
only applied to existence, not containment. However, arrays are
ordered for the purposes of equality, but not containment.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Josh Berkus
On 09/12/2014 06:40 AM, Alexander Korotkov wrote:
 Hi!
 
 Let's consider some examples.
 
 # select '[1,2]'::jsonb @ '[1,2,2]'::jsonb;
  ?column?
 --
  f
 (1 row)
 
 One may think it's because second jsonb array contain two 2. So,
 contains takes care about count of equal elements.

JSONB arrays are allowed to have repleating elements.  It's keys which
are not allowed to repeat.

 
 # select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb;
  ?column?
 --
  t
 (1 row)
 
 But, it's not. Jsonb contains takes care only about length of array.

OK, now, that's messed up.

 
 # select '[[1,2]]'::jsonb @ '[[1,2,2]]'::jsonb;
  ?column?
 --
  t
 (1 row)
 
 Even more weird :)
 The reason why jsonb contains behaves so is check in the beginning
 of jsonb_contains. It makes fast check of jsonb type and elements count
 before calling JsonbDeepContains.
 
 if (JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) ||
 JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
 PG_RETURN_BOOL(false);
 
 It's likely that JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) should be
 checked only for objects, not arrays.

Yeah, agreed.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP signatures

2014-09-12 Thread Abhijit Menon-Sen
(I have't read the patch, or even earlier correspondence in this
thread, so I apologise for just jumping in.)

At 2014-09-12 12:50:45 -0300, alvhe...@2ndquadrant.com wrote:

 +1 for ignoring sigs.  If somebody want to check sigs, that's a
 separate step. 

For what it's worth, although it seems logical to split up cryptographic
primitives like this, I think it's widely recognised these days to have
contributed to plenty of bad crypto implementations. These seems to be
general trend of moving towards higher-level interfaces that require
fewer decisions and can be relied upon do the Right Thing.

I don't like the idea of ignoring signature verification errors any more
than I would like if somebody wants to check the HMAC before decypting,
that's a separate step.

Of course, all that is an aside. If the function ever threw an error on
signature verification failures, I would strongly object to changing it
to ignore such errors for exactly the reasons you mention already.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-12 Thread Gregory Smith

On 9/10/14, 10:57 AM, Fabien COELHO wrote:
Indeed. I think that people do not like it to change. I remember that 
I suggested to change timestamps to .yy instead of the 
unreadable  yyy, and be told not to, because some people have 
tool which process the output so the format MUST NOT CHANGE. So my 
behavior is not to avoid touching anything in this area.


That somewhat hysterical version of events isn't what I said. Heikki has 
the right idea for backpatching, so let me expand on that rationale, 
with an eye toward whether 9.5 is the right time to deal with this.


Not all software out there will process epoch timestamps with 
milliseconds added as a fraction at the end.  Being able to read an 
epoch time in seconds as an integer is a well defined standard; the 
fraction part is not.


Here's an example of the problem, from a Mac OS X system:

$ date -j -f %a %b %d %T %Z %Y `date` +%s
1410544903
$ date -r 1410544903
Fri Sep 12 14:01:43 EDT 2014
$ date -r 1410544903.532
usage: date [-jnu] [-d dst] [-r seconds] [-t west] [-v[+|-]val[ymwdHMS]] ...
[-f fmt date | [[[mm]dd]HH]MM[[cc]yy][.ss]] [+format]

The current file format allows any random shell script to use a tool 
like cut to pull out the second resolution timestamp column as an epoch 
integer field, then pass it through even a utility as simple as date to 
reformat that.  And for a lot of people, second resolution is perfectly 
fine anyway.


The change you propose will make that job harder for some people, in 
order to make the job you're interested in easier.  I picked the 
simplest possible example, but there are more.  Whether epoch timestamps 
can have millisecond parts depends on your time library in Java, in 
Python some behavior depends on whether you have 2.6 or earlier, I don't 
think gnuplot handles milllisecond ones at all yet; the list goes on and 
on.  Some people will just have to apply a second split for timestamp 
string pgbench outputs, at the period and use the left side, where right 
now they can just split the whole thing on a space.


What you want to do is actually fine with me--and as far as I know, I'm 
the producer of the most popular pgbench latency parsing script 
around--but it will be a new sort of headache.  I just wanted the 
benefit to outweigh that.  Breaking the existing scripts and burning 
compatibility with simple utilities like date was not worth the tiny 
improvement you wanted in your personal workflow.  That's just not how 
we do things in PostgreSQL.


If there's a good case that the whole format needs to be changed anyway, 
like adding a new field, then we might as well switch to fractional 
epoch timestamps too now though.  When I added timestamps to the latency 
log in 8.3, parsers that handled milliseconds were even more rare.  
Today it's still inconsistent, but the workarounds are good enough to me 
now.  There's a lot more people using things like Python instead of bash 
pipelines here in 2014 too.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 1:13 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 OK. I see your point.

 Now, what about the following assumptions (somewhat restrictions to
 facilitate the user experience for setting syncrep and the
 parametrization of this feature):
 - Nodes are defined within the same set (or group) if they have the
 same priority, aka the same application_name.
 - One node cannot be a part of two sets. That's obvious...

I feel pretty strongly that we should encourage people to use a
different application_name for every server.  The fact that a server
is interchangeable for one purpose does not mean that it's
interchangeable for all purposes; let's try to keep application_name
as the identifier for a server, and design the other facilities we
need around that.

 The current patch has its own merit, but it fails in the case you and
 Heikki are describing: wait for k nodes in set 1 (nodes with lowest
 priority value), l nodes in set 2 (nodes with priority 2nd lowest
 priority value), etc.
 What is does is, if for example we have a set of nodes with priorities
 {0,1,1,2,2,3,3}, backends will wait for flush_position from the first
 s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2
 {0,1,1,2}, etc.

 Now what about that: instead of waiting for the nodes in absolute
 order like the way current patch does, let's do it in a relative
 way. By that I mean that a backend waits for flush_position
 confirmation only from *1* node among a set of nodes having the same
 priority. So by using s_s_num = 3, we'll wait for {0, one node with
 1, one node with 2}, and you can guess the rest.

 The point is as well that we can keep s_s_num behavior as it is now:
 - if set at -1, we rely on the default way of doing with s_s_names
 (empty means all nodes async, at least one entry meaning that we need
 to wait for a node)
 - if set at 0, all nodes are forced to be async'd
 - if set at n  1, we have to wait for one node in each set of the
 N-lowest priority values.
 I'd see enough users happy with those improvements, and that would
 help improving the coverage of test cases that Heikki and you
 envisioned.

Sounds confusing.  I hate to be the guy always suggesting a
mini-language (cf. recent discussion of an expression syntax for
pgbench), but we could do much more powerful and flexible things here
if we had one.  For example, suppose we let each element of
synchronous_standby_names use the constructs (X,Y,Z,...) [meaning one
of the parenthesized severs], N(X,Y,Z,...) [meaning N of the
parenthesized servers].  Then if you want to consider a commit
acknowledge when you have any two of foo, bar, and baz you can write:

synchronous_standby_names = 2(foo,bar,baz)

And if you want to acknowledge when you've got either foo or both bar
and baz, you can write:

synchronous_standby_names = (foo,2(bar,baz))

And if you want one of foo and bar and one of baz and bletch, you can write:

synchronous_standby_names = 2((foo,bar),(baz,bletch))

The crazy-complicated policy I mentioned upthread would be:

synchronous_standby_names = (a,2((2(b,c),2(d,e)),f))
or (equivalently and simpler)
synchronous_standby_names = (a,3(b,c,f),3(d,e,f))

We could have a rule that we fall back to the next rule in
synchronous_standby_names when the first rule can never be satisfied
by the connected standbys.  For example, if you have foo, bar, and
baz, and you want any two of the three, but wish to prefer waiting for
foo over the others when it's connected, then you could write:

synchronous_standby_names = 2(foo,2(bar,baz)), 2(bar, baz)

If foo disconnects, the first rule can never be met, so we use the
second rule.  It's still 2 out of 3, just as if we'd written
2(foo,bar,baz) but we won't accept an ack from bar and baz as
sufficient unless foo is dead.

The exact syntax here is of course debatable; maybe somebody come up
with something better.  But it doesn't seem like it would be
incredibly painful to implement, and it would give us a lot of
flexibility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Fri, Sep 12, 2014 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm confused.  Are you proposing to return to sort + de-dup of JSON
 arrays?  Surely that is completely broken.  Arrays are ordered.

 Sorry, my earlier remarks were premature. In fact, that alteration
 only applied to existence, not containment. However, arrays are
 ordered for the purposes of equality, but not containment.

My remarks were also premature, because looking back at the referenced
commit, I see what it removed was a sort and de-dup as a preliminary step
in containment comparisons, not as a generic alteration of array contents.
So that was sane enough, though I concur with Heikki's opinion that it
likely failed to be a performance win.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 11:21 AM, Josh Berkus j...@agliodbs.com wrote:
 # select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb;
  ?column?
 --
  t
 (1 row)

 But, it's not. Jsonb contains takes care only about length of array.

 OK, now, that's messed up.

To be clear: I don't think that this example is messed up (in
isolation). I think it's the correct behavior. What I find
objectionable is the inconsistency. I believe that this is Alexander's
concern too. Alexander's first example exhibits broken behavior.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 5:28 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 09/12/2014 12:46 AM, Peter Geoghegan wrote:

 On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com
 wrote:

 I think I said pretty clearly that it was.


 I agree that you did, but it wasn't clear exactly what factors you
 were asking me to simulate.


 All factors.

 Do you want me to compare the same string a million times in a loop,
 both with a strcoll() and with a memcmp()?


 Yes.

 Should I copy it into a buffer to add a NUL byte?


 Yes.

 Or should it be a new string each time, with a cache miss expected
 some proportion of the time?


 Yes.

 I'm being facetious - it's easy to ask for tests when you're not the one
 running them. But seriously, please do run the all the tests that you think
 make sense.

 I'm particularly interested in the worst case. What is the worst case for
 the proposed memcmp() check? Test that. If the worst case regresses
 significantly, then we need to have a discussion of how likely that worst
 case is to happen in real life, what the performance is like in more
 realistic almost-worst-case scenarios, does it need to be tunable, is the
 trade-off worth it, etc. But if the worst case regresses less than, say, 1%,
 and there are some other cases where you get a 300% speed up, then I think
 it's safe to say that the optimization is worth it, without any more testing
 or discussion.

+1 to all that, including the facetious parts.

Based on discussion thus far it seems that there's a possibility that
the trade-off may be different for short strings vs. long strings.  If
the string is small enough to fit in the L1 CPU cache, then it may be
that memcmp() followed by strcoll() is not much more expensive than
strcoll().  That should be easy to figure out: write a standalone C
program that creates a bunch of arbitrary, fairly-short strings, say
32 bytes, in a big array.  Make the strings different near the end,
but not at the beginning.  Then have the program either do strcoll()
on every pair (O(n^2)) or, with a #define, memcmp() followed by
strcoll() on every pair.  It should be easy to see whether the
memcmp() adds 1% or 25% or 100%.

Then, repeat the same thing with strings that are big enough to blow
out the L1 cache, say 1MB in length.  Some intermediate sizes (64kB?)
might be worth testing, too.  Again, it should be easy to see what the
overhead is.  Once we know that, we can make intelligent decisions
about whether this is a good idea or not, and when.  If you attach the
test programs, other people (e.g. me) can also try them on other
systems (e.g. MacOS X) to see whether the characteristics there are
different than what you saw on your system.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-09-12 Thread Gregory Smith

On 9/12/14, 2:28 PM, Robert Haas wrote:
I hate to be the guy always suggesting a mini-language (cf. recent 
discussion of an expression syntax for pgbench), but we could do much 
more powerful and flexible things here if we had one. For example, 
suppose we let each element of synchronous_standby_names use the 
constructs (X,Y,Z,...)


While I have my old list history hat on this afternoon, when the 9.1 
deadline was approaching I said that some people were not going to be 
happy until is it safe to commit? calls an arbitrary function that is 
passed the names of all the active servers, and then they could plug 
whatever consensus rule they wanted into there.  And then I said that if 
we actually wanted to ship something, it should be some stupid simple 
thing like just putting a list of servers in synchronous_standby_names 
and proceeding if one is active.  One of those two ideas worked out...


Can you make a case for why it needs to be a mini-language instead of a 
function?


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 7:22 AM, Andres Freund and...@2ndquadrant.com wrote:
 What I like even less is that pg_control is actually marked as
 DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
 the database was *NOT* shutdown peacefully. I don't see active bugs due
 it besides this, but I think it's likely to either have or create futher
 ones.

I agree.  The database clearly isn't shut down at end of recovery;
it's still active and we're still doing things to it.  If we crash at
that point, we need to go back into recovery on restart.  This seems
open and shut, but maybe I'm missing something.  Why shouldn't we fix
*that*?

With regard to your second email, I agree that
ResetUnloggedRelation(UNLOGGED_RELATION_INIT) needs to issue fsyncs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 2:44 PM, Gregory Smith gregsmithpg...@gmail.com wrote:
 On 9/12/14, 2:28 PM, Robert Haas wrote:
 I hate to be the guy always suggesting a mini-language (cf. recent
 discussion of an expression syntax for pgbench), but we could do much more
 powerful and flexible things here if we had one. For example, suppose we let
 each element of synchronous_standby_names use the constructs (X,Y,Z,...)

 While I have my old list history hat on this afternoon, when the 9.1
 deadline was approaching I said that some people were not going to be happy
 until is it safe to commit? calls an arbitrary function that is passed the
 names of all the active servers, and then they could plug whatever consensus
 rule they wanted into there.  And then I said that if we actually wanted to
 ship something, it should be some stupid simple thing like just putting a
 list of servers in synchronous_standby_names and proceeding if one is
 active.  One of those two ideas worked out...

 Can you make a case for why it needs to be a mini-language instead of a
 function?

I think so.  If we make it a function, then it's either the kind of
function that you access via pg_proc, or it's the kind that's written
in C and installed by storing a function pointer in a hook variable
from _PG_init().  The first approach is a non-starter because it would
require walsender to be connected to the database where that function
lives, which is a non-starter at least for logical replication where
we need walsender to be connected to the database being replicated.
Even if we found some way around that problem, and I'm not sure there
is one, I suspect the overhead would be pretty high.  The second
approach - a hook that can be accessed directly by loadable modules -
seems like it would work fine; the only problem that is that you've
got to write your policy function in C.  But I have no issue with
exposing it that way if someone wants to write a patch.  There is no
joy in getting between the advanced user and his nutty-complicated
sync rep configuration.  However, I suspect that most users will
prefer a more user-friendly interface.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote:
 Based on discussion thus far it seems that there's a possibility that
 the trade-off may be different for short strings vs. long strings.  If
 the string is small enough to fit in the L1 CPU cache, then it may be
 that memcmp() followed by strcoll() is not much more expensive than
 strcoll().  That should be easy to figure out: write a standalone C
 program that creates a bunch of arbitrary, fairly-short strings, say
 32 bytes, in a big array.

While I think that's fair, the reason I didn't bother playing tricks
with only doing a (purely) opportunistic memcmp() when the string size
is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter
you'd have to have a use case where the first CACHE_LINE_SIZE of bytes
matched, and the string just happened to be identical in length, but
also ultimately differed at least a good fraction of the time. That
seems like the kind of thing that it's okay to care less about. That
might have been regressed worse than what you've seen already. It's
narrow in a whole new dimension, though. The intersection of that
issue, and the issues exercised by Heikki's existing test case must be
exceedingly rare.

I'm still confused about whether or not we're talking at cross
purposes here, Robert. Are you happy to consider this as a separate
and additional question to the question of what to do in an
abbreviated comparison tie-break? The correlated multiple sort key
attributes case strikes me as very common - it's a nice to have, and
will sometimes offer a nice additional boost. On the other hand, doing
this for abbreviated comparison tie-breakers is more or less
fundamental to the patch. In my cost model, a memcmp() abbreviated key
tie-breaker than works out is equivalent to an abbreviated comparison.
This is a bit of a fudge, but close enough.

BTW, I do appreciate your work on this. I realize that if you didn't
give this patch a fair go, there is a chance that no one else would.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 2:58 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote:
 Based on discussion thus far it seems that there's a possibility that
 the trade-off may be different for short strings vs. long strings.  If
 the string is small enough to fit in the L1 CPU cache, then it may be
 that memcmp() followed by strcoll() is not much more expensive than
 strcoll().  That should be easy to figure out: write a standalone C
 program that creates a bunch of arbitrary, fairly-short strings, say
 32 bytes, in a big array.

 While I think that's fair, the reason I didn't bother playing tricks
 with only doing a (purely) opportunistic memcmp() when the string size
 is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter
 you'd have to have a use case where the first CACHE_LINE_SIZE of bytes
 matched, and the string just happened to be identical in length, but
 also ultimately differed at least a good fraction of the time. That
 seems like the kind of thing that it's okay to care less about. That
 might have been regressed worse than what you've seen already. It's
 narrow in a whole new dimension, though. The intersection of that
 issue, and the issues exercised by Heikki's existing test case must be
 exceedingly rare.

 I'm still confused about whether or not we're talking at cross
 purposes here, Robert. Are you happy to consider this as a separate
 and additional question to the question of what to do in an
 abbreviated comparison tie-break?

I think I've said a few times now that I really want to get this
additional data before forming an opinion.  As a certain Mr. Doyle
writes, It is a capital mistake to theorize before one has data.
Insensibly one begins to twist facts to suit theories, instead of
theories to suit facts.  I can't say it any better than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 12:02 PM, Robert Haas robertmh...@gmail.com wrote:
 I think I've said a few times now that I really want to get this
 additional data before forming an opinion.  As a certain Mr. Doyle
 writes, It is a capital mistake to theorize before one has data.
 Insensibly one begins to twist facts to suit theories, instead of
 theories to suit facts.  I can't say it any better than that.

Well, in the abbreviated key case we might know that with probability
0.9 that the memcmp() == 0 thing will work out. In the
non-abbreviated tie-breaker case, we'll definitely know nothing. That
seems like a pretty fundamental distinction, so I don't think it's
premature to ask you to consider those two questions individually.
Still, maybe it's easier to justify both cases in the same way, if we
can.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] documentation update for doc/src/sgml/func.sgml

2014-09-12 Thread Andreas 'ads' Scherbaum

On 08/21/2014 12:35 PM, Fabien COELHO wrote:




I do not understand why the last sentence in the first paragraph
about bitwise ops is put there with rounding issues, which seem
unrelated. It seems to me that it belongs to the second paragraph
which is about bitwise operators.


That's the part which came from Josh Berkus. We discussed this patch
on IRC.


Hmmm. I do think the last sentence belongs to the next paragraph. The
identity of the author does not change my opinion on that point:-) If
you have another argument, maybe.


Attached is an updated version of the patch.



The wikipedia link can be simplified to a much cleaner:

 http://en.wikipedia.org/wiki/IEEE_floating_point#Rounding_rules


It can, but then you always refer to the latest version of the
Wikipedia page, which might or might not be a good idea. The link in
the patch points to the current version from yesterday, no matter how
many changes are introduced afterwards.


I doubt that IEEE floating point rounding rules are likely to change
much, so referencing the latest version is both safe  cleaner. Also,
wikipedia would change its implementation from php to something else
(well, unlikely, probably as unlikely as a change in IEEE fp rounding
rules:-).


It's really not about the IEEE changing something, but about someone
changing the Wikipedia page. The way I linked it makes sure it always
displays the same version of the page.

Of course a general rule how to link to WP would be nice ...


Regards,

--
Andreas 'ads' Scherbaum
German PostgreSQL User Group
European PostgreSQL User Group - Board of Directors
Volunteer Regional Contact, Germany - PostgreSQL Project
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 13c71af..15742f8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -924,6 +924,25 @@
 /tgroup
/table
 
+   para
+For functions like round(), log() and sqrt() which run against
+either fixed-precision (NUMERIC) or floating-point numbers (e.g. REAL),
+note that the results of these operations will differ according
+to the input type due to rounding. This is most observable with
+round(), which can end up rounding down as well as up for
+any #.5 value. We use the
+a xmlns=http://en.wikipedia.org/w/index.php?title=IEEE_floating_pointoldid=622007055#Rounding_rules;IEEE's rules/a
+for rounding floating-point numbers which can be machine-specific.
+  /para
+
+  para
+The bitwise operators work only on integral data types, whereas
+the others are available for all numeric data types. The bitwise
+operators are also available for the bit string types
+typebit/type and typebit varying/type, as
+shown in xref linkend=functions-bit-string-op-table.
+   /para
+
   para
 xref linkend=functions-math-random-table shows functions for
 generating random numbers.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Josh Berkus
On 09/12/2014 11:38 AM, Peter Geoghegan wrote:
 To be clear: I don't think that this example is messed up (in
 isolation). I think it's the correct behavior. What I find
 objectionable is the inconsistency. I believe that this is Alexander's
 concern too. Alexander's first example exhibits broken behavior.

Hmmm, oh.  Yeah, I see what you mean; PostgreSQL's SQL array behavior is
that @ is true if array A contains all of the elements of array B
regardless of ordering or repetition.

jsonic=# select array[1,2,2] @ array[1,1,2]

;
 ?column?
--
 t

That's consistent with our docs and past behavior.

However, this better become a FAQ item, because it's not necessarily the
behavior that folks used to JSON but not Postgres will expect.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP signatures

2014-09-12 Thread Jeff Janes
On Fri, Sep 12, 2014 at 8:50 AM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 Marko Tiikkaja wrote:

  On 9/8/14 7:30 PM, Jeff Janes wrote:

  If i understand the sequence here: The current git HEAD is that
  pgp_pub_decrypt would throw an error if given a signed and encrypted
  message, and earlier version of your patch changed that to decrypt the
  message and ignore the signature, and the current version went back to
  throwing an error.
  
  I think I prefer the middle of those behaviors.  The original behavior
  seems like a bug to me, and I don't think we need to be backwards
  compatible with bugs.  Why should a function called decrypt care if
 the
  message is also signed?  That is not its job.
 
  I haven't updated the patch yet because I don't want to waste my
  time going back and forth until we have a consensus, but I think I
  prefer Jeff's suggestion here to make the _decrypt() functions
  ignore signatures.  Does anyone else want to voice their opinion?

 +1 for ignoring sigs.  If somebody want to check sigs, that's a
 separate step.  Maybe we could have an optional boolean flag, default
 false, to enable checking sigs, but that seems material for a future
 patch.

 That said, I do wonder if it's a behavior change with security
 implications: if somebody is relying on the current behavior of throwing
 an error when sigs don't match, they wouldn't be thrilled to hear that
 their security checks now fail to detect a problem because we don't
 verify signatures when decrypting.  In other words, is this established
 practice already?  If not, it's okay; otherwise, hmm.



If it is established practise, I think the only security model that could
be used to justify it is The bad guys will be nice enough to always sign
their attacks, while the good guys will refrain from signing them.  It is
not clear why the bad guys would cooperate with that.

Anyone who can produce an encrypted and signed attack message could also
produce an encrypted and unsigned one, couldn't they?

Cheers,

Jeff


Re: [HACKERS] pgbench throttling latency limit

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 2:27 PM, Gregory Smith gregsmithpg...@gmail.com wrote:
 If there's a good case that the whole format needs to be changed anyway,
 like adding a new field, then we might as well switch to fractional epoch
 timestamps too now though.  When I added timestamps to the latency log in
 8.3, parsers that handled milliseconds were even more rare.  Today it's
 still inconsistent, but the workarounds are good enough to me now.  There's
 a lot more people using things like Python instead of bash pipelines here in
 2014 too.

+1.  s/\..*// is not an onerous requirement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Heikki Linnakangas

On 09/02/2014 09:52 AM, Fujii Masao wrote:

[RESULT]
Throughput in the benchmark.

 MultipleSingle
 off2162.62164.5
 on891.8895.6
 pglz1037.21042.3
 lz41084.71091.8
 snappy1058.41073.3


Most of the CPU overhead of writing full pages is because of CRC 
calculation. Compression helps because then you have less data to CRC.


It's worth noting that there are faster CRC implementations out there 
than what we use. The Slicing-by-4 algorithm was discussed years ago, 
but was not deemed worth it back then IIRC because we typically 
calculate CRC over very small chunks of data, and the benefit of 
Slicing-by-4 and many other algorithms only show up when you work on 
larger chunks. But a full-page image is probably large enough to benefit.


What I'm trying to say is that this should be compared with the idea of 
just switching the CRC implementation. That would make the 'on' case 
faster, and and the benefit of compression smaller. I wouldn't be 
surprised if it made the 'on' case faster than compressed cases.


I don't mean that we should abandon this patch - compression makes the 
WAL smaller which has all kinds of other benefits, even if it makes the 
raw TPS throughput of the system worse. But I'm just saying that these 
TPS comparisons should be taken with a grain of salt. We probably should 
consider switching to a faster CRC algorithm again, regardless of what 
we do with compression.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Tomas Vondra
On 12.9.2014 18:49, Robert Haas wrote:
 On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Attached is the patch split as suggested:

 (a) hashjoin-nbuckets-v14a-size.patch

 * NTUP_PER_BUCKET=1
 * counting buckets towards work_mem
 * changes in ExecChooseHashTableSize (due to the other changes)

 OK, I'm going to work on this one now.  I have some ideas about how to
 simplify this a bit and will post an update in a few hours once I see
 how that pans out.
 
 Here's what I came up with. I believe this has the same semantics as 
 your patch for less code, but please check, because I might be
 wrong. The main simplifications I made were:
 
 (1) In master, we decide whether or not to batch based on the size
 of the data, without knowing the number of buckets. If we want to 
 account for the bucket overhead, that doesn't work. But in your 
 patch, we basically computed the number of buckets we'd need for the 
 single-batch case, then decide whether to do a single batch, and if 
 yes, computed the same values over again with the same formula
 phrased differently. I revised that so that the single-batch case is 
 calculated only once. If everything fits, we're all set, and don't 
 need to recalculate anything.
 
 (2) In your patch, once we decide we need to batch, you set nbatch
 as now, and then check whether the computed value is still adequate
 after subtracting buckets_byte from hash_table_bytes. I revised that
 so that we make the initial batch estimate of dbatch by dividing 
 inner_rel_bytes by hash_table_bytes - bucket_bytes rather than 
 hash_table_bytes. I think that avoids the need to maybe increase 
 nbatch again afterwards.

Yes, I like those changes and I think your reasoning is correct in both
cases. It certainly makes the method shorter and more readable - I was
too stuck in the original logic, so thanks for improving this.

So +1 from me to those changes.

 I'm comfortable with this version if you are, but (maybe as a 
 follow-on commit) I think we could make this even a bit smarter. If 
 inner_rel_bytes + bucket_bytes  hash_table_bytes but
 inner_rel_bytes + bucket_bytes / 4  hash_table_bytes, then reduce
 the number of buckets by 2x or 4x to make it fit. That would provide
 a bit of additional safety margin against cases where this patch
 might unluckily lose.

I don't think that's a good idea. That essentially says 'we're shooting
for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
batching (because of smaller work_mem) actually significantly improves
performance. If we want to make this reasoning properly, deciding
whether to add batches or reduce buckets, we need a better heuristics -
that's quite complex and I'd expect it to result ain a quite complex patch.

So -1 from me to this at this moment (it certainly needs much more
thought than a mere follow-on commit).

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Kevin Grittner
Peter Geoghegan p...@heroku.com wrote:

 The introductory paragraph of LY says the following:

 Our solution compares favorably with earlier solutions in that the
 locking scheme is simpler (no read-locks are used) and only a (small)
 constant number of nodes are locked by any update process at any given
 time.

 They clearly and prominently state that not needing read locks is a
 major advantage of their algorithm, which doesn't quite ring true.

It's been a while since I read that paper, but my recollection is
that they assumed that each process or thread looking at a buffer
would have its own private copy of that buffer, which it could be
sure nobody was changing (even if the master copy somewhere else
was changing).  Locking was only needed to prevent conflicting
writes.  Now, whether it is safe to assume that creating a
process-local buffer and copying to it is cheaper than getting a
lock seems dicey, but that seemed to be the implicit assumption.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 08:59 PM, Fabien COELHO wrote:



The output would look something like this (modified from the manual's example
by hand, so the numbers don't add up):

0 199 2241 0 1175850568 995598 1020
0 200 2465 0 1175850568 998079 1010
0 201 skipped 1175850569 608 3011
0 202 skipped 1175850569 608 2400
0 203 skipped 1175850569 608 1000
0 204 2513 0 1175850569 608 500
0 205 2038 0 1175850569 2663 500


My 0.02€: ISTM that the number of columns should stay the same whether it
is skipped or not, so the file_no should be kept.


Oh, sorry, I totally agree. I left file_no out by mistake.


Maybe to keep it a
number would make sense (-1) or just a sign (-) which means no value
with gnuplot for instance. Or skipped.

Basically I would be fine with that, but as I do not use the log file
feature I'm not sure that my opinion should count.

Note that there are also potential issues with the aggregate logging and
the sampling stuff.


Yep.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Abhijit Menon-Sen
At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote:

 We probably should consider switching to a faster CRC algorithm again,
 regardless of what we do with compression.

As it happens, I'm already working on resurrecting a patch that Andres
posted in 2010 to switch to zlib's faster CRC implementation.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:

At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote:


We probably should consider switching to a faster CRC algorithm again,
regardless of what we do with compression.


As it happens, I'm already working on resurrecting a patch that Andres
posted in 2010 to switch to zlib's faster CRC implementation.


As it happens, I also wrote an implementation of Slice-by-4 the other 
day :-). Haven't gotten around to post it, but here it is.


What algorithm does zlib use for CRC calculation?

- Heikki

diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h
index 375c405..da1af98 100644
--- a/src/include/utils/pg_crc.h
+++ b/src/include/utils/pg_crc.h
@@ -37,25 +37,60 @@ typedef uint32 pg_crc32;
 /* Finish a CRC calculation */
 #define FIN_CRC32(crc)	((crc) ^= 0x)
 
+static inline uint32
+swap(uint32 x)
+{
+	/* XXX: use configure check for this? */
+#if defined(__GNUC__) || defined(__clang__)
+	return __builtin_bswap32(x);
+#else
+	return (x  24) | ((x  8)  0xFF00) | ((x  8)  0x00FF) | (x  24);
+#endif
+}
+
 /* Accumulate some (more) bytes into a CRC */
 #define COMP_CRC32(crc, data, len)	\
 do { \
 	const unsigned char *__data = (const unsigned char *) (data); \
 	uint32		__len = (len); \
+	pg_crc32	__crc = (crc); \
+	\
+	/* Process byte at a time until the data address is aligned */ \
+	while uintptr_t) __data)  3) != 0  __len  0)		   \
+	{ \
+		int		__tab_index = ((int) ((__crc)  24) ^ *__data++)  0xFF; \
+		(__crc) = pg_crc32_table[0][__tab_index] ^ ((__crc)  8); \
+		__len--; \
+	} \
+	/* Process 4 bytes a time */ \
+	while (__len = 4) \
+	{ \
+		uint32 *current = (uint32 *) __data; \
 \
-	while (__len--  0) \
+		(__crc) = swap(__crc); \
+		(__crc) ^= *current; \
+		(__crc) = pg_crc32_table[3][ (__crc)   0xFF] \
+			  ^ pg_crc32_table[2][((__crc)8 )  0xFF] \
+			  ^ pg_crc32_table[1][((__crc)16)  0xFF] \
+			  ^ pg_crc32_table[0][ (__crc)24]; \
+		__len -= 4; \
+		__data += 4; \
+	} \
+	/* Process the last 0-3 bytes one byte at a time */ \
+	while (__len  0) \
 	{ \
-		int		__tab_index = ((int) ((crc)  24) ^ *__data++)  0xFF; \
-		(crc) = pg_crc32_table[__tab_index] ^ ((crc)  8); \
+		int		__tab_index = ((int) ((__crc)  24) ^ *__data++)  0xFF; \
+		(__crc) = pg_crc32_table[0][__tab_index] ^ ((__crc)  8); \
+		__len--; \
 	} \
+	(crc) = __crc; \
 } while (0)
 
 /* Check for equality of two CRCs */
 #define EQ_CRC32(c1,c2)  ((c1) == (c2))
 
 /* Constant table for CRC calculation */
-extern CRCDLLIMPORT const uint32 pg_crc32_table[];
-
+extern CRCDLLIMPORT const uint32 pg_crc32_table[4][256];
 
 #ifdef PROVIDE_64BIT_CRC
 
diff --git a/src/include/utils/pg_crc_tables.h b/src/include/utils/pg_crc_tables.h
index a487ee4..fcc9a58 100644
--- a/src/include/utils/pg_crc_tables.h
+++ b/src/include/utils/pg_crc_tables.h
@@ -33,7 +33,9 @@
  *	x^32+x^26+x^23+x^22+x^16+x^12+x^11+x^10+x^8+x^7+x^5+x^4+x^2+x+1.
  * (This is the same polynomial used in Ethernet checksums, for instance.)
  */
-const uint32 pg_crc32_table[256] = {
+const uint32 pg_crc32_table[4][256] =
+{
+{
 	0x, 0x77073096, 0xEE0E612C, 0x990951BA,
 	0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
 	0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
@@ -98,6 +100,205 @@ const uint32 pg_crc32_table[256] = {
 	0xBAD03605, 0xCDD70693, 0x54DE5729, 0x23D967BF,
 	0xB3667A2E, 0xC4614AB8, 0x5D681B02, 0x2A6F2B94,
 	0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
+},
+{
+	0x, 0xC951729F, 0x49D3E37F, 0x808291E0,
+	0xF3A08CA3, 0x3AF1FE3C, 0xBA736FDC, 0x73221D43,
+	0x3C301F07, 0xF5616D98, 0x75E3FC78, 0xBCB28EE7,
+	0xCF9093A4, 0x06C1E13B, 0x864370DB, 0x4F120244,
+	0xD41608D9, 0x1D477A46, 0x9DC5EBA6, 0x54949939,
+	0x27B6847A, 0xEEE7F6E5, 0x6E656705, 0xA734159A,
+	0xE82617DE, 0x21776541, 0xA1F5F4A1, 0x68A4863E,
+	0x1B869B7D, 0xD2D7E9E2, 0x52557802, 0x9B040A9D,
+	0xDF2B2124, 0x167A53BB, 0x96F8C25B, 0x5FA9B0C4,
+	0x2C8BAD87, 0xE5DADF18, 0x65584EF8, 0xAC093C67,
+	0xE31B3E23, 0x2A4A4CBC, 0xAAC8DD5C, 0x6399AFC3,
+	0x10BBB280, 0xD9EAC01F, 0x596851FF, 0x90392360,
+	0x0B3D29FD, 0xC26C5B62, 0x42EECA82, 0x8BBFB81D,
+	0xF89DA55E, 0x31CCD7C1, 0xB14E4621, 0x781F34BE,
+	0x370D36FA, 0xFE5C4465, 0x7EDED585, 0xB78FA71A,
+	0xC4ADBA59, 0x0DFCC8C6, 0x8D7E5926, 0x442F2BB9,
+	0x65274409, 0xAC763696, 0x2CF4A776, 0xE5A5D5E9,
+	0x9687C8AA, 0x5FD6BA35, 0xDF542BD5, 0x1605594A,
+	0x59175B0E, 0x90462991, 0x10C4B871, 0xD995CAEE,
+	0xAAB7D7AD, 0x63E6A532, 0xE36434D2, 0x2A35464D,
+	0xB1314CD0, 0x78603E4F, 0xF8E2AFAF, 0x31B3DD30,
+	0x4291C073, 0x8BC0B2EC, 0x0B42230C, 0xC2135193,
+	0x8D0153D7, 0x44502148, 0xC4D2B0A8, 0x0D83C237,
+	0x7EA1DF74, 0xB7F0ADEB, 0x37723C0B, 0xFE234E94,
+	0xBA0C652D, 0x735D17B2, 0xF3DF8652, 0x3A8EF4CD,
+	0x49ACE98E, 0x80FD9B11, 0x007F0AF1, 0xC92E786E,
+	0x863C7A2A, 0x4F6D08B5, 0xCFEF9955, 0x06BEEBCA,
+	0x759CF689, 0xBCCD8416, 0x3C4F15F6, 0xF51E6769,
+	0x6E1A6DF4, 0xA74B1F6B, 0x27C98E8B, 0xEE98FC14,
+	0x9DBAE157, 0x54EB93C8, 0xD4690228, 0x1D3870B7,
+	0x522A72F3, 0x9B7B006C, 

Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 12:39 PM, Kevin Grittner kgri...@ymail.com wrote:
 It's been a while since I read that paper, but my recollection is
 that they assumed that each process or thread looking at a buffer
 would have its own private copy of that buffer, which it could be
 sure nobody was changing (even if the master copy somewhere else
 was changing).  Locking was only needed to prevent conflicting
 writes.  Now, whether it is safe to assume that creating a
 process-local buffer and copying to it is cheaper than getting a
 lock seems dicey, but that seemed to be the implicit assumption.

That is one way to make reads atomic, but I don't recall any explicit
mention of it. In 1981, I think page sizes were about the same as
today, but 4K was a lot of memory. We could actually do this, with
some work. I think that this has actually been implemented elsewhere,
though. Note that LY have practically nothing to say about deletion -
they simply suggest that it be done offline.

It is really useful that we can recover from page splits as and when
problems arise. That's really what I'd like to prominently convey - it
is the whole point of LY.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Ants Aasma
On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I don't mean that we should abandon this patch - compression makes the WAL
 smaller which has all kinds of other benefits, even if it makes the raw TPS
 throughput of the system worse. But I'm just saying that these TPS
 comparisons should be taken with a grain of salt. We probably should
 consider switching to a faster CRC algorithm again, regardless of what we do
 with compression.

CRC is a pretty awfully slow algorithm for checksums. We should
consider switching it out for something more modern. CityHash,
MurmurHash3 and xxhash look like pretty good candidates, being around
an order of magnitude faster than CRC. I'm hoping to investigate
substituting the WAL checksum algorithm 9.5.

Given the room for improvement in this area I think it would make
sense to just short-circuit the CRC calculations for testing this
patch to see if the performance improvement is due to less data being
checksummed.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)

2014-09-12 Thread Andres Freund
On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote:
 On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
 At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote:
 
 We probably should consider switching to a faster CRC algorithm again,
 regardless of what we do with compression.
 
 As it happens, I'm already working on resurrecting a patch that Andres
 posted in 2010 to switch to zlib's faster CRC implementation.
 
 As it happens, I also wrote an implementation of Slice-by-4 the other day
 :-). Haven't gotten around to post it, but here it is.
 
 What algorithm does zlib use for CRC calculation?

Also slice-by-4, with a manually unrolled loop doing 32bytes at once, using
individual slice-by-4's. IIRC I tried and removing that slowed things
down overall. What it also did was move crc to a function. I'm not sure
why I did it that way, but it really might be beneficial - if you look
at profiles today there's sometimes icache/decoding stalls...

Hm. Let me look:
http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de

Ick, there's quite some debugging leftovers ;)

I think it might be a good idea to also switch the polynom at the same
time. I really really think we should, when the hardware supports, use
the polynom that's available in SSE4.2. It has similar properties, can
implemented in software just the same...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 12.9.2014 18:49, Robert Haas wrote:
 On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Attached is the patch split as suggested:

 (a) hashjoin-nbuckets-v14a-size.patch

 * NTUP_PER_BUCKET=1
 * counting buckets towards work_mem
 * changes in ExecChooseHashTableSize (due to the other changes)

 OK, I'm going to work on this one now.  I have some ideas about how to
 simplify this a bit and will post an update in a few hours once I see
 how that pans out.

 Here's what I came up with. I believe this has the same semantics as
 your patch for less code, but please check, because I might be
 wrong. The main simplifications I made were:

 (1) In master, we decide whether or not to batch based on the size
 of the data, without knowing the number of buckets. If we want to
 account for the bucket overhead, that doesn't work. But in your
 patch, we basically computed the number of buckets we'd need for the
 single-batch case, then decide whether to do a single batch, and if
 yes, computed the same values over again with the same formula
 phrased differently. I revised that so that the single-batch case is
 calculated only once. If everything fits, we're all set, and don't
 need to recalculate anything.

 (2) In your patch, once we decide we need to batch, you set nbatch
 as now, and then check whether the computed value is still adequate
 after subtracting buckets_byte from hash_table_bytes. I revised that
 so that we make the initial batch estimate of dbatch by dividing
 inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
 hash_table_bytes. I think that avoids the need to maybe increase
 nbatch again afterwards.

 Yes, I like those changes and I think your reasoning is correct in both
 cases. It certainly makes the method shorter and more readable - I was
 too stuck in the original logic, so thanks for improving this.

 So +1 from me to those changes.

OK, committed.  Please rebase the rest.

 I'm comfortable with this version if you are, but (maybe as a
 follow-on commit) I think we could make this even a bit smarter. If
 inner_rel_bytes + bucket_bytes  hash_table_bytes but
 inner_rel_bytes + bucket_bytes / 4  hash_table_bytes, then reduce
 the number of buckets by 2x or 4x to make it fit. That would provide
 a bit of additional safety margin against cases where this patch
 might unluckily lose.

 I don't think that's a good idea. That essentially says 'we're shooting
 for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
 batching (because of smaller work_mem) actually significantly improves
 performance. If we want to make this reasoning properly, deciding
 whether to add batches or reduce buckets, we need a better heuristics -
 that's quite complex and I'd expect it to result ain a quite complex patch.

 So -1 from me to this at this moment (it certainly needs much more
 thought than a mere follow-on commit).

OK, but let's discuss further.  You make it sound like treating
NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I'm actually quite surprised that you find batching to be a better
strategy than skimping on buckets, because I would have expect the
opposite, almost categorically.  Batching means having to write out
the tuples we can't process right away and read them back in.  If that
involves disk I/O, I think the cost of that I/O is going to be FAR
more than the overhead we would have incurred by searching slightly
longer bucket chains.  If it didn't, then you could've set work_mem
higher and avoided batching in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Andres Freund
On 2014-09-12 23:17:12 +0300, Ants Aasma wrote:
 On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  I don't mean that we should abandon this patch - compression makes the WAL
  smaller which has all kinds of other benefits, even if it makes the raw TPS
  throughput of the system worse. But I'm just saying that these TPS
  comparisons should be taken with a grain of salt. We probably should
  consider switching to a faster CRC algorithm again, regardless of what we do
  with compression.
 
 CRC is a pretty awfully slow algorithm for checksums. We should
 consider switching it out for something more modern. CityHash,
 MurmurHash3 and xxhash look like pretty good candidates, being around
 an order of magnitude faster than CRC. I'm hoping to investigate
 substituting the WAL checksum algorithm 9.5.

I think that might not be a bad plan. But it'll involve *far* more
effort and arguing to change to fundamentally different algorithms. So
personally I'd just go with slice-by-4. that's relatively
uncontroversial I think. Then maybe switch the polynom so we can use the
CRC32 instruction.

 Given the room for improvement in this area I think it would make
 sense to just short-circuit the CRC calculations for testing this
 patch to see if the performance improvement is due to less data being
 checksummed.

FWIW, I don't think it's 'bad' that less data provides speedups. I don't
really see a need to see that get that out of the benchmarks.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Alexander Korotkov
On Fri, Sep 12, 2014 at 10:38 PM, Peter Geoghegan p...@heroku.com wrote:

 On Fri, Sep 12, 2014 at 11:21 AM, Josh Berkus j...@agliodbs.com wrote:
  # select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb;
   ?column?
  --
   t
  (1 row)
 
  But, it's not. Jsonb contains takes care only about length of array.
 
  OK, now, that's messed up.

 To be clear: I don't think that this example is messed up (in
 isolation). I think it's the correct behavior. What I find
 objectionable is the inconsistency. I believe that this is Alexander's
 concern too. Alexander's first example exhibits broken behavior.


Agree. I just tried to explain how current behaviour could look for user
who sees it for the first time.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Andres Freund
On 2014-09-12 22:38:01 +0300, Heikki Linnakangas wrote:
 It's worth noting that there are faster CRC implementations out there than
 what we use. The Slicing-by-4 algorithm was discussed years ago, but was not
 deemed worth it back then IIRC because we typically calculate CRC over very
 small chunks of data, and the benefit of Slicing-by-4 and many other
 algorithms only show up when you work on larger chunks. But a full-page
 image is probably large enough to benefit.

I've recently pondered moving things around so the CRC sum can be
computed over the whole data instead of the individual chain elements. I
think, regardless of the checksum algorithm and implementation we end up
with, that might end up as a noticeable benefit.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-12 Thread Heikki Linnakangas

On 09/12/2014 08:52 PM, Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:

On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote:

It's certainly possible that there is a test case for which Heikki's
approach is superior, but if so we haven't seen it.  And since it's
approach is also more complicated, sticking with the simpler
lengths-only approach seems like the way to go.



Huh, OK.  I'm slightly surprised, but that's why we benchmark these things.


The argument for Heikki's patch was never that it would offer better
performance; it's obvious (at least to me) that it won't.


Performance was one argument for sure. It's not hard to come up with a 
case where the all-lengths approach is much slower: take a huge array 
with, say, million elements, and fetch the last element in a tight loop. 
And do that in a PL/pgSQL function without storing the datum to disk, so 
that it doesn't get toasted. Not a very common thing to do in real life, 
although something like that might come up if you do a lot of json 
processing in PL/pgSQL. but storing offsets makes that faster.


IOW, something like this:

do $$
declare
  ja jsonb;
  i int4;
begin
  select json_agg(g) into ja from generate_series(1, 10) g;
  for i in 1..10 loop
perform ja - 9;
  end loop;
end;
$$;

should perform much better with current git master or my patch, than 
with the all-lengths patch.


I'm OK with going for the all-lengths approach anyway; it's simpler, and 
working with huge arrays is hopefully not that common. But it's not a 
completely open-and-shut case.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 However, this better become a FAQ item, because it's not necessarily the
 behavior that folks used to JSON but not Postgres will expect.

No, it's a bug, not a documentation deficiency.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread k...@rice.edu
On Fri, Sep 12, 2014 at 11:17:12PM +0300, Ants Aasma wrote:
 On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  I don't mean that we should abandon this patch - compression makes the WAL
  smaller which has all kinds of other benefits, even if it makes the raw TPS
  throughput of the system worse. But I'm just saying that these TPS
  comparisons should be taken with a grain of salt. We probably should
  consider switching to a faster CRC algorithm again, regardless of what we do
  with compression.
 
 CRC is a pretty awfully slow algorithm for checksums. We should
 consider switching it out for something more modern. CityHash,
 MurmurHash3 and xxhash look like pretty good candidates, being around
 an order of magnitude faster than CRC. I'm hoping to investigate
 substituting the WAL checksum algorithm 9.5.
 
 Given the room for improvement in this area I think it would make
 sense to just short-circuit the CRC calculations for testing this
 patch to see if the performance improvement is due to less data being
 checksummed.
 
 Regards,
 Ants Aasma

+1 for xxhash -

versionspeed on 64-bits  speed on 32-bits
---  
XXH64  13.8 GB/s 1.9 GB/s
XXH32   6.8 GB/s 6.0 GB/s

Here is a blog about its performance as a hash function:

http://fastcompression.blogspot.com/2014/07/xxhash-wider-64-bits.html

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >