Re: [HACKERS] ALTER TABLESPACE ... MOVE ALL TO ...

2014-01-18 Thread Robert Haas
On Thu, Jan 16, 2014 at 4:37 PM, Stephen Frost sfr...@snowman.net wrote:
 Greetings,

   It's a day late and I'm a dollar short, but attached is a (very) minor
   patch to allow users to more easily move their various objects from
   one tablespace to another.  Included are docs and a regression test;
   I'm happy to improve on both should folks send me suggestions.

   As we use tablespaces quite a bit, this can be extremely handy for us
   and I expect others will find it useful too.

   Thoughts?

Don't be late next time?

I did look this over and it seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-18 Thread Robert Haas
On Thu, Jan 16, 2014 at 3:35 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Makes sense. Can you extract that into a separate patch, please?

 I was wondering if that might cause deadlocks if an existing index is
 changed from unique to non-unique, or vice versa, as the ordering would
 change. But we don't have a DDL command to change that, so the question is
 moot.

It's not hard to imagine someone wanting to add such a DDL command.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Robert Haas
On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer cr...@2ndquadrant.com wrote:
 Anybody who actually uses SHIFT_JIS as an operational encoding, rather
 than as an input/output encoding, is into pain and suffering. Personally
 I'd be quite happy to see it supported as client_encoding, but forbidden
 as a server-side encoding. That's not the case right now - so since we
 support it, we'd better guard against its quirks.

I think that *is* the case right now.  pg_wchar.h sayeth:

/* followings are for client encoding only */
PG_SJIS,/* Shift JIS
(Winindows-932) */
PG_BIG5,/* Big5 (Windows-950) */
PG_GBK, /* GBK (Windows-936) */

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Robert Haas
On Thu, Jan 16, 2014 at 9:54 AM, Andres Freund and...@2ndquadrant.com wrote:
 Maybe it would be better to get rid of active/in_use and have
 three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
 REPLSLOT_FREE.  Or something like that.

 Hm. Color me unenthusiastic. If you feel strongly I can change it, but
 otherwise not.

I found the active/in_use distinction confusing; I thought one
three-state flag rather than two Booleans might be clearer.  But I
might be able to just suck it up.

  - If there's a coding rule that slot-database can't be changed while
  the slot is active, then the check to make sure that the user isn't
  trying to bind to a slot with a mis-matching database could be done
  before the code described in the previous point, avoiding the need to
  go back and release the resource.
 
  I don't think slot-database should be allowed to change at all...

 Well, it can if the slot is dropped and a new one created.

 Well. That obviously requires the lwlock to be acquired...

Right, so the point of this comment originally was you had some logic
that could be moved sooner to avoid having to undo so much on a
failure.

  - I think the critical section in ReplicationSlotDrop is bogus.  If
  DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
  gone.
 
  Well, if delete slot fails, we don't really know at which point it
  failed which means that the on-disk state might not correspond to the
  in-memory state. I don't see a point in adding code trying to handle
  that case correctly...

 Deleting the slot should be an atomic operation.  There's some
 critical point before which the slot will be picked up by recovery and
 after which it won't.  You either did that operation, or not, and can
 adjust the in-memory state accordingly.

 I am not sure I understand that point. We can either update the
 in-memory bit before performing the on-disk operations or
 afterwards. Either way, there's a way to be inconsistent if the disk
 operation fails somewhere inbetween (it might fail but still have
 deleted the file/directory!). The normal way to handle that in other
 places is PANICing when we don't know so we recover from the on-disk
 state.
 I really don't see the problem here? Code doesn't get more robust by
 doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
 ERROR, often that's not warranted.

People get cranky when the database PANICs because of a filesystem
failure.  We should avoid that, especially when it's trivial to do so.
 The update to shared memory should be done second and should be set
up to be no-fail.

  The slot details get updates by the respective replication code. For
  streaming rep, that should happen via reply and feedback
  messages. For changeset extraction it happens when
  LogicalConfirmReceivedLocation() is called; the walsender interface
  does that using reply messages, the SQL interface calls it when
  finished (unless you use the _peek_ functions).

 Right, but where is this code?  I don't see this updating the reply
 and feedback message processing code to touch slots.  Did I miss that?

 It's in wal_decoding: logical changeset extraction walsender interface
 currently :(. Splitting the streaming replication part of that patch off
 isn't easy...

Ack.  I was hoping to work through these patches one at a time, but
that's not going to work if they are interdependent to that degree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] currawong is not a happy animal

2014-01-18 Thread Andrew Dunstan


On 01/18/2014 02:42 AM, Amit Kapila wrote:

On Sat, Jan 18, 2014 at 2:48 AM, Andrew Dunstan and...@dunslane.net wrote:

On 01/17/2014 03:15 PM, Tom Lane wrote:


The other possibility I was contemplating is that export a const
variable doesn't actually work for some reason.  We're not in the habit
of doing that elsewhere, so I don't find that theory outlandish.  Perhaps
it could be fixed by adding PGDLLIMPORT to the extern, but on the whole
I'd rather avoid the technique altogether.

Is there any specific reason why adding PGDLLIMPORT for exporting
const variables is not good, when we are already doing for other variables
like InterruptHoldoffCount, CritSectionCount?

On searching in code, I found that for few const variables we do
extern PGDLLIMPORT. For example:
extern PGDLLIMPORT const int NumScanKeywords;
extern PGDLLIMPORT const char *debug_query_string;



[...]

After adding PGDLLIMPORT to variables (ImmediateInterruptOK,
MyBgworkerEntry, shm_mq_minimum_size) both the tests defined in
contrib module passed.


Attached patch fixes the problems related to test_shm_mq for me.



OK, I'll apply this later today unless there are strong objections. That 
would get the buildfarm green again and we could argue about the rest 
later if necessary.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLESPACE WITH

2014-01-18 Thread Michael Paquier
On Thu, Jan 16, 2014 at 9:03 AM, Vik Fearing vik.fear...@dalibo.com wrote:
 New patch attached, with regression tests.

Thanks for the new version, I have spent some time looking at it:
- Patch compiles without warnings.
- Done some manual testing with CREATE/ALTER TABLESPACE WITH and
checked pg_tablespace, it worked fine.
- However, regression tests are failing because
src/test/regress/output/tablespace.source has an incorrect output, it
is necessary to replace that:
/home/vik/postgresql/9.4/git/src/test/regress/testtablespace
by that:
@testtablespace@
This is something I have actually fixed that in the attached patch.

So, except for the regression output problem, I think that the code is
clean so marking it as Ready for committer on the commit fest app.
Thanks,
-- 
Michael
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index 89c8907..04c5fb8 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -21,7 +21,10 @@ PostgreSQL documentation
 
  refsynopsisdiv
 synopsis
-CREATE TABLESPACE replaceable class=parametertablespace_name/replaceable [ OWNER replaceable class=parameteruser_name/replaceable ] LOCATION 'replaceable class=parameterdirectory/replaceable'
+CREATE TABLESPACE replaceable class=parametertablespace_name/replaceable
+[ OWNER replaceable class=parameteruser_name/replaceable ]
+LOCATION 'replaceable class=parameterdirectory/replaceable'
+[ WITH ( replaceable class=PARAMETERtablespace_option/replaceable = replaceable class=PARAMETERvalue/replaceable [, ... ] ) ]
 /synopsis
  /refsynopsisdiv
 
@@ -87,6 +90,24 @@ CREATE TABLESPACE replaceable class=parametertablespace_name/replaceable [
/para
   /listitem
  /varlistentry
+
+ varlistentry
+  termreplaceable class=parametertablespace_option/replaceable/term
+  listitem
+   para
+A tablespace parameter to be set or reset.  Currently, the only
+available parameters are varnameseq_page_cost/ and
+varnamerandom_page_cost/.  Setting either value for a particular
+tablespace will override the planner's usual estimate of the cost of
+reading pages from tables in that tablespace, as established by
+the configuration parameters of the same name (see
+xref linkend=guc-seq-page-cost,
+xref linkend=guc-random-page-cost).  This may be useful if one
+tablespace is located on a disk which is faster or slower than the
+remainder of the I/O subsystem.
+   /para
+  /listitem
+ /varlistentry
   /variablelist
  /refsect1
 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 07f5221..c7aedc0 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -234,6 +234,7 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	Oid			tablespaceoid;
 	char	   *location;
 	Oid			ownerId;
+	Datum		newOptions;
 
 	/* Must be super user */
 	if (!superuser())
@@ -317,7 +318,16 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 	values[Anum_pg_tablespace_spcowner - 1] =
 		ObjectIdGetDatum(ownerId);
 	nulls[Anum_pg_tablespace_spcacl - 1] = true;
-	nulls[Anum_pg_tablespace_spcoptions - 1] = true;
+
+	/* Generate new proposed spcoptions (text array) */
+	newOptions = transformRelOptions((Datum) 0,
+	 stmt-options,
+	 NULL, NULL, false, false);
+	(void) tablespace_reloptions(newOptions, true);
+	if (newOptions != (Datum) 0)
+		values[Anum_pg_tablespace_spcoptions - 1] = newOptions;
+	else
+		nulls[Anum_pg_tablespace_spcoptions - 1] = true;
 
 	tuple = heap_form_tuple(rel-rd_att, values, nulls);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index fb4ce2c..506ceb8 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3370,6 +3370,7 @@ _copyCreateTableSpaceStmt(const CreateTableSpaceStmt *from)
 	COPY_STRING_FIELD(tablespacename);
 	COPY_STRING_FIELD(owner);
 	COPY_STRING_FIELD(location);
+	COPY_NODE_FIELD(options);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index ccf7267..ae68b86 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1610,6 +1610,7 @@ _equalCreateTableSpaceStmt(const CreateTableSpaceStmt *a, const CreateTableSpace
 	COMPARE_STRING_FIELD(tablespacename);
 	COMPARE_STRING_FIELD(owner);
 	COMPARE_STRING_FIELD(location);
+	COMPARE_NODE_FIELD(options);
 
 	return true;
 }
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index f0b9507..9b76ac9 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3588,12 +3588,13 @@ opt_procedural:
  *
  */
 
-CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner LOCATION Sconst
+CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner LOCATION Sconst opt_reloptions
 

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-18 Thread Florian Pflug
On Jan18, 2014, at 06:15 , David Rowley dgrowle...@gmail.com wrote:
 On Sat, Jan 18, 2014 at 2:20 PM, Florian Pflug f...@phlo.org wrote:
 On Jan17, 2014, at 23:34 , David Rowley dgrowle...@gmail.com wrote:
 The test turned out to become:
  if (state-expectedScale == -1)
  state-expectedScale = X.dscale;
  else if (state-expectedScale != X.dscale)
  state-expectedScale = -2;
 
 In do_numeric_accum, then when we do the inverse I just check if
 expectedScale  0 then return NULL.
 
 Ok, so this will rescan if and only if the dscales of all inputs match.
 I still that's overly pessimistic - we've only got a problem when we
 removed the input with the largest dscale, no? So my suggestion would be
 
 sniped
 
 I'd think that this avoids more restarts without about the same effort,
 but I haven't tried this though, so maybe I'm missing something.
 
 This is not quite right as it means if all the values are the same then
 we reject inverse transitions since state-maxScale will always be equal
 to X.dscale.
 But you are right about the overly strict code I've put in, we should allow
 values with a less than the maximum dscale to be unaggregated without
 complaint. To implement this I needed a maxScaleCounter variable too so we
 only reject when the maxScaleCounter gets back to 0 again. 

Ups, sorry, yeah. Sounds sensible.

BTW, this made me realize that MIN and MAX currently have the same issue -
they'll rescan if the inputs are all equal. We could avoid that by doing what
you did with dscale - track the number of times we've seen the maximum. I
wonder if that would be worth it - it would, unfortunately, require use to
use state type internal there too, and hence to add final functions for all
the MIN/MAX aggregates. But that seems excessive. So for now, let's just
live with that.

If we really *do* want to optimize this case, we could
come to it from a completely different angle. Aggregates already special-case
MIN and MAX to be able to use an index to evalutate SELECT MAX(c) FROM t.
If we provided a way for the transition function to call the sort operator
specified by SORTOP in CREATE AGGREGATE, one generic triple of forward and
inverse transition function and final function would work for all the
MIN and MAX aggregates. But that's material for a separate patch for 9.5

 Note that after this fix the results for my quick benchmark now look like:
 
 create table num (num numeric(10,2) not null);
 insert into num (num) select * from generate_series(1,2);
 select sum(num) over(order by num rows between current row and unbounded 
 following) from num; -- 113 ms
 select sum(num / 10) over(order by num rows between current row and unbounded 
 following) from num; -- 156ms
 select sum(num / 1) over(order by num rows between current row and unbounded 
 following) from num; -- 144 ms
 
 So it seems to be much less prone to falling back to brute force forward
 transitions. 
 It also seems the / 10 version must have had to previously do 1 brute
 force rescan but now it looks like it can do it in 1 scan.
 
 I'm not set on it, and I'm willing to try the lazy zeroing of the scale
 tracker array, but I'm just not quite sure what extra real use cases it
 covers that the one above does not. Perhaps my way won't perform inverse
 transitions if the query did sum(numericCol / 10) or so.
 
 Dunno how many people SUM over numerics with different dscales. Easily
 possible that it's few enough to not be worth fussing over.
 
 Going by Tom's comments in the post above this is possible just by having an
 unconstrained numeric column, but I guess there's still a good chance that
 even those unconstrained numbers have the same scale or at least the scale
 will likely not vary wildly enough to make us have to perform brute force
 forward transitions for each row.

Yeah, I'm convinced by now that your approach is the right trade-off there.
Those who do have values with wildly different dscales in their columns can
always add a cast to normalize them, if they experience a lot of restarts.

So let's just add a sentence or two to the SUM(numeric) documentation about
this, and be done.

 * build_aggregate_fnexprs() should allow NULL to be passed for 
 invtransfn_oid,
  I think. I don't quite like that we construct that even for plain 
 aggregates,
  and avoiding requires just an additional if.
 
 I'm not quite sure what you mean on this. It passes InvalidOid in normal
 aggregate calls (search for: InvalidOid, /* invtrans is not needed here */)
 and only looks up the function in build_aggregate_fnexprs if
 (OidIsValid(invtransfn_oid)) is true. I'm not sure how this can be improved
 since that function is used for window aggregates and normal aggregates.

I was thinking about checking for **invtransfnexpr = NULL, and not assigning
if it is. But on second thought, you're right - the additional variable doesn't
really hurt. So let's leave it as it is.

 * Don't we need to check for volatile function in the filter 

[HACKERS] ALTER SYSTEM SET typos and fix for temporary file name management

2014-01-18 Thread Michael Paquier
Hi all,

After going through commit 65d6e4c (introducing ALTER SYSTEM SET), I
noticed a couple of typo mistakes as well as (I think) a weird way of
using the temporary auto-configuration name postgresql.auto.conf.temp
in two different places, resulting in the patch attached.

It might be an overkill to use a dedicated variable for the temporary
autoconfiguration file (here PG_AUTOCONF_FILENAME_TEMP), but I found
kind of weird the way things are currently done on master branch. IMO,
it would reduce bug possibilities to have everything managed with a
single variable.

Typos at least should be fixed :)
Regards,
-- 
Michael
commit 9886e788b27c8fce895ccea4c83ddd223bc2
Author: Michael Paquier mich...@otacoo.com
Date:   Sat Jan 18 13:38:59 2014 +0900

Fix typos and temporary file management in ALTER SYSTEM

Using a temporary file uniquely defined in a single place reduces the
occurence of bugs related to it. Some typos are fixed at the same time.

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index fc35f5b..a31bcdf 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -834,11 +834,10 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces)
 
 		/* skip auto conf temporary file */
 		if (strncmp(de-d_name,
-	PG_AUTOCONF_FILENAME .temp,
-	sizeof(PG_AUTOCONF_FILENAME) + 4) == 0)
+	PG_AUTOCONF_FILENAME_TEMP,
+	sizeof(PG_AUTOCONF_FILENAME_TEMP)) == 0)
 			continue;
 
-
 		/*
 		 * If there's a backup_label file, it belongs to a backup started by
 		 * the user with pg_start_backup(). It is *not* correct for this
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1217098..7da52bf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -6457,9 +6457,9 @@ flatten_set_variable_args(const char *name, List *args)
 }
 
 /*
- * Writes updated configuration parameter values into
- * postgresql.auto.conf.temp file. It traverses the list of parameters
- * and quote the string values before writing them to temporaray file.
+ * Write updated configuration parameter values into
+ * file PG_AUTOCONF_FILENAME_TEMP. It traverses the list of parameters
+ * and quotes the string values before writing them to temporary file.
  */
 static void
 write_auto_conf_file(int fd, const char *filename, ConfigVariable **head_p)
@@ -6468,8 +6468,8 @@ write_auto_conf_file(int fd, const char *filename, ConfigVariable **head_p)
 	StringInfoData buf;
 
 	initStringInfo(buf);
-	appendStringInfoString(buf, # Do not edit this file manually! \n);
-	appendStringInfoString(buf, # It will be overwritten by ALTER SYSTEM command. \n);
+	appendStringInfoString(buf, # Do not edit this file manually!\n);
+	appendStringInfoString(buf, # It will be overwritten by ALTER SYSTEM command.\n);
 
 	/*
 	 * write the file header message before contents, so that if there is no
@@ -6596,7 +6596,7 @@ replace_auto_config_value(ConfigVariable **head_p, ConfigVariable **tail_p,
  *
  * The configuration parameters are written to a temporary
  * file then renamed to the final name. The template for the
- * temporary file is postgresql.auto.conf.temp.
+ * temporary file is PG_AUTOCONF_FILENAME_TEMP.
  *
  * An LWLock is used to serialize writing to the same file.
  *
@@ -6662,16 +6662,16 @@ AlterSystemSetConfigFile(AlterSystemStmt * altersysstmt)
 		ereport(ERROR,
 (errmsg(invalid value for parameter \%s\: \%s\, name, value)));
 
-
 	/*
 	 * Use data directory as reference path for postgresql.auto.conf and it's
-	 * corresponding temp file
+	 * corresponding temp file.
 	 */
-	join_path_components(AutoConfFileName, data_directory, PG_AUTOCONF_FILENAME);
+	join_path_components(AutoConfFileName, data_directory,
+		 PG_AUTOCONF_FILENAME);
 	canonicalize_path(AutoConfFileName);
-	snprintf(AutoConfTmpFileName, sizeof(AutoConfTmpFileName), %s.%s,
-			 AutoConfFileName,
-			 temp);
+	join_path_components(AutoConfTmpFileName, data_directory,
+		 PG_AUTOCONF_FILENAME_TEMP);
+	canonicalize_path(AutoConfTmpFileName);
 
 	/*
 	 * one backend is allowed to operate on postgresql.auto.conf file, to
@@ -6713,7 +6713,7 @@ AlterSystemSetConfigFile(AlterSystemStmt * altersysstmt)
 		 */
 		replace_auto_config_value(head, tail, AutoConfFileName, name, value);
 
-		/* Write and sync the New contents to postgresql.auto.conf.temp file */
+		/* Write and sync the New contents to file PG_AUTOCONF_FILENAME_TEMP */
 		write_auto_conf_file(Tmpfd, AutoConfTmpFileName, head);
 
 		close(Tmpfd);
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 20c5ff0..a4c2e3f 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -304,6 +304,12 @@
 /*
  * Automatic configuration file name for ALTER SYSTEM.
  * This file will be used to store values of configuration parameters
- * set by ALTER SYSTEM command
+ * set by ALTER SYSTEM command.
  */
 #define 

Re: [HACKERS] Deprecations in authentication

2014-01-18 Thread Andrew Dunstan


On 01/16/2014 08:01 AM, Magnus Hagander wrote:


On Wed, Jan 15, 2014 at 6:57 PM, Tom Lane t...@sss.pgh.pa.us 
mailto:t...@sss.pgh.pa.us wrote:


Magnus Hagander mag...@hagander.net mailto:mag...@hagander.net
writes:
 One thing I noticed - in MSVC, the config parameter krb5
(equivalent of
 the removed --with-krb5) enabled *both* krb5 and gssapi, and
there is no
 separate config parameter for gssapi. Do we want to rename that
one to
 gss, or do we want to keep it as krb5? Renaming it would break
 otherwise working environments, but it's kind of weird to leave
it...

+1 for renaming --- anybody who's building with krb5 and
expecting to,
you know, actually *get* krb5 would probably rather find out about
this
change at build time instead of down the road a ways.

A compromise position would be to introduce a gss parameter while
leaving
krb5 in place as a deprecated (perhaps undocumented?) synonym for it.
But I think that's basically confusing.


Yeah, I'm not sure it actually helps much.


Andrew - is this going to cause any issues wrt the buildfarm, by any 
chance?




None of my Windows buildfarm members builds with krb5. Mastodon does, 
although it seems to have gone quiet for 16 days (Dave - might be worth 
a check). Probably the result of renaming krb5 would be just that the 
build would proceed without it. From memory I don't thing the config 
settings are sanity checked.


(We need some more, and more modern, Windows buildfarm members.)

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Craig Ringer
On 01/18/2014 09:31 PM, Robert Haas wrote:
 On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer cr...@2ndquadrant.com wrote:
 Anybody who actually uses SHIFT_JIS as an operational encoding, rather
 than as an input/output encoding, is into pain and suffering. Personally
 I'd be quite happy to see it supported as client_encoding, but forbidden
 as a server-side encoding. That's not the case right now - so since we
 support it, we'd better guard against its quirks.
 
 I think that *is* the case right now.  pg_wchar.h sayeth:
 
 /* followings are for client encoding only */
 PG_SJIS,/* Shift JIS
 (Winindows-932) */
 PG_BIG5,/* Big5 (Windows-950) 
 */
 PG_GBK, /* GBK (Windows-936) 
 */

Perfect - that makes ASCII-only just fine, IMO.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] currawong is not a happy animal

2014-01-18 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 Is there any specific reason why adding PGDLLIMPORT for exporting
 const variables is not good, when we are already doing for other variables
 like InterruptHoldoffCount, CritSectionCount?

 On searching in code, I found that for few const variables we do
 extern PGDLLIMPORT. For example:
 extern PGDLLIMPORT const int NumScanKeywords;

OK, that one is a direct counterexample to my worry.

I'm still unconvinced that having a contrib module checking stuff based
on the size of a struct it's not supposed to access represents a sane
division of responsibilities; but let's fix the build problem now and
then Robert can contemplate that question at his leisure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer cr...@2ndquadrant.com wrote:
 Anybody who actually uses SHIFT_JIS as an operational encoding, rather
 than as an input/output encoding, is into pain and suffering. Personally
 I'd be quite happy to see it supported as client_encoding, but forbidden
 as a server-side encoding. That's not the case right now - so since we
 support it, we'd better guard against its quirks.

 I think that *is* the case right now.

SHIFT_JIS is not and never will be allowed as a server encoding,
precisely because it has multi-byte characters of which some bytes could
be taken for ASCII.  The same is true of our other client-only encodings.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-18 Thread Jeremy Harris

On 13/01/14 18:01, Alexander Korotkov wrote:

Thanks. It's included into attached version of patch. As wall as estimation
improvements, more comments and regression tests fix.


Would it be possible to totally separate the two sets of sort-keys,
only giving the non-index set to the tuplesort?  At present tuplesort
will, when it has a group to sort, make wasted compares on the
indexed set of keys before starting on the non-indexed set.
--
Cheers,
  Jeremy



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make various variables read-only (const)

2014-01-18 Thread Tom Lane
I wrote:
 [ speculation about refactoring the API of transformRelOptions ]

This morning I remembered that there's another patch in the queue with
an interest in the API and behavior of transformRelOptions:
https://commitfest.postgresql.org/action/patch_view?id=1318

While I think that one has no chance of getting committed in exactly
the submitted form, some descendant patch may well make it in; if we do
anything to transformRelOptions right now it'll create merge issues for
any work in that area.

Moreover, the amount of .data space that'd be saved by twiddling
transformRelOptions' API is negligible, so it's not really that
interesting for the purposes of this patch.  So I think the right
thing to do for now is just drop the relevant parts of this patch.
We can revisit the issue, if still needed, after the extension-options
dust settles.

I'll keep looking at the rest of this patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-18 Thread Marti Raudsepp
Hi,

There's another small regression with this patch when used with expensive
comparison functions, such as long text fields.

If we go through all this trouble in cmpSortSkipCols to prove that the
first N sortkeys are equal, it would be nice if Tuplesort could skip their
comparisons entirely; that's another nice optimization this patch can
provide.

I've implemented that in the attached patch, which applies on top of your
partial-sort-5.patch

Should the Sort Key field in EXPLAIN output be changed as well? I'd say
no, I think that makes the partial sort steps harder to read.

Generate test data:
create table longtext as select (select repeat('a', 1000*100)) a,
generate_series(1,1000) i;
create index on longtext(a);

Unpatched (using your original partial-sort-5.patch):
=# explain analyze select * from longtext order by a, i limit 10;
 Limit  (cost=2.34..19.26 rows=10 width=1160) (actual time=13477.739..13477.756
rows=10 loops=1)
   -  Partial sort  (cost=2.34..1694.15 rows=1000 width=1160) (actual time=
13477.737..13477.742 rows=10 loops=1)
 Sort Key: a, i
 Presorted Key: a
 Sort Method: top-N heapsort  Memory: 45kB
 -  Index Scan using longtext_a_idx on longtext  (cost=0.65..1691.65
rows=1000 width=1160) (actual time=0.015..2.364 rows=1000 loops=1)
 Total runtime: 13478.158 ms
(7 rows)

=# set enable_indexscan=off;
=# explain analyze select * from longtext order by a, i limit 10;
 Limit  (cost=198.61..198.63 rows=10 width=1160) (actual
time=6970.439..6970.458 rows=10 loops=1)
   -  Sort  (cost=198.61..201.11 rows=1000 width=1160) (actual
time=6970.438..6970.444 rows=10 loops=1)
 Sort Key: a, i
 Sort Method: top-N heapsort  Memory: 45kB
 -  Seq Scan on longtext  (cost=0.00..177.00 rows=1000 width=1160)
(actual time=0.007..1.763 rows=1000 loops=1)
 Total runtime: 6970.491 ms

Patched:
=# explain analyze select * from longtext order by a, i ;
 Partial sort  (cost=2.34..1694.15 rows=1000 width=1160) (actual
time=0.024..4.603 rows=1000 loops=1)
   Sort Key: a, i
   Presorted Key: a
   Sort Method: quicksort  Memory: 27kB
   -  Index Scan using longtext_a_idx on longtext  (cost=0.65..1691.65
rows=1000 width=1160) (actual time=0.013..2.094 rows=1000 loops=1)
 Total runtime: 5.418 ms

Regards,
Marti
From fbc6c31528018bca64dc54f65e1cd292f8de482a Mon Sep 17 00:00:00 2001
From: Marti Raudsepp ma...@juffo.org
Date: Sat, 18 Jan 2014 19:16:15 +0200
Subject: [PATCH] Batched sort: skip comparisons for known-equal columns

---
 src/backend/executor/nodeSort.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index cf1f79e..5abda1d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -125,10 +125,10 @@ ExecSort(SortState *node)
 	{
 		tuplesortstate = tuplesort_begin_heap(tupDesc,
 			  plannode-numCols - skipCols,
-			  (plannode-sortColIdx)[skipCols],
-			  plannode-sortOperators,
-			  plannode-collations,
-			  plannode-nullsFirst,
+			  (plannode-sortColIdx[skipCols]),
+			  (plannode-sortOperators[skipCols]),
+			  (plannode-collations[skipCols]),
+			  (plannode-nullsFirst[skipCols]),
 			  work_mem,
 			  node-randomAccess);
 		if (node-bounded)
-- 
1.8.5.3


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-18 Thread Marti Raudsepp
Funny, I just wrote a patch to do that some minutes ago (didn't see your
email yet).

http://www.postgresql.org/message-id/CABRT9RCK=wmFUYZdqU_+MOFW5PDevLxJmZ5B=etjjnubvya...@mail.gmail.com

Regards,
Marti



On Sat, Jan 18, 2014 at 7:10 PM, Jeremy Harris j...@wizmail.org wrote:

 On 13/01/14 18:01, Alexander Korotkov wrote:

 Thanks. It's included into attached version of patch. As wall as
 estimation
 improvements, more comments and regression tests fix.


 Would it be possible to totally separate the two sets of sort-keys,
 only giving the non-index set to the tuplesort?  At present tuplesort
 will, when it has a group to sort, make wasted compares on the
 indexed set of keys before starting on the non-indexed set.
 --
 Cheers,
   Jeremy




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] new json funcs

2014-01-18 Thread Marko Tiikkaja

On 2014-01-17 03:08, Andrew Dunstan wrote:

In case anyone feels like really nitpicking, this version fixes some
pgindent weirdness due to an outdated typedef list in the previous version.


Is it possible for you to generate a diff that doesn't have all these 
unrelated changes in it (from a pgindent run, I presume)?  I just read 
through three pagefuls and I didn't see any relevant changes, but I'm 
not sure I want to keep doing that for the rest of the patch.




Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make various variables read-only (const)

2014-01-18 Thread Tom Lane
Oskari Saarenmaa o...@ohmu.fi writes:
 On Sun, Dec 22, 2013 at 09:43:57PM -0500, Robert Haas wrote:
 -#define DEF_ENC2NAME(name, codepage) { #name, PG_##name }
 +/* The extra NUL-terminator will make sure a warning is raised if the
 + * storage space for name is too small, otherwise when strlen(name) ==
 + * sizeof(pg_enc2name.name) the NUL-terminator would be silently dropped.
 + */
 +#define DEF_ENC2NAME(name, codepage) { #name \0, PG_##name }
 
 - The above hunk is not related to the primary purpose of this patch.

 It sort-of is.  Without fixed size char-arrays it's not possible to move
 everything to .rodata, but fixed size char-arrays come with the drawback of
 silently dropping the NUL-terminator when strlen(str) == sizeof(array), by
 forcing a NUL-terminator in we always get a warning if it would've been
 dropped and the size of the array can then be increased.

AFAICT, this change and some similar ones are based on a false assumption.
It is *not* necessary to replace pointers by fixed-length arrays in order
to get things into .rodata.  For example, in chklocale.c where we already
had this coding:

struct encoding_match
{
enum pg_enc pg_enc_code;
const char *system_enc_name;
};

static const struct encoding_match encoding_match_list[] = {
{PG_EUC_JP, EUC-JP},
{PG_EUC_JP, eucJP},
...

what I see in gcc -S (on x86_64 Linux) is:

.section.rodata
.align 32
.type   encoding_match_list, @object
.size   encoding_match_list, 1776
encoding_match_list:
.long   1
.zero   4
.quad   .LC5
.long   1
.zero   4
.quad   .LC6
...
.section.rodata.str1.1
.LC5:
.string EUC-JP
.LC6:
.string eucJP
...

So it's all read-only data anyway, as can be confirmed with size
which shows that the .o file contains *no* data outside the text
segment.  There might be some platforms where this isn't the case,
but I don't think they're mainstream.

Converting the pointer arrangement to a fixed-length struct member
as per the patch has the following implications:

* We save the pointer, and possibly some alignment padding, at the cost
that now every string has to be padded to the maximal length.  This
might produce a net space savings if all the strings in the table
are of similar length, but it's hardly a guaranteed win.  Note also
that we no longer get to share storage with any other uses of the
same strings as plain literals.

* Possibly there's some savings in executable startup time as a
consequence of eliminating pointer-relocation work.  This would be
platform dependent, and in any case I've never heard any complaints
suggesting that this is a metric we need to care about at all.

* We now have a risk of possible silent breakage if any string reaches
the maximum length.  This is exacerbated by the temptation to not leave
a lot of daylight in the chosen maximum length, so as to minimize the
amount of bloat created.

The hunk Robert is complaining about is attempting to deal with that last
drawback; but it seems like a mighty ugly technique to me, and there are
other places where you've replaced pointers by fixed-length arrays without
adding any such protection (because it's notationally impractical).

TBH I don't find this to be an improvement, especially in cases where
it forces code changes beyond just const-ifying the data declarations.
I don't have a lot of confidence for instance that you found all the
places that were checking for a NULL pointer as an array terminator.
Having to change existing code greatly weakens the argument that this
is a correctness-improving patch, because now you have to factor in a
risk of actively breaking something.

So I think the fixed-length-array changes in this patch are a bad
idea; it's sufficient to make sure we've const-ified both the pointers
and the containing structs, as in the existing chklocale.c coding.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-18 Thread Jeremy Harris

On 31/12/13 01:41, Andreas Karlsson wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan - sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.


It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from
ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.


How does this work in combination with randomAccess ?
--
Thanks,
   Jeremy



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race condition in b-tree page deletion

2014-01-18 Thread Kevin Grittner
Heikki Linnakangas hlinnakan...@vmware.com wrote:

 Here's a patch implementing this scheme.

This patch fixes a race condition bug in btree entry deletion.  The
bug seems to rarely be encountered in practice; or at least it is
generally not recognized as the cause of index corruption when
that occurs.

The basic approach is sound.  The papers on which we based our
btree implementation did not discuss entry deletion, and our
current approach introduces new possible states into the on-disk
index structure which are not covered by the papers and are not
always handled correctly by the current code.  Specifically, entry
inserts can add new pages at the lower levels which were not (yet)
in any higher-level pages, but right-sibling pointers allow allow
the correct point in the index to be found.  Our existing deletion
approach turned this on its head by having missing pages at the
bottom, which introduces whole new classes of problems which
otherwise don't exist.  It is in handling these index states which
are not addressed in the academic papers that we have a race
condition.  This patch makes interim states in the deletion process
look almost exactly the same as interim states in the insertion
process, thus dodging the need to handle new states.

The approach to limiting the number of locks to a finite (and
small) number is clever, but seems quite sound.  The locks are
always taken in an order which cannot cause a deadlock.

The write-up lacks any explicit mention of what happens if the
branch being considered for deletion reaches all the way to the
root.  Although an astute reader will notice that since root page
will always be the only page at its level, it must always be the
rightmost page at its level, and therefore is correctly handled by
the logic dealing with that, I think it merits an explicit mention
in the README.

The general feeling is that this patch is not suitable for
back-patching.  I agree with this, but this is a bug which could
lead to index corruption which exists in all supported branches.
If nobody can suggest an alternative which we can back-patch, we
might want to reconsider this after this has been in production
releases for several months.

The patch still applies with a few offsets.  It compiles with these
warnings (which were also reported by Peter Eisentraut and should
be fixed):

nbtpage.c: In function ‘_bt_pagedel’:
nbtpage.c:1695:24: warning: ‘metad’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
nbtpage.c:1414:18: note: ‘metad’ was declared here
nbtpage.c:1730:4: warning: ‘metapg’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
nbtpage.c:1413:8: note: ‘metapg’ was declared here

I'm not a fan of the new retry label introduced in _bt_pagedel()
and the goto statement which references it.  I think a loop using
while() or for() would be cleaner.  The whole idea of describing
this logic recursively and then making a big deal about using tail
recursion as an optimization seems pointless and confusing.  It
could just as easily be described as an iterative approach and save
the extra logical step in explaining the code.  I think that would
be clearer and easier to understand.  It would also eliminate the
last parameter, which is always passed as NULL and only set to
something else internally.  I think it would be cleaner to make
that a local variable initialized to NULL as part of eliminating
the fiction of recursion here.

There is a point in this loop where we return 0, but I think that
is wrong; I think we should break so that the pages already deleted
will be counted.  On a related note. the caller doesn't actually
use the count, but assumes that any non-zero value should be
treated as 1.

Similar issues with faux recursion existing the calling function,
btvacuumpage().  In that case the unnecessary parameter which the
caller must get right is orig_blkno, which must be the same as
blkno on an actual call.  That could be a local variable
initialized to blkno within the function.  I think this would be
better described as iteration directly rather than claiming
recursions and whining about how the compiler is too dumb to turn
it into iteration for us.  It's not the job of this patch to fix
that in the caller, but let's not emulate it.

Other than that, I have not found any problems.

Since this is fixing a bug which is a hard-to-hit race condition,
it is hard to test.  Pretty much you need to get into a debugger
and set breakpoints to cause the right timing to be sure of hitting
it.  I'm still playing with that, but I figured I should post these
notes from reviewing the source code while I continue with that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-18 Thread Marti Raudsepp
On Sat, Jan 18, 2014 at 7:22 PM, Marti Raudsepp ma...@juffo.org wrote:
 Total runtime: 5.418 ms

Oops, shouldn't have rushed this. Clearly the timings should have
tipped me off that it's broken. I didn't notice that cmpSortSkipCols
was re-using tuplesort's sortkeys.

Here's a patch that actually works; I added a new skipKeys attribute
to SortState. I had to extract the SortSupport-creation code from
tuplesort_begin_heap to a new function; but that's fine, because it
was already duplicated in ExecInitMergeAppend too.

I reverted the addition of tuplesort_get_sortkeys, which is not needed now.

Now the timings are:
Unpatched partial sort: 13478.158 ms
Full sort: 6802.063 ms
Patched partial sort: 6618.962 ms

Regards,
Marti
From 7d9f34c09e7836504725ff11be7e63a2fc438ae9 Mon Sep 17 00:00:00 2001
From: Marti Raudsepp ma...@juffo.org
Date: Mon, 13 Jan 2014 20:38:45 +0200
Subject: [PATCH] Partial sort: skip comparisons for known-equal columns

---
 src/backend/executor/nodeMergeAppend.c | 18 +-
 src/backend/executor/nodeSort.c| 26 +-
 src/backend/utils/sort/sortsupport.c   | 29 +
 src/backend/utils/sort/tuplesort.c | 31 +--
 src/include/nodes/execnodes.h  |  1 +
 src/include/utils/sortsupport.h|  3 +++
 src/include/utils/tuplesort.h  |  2 --
 7 files changed, 60 insertions(+), 50 deletions(-)

diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 74fa40d..db6ec23 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -126,19 +126,11 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	 * initialize sort-key information
 	 */
 	mergestate-ms_nkeys = node-numCols;
-	mergestate-ms_sortkeys = palloc0(sizeof(SortSupportData) * node-numCols);
-
-	for (i = 0; i  node-numCols; i++)
-	{
-		SortSupport sortKey = mergestate-ms_sortkeys + i;
-
-		sortKey-ssup_cxt = CurrentMemoryContext;
-		sortKey-ssup_collation = node-collations[i];
-		sortKey-ssup_nulls_first = node-nullsFirst[i];
-		sortKey-ssup_attno = node-sortColIdx[i];
-
-		PrepareSortSupportFromOrderingOp(node-sortOperators[i], sortKey);
-	}
+	mergestate-ms_sortkeys = MakeSortSupportKeys(mergestate-ms_nkeys,
+  node-sortColIdx,
+  node-sortOperators,
+  node-collations,
+  node-nullsFirst);
 
 	/*
 	 * initialize to show we have not run the subplans yet
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 55cdb05..7645645 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -28,20 +28,19 @@ static bool
 cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
 {
 	int n = ((Sort *)node-ss.ps.plan)-skipCols, i;
-	SortSupport sortKeys = tuplesort_get_sortkeys(node-tuplesortstate);
 
 	for (i = 0; i  n; i++)
 	{
 		Datum datumA, datumB;
 		bool isnullA, isnullB;
-		AttrNumber attno = sortKeys[i].ssup_attno;
+		AttrNumber attno = node-skipKeys[i].ssup_attno;
 
 		datumA = heap_getattr(a, attno, tupDesc, isnullA);
 		datumB = slot_getattr(b, attno, isnullB);
 
 		if (ApplySortComparator(datumA, isnullA,
-  datumB, isnullB,
-  sortKeys[i]))
+datumB, isnullB,
+node-skipKeys[i]))
 			return false;
 	}
 	return true;
@@ -123,12 +122,21 @@ ExecSort(SortState *node)
 		tuplesort_reset((Tuplesortstate *) node-tuplesortstate);
 	else
 	{
+		/* Support structures for cmpSortSkipCols - already sorted columns */
+		if (skipCols)
+			node-skipKeys = MakeSortSupportKeys(skipCols,
+ plannode-sortColIdx,
+ plannode-sortOperators,
+ plannode-collations,
+ plannode-nullsFirst);
+
+		/* Only pass on remaining columns that are unsorted */
 		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode-numCols,
-			  plannode-sortColIdx,
-			  plannode-sortOperators,
-			  plannode-collations,
-			  plannode-nullsFirst,
+			  plannode-numCols - skipCols,
+			  (plannode-sortColIdx[skipCols]),
+			  (plannode-sortOperators[skipCols]),
+			  (plannode-collations[skipCols]),
+			  (plannode-nullsFirst[skipCols]),
 			  work_mem,
 			  node-randomAccess);
 		if (node-bounded)
diff --git a/src/backend/utils/sort/sortsupport.c b/src/backend/utils/sort/sortsupport.c
index 347f448..df82f5f 100644
--- a/src/backend/utils/sort/sortsupport.c
+++ b/src/backend/utils/sort/sortsupport.c
@@ -85,6 +85,35 @@ PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup)
 }
 
 /*
+ * Build an array of SortSupportData structures from separated arrays.
+ */
+SortSupport
+MakeSortSupportKeys(int nkeys, AttrNumber *attNums,
+	Oid *sortOperators, Oid *sortCollations,
+	bool *nullsFirstFlags)
+{
+	SortSupport 

Re: [HACKERS] [PATCH] Make various variables read-only (const)

2014-01-18 Thread Tom Lane
I wrote:
 AFAICT, this change and some similar ones are based on a false assumption.
 It is *not* necessary to replace pointers by fixed-length arrays in order
 to get things into .rodata.

After further experimentation I find that this claim is true when
compiling normally, but apparently not when using -fpic, at least
not on RHEL6 x86_64.  Even when const-ified, the tables in encnames.o
and wchar.o end up in the data segment (though the underlying strings
are not).  So if we want a meaningful shrinkage in the size of the data
segment in libpq.so, we'd have to do something similar to what Oskari
proposes.

However, I'm still against doing so; the other points I made before
still apply, and I think on balance fixed-length arrays are still a
bad idea.  It seems like a particularly bad idea to expose a limit
on the length of an encoding name as part of the ABI defined by
pg_wchar.h, as the submitted patch did.  I'm on board with making
changes like this where they can be argued to improve correctness and
maintainability, but surely moving to fixed-length arrays is the
opposite of that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new json funcs

2014-01-18 Thread Andrew Dunstan


On 01/18/2014 12:34 PM, Marko Tiikkaja wrote:

On 2014-01-17 03:08, Andrew Dunstan wrote:

In case anyone feels like really nitpicking, this version fixes some
pgindent weirdness due to an outdated typedef list in the previous 
version.


Is it possible for you to generate a diff that doesn't have all these 
unrelated changes in it (from a pgindent run, I presume)?  I just read 
through three pagefuls and I didn't see any relevant changes, but I'm 
not sure I want to keep doing that for the rest of the patch.







This seems to be quite overstated. The chunks in the version 3 patch 
that only contain pgindent effects are those at lines 751,771,866 and 
1808 of json.c, by my reckoning. All the other changes are more than 
formatting.


And undoing the pgindent changes and then individually applying all but 
those mentioned above would take me a lot of time.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new json funcs

2014-01-18 Thread Marko Tiikkaja

On 1/18/14, 9:38 PM, Andrew Dunstan wrote:

On 01/18/2014 12:34 PM, Marko Tiikkaja wrote:

Is it possible for you to generate a diff that doesn't have all these
unrelated changes in it (from a pgindent run, I presume)?  I just read
through three pagefuls and I didn't see any relevant changes, but I'm
not sure I want to keep doing that for the rest of the patch.



This seems to be quite overstated. The chunks in the version 3 patch
that only contain pgindent effects are those at lines 751,771,866 and
1808 of json.c, by my reckoning. All the other changes are more than
formatting.


Oh I see, there's a version 3 which improves things by a lot.  I just 
took the latest patch from the CF app and that was the v2 patch.  Now 
looking at it again, I see that it actually added new lines around 
json.c:68, which I believe proves my point that reviewing such a patch 
is hard.



And undoing the pgindent changes and then individually applying all but
those mentioned above would take me a lot of time.


v3 looks ok.  I would have preferred a patch with no unrelated 
changes, but I can live with what we have there.


Something like the first three pagefuls of v2, however, would take *me* 
a lot of time, which I believe is not acceptable.  I don't care why a 
patch has lots of unrelated stuff in it, I'm not going to waste my time 
trying to figure out which parts are relevant and which aren't.  That's 
a lot of time wasted just to end up with a review possibly full of 
missed problems and misunderstood code.


But I'll continue with my review now that this has been sorted out.


Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make various variables read-only (const)

2014-01-18 Thread Tom Lane
Oskari Saarenmaa o...@ohmu.fi writes:
 This allows the variables to be moved from .data to .rodata section which
 means that more data can be shared by processes and makes sure that nothing
 can accidentally overwrite the read-only definitions.  On a x86-64 Linux
 system this moves roughly 9 kilobytes of previously writable data to the
 read-only data segment in the backend and 4 kilobytes in libpq.

Committed, with the changes mentioned upthread and some other minor
editorialization.  I also adopted Wim Lewis' suggestion to not export
pg_encname_tbl[] at all anymore, since it's hard to see what the point
is of doing anything besides pg_char_to_encoding() with it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLESPACE WITH

2014-01-18 Thread Vik Fearing
On 01/18/2014 03:21 PM, Michael Paquier wrote:
 So, except for the regression output problem, I think that the code is
 clean so marking it as Ready for committer on the commit fest app.

Thank you for the review.  Sorry about the regression thing.  I tried to
randomize it a bit so that I would catch it but it ended up breaking all
the other tablespace tests, so I decided it's not worth it.

-- 
Vik



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array_length(anyarray)

2014-01-18 Thread Pavel Stehule
Hello

I checked it and I got a small issue

bash-4.1$ patch -p1  cardinality.patch
(Stripping trailing CRs from patch.)
patching file doc/src/sgml/array.sgml
(Stripping trailing CRs from patch.)
patching file doc/src/sgml/func.sgml
(Stripping trailing CRs from patch.)
patching file src/backend/utils/adt/arrayfuncs.c
(Stripping trailing CRs from patch.)
patching file src/include/catalog/pg_proc.h
(Stripping trailing CRs from patch.)
patching file src/include/utils/array.h
(Stripping trailing CRs from patch.)
patching file src/test/regress/expected/arrays.out
(Stripping trailing CRs from patch.)
patching file src/test/regress/sql/arrays.sql

not sure about source of this problem.

Function works correctly and I would this feature

Regards

Pavel




2014/1/18 Marko Tiikkaja ma...@joh.to

 On 1/12/14, 5:53 AM, I wrote:

 On 1/9/14, 2:57 PM, Dean Rasheed wrote:

 How it should behave for multi-dimensional arrays is less clear, but
 I'd argue that it should return the total number of elements, i.e.
 cardinality('{{1,2},{3,4}}'::int[][]) = 4. That would make it
 consistent with the choices we've already made for unnest() and
 ordinality:
- cardinality(foo) = (select count(*) from unnest(foo)).
- unnest with ordinality would always result in ordinals in the range
 [1, cardinality].


 Ignoring my proposal, this seems like the most reasonable option.  I'll
 send an updated patch along these lines.


 Here's the patch as promised.  Thoughts?


 Regards,
 Marko Tiikkaja



Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-18 Thread Peter Geoghegan
On Sat, Jan 18, 2014 at 5:28 AM, Robert Haas robertmh...@gmail.com wrote:
 I was wondering if that might cause deadlocks if an existing index is
 changed from unique to non-unique, or vice versa, as the ordering would
 change. But we don't have a DDL command to change that, so the question is
 moot.

 It's not hard to imagine someone wanting to add such a DDL command.

Perhaps, but the burden of solving that problem ought to rest with
whoever eventually propose the command. Certainly, if someone did so
today, I would object on the grounds that their patch precluded us
from ever prioritizing unique indexes, to get them out of the way
during insertion, so I am not actually making such an effort more
difficult than it already is. Moreover, avoiding entirely predictable
index bloat is more important than making supporting this yet to be
proposed feature's implementation easier. I was surprised when I
learned that things didn't already work this way.

Attached patch, broken off from my patch has relcache sort indexes by
(!indisunique, relindexid).

-- 
Peter Geoghegan
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*** typedef struct relidcacheent
*** 108,113 
--- 108,125 
  	Relation	reldesc;
  } RelIdCacheEnt;
  
+ /*
+  *		Representation of indexes for sorting purposes
+  *
+  *		We use this to sort indexes globally by a specific sort order, per
+  *		RelationGetIndexList().
+  */
+ typedef struct relidunq
+ {
+ 	bool		indisunique;
+ 	Oid			relindexid;
+ } relidunq;
+ 
  static HTAB *RelationIdCache;
  
  /*
*** static TupleDesc GetPgClassDescriptor(vo
*** 246,252 
  static TupleDesc GetPgIndexDescriptor(void);
  static void AttrDefaultFetch(Relation relation);
  static void CheckConstraintFetch(Relation relation);
! static List *insert_ordered_oid(List *list, Oid datum);
  static void IndexSupportInitialize(oidvector *indclass,
  	   RegProcedure *indexSupport,
  	   Oid *opFamily,
--- 258,264 
  static TupleDesc GetPgIndexDescriptor(void);
  static void AttrDefaultFetch(Relation relation);
  static void CheckConstraintFetch(Relation relation);
! static int relidunq_cmp(const void *a, const void *b);
  static void IndexSupportInitialize(oidvector *indclass,
  	   RegProcedure *indexSupport,
  	   Oid *opFamily,
*** CheckConstraintFetch(Relation relation)
*** 3445,3455 
   * Such indexes are expected to be dropped momentarily, and should not be
   * touched at all by any caller of this function.
   *
!  * The returned list is guaranteed to be sorted in order by OID.  This is
!  * needed by the executor, since for index types that we obtain exclusive
!  * locks on when updating the index, all backends must lock the indexes in
!  * the same order or we will get deadlocks (see ExecOpenIndices()).  Any
!  * consistent ordering would do, but ordering by OID is easy.
   *
   * Since shared cache inval causes the relcache's copy of the list to go away,
   * we return a copy of the list palloc'd in the caller's context.  The caller
--- 3457,3471 
   * Such indexes are expected to be dropped momentarily, and should not be
   * touched at all by any caller of this function.
   *
!  * The returned list is guaranteed to be in (!indisunique, OID) order.  This is
!  * needed by the executor, since for index types that we obtain exclusive locks
!  * on when updating the index, all backends must lock the indexes in the same
!  * order or we will get deadlocks (see ExecOpenIndices()).  For most purposes
!  * any consistent ordering would do, but there is further consideration, which
!  * is why we put unique indexes first: it is generally useful to get insertion
!  * into unique indexes out of the way, since unique violations are the cause of
!  * many aborted transactions.  We can always avoid bloating non-unique indexes
!  * of the same slot.
   *
   * Since shared cache inval causes the relcache's copy of the list to go away,
   * we return a copy of the list palloc'd in the caller's context.  The caller
*** RelationGetIndexList(Relation relation)
*** 3469,3475 
  	SysScanDesc indscan;
  	ScanKeyData skey;
  	HeapTuple	htup;
! 	List	   *result;
  	char		replident = relation-rd_rel-relreplident;
  	Oid			oidIndex = InvalidOid;
  	Oid			pkeyIndex = InvalidOid;
--- 3485,3495 
  	SysScanDesc indscan;
  	ScanKeyData skey;
  	HeapTuple	htup;
! 	relidunq   *indexTypes;
! 	int			nIndexType;
! 	int			i;
! 	Size		szIndexTypes;
! 	List	   *result = NIL;
  	char		replident = relation-rd_rel-relreplident;
  	Oid			oidIndex = InvalidOid;
  	Oid			pkeyIndex = InvalidOid;
*** RelationGetIndexList(Relation relation)
*** 3486,3494 
  	 * list into the relcache entry.  This avoids cache-context memory leakage
  	 * if we get some sort of error partway through.
  	 */
- 	result = NIL;
  	oidIndex = InvalidOid;
  
  	/* Prepare to scan pg_index for entries having indrelid = this rel. */
  	

Re: [HACKERS] array_length(anyarray)

2014-01-18 Thread Marko Tiikkaja

On 1/19/14, 12:21 AM, Pavel Stehule wrote:

I checked it and I got a small issue

bash-4.1$ patch -p1  cardinality.patch
(Stripping trailing CRs from patch.)

not sure about source of this problem.


I can't reproduce the problem.  In fact, I don't see a single CR byte in 
the patch file on my disk, the file my email clients claims to have sent 
or a copy of the file I downloaded from the archives.  Are you sure this 
isn't a problem on your end?



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [patch] Potential relcache leak in get_object_address_attribute

2014-01-18 Thread Marti Raudsepp
Hi list,

It looks like the get_object_address_attribute() function has a
potential relcache leak. When missing_ok=true, the relation is found
but attribute is not, then the relation isn't closed, nor is it
returned to the caller.

I couldn't figure out any ways to trigger this, but it's best to fix anyway.

Regards,
Marti
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index f8fd4f8..e22aa66 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -1024,6 +1024,7 @@ get_object_address_attribute(ObjectType objtype, List *objname,
 		address.classId = RelationRelationId;
 		address.objectId = InvalidOid;
 		address.objectSubId = InvalidAttrNumber;
+		relation_close(relation, lockmode);
 		return address;
 	}
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heavily modified big table bloat even in auto vacuum is running

2014-01-18 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 I am marking this (based on patch vacuum_fix_v7_nkeep.patch) as Ready
 For Committer.

I've reviewed and committed this patch, with one significant change.
If you look at the way that vacuumlazy.c computes new_rel_tuples, it's
based on scanned_tuples (lazy_scan_heap's num_tuples), which is the total
number of surviving tuples *including* the recently-dead ones counted in
nkeep.  This is the number that we want to put into pg_class.reltuples,
I think, but it's wrong for the pgstats stuff to use it as n_live_tuples
if we're going to count the recently-dead ones as dead.  That is, if we're
improving the approximation that n_dead_tuples is zero after a vacuum,
the fix should involve reducing the n_live_tuples estimate as well as
increasing the n_dead_tuples estimate.

Using your test script against the unpatched code, it's easy to see that
there's a large (and wrong) value of n_live_tup reported by an autovacuum,
which gets corrected by the next autoanalyze.  For instance note these
successive readouts from the pg_stat_all_tables query:

 n_live_tup | autovacuum_count | autoanalyze_count | n_dead_tup 
+--+---+
 497365 |9 | 8 |4958346
 497365 |9 | 8 |5458346
1186555 |   10 | 8 |  0
1186555 |   10 | 8 | 50
 499975 |   10 | 9 |2491877

Since we know the true number of live tuples is always exactly 50
in this test, that jump is certainly wrong.  With the committed patch,
the behavior is significantly saner:

 n_live_tup | autovacuum_count | autoanalyze_count | n_dead_tup 
+--+---+
 483416 |2 | 2 |5759861
 483416 |2 | 2 |6259861
 655171 |3 | 2 | 382306
 655171 |3 | 2 | 882306
 553942 |3 | 3 |3523744

Still some room for improvement, but it's not so silly anymore.

It strikes me that there may be an obvious way to improve the number
further, based on the observation in this thread that nkeep doesn't need
to be scaled up because VACUUM should have scanned every page that could
contain dead tuples.  Namely, that we're arriving at new_rel_tuples by
scaling up num_tuples linearly --- but perhaps we should only scale up
the live-tuples fraction of that count, not the dead-tuples fraction.
By scaling up dead tuples too, we are presumably still overestimating
new_rel_tuples somewhat, and the behavior that I'm seeing with this test
script seems to confirm that.  I haven't time to pursue this idea at the
moment, but perhaps someone else would like to.

The n_dead_tup values seem to still be on the high side (not the low
side) when I run this test.  Not too sure why that is.

Also, I don't see any particularly meaningful change in the rate of
autovacuuming or autoanalyzing when using default postgresql.conf
settings.  I see that your test case changes the autovac parameters quite
a bit, but I didn't bother installing those values here.  This may or may
not mean much; the fix is clearly correct on its own terms.

On the whole, this patch is not really addressing the question of
accounting for transactions changing the table concurrently with
VACUUM; it's only fixing the impedance mismatch between pgstats wanting
to count live and dead tuples separately while VACUUM wasn't telling
it that.  That's a good thing to do, but I think we still have some
issues here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Patch to add support of IF NOT EXISTS to others CREATE statements

2014-01-18 Thread Stephen Frost
All,

* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Jun 12, 2013 at 3:00 PM, Peter Eisentraut pete...@gmx.net wrote:
  On 6/12/13 1:29 PM, Fabrízio de Royes Mello wrote:
  - CREATE AGGREGATE [ IF NOT EXISTS ] ...
  - CREATE CAST [ IF NOT EXISTS ] ...
  - CREATE COLLATION [ IF NOT EXISTS ] ...
  - CREATE OPERATOR [ IF NOT EXISTS ] ...
  - CREATE TEXT SEARCH {PARSER | DICTIONARY | TEMPLATE | CONFIGURATION} [
  IF NOT EXISTS ] ...
  - CREATE TYPE [ IF NOT EXISTS ] ... [AS [{ENUM | RANGE}] (...)]
 
  I'm wondering where IF NOT EXISTS and OR REPLACE will meet.
 
 I kind of don't see the point of having IF NOT EXISTS for things that
 have OR REPLACE, and am generally in favor of implementing OR REPLACE
 rather than IF NOT EXISTS where possible.  The point is usually to get
 the object to a known state, and OR REPLACE will generally accomplish
 that better than IF NOT EXISTS.  However, if the object has complex
 structure (like a table that contains data) then replacing it is a
 bad plan, so IF NOT EXISTS is really the best you can do - and it's
 still useful, even if it does require more care.

This patch is in the most recent commitfest and marked as Ready for
Committer, so I started reviewing it and came across the above.

I find myself mostly agreeing with the above comments from Robert, but
it doesn't seem like we've really done a comprehensive review of the
various commands to make a 'command' decision on each as to if it should
have IF NOT EXISTS or OR REPLACE options.

The one difficulty that I do see with the 'OR REPLACE' option is when we
can't simply replace an existing object due to dependencies on the
existing definition of that object.  Still, if that's the case, wouldn't
you want an error?  The 'IF NOT EXISTS' case is the no-error option, but
then you might not know what kind of state the system is in.

Fabrízio, can you clarify the use-case for things like CREATE AGGREGATE
to have IF NOT EXISTS rather than OR REPLACE, or if there is a reason
why both should exist?  Complicating our CREATE options is not something
we really wish to do without good reason and we certainly don't want to
add something now that we'll wish to remove in another version or two.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-18 Thread Marti Raudsepp
On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby j...@nasby.net wrote:
 it's very common to create temporary file data that will never, ever, ever
 actually NEED to hit disk. Where I work being able to tell the kernel to
 avoid flushing those files unless the kernel thinks it's got better things
 to do with that memory would be EXTREMELY valuable

Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.

ISTR that there was discussion about implementing something analogous
in Linux when ext4 got delayed allocation support, but I don't think
it got anywhere and I can't find the discussion now. I think the
proposed interface was to create and then unlink the file immediately,
which serves as a hint that the application doesn't care about
persistence.

Postgres is far from being the only application that wants this; many
people resort to tmpfs because of this:
https://lwn.net/Articles/499410/

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST support for inet datatypes

2014-01-18 Thread Andreas Karlsson

Hi,

I will review your two patches (gist support + selectivity). This is 
part 1 of my review. I will look more into the actual GiST 
implementation in a couple of days, but thought I could provide you with 
my initial input right away.


inet-gist
-

General:

I like the idea of the patch and think the  operator is useful for 
exclusion constraints, and that indexing the contains operator is useful 
for IP-address lookups. There is an extension, ip4r, which adds a GiST 
indexed type for IP ranges but since we have the inet type in core I 
think it should have GiST indexes.


I am not convinced an adjacent operator is useful for the inet type, but 
if it is included it should be indexed just like -|- of ranges. We 
should try to keep these lists of indexed operators the same.


Compilation:

Compiled without errors.

Regression tests:

One of the broken regression tests seems unrelated to the selectivity 
functions.


-- Make a list of all the distinct operator names being used in particular
-- strategy slots.

I think it would be fine just to add the newly indexed operators here, 
but the more indexed operators we get in the core the less useful this 
test becomes.


Source code:

The is adjacent to operator, -|-, does not seem to be doing the right 
thing. In the code it is implemented as the inverse of  which is not 
true. The below comparison should return false.


SELECT '10.0.0.1'::inet -|- '127.0.0.1'::inet;
 ?column?
--
 t
(1 row)

I am a bit suspicious about your memcmp based optimization in 
bitncommon, but it could be good. Have you benchmarked it compared to 
doing the same thing with a loop?


Documentation:

Please use examples in the operator table which will evaluate to true, 
and for the  case an example where not both sides are the same.


I have not found a place either in the documentation where it is 
documented which operators are documented. I would suggest just adding a 
short note after the operators table.


inet-selfuncs
-

Compilation:

There were some new warnings caused by this patch (with gcc 4.8.2). The 
warnings were all about the use of uninitialized variables (right, 
right_min_bits, right_order) in inet_hist_overlap_selectivity. Looking 
at the code I see that they are harmless but you should still rewrite 
the loop to silence the warnings.


Regression tests:

There are two broken

-- Insist that all built-in pg_proc entries have descriptions

Here you should just add descriptions for the new functions in pg_proc.h.

-- Check that all opclass search operators have selectivity estimators.

Fails due to missing selectivity functions for the operators. I think 
you should at least for now do like the range types and use areajoinsel, 
contjoinsel, etc for the join selectivity estimation.


Source code:

The selectivity estimation functions, network_overlap_selectivity and 
network_adjacent_selectivity, should follow the naming conventions of 
the other selectivity estimation functions. They are named things like 
tsmatchsel, tsmatchjoinsel, and rangesel.


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-18 Thread Andreas Karlsson

On 01/18/2014 08:13 PM, Jeremy Harris wrote:

On 31/12/13 01:41, Andreas Karlsson wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan - sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.


It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from
ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.


How does this work in combination with randomAccess ?


As far as I can tell randomAccess was broken by the partial sort patch 
even before my change since it would not iterate over multiple 
tuplesorts anyway.


Alexander: Is this true or am I missing something?

--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-18 Thread Peter Geoghegan
On Thu, Jan 16, 2014 at 6:31 PM, Peter Geoghegan p...@heroku.com wrote:
 I think we need to give this some more thought. I have not addressed
 the implications for MVCC snapshots here.

So I gave this some more thought, and this is what I came up with:

+ static bool
+ ExecLockHeapTupleForUpdateSpec(EState *estate,
+  ResultRelInfo 
*relinfo,
+  ItemPointer tid)
+ {
+   Relationrelation = 
relinfo-ri_RelationDesc;
+   HeapTupleData   tuple;
+   HeapUpdateFailureData   hufd;
+   HTSU_Result test;
+   Buffer  buffer;
+
+   Assert(ItemPointerIsValid(tid));
+
+   /* Lock tuple for update */
+   tuple.t_self = *tid;
+   test = heap_lock_tuple(relation, tuple,
+  estate-es_output_cid,
+  LockTupleExclusive, false, 
/* wait */
+  true, buffer, hufd);
+   ReleaseBuffer(buffer);
+
+   switch (test)
+   {
+   case HeapTupleInvisible:
+   /*
+* Tuple may have originated from this transaction, in 
which case
+* it's already locked.  However, to avoid having to 
consider the
+* case where the user locked an instantaneously 
invisible row
+* inserted in the same command, throw an error.
+*/
+   if 
(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data)))
+   ereport(ERROR,
+   
(errcode(ERRCODE_UNIQUE_VIOLATION),
+errmsg(could not lock 
instantaneously invisible tuple
inserted in same transaction),
+errhint(Ensure that no rows 
proposed for insertion in the
same command have constrained values that duplicate each other.)));
+   if (IsolationUsesXactSnapshot())
+   ereport(ERROR,
+   
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+errmsg(could not serialize 
access due to concurrent update)));
+   /* Tuple became invisible due to concurrent update; try 
again */
+   return false;
+   case HeapTupleSelfUpdated:
+   /*

I'm just throwing an error when locking the tuple returns
HeapTupleInvisible, and the xmin of the tuple is our xid.

It's sufficient to just check
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data)),
because there is no way that _bt_check_unique() could consider the
tuple dirty visible + conclusively fit for a lock attempt if it came
from our xact, while at the same time for the same tuple
HeapTupleSatisfiesUpdate() indicated invisibility, unless the tuple
originated from the same command. Checking against subxacts or
ancestor xacts is at worst redundant.

I am happy with this. ISTM that it'd be hard to argue that any
reasonable and well-informed person would ever thank us for trying
harder here, although it took me a while to reach that position. To
understand what I mean, consider what MySQL does when in a similar
position. I didn't actually check, but based on the fact that their
docs don't consider this question I guess MySQL would go update the
tuple inserted by that same INSERTON DUPLICATE KEY UPDATE
command. Most of the time the conflicting tuples proposed for
insertion by the user are in *some* way different (i.e. if the table
was initially empty and you did a regular insert, inserting those same
tuples would cause a unique constraint violation all on their own, but
without there being any fully identical tuples among these
hypothetical tuples proposed for insertion). It seems obvious that the
order in which each tuple is evaluated for insert-or-update on MySQL
is more or less undefined. And so by allowing this, they arguably
allow their users to miss something they should not: they don't end up
doing anything useful with the datums originally inserted in the
command, but then subsequently updated over with something else in the
same command.

MySQL users are not notified that this happened, and are probably
blissfully unaware that there has been a limited form of data loss. So
it's The Right Thing to say to Postgres users: if you inserted these
rows into the table when it was empty, there'd *still* definitely be a
unique constraint violation, and you need to sort that out before
asking Postgres to handle conflicts with concurrent sessions and
existing data, where rows that come from earlier commands in your xact

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-18 Thread Peter Geoghegan
On Sat, Jan 18, 2014 at 6:17 PM, Peter Geoghegan p...@heroku.com wrote:
 MySQL users are not notified that this happened, and are probably
 blissfully unaware that there has been a limited form of data loss. So
 it's The Right Thing to say to Postgres users: if you inserted these
 rows into the table when it was empty, there'd *still* definitely be a
 unique constraint violation, and you need to sort that out before
 asking Postgres to handle conflicts with concurrent sessions and
 existing data, where rows that come from earlier commands in your xact
 counts as existing data.

I Googled and found evidence indicating that a number of popular
proprietary system's SQL MERGE implementations do much the same thing.
You may get an attempt to UPDATE the same row twice error on both
SQL Server and Oracle. I wouldn't like to speculate if the standard
requires this of MERGE, but to require it seems very sensible.

 The only problem I can see with that is that
 we cannot complain consistently for practical reasons, as when we lock
 *some other* xact's tuple rather than inserting in the same command
 two or more times.

Actually, maybe it would be practical to complain that the same UPSERT
command attempted to lock a row twice with at least *almost* total
accuracy, and not just for the particularly problematic case where
tuple visibility is not assured.

Personally, I favor just making case HeapTupleSelfUpdated: within
the patch's ExecLockHeapTupleForUpdateSpec() function complain when
hufd.cmax == estate-es_output_cid) (currently there is a separate
complaint, but only when those two variables are unequal). That's
probably almost perfect in practice.

If we wanted perfection, which would be to always complain when two
rows were locked by the same UPSERT command, it would be a matter of
having heap_lock_tuple indicate to the patch's
ExecLockHeapTupleForUpdateSpec() caller that the row was already
locked, so that it could complain in a special way for the
locked-not-updated case. But that is hard, because there is no way for
it to know if the current *command* locked the tuple, and that's the
only case that we are justified in raising an error for.

But now that I think about it some more, maybe always complaining when
we lock but have not yet updated is not just not worth the trouble,
but is in fact bogus. It's not obvious what precise behavior is
correct here. I was worried about someone updating something twice,
but maybe it's fully sufficient to do what I've already proposed,
while in addition documenting that you cannot on-duplicate-key-lock a
tuple that has already been inserted or updated within the same
command. It will be very rare for anyone to trip up over that in
practice (e.g. by locking twice and spuriously updating the same row
twice or more in a later command). Users learn to not try this kind of
thing by having it break immediately; the fact that it doesn't break
with 100% reliability is good enough (plus it doesn't *really* fail to
break when it should because of how things are documented).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-18 Thread David Rowley
On Sun, Jan 19, 2014 at 3:22 AM, Florian Pflug f...@phlo.org wrote:


 BTW, this made me realize that MIN and MAX currently have the same issue -
 they'll rescan if the inputs are all equal. We could avoid that by doing
 what
 you did with dscale - track the number of times we've seen the maximum. I
 wonder if that would be worth it - it would, unfortunately, require use to
 use state type internal there too, and hence to add final functions for
 all
 the MIN/MAX aggregates. But that seems excessive. So for now, let's just
 live with that.


Yeah, it's an idea... I had actually talked a bit about it before when
first I posted about inverse transition functions.

http://www.postgresql.org/message-id/caaphdvqu+ygw0vbpbb+yxhrpg5vcy_kifyi8xmxfo8kyocz...@mail.gmail.com

But I've only realised today that it might be a no go. Let me explain...

I just finished implementing the inverse transition functions for bool_and
and bool_or, these aggregates had a sort operator which I assume would have
allowed an index scan to be performed, but since I had to change the first
argument of these aggregates to internal and that meant I had to get rid of
the sort operator... So I'm not actually sure that we really should
implement inverse transition functions for bool_and and bool_or because of
this. Never-the-less I commited a patch to the github repo which implements
them. I guess this sort operator problem completely writes off doing
something similar for MAX and MIN as that would mean no index scan would be
possible for these aggregates!



 If we really *do* want to optimize this case, we could
 come to it from a completely different angle. Aggregates already
 special-case
 MIN and MAX to be able to use an index to evalutate SELECT MAX(c) FROM t.
 If we provided a way for the transition function to call the sort operator
 specified by SORTOP in CREATE AGGREGATE, one generic triple of forward and
 inverse transition function and final function would work for all the
 MIN and MAX aggregates. But that's material for a separate patch for 9.5

  Note that after this fix the results for my quick benchmark now look
 like:
 
  create table num (num numeric(10,2) not null);
  insert into num (num) select * from generate_series(1,2);
  select sum(num) over(order by num rows between current row and unbounded
 following) from num; -- 113 ms
  select sum(num / 10) over(order by num rows between current row and
 unbounded following) from num; -- 156ms
  select sum(num / 1) over(order by num rows between current row and
 unbounded following) from num; -- 144 ms
 
  So it seems to be much less prone to falling back to brute force forward
  transitions.
  It also seems the / 10 version must have had to previously do 1 brute
  force rescan but now it looks like it can do it in 1 scan.
 
  I'm not set on it, and I'm willing to try the lazy zeroing of the scale
  tracker array, but I'm just not quite sure what extra real use cases it
  covers that the one above does not. Perhaps my way won't perform
 inverse
  transitions if the query did sum(numericCol / 10) or so.
 
  Dunno how many people SUM over numerics with different dscales. Easily
  possible that it's few enough to not be worth fussing over.
 
  Going by Tom's comments in the post above this is possible just by
 having an
  unconstrained numeric column, but I guess there's still a good chance
 that
  even those unconstrained numbers have the same scale or at least the
 scale
  will likely not vary wildly enough to make us have to perform brute force
  forward transitions for each row.

 Yeah, I'm convinced by now that your approach is the right trade-off there.
 Those who do have values with wildly different dscales in their columns can
 always add a cast to normalize them, if they experience a lot of restarts.

 So let's just add a sentence or two to the SUM(numeric) documentation about
 this, and be done.


I had a quick look and I couldn't decide the best place to write about
specific details on inverse transition functions. The best place I could
see was to add a note under the aggregates table here:
http://www.postgresql.org/docs/devel/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE



 
  I did think of this when I wrote them. I thought that the removing 0
 case might
  be quite common and worth it, but I thought the ~0 case would be less
 common,
  but I just thought it was weird to do one without the other.
  To do more tracking on these it looks like we'd need to change those
 aggregates
  to use an state type that is internal and I think the extra tracking
 would mean
  looping over a 8, 32 or 64 element array of int64's for each value, I
 just don't
  think that would be a winner performance wise since the code that's
 there is
  pretty much a handful of CPU cycles.

 Yeah, this is similar to the SUM(numeric) problem in that we *could* avoid
 all restarts, but the overhead of doing so is quite high. But whereas in
 the
 SUM(numeric) case we manage to reduce 

Re: [HACKERS] Heavily modified big table bloat even in auto vacuum is running

2014-01-18 Thread Amit Kapila
On Sun, Jan 19, 2014 at 5:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
 I am marking this (based on patch vacuum_fix_v7_nkeep.patch) as Ready
 For Committer.

 I've reviewed and committed this patch, with one significant change.
 If you look at the way that vacuumlazy.c computes new_rel_tuples, it's
 based on scanned_tuples (lazy_scan_heap's num_tuples), which is the total
 number of surviving tuples *including* the recently-dead ones counted in
 nkeep.  This is the number that we want to put into pg_class.reltuples,
 I think, but it's wrong for the pgstats stuff to use it as n_live_tuples
 if we're going to count the recently-dead ones as dead.  That is, if we're
 improving the approximation that n_dead_tuples is zero after a vacuum,
 the fix should involve reducing the n_live_tuples estimate as well as
 increasing the n_dead_tuples estimate.

 Using your test script against the unpatched code, it's easy to see that
 there's a large (and wrong) value of n_live_tup reported by an autovacuum,
 which gets corrected by the next autoanalyze.  For instance note these
 successive readouts from the pg_stat_all_tables query:

  n_live_tup | autovacuum_count | autoanalyze_count | n_dead_tup
 +--+---+
  497365 |9 | 8 |4958346
  497365 |9 | 8 |5458346
 1186555 |   10 | 8 |  0
 1186555 |   10 | 8 | 50
  499975 |   10 | 9 |2491877

 Since we know the true number of live tuples is always exactly 50
 in this test, that jump is certainly wrong.  With the committed patch,
 the behavior is significantly saner:

  n_live_tup | autovacuum_count | autoanalyze_count | n_dead_tup
 +--+---+
  483416 |2 | 2 |5759861
  483416 |2 | 2 |6259861
  655171 |3 | 2 | 382306
  655171 |3 | 2 | 882306
  553942 |3 | 3 |3523744

 Still some room for improvement, but it's not so silly anymore.

 It strikes me that there may be an obvious way to improve the number
 further, based on the observation in this thread that nkeep doesn't need
 to be scaled up because VACUUM should have scanned every page that could
 contain dead tuples.  Namely, that we're arriving at new_rel_tuples by
 scaling up num_tuples linearly --- but perhaps we should only scale up
 the live-tuples fraction of that count, not the dead-tuples fraction.
 By scaling up dead tuples too, we are presumably still overestimating
 new_rel_tuples somewhat, and the behavior that I'm seeing with this test
 script seems to confirm that.

After reading your analysis, first thought occurred to me is that we can
directly subtract nkeep from num_tuples to account for better scaling
of live tuples, but I think the scaling routine vac_estimate_reltuples()
is expecting scanned_tuples and this routine is shared by both
Analyze and Vacuum where the mechanism to calculate the live
and dead tuples seems to be bit different, so may be directly passing
a subtract of num_tuples and nkeep to this routine might create some
problem. However I think this idea is definitely worth pursuing to
improve the estimates of live tuples in Vacuum.

 I haven't time to pursue this idea at the moment, but perhaps someone else 
 would like to.

I think this idea is worth to be added in Todo list.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve the help message about psql -F

2014-01-18 Thread Jov
any commnet?

Jov
blog: http:amutu.com/blog http://amutu.com/blog


2014/1/17 Jov am...@amutu.com

 in the offical doc,-F say:

 Use separator as the field separator for unaligned output.


 but in the psql --help,-F say:

 set field separator (default: |)


 if user don't read the offical doc carefully,he can use:

 psql -F , -c 'select ...'

 But can't get what he want.
 It is a bad user Experience.

 I make a patch change the help message for -F to:

 set field separator for unaligned output (default: |)


 patch for head attached.

 Jov
 blog: http:amutu.com/blog http://amutu.com/blog



Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-18 Thread Peter Geoghegan
On Sat, Jan 18, 2014 at 7:49 PM, Peter Geoghegan p...@heroku.com wrote:
 Personally, I favor just making case HeapTupleSelfUpdated: within
 the patch's ExecLockHeapTupleForUpdateSpec() function complain when
 hufd.cmax == estate-es_output_cid) (currently there is a separate
 complaint, but only when those two variables are unequal). That's
 probably almost perfect in practice.

Actually, there isn't really a need to do so, since I believe in
practice the tuple locked will always be instantaneously invisible
(when we have the scope to avoid this updated the tuple twice in the
same command problem by forbidding it in the style of SQL MERGE).
However, I think I'm going to propose that we still do something in
the ExecLockHeapTupleForUpdateSpec() HeapTupleSelfUpdated handler (in
addition to HeapTupleInvisible), because that'll still be illustrative
dead code.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array_length(anyarray)

2014-01-18 Thread Pavel Stehule
2014/1/19 Marko Tiikkaja ma...@joh.to

 On 1/19/14, 12:21 AM, Pavel Stehule wrote:

 I checked it and I got a small issue

 bash-4.1$ patch -p1  cardinality.patch
 (Stripping trailing CRs from patch.)

 not sure about source of this problem.


 I can't reproduce the problem.  In fact, I don't see a single CR byte in
 the patch file on my disk, the file my email clients claims to have sent or
 a copy of the file I downloaded from the archives.  Are you sure this isn't
 a problem on your end?


It can be problem on my side - some strange combination of mime type. I
seen this issue before. I will recheck it tomorrow from other computer.

Regards

Pavel




 Regards,
 Marko Tiikkaja



Re: [HACKERS] Heavily modified big table bloat even in auto vacuum is running

2014-01-18 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 On Sun, Jan 19, 2014 at 5:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 It strikes me that there may be an obvious way to improve the number
 further, based on the observation in this thread that nkeep doesn't need
 to be scaled up because VACUUM should have scanned every page that could
 contain dead tuples.  Namely, that we're arriving at new_rel_tuples by
 scaling up num_tuples linearly --- but perhaps we should only scale up
 the live-tuples fraction of that count, not the dead-tuples fraction.
 By scaling up dead tuples too, we are presumably still overestimating
 new_rel_tuples somewhat, and the behavior that I'm seeing with this test
 script seems to confirm that.

 After reading your analysis, first thought occurred to me is that we can
 directly subtract nkeep from num_tuples to account for better scaling
 of live tuples, but I think the scaling routine vac_estimate_reltuples()
 is expecting scanned_tuples and this routine is shared by both
 Analyze and Vacuum where the mechanism to calculate the live
 and dead tuples seems to be bit different, so may be directly passing
 a subtract of num_tuples and nkeep to this routine might create some
 problem. However I think this idea is definitely worth pursuing to
 improve the estimates of live tuples in Vacuum.

Yeah, it seemed like it would require some rethinking of the way
vac_estimate_reltuples() works.  It's probably not that hard, but it
looked like it'd require more thought than I wanted to put into it on
a Saturday ;-)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers