[HACKERS] COPY and heap_sync

2014-08-30 Thread Jeff Janes
If you insert tuples with COPY into a table created or truncated in the
same transaction, at the end of the COPY it calls heap_sync.

But there cases were people use COPY in a loop with a small amount of data
in each statement.  Now it is calling heap_sync many times, and if NBuffers
is large doing that gets very slow.

Could the heap_sync be safely delayed until the end of the transaction,
rather than the end of the COPY?

Cheers,

Jeff


Re: [HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-30 Thread Michael Paquier
On Fri, Aug 29, 2014 at 11:15 PM, Andres Freund and...@2ndquadrant.com
wrote:

  On top of that, a
  logical receiver receives data in textual form even if the decoder is
  marked as binary when getting a message with PQgetCopyData.

 I have no idea what you mean by that. The data is sent in the format the
 output plugin writes the data in.


Well, that's where we don't understand each other. By reading the docs I
understand that by setting output_type to OUTPUT_PLUGIN_BINARY_OUTPUT, all
the changes generated by an output plugin would be automatically converted
to binary and sent to a client back as binary data, but that's not the
case. For example, when using pg_recvlogical on a slot plugged with
test_decoding, the output received is not changed and remains like that
even when using force-binary:
BEGIN 1000
table public.aa: INSERT: a[integer]:2
COMMIT 1000
Perhaps I didn't understand the docs well, but this is confusing and it
should be clearly specified to the user that output_type only impacts the
SQL interface with the peekget functions (as far as I tested). As things
stand now, by reading the docs a user can only know that output_type can be
set as OUTPUT_PLUGIN_BINARY_OUTPUT or OUTPUT_PLUGIN_TEXTUAL_OUTPUT, but
there is nothing about what each value means and how it impacts the output
format, and you need to look at the source code to get that this parameter
is used for some sanity checks, only within the logical functions. That's
not really user-friendly.

Attached is a patch improving the documentation regarding those comments.
Regards,
-- 
Michael
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index e6880d0..1472a6c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -403,9 +403,17 @@ typedef struct OutputPluginOptions
 OutputPluginOutputType output_type;
 } OutputPluginOptions;
 /programlisting
-  literaloutput_type/literal has to either be set to
-  literalOUTPUT_PLUGIN_TEXTUAL_OUTPUT/literal
-  or literalOUTPUT_PLUGIN_BINARY_OUTPUT/literal.
+  literaloutput_type/literal is the parameter defining the output
+  format of plugin; it can be one of
+  literalOUTPUT_PLUGIN_TEXTUAL_OUTPUT/literal (support for output
+  format as text) and literalOUTPUT_PLUGIN_BINARY_OUTPUT/literal
+  (support for format output as binary). This parameter influences only
+  the way output changes can be accessed by the set of SQL functions able
+  to query a replication slot. For example, the changes fetched by
+  xref linkend=app-pgrecvlogical are not impacted by this parameter
+  and remain the same, while functionpg_logical_slot_get_changes/
+  is not able to query changes using output type
+  literalOUTPUT_PLUGIN_BINARY_OUTPUT/literal.
  /para
  para
   The startup callback should validate the options present in
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index a3a58e6..4f802ad 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -394,14 +394,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
-		 * Check whether the output pluggin writes textual output if that's
+		 * Check whether the output plugin writes textual output if that's
 		 * what we need.
 		 */
 		if (!binary 
 			ctx-options.output_type != OUTPUT_PLUGIN_TEXTUAL_OUTPUT)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg(output plugin cannot produce binary output)));
+	 errmsg(output plugin produces binary output but only textual output is accepted)))
 
 		ctx-output_writer_private = p;
 
diff --git a/src/include/catalog/objectaccess.h b/src/include/catalog/objectaccess.h
index 4fdd056..1e1cb13 100644
--- a/src/include/catalog/objectaccess.h
+++ b/src/include/catalog/objectaccess.h
@@ -13,7 +13,7 @@
 /*
  * Object access hooks are intended to be called just before or just after
  * performing certain actions on a SQL object.  This is intended as
- * infrastructure for security or logging pluggins.
+ * infrastructure for security or logging plugins.
  *
  * OAT_POST_CREATE should be invoked just after the object is created.
  * Typically, this is done after inserting the primary catalog records and

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-30 Thread Amit Kapila
On Tue, Aug 26, 2014 at 9:49 PM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 So my proposal is a bit more complicated.  First we introduce the notion
 of a single number, to enable sorting and computations: the delay
 equivalent, which is the cost_limit divided by cost_delay.  The highest
 the value is for any table, the fastest it is vacuumed.  (It makes sense
 in physical terms: a higher cost_limit makes it faster, because vacuum
 sleeps less often; and a higher cost_delay makes it go slower, because
 vacuums sleeps for longer.)  Now, the critical issue is to notice that
 not all tables are equal; they can be split in two groups, those that go
 faster than the global delay equivalent
 (i.e. the effective values of GUC variables
 autovacuum_vacuum_cost_limit/autovacuum_vacuum_cost_delay), and those
 that go equal or slower.  For the latter group, the rebalancing
 algorithm distributes the allocated I/O by the global vars, in a
 pro-rated manner.  For the former group (tables vacuumed faster than
 global delay equiv), to rebalance we don't consider the global delay
 equiv but the delay equiv of the fastest table currently being vacuumed.

 Suppose we have two tables, delay_equiv=10 each (which is the default
 value).  If they are both vacuumed in parallel, then we distribute a
 delay_equiv of 5 to each (so set cost_limit=100, cost_delay=20).  As
 soon as one of them finishes, the remaining one is allowed to upgrade to
 delay_equiv=10 (cost_limit=200, cost_delay=20).

 Now add a third table, delay_equiv=500 (cost_limit=1, cost_delay=20;
 this is Mark's volatile table).  If it's being vacuumed on its own, just
 assign cost_limit=1 cost_delay=20, as normal.  If one of the other
 two tables are being vacuumed, that one will use delay_equiv=10, as per
 above.  To balance the volatile table, we take the delay_equiv of this
 one and subtract the already handed-out delay_equiv of 10; so we set the
 volatile table to delay_equiv=490 (cost_limit=9800, cost_delay=20).

 If we do it this way, the whole system is running at the full speed
 enabled by the fastest table we have set the per-table options, but also
 we have scaled things so that the slow tables go slow and the fast
 tables go fast.

 As a more elaborate example, add a fourth table with delay_equiv=50
 (cost_limit=1000, cost_delay=20).  This is also faster than the global
 vars, so we put it in the first group.  If all four tables are being
 vacuumed in parallel, we have the two slow tables going at delay_equiv=5
 each (cost_limit=100, cost_delay=20); then there are delay_equiv=490 to
 distribute among the remaining ones; pro-rating this we have
 delay_equiv=445 (cost_limit=8900, cost_delay=20) for the volatile table
 and delay_equiv=45 (cost_limit=900, cost_delay=20) for the other one.

How will this calculation behave if third table has delay_equiv = 30
and fourth table has delay_equiv = 20 which are both greater than
default delay_equiv = 10, so they will participate in fast group, as
per my understanding from above calculation both might get same
delay_equiv, but I might be wrong because still your patch has
FixMe and I haven't yet fully understood the code of patch.

In general, I have a feeling that distributing vacuuming speed is
a good way to tune the system, however if user wants to override
that by providing specific values for particular tables, we should
honour that setting.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-30 Thread Amit Kapila
On Thu, Aug 28, 2014 at 11:06 PM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 Robert Haas wrote:

  Now, in the case where you are setting an overall limit, there is at
  least an argument to be made that you can determine the overall rate
  of autovacuum-induced I/O activity that the system can tolerate, and
  set your limits to stay within that budget, and then let the system
  decide how to divide that I/O up between workers.  But if you're
  overriding a per-table limit, I don't really see how that holds any
  water.  The system I/O budget doesn't go up just because one
  particular table is being vacuumed rather than any other.  The only
  plausible use case for setting a per-table rate that I can see is when
  you actually want the system to use that exact rate for that
  particular table.

 Yeah, this makes sense to me too -- at least as long as you only have
 one such table.  But if you happen to have more than one, and due to
 some bad luck they happen to be vacuumed concurrently, they will eat a
 larger share of your I/O bandwidth budget than you anticipated, which
 you might not like.

I think to control I/O bandwidth, there should be a separate mechanism
(may be similar to what Simon proposed for WAL rate limiting) rather
than by changing user specified values internally where he might
specifically want that value to be used.  This can give more predictable
results which user can control.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] COPY and heap_sync

2014-08-30 Thread Amit Kapila
On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 If you insert tuples with COPY into a table created or truncated in the
same transaction, at the end of the COPY it calls heap_sync.

 But there cases were people use COPY in a loop with a small amount of
data in each statement.  Now it is calling heap_sync many times, and if
NBuffers is large doing that gets very slow.

 Could the heap_sync be safely delayed until the end of the transaction,
rather than the end of the COPY?

Wouldn't unconditionally delaying sync until end of transaction
can lead to burst of I/O at that time especially if there are many
such copy commands in a transaction, leading to delay in some
other operation's that might be happening concurrently in the
system.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] COPY and heap_sync

2014-08-30 Thread Atri Sharma
On Saturday, August 30, 2014, Amit Kapila amit.kapil...@gmail.com wrote:

 On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes jeff.ja...@gmail.com
 javascript:_e(%7B%7D,'cvml','jeff.ja...@gmail.com'); wrote:
 
  If you insert tuples with COPY into a table created or truncated in the
 same transaction, at the end of the COPY it calls heap_sync.
 
  But there cases were people use COPY in a loop with a small amount of
 data in each statement.  Now it is calling heap_sync many times, and if
 NBuffers is large doing that gets very slow.
 
  Could the heap_sync be safely delayed until the end of the transaction,
 rather than the end of the COPY?

 Wouldn't unconditionally delaying sync until end of transaction
 can lead to burst of I/O at that time especially if there are many
 such copy commands in a transaction, leading to delay in some
 other operation's that might be happening concurrently in the
 system.




I agree with that but then, it can provide us the same benefits like group
commit,especially when most of the copy commands touch pages which are
nearby,hence reducing the seek time overhead.

We could look at making it optional through a GUC, since it is useful
albeit for some specific usecases.

Regards,

Atri


-- 
Regards,

Atri
*l'apprenant*


[HACKERS] Make LWLockAcquireCommon() inline?

2014-08-30 Thread Andres Freund
Hi,

when profiling optimized builds (linux, gcc 4.9) it's currently
LWLockAcquireCommon() showing up, not it's callers. Instruction level
profiles show that the tests for valptr show up in profiles to some
extent. Since most callers don't need the valptr logic it seems prudent
to mark the function inline which will then eliminate the superflous
branches.

Arguments against?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-08-30 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:
 Kevin Grittner kgri...@ymail.com wrote:

 executor-tuplestore-relations covers parse analysis,
 planner/optimizer, and executor layers.

 30 files changed, 786 insertions(+), 9 deletions(-)

Testing and further review found a few places that needed to add 
lines for the new RTE kind that I had missed.  Delta patch 
attached.
7 files changed, 58 insertions(+)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 799242b..2e61edf 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -2250,6 +2250,9 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
 APP_JUMB_STRING(rte-ctename);
 APP_JUMB(rte-ctelevelsup);
 break;
+			case RTE_TUPLESTORE:
+APP_JUMB_STRING(rte-tsrname);
+break;
 			default:
 elog(ERROR, unrecognized RTE kind: %d, (int) rte-rtekind);
 break;
@@ -2638,6 +2641,13 @@ JumbleExpr(pgssJumbleState *jstate, Node *node)
 JumbleQuery(jstate, (Query *) cte-ctequery);
 			}
 			break;
+		case T_TuplestoreRelation:
+			{
+TuplestoreRelation *tsr = (TuplestoreRelation *) node;
+
+APP_JUMB_STRING(tsr-name);
+			}
+			break;
 		case T_SetOperationStmt:
 			{
 SetOperationStmt *setop = (SetOperationStmt *) node;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..de31026 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -717,6 +717,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 		case T_FunctionScan:
 		case T_ValuesScan:
 		case T_CteScan:
+		case T_TuplestoreScan:
 		case T_WorkTableScan:
 		case T_ForeignScan:
 			*rels_used = bms_add_member(*rels_used,
@@ -930,6 +931,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_CteScan:
 			pname = sname = CTE Scan;
 			break;
+		case T_TuplestoreScan:
+			pname = sname = Tuplestore Scan;
+			break;
 		case T_WorkTableScan:
 			pname = sname = WorkTable Scan;
 			break;
@@ -1300,6 +1304,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
+		case T_TuplestoreScan:
 		case T_WorkTableScan:
 		case T_SubqueryScan:
 			show_scan_qual(plan-qual, Filter, planstate, ancestors, es);
@@ -2160,6 +2165,11 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 			objectname = rte-ctename;
 			objecttag = CTE Name;
 			break;
+		case T_TuplestoreScan:
+			Assert(rte-rtekind == RTE_TUPLESTORE);
+			objectname = rte-tsrname;
+			objecttag = Tuplestore Name;
+			break;
 		case T_WorkTableScan:
 			/* Assert it's on a self-reference CTE */
 			Assert(rte-rtekind == RTE_CTE);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 93f3905..255348b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2442,6 +2442,13 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 			WRITE_NODE_FIELD(ctecoltypmods);
 			WRITE_NODE_FIELD(ctecolcollations);
 			break;
+		case RTE_TUPLESTORE:
+			WRITE_STRING_FIELD(tsrname);
+			WRITE_OID_FIELD(relid);
+			WRITE_NODE_FIELD(ctecoltypes);
+			WRITE_NODE_FIELD(ctecoltypmods);
+			WRITE_NODE_FIELD(ctecolcollations);
+			break;
 		default:
 			elog(ERROR, unrecognized RTE kind: %d, (int) node-rtekind);
 			break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 9f7f322..e11a116 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -287,6 +287,10 @@ print_rt(const List *rtable)
 printf(%d\t%s\t[cte],
 	   i, rte-eref-aliasname);
 break;
+			case RTE_TUPLESTORE:
+printf(%d\t%s\t[tuplestore],
+	   i, rte-eref-aliasname);
+break;
 			default:
 printf(%d\t%s\t[unknown rtekind],
 	   i, rte-eref-aliasname);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69d9989..bcaeeb0 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1240,6 +1240,13 @@ _readRangeTblEntry(void)
 			READ_NODE_FIELD(ctecoltypmods);
 			READ_NODE_FIELD(ctecolcollations);
 			break;
+		case RTE_TUPLESTORE:
+			READ_STRING_FIELD(tsrname);
+			READ_OID_FIELD(relid);
+			READ_NODE_FIELD(ctecoltypes);
+			READ_NODE_FIELD(ctecoltypmods);
+			READ_NODE_FIELD(ctecolcollations);
+			break;
 		default:
 			elog(ERROR, unrecognized RTE kind: %d,
  (int) local_node-rtekind);
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index fb6c44c..518691a 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -2473,6 +2473,15 @@ transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
 	  LCS_asString(lc-strength)),
 			 parser_errposition(pstate, thisrel-location)));
 			break;
+		case RTE_TUPLESTORE:
+			ereport(ERROR,
+	

Re: [HACKERS] pgbench throttling latency limit

2014-08-30 Thread Andres Freund
Hi,

I generally want to say that having a feature like this feels *very*
helpful to me. Lots of pg development hasn't really paid attention to
anything but the final pgbench results...

On 2014-08-29 19:48:43 +0200, Fabien COELHO wrote:
 + if (latency_limit)
 + printf(number of transactions above the %.1f ms latency limit: 
  INT64_FORMAT \n,
 +latency_limit / 1000.0, latency_late);
 +

Any reason not to report a percentage here?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-30 Thread Fabien COELHO



+   if (latency_limit)
+   printf(number of transactions above the %.1f ms latency limit:  
INT64_FORMAT \n,
+  latency_limit / 1000.0, latency_late);
+


Any reason not to report a percentage here?


Yes: I did not thought of it.

Here is a v7, with a percent. I also added a paragraph in the documenation 
about how the latency is computed under throttling, and I tried to reorder 
the reported stuff so that it is more logical.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..c102ad7 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,18 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * Transactions which take longer that this limit are counted as late
+ * and reported as such, although they are completed anyway.
+ *
+ * When under throttling: execution time slots which are more than
+ * this late (in us) are simply skipped, and the corresponding transaction
+ * is counted as such... it is not even started;
+ * otherwise above the limit transactions are counted as such, with the latency
+ * measured wrt the transaction schedule, not its actual start.
+ */
+int64		latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +250,8 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
+	int64		latency_late; /* late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +264,8 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
+	int64		latency_late;
 } TResult;
 
 /*
@@ -367,6 +383,10 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --limit=NUM  count transactions lasting more than NUM ms.\n
+		  under throttling (--rate), transactions behind schedule\n
+		  more than NUM ms are skipped, and those finishing more\n
+		  than NUM ms after their scheduled start are counted.\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -1016,6 +1036,24 @@ top:
 
 		thread-throttle_trigger += wait;
 
+		if (latency_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread-throttle_trigger  now_us - latency_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = (int64) (throttle_delay *
+	1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));
+thread-throttle_trigger += wait;
+thread-throttle_latency_skipped ++;
+			}
+		}
+
 		st-until = thread-throttle_trigger;
 		st-sleeping = 1;
 		st-throttling = true;
@@ -1080,15 +1118,31 @@ top:
 			thread-exec_count[cnum]++;
 		}
 
-		/* transaction finished: record latency under progress or throttling */
-		if ((progress || throttle_delay)  commands[st-state + 1] == NULL)
+		/* transaction finished: record latency under progress or throttling,
+		 * ot if latency limit is set
+		 */
+		if ((progress || throttle_delay || latency_limit) 
+			commands[st-state + 1] == NULL)
 		{
 			instr_time	diff;
 			int64		latency;
 
 			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, st-txn_begin);
-			latency = INSTR_TIME_GET_MICROSEC(diff);
+
+			if (throttle_delay)
+			{
+/* Under throttling, compute latency wrt to the initial schedule,
+ * not the actual transaction start.
+ */
+int64 now = INSTR_TIME_GET_MICROSEC(diff);
+latency = now - thread-throttle_trigger;
+			}
+			else
+			{
+INSTR_TIME_SUBTRACT(diff, st-txn_begin);
+latency = INSTR_TIME_GET_MICROSEC(diff);
+			}
+
 			st-txn_latencies += latency;
 
 			/*
@@ -1099,6 +1153,11 @@ top:
 			 * would take 256 hours.
 			 */
 			st-txn_sqlats += latency * latency;
+
+			/* record over the limit transactions if needed.
+			 */
+			if (latency_limit  latency  latency_limit)
+thread-latency_late++;
 		}
 
 		/*
@@ -1294,7 +1353,7 @@ top:
 	}
 
 	/* Record transaction start time under logging, progress or throttling */
-	if ((logfile || progress || throttle_delay)  st-state == 0)
+	if ((logfile || progress || throttle_delay || latency_limit)  st-state == 0)
 		INSTR_TIME_SET_CURRENT(st-txn_begin);
 
 	/* Record statement start time if per-command latencies are requested */
@@ -2351,7 +2410,8 @@ 

Re: [HACKERS] [PATCH] empty xml values

2014-08-30 Thread Ali Akbar

 While looking into this report
 http://www.postgresql.org/message-id/cf48ccfb.65a9d%tim.k...@gmail.com I
 noticed that we don't accept empty values as xml content values, even
 though this should apparently be allowed by the spec.  Attached is a
 patch to fix it (which needs updates to xml_1.out, which I omit here for
 simplicity).


The patch works, albeit must be applied with --ignore-whitespace. Attached
the patch + xml_1.out updates.

I'm marking this ready for commit
-- 
Ali Akbar
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 422be69..7abe215 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -1400,11 +1400,14 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
 			doc-encoding = xmlStrdup((const xmlChar *) UTF-8);
 			doc-standalone = standalone;
 
-			res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
-   utf8string + count, NULL);
-			if (res_code != 0 || xmlerrcxt-err_occurred)
-xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT,
-			invalid XML content);
+			if (*(utf8string + count))
+			{
+res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
+	   utf8string + count, NULL);
+if (res_code != 0 || xmlerrcxt-err_occurred)
+	xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT,
+invalid XML content);
+			}
 		}
 	}
 	PG_CATCH();
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index 382f9df..6e6c673 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -194,6 +194,18 @@ SELECT xmlelement(name foo, xmlattributes( as funny, xml 'ba/r' as fun
  foo funny=lt;gt;amp;quot;' funnier=blt;a/gt;r/
 (1 row)
 
+SELECT xmlparse(content '');
+ xmlparse 
+--
+ 
+(1 row)
+
+SELECT xmlparse(content '  ');
+ xmlparse 
+--
+   
+(1 row)
+
 SELECT xmlparse(content 'abc');
  xmlparse 
 --
@@ -251,6 +263,22 @@ SELECT xmlparse(content 'nosuchprefix:tag/');
  nosuchprefix:tag/
 (1 row)
 
+SELECT xmlparse(document '');
+ERROR:  invalid XML document
+DETAIL:  line 1: switching encoding : no input
+
+^
+line 1: Document is empty
+
+^
+line 1: Start tag expected, '' not found
+
+^
+SELECT xmlparse(document '   ');
+ERROR:  invalid XML document
+DETAIL:  line 1: Start tag expected, '' not found
+   
+   ^
 SELECT xmlparse(document 'abc');
 ERROR:  invalid XML document
 DETAIL:  line 1: Start tag expected, '' not found
diff --git a/src/test/regress/expected/xml_1.out b/src/test/regress/expected/xml_1.out
index a34d1f4..b0e0067 100644
--- a/src/test/regress/expected/xml_1.out
+++ b/src/test/regress/expected/xml_1.out
@@ -164,6 +164,14 @@ SELECT xmlelement(name foo, xmlattributes( as funny, xml 'ba/r' as fun
 ERROR:  unsupported XML feature
 DETAIL:  This functionality requires the server to be built with libxml support.
 HINT:  You need to rebuild PostgreSQL using --with-libxml.
+SELECT xmlparse(content '');
+ERROR:  unsupported XML feature
+DETAIL:  This functionality requires the server to be built with libxml support.
+HINT:  You need to rebuild PostgreSQL using --with-libxml.
+SELECT xmlparse(content '  ');
+ERROR:  unsupported XML feature
+DETAIL:  This functionality requires the server to be built with libxml support.
+HINT:  You need to rebuild PostgreSQL using --with-libxml.
 SELECT xmlparse(content 'abc');
 ERROR:  unsupported XML feature
 DETAIL:  This functionality requires the server to be built with libxml support.
@@ -196,6 +204,14 @@ SELECT xmlparse(content 'nosuchprefix:tag/');
 ERROR:  unsupported XML feature
 DETAIL:  This functionality requires the server to be built with libxml support.
 HINT:  You need to rebuild PostgreSQL using --with-libxml.
+SELECT xmlparse(document '');
+ERROR:  unsupported XML feature
+DETAIL:  This functionality requires the server to be built with libxml support.
+HINT:  You need to rebuild PostgreSQL using --with-libxml.
+SELECT xmlparse(document '   ');
+ERROR:  unsupported XML feature
+DETAIL:  This functionality requires the server to be built with libxml support.
+HINT:  You need to rebuild PostgreSQL using --with-libxml.
 SELECT xmlparse(document 'abc');
 ERROR:  unsupported XML feature
 DETAIL:  This functionality requires the server to be built with libxml support.
diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql
index 90d4d67..922ab7a 100644
--- a/src/test/regress/sql/xml.sql
+++ b/src/test/regress/sql/xml.sql
@@ -60,6 +60,8 @@ SELECT xmlelement(name foo, xmlattributes('infinity'::timestamp as bar));
 SELECT xmlelement(name foo, xmlattributes( as funny, xml 'ba/r' as funnier));
 
 
+SELECT xmlparse(content '');
+SELECT xmlparse(content '  ');
 SELECT xmlparse(content 'abc');
 SELECT xmlparse(content 'abcx/abc');
 SELECT xmlparse(content 'invalidentity/invalidentity');
@@ -69,6 +71,8 @@ SELECT xmlparse(content 'relativens xmlns=''relative''/');
 SELECT xmlparse(content 

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
 A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
 them out in order 
 (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
 The performance impact of that was inconclusive, but one thing that it
 allows nicely is to interleave the fsyncs, so that you write all the buffers
 for one file, then fsync it, then next file and so on. IIRC the biggest
 worry with that patch was that sorting the buffers requires a fairly large
 amount of memory, and making a large allocation in the checkpointer might
 cause an out-of-memory, which would be bad.
 
 I don't think anyone's seriously worked on this area since. If the impact on
 responsiveness or performance is significant, I'm pretty sure the OOM
 problem could be alleviated somehow.

I've dug up that patch (after a bit of fighting with the archives) and
refreshed it. It's *clearly* beneficial:

master:
andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 
180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219
...
progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720
progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228
progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546
progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740
progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934
..
progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685
progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342
progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916
progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448
progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339
progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340
progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648
progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297
progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224
progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901
progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576
progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520
progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346
progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670
progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411
progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096
progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690
progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366
progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900
progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774
progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501
progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921
progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426
progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247
progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082
progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577
progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611
progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498
progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970
progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848
progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592
progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851
progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976
progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057
progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994
progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553
progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508
progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053
...
progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210
progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411
progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650
progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596
progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174
progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444
progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611
progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121
progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377
progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927
progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472
progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491
progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754
progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857
progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488
progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162
progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952
progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539
progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094
progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200
progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev 

Re: [HACKERS] Built-in binning functions

2014-08-30 Thread Tom Lane
Pavel Stehule pavel.steh...@gmail.com writes:
 1. I am thinking so reduction to only numeric types is not necessary -
 although we can live without it - but there are lot of non numeric
 categories: chars, date, ...

I wasn't terribly happy about that either.  I still think we should
reduce this to a single polymorphic function, as in the attached.

 2. Still I strongly afraid about used searching method - there is not
 possible to check a  validity of input. Did you check how much linear
 searching is slower - you spoke, so you have a real data and real use case?

Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N).  That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()).  I think it's probably
sufficient to have a clear warning in the docs.

 3. I am thinking about name - I don't think so varwidth_bucket is correct
 -- in relation to name width_bucket ... what about range_bucket or
 scope_bucket ?

Neither of those seem like improvements from here.  I agree with the
objections to bin() as well.  bucket() might be all right but it still
seems a bit too generic.  width_bucket(), which at least shows there's
a relationship to the standard functions, still seems like the best
of the suggestions so far.

It occurs to me that there would be an advantage to using some other
name besides width_bucket: we could then mark the function as variadic,
which might be a notational advantage in some usages.  (We can't do
that if we call it width_bucket because the four-parameter case would
be ambiguous with the existing functions.)  I'm not sure that this is
important enough to justify changing the name though, especially if
we can't come up with a good name.  Also, doing that would put a very
large premium on picking a non-generic name, else we'd risk creating
ambiguities against user-defined functions.

Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as right bounds, which I didn't care
for and replaced with upper bounds.  However, looking at it again,
I wonder if we would not be better off describing them as lower bounds.
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing.  Thoughts?

Proposed patch with 1 polymorphic function attached.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 722640b..93cf1e7 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 901,925 
  indexterm
   primarywidth_bucket/primary
  /indexterm
! literalfunctionwidth_bucket(parameterop/parameter typenumeric/type, parameterb1/parameter typenumeric/type, parameterb2/parameter typenumeric/type, parametercount/parameter typeint/type)/function/literal
 /entry
 entrytypeint/type/entry
 entryreturn the bucket to which parameteroperand/ would
 be assigned in an equidepth histogram with parametercount/
!buckets, in the range parameterb1/ to parameterb2//entry
 entryliteralwidth_bucket(5.35, 0.024, 10.06, 5)/literal/entry
 entryliteral3/literal/entry
/row
  
row
!entryliteralfunctionwidth_bucket(parameterop/parameter typedp/type, parameterb1/parameter typedp/type, parameterb2/parameter typedp/type, parametercount/parameter typeint/type)/function/literal/entry
 entrytypeint/type/entry
 entryreturn the bucket to which parameteroperand/ would
 be assigned in an equidepth histogram with parametercount/
!buckets, in the range parameterb1/ to parameterb2//entry
 entryliteralwidth_bucket(5.35, 0.024, 10.06, 5)/literal/entry
 entryliteral3/literal/entry
/row
   /tbody
  /tgroup
 /table
--- 901,936 
  indexterm
   primarywidth_bucket/primary
  /indexterm
! literalfunctionwidth_bucket(parameteroperand/parameter typenumeric/type, parameterb1/parameter typenumeric/type, parameterb2/parameter typenumeric/type, parametercount/parameter typeint/type)/function/literal
 /entry
 entrytypeint/type/entry
 entryreturn the bucket to which parameteroperand/ would
 be assigned in an equidepth histogram with parametercount/
!buckets spanning the range parameterb1/ to parameterb2//entry
 entryliteralwidth_bucket(5.35, 0.024, 10.06, 5)/literal/entry
 entryliteral3/literal/entry
/row
  
row
!entryliteralfunctionwidth_bucket(parameteroperand/parameter typedp/type, parameterb1/parameter typedp/type, parameterb2/parameter typedp/type, parametercount/parameter typeint/type)/function/literal/entry
 entrytypeint/type/entry
 entryreturn the bucket to which parameteroperand/ would
 be assigned in an equidepth histogram with 

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
 A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
 them out in order 
 (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
 The performance impact of that was inconclusive, but one thing that it
 allows nicely is to interleave the fsyncs, so that you write all the buffers
 for one file, then fsync it, then next file and so on.

 ...
 So, *very* clearly sorting is a benefit.

pg_bench alone doesn't convince me on this.  The original thread found
cases where it was a loss, IIRC; you will need to test many more than
one scenario to prove the point.

Also, it does not matter how good it looks in test cases if it causes
outright failures due to OOM; unlike you, I am not prepared to just wave
away that risk.  A possible compromise is to sort a limited number of
buffers  say, collect a few thousand dirty buffers then sort, dump and
fsync them, repeat as needed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
  A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
  them out in order 
  (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
  The performance impact of that was inconclusive, but one thing that it
  allows nicely is to interleave the fsyncs, so that you write all the 
  buffers
  for one file, then fsync it, then next file and so on.
 
  ...
  So, *very* clearly sorting is a benefit.
 
 pg_bench alone doesn't convince me on this.  The original thread found
 cases where it was a loss, IIRC; you will need to test many more than
 one scenario to prove the point.

Sure. And I'm not claiming Itagaki/your old patch is immediately going
to be ready for commit. But our checkpoint performance has sucked for
years in the field. I don't think we can wave that away.

I think the primary reason it wasn't easily visible as being beneficial
back then was that only the throughput, not the latency and such were
looked at.

 Also, it does not matter how good it looks in test cases if it causes
 outright failures due to OOM; unlike you, I am not prepared to just wave
 away that risk.

I'm not waving away any risks.

If the sort buffer is allocated when the checkpointer is started, not
everytime we sort, as you've done in your version of the patch I think
that risk is pretty manageable. If we really want to be sure nothing is
happening at runtime, even if the checkpointer was restarted, we can put
the sort array in shared memory.

We're talking about (sizeof(BufferTag) + sizeof(int))/8192 ~= 0.3 %
overhead over shared_buffers here. If we decide to got that way, it's a
pretty darn small to price not to regularly have stalls that last
minutes.

 A possible compromise is to sort a limited number of
 buffers  say, collect a few thousand dirty buffers then sort, dump and
 fsync them, repeat as needed.

Yea, that's what I suggested nearby. But I don't really like it, because
it robs us of the the chance to fsync() a relfilenode immediately after
having synced all its buffers. Doing so is rather beneficial because
then fewer independently dirtied pages have to be flushed out - reducing
the impact of the fsync().

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
 A possible compromise is to sort a limited number of
 buffers  say, collect a few thousand dirty buffers then sort, dump and
 fsync them, repeat as needed.

 Yea, that's what I suggested nearby. But I don't really like it, because
 it robs us of the the chance to fsync() a relfilenode immediately after
 having synced all its buffers.

Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
once per sort-group and not just once per checkpoint, but it's not clear
that that's a loss as long as the group size isn't tiny.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
  A possible compromise is to sort a limited number of
  buffers  say, collect a few thousand dirty buffers then sort, dump and
  fsync them, repeat as needed.
 
  Yea, that's what I suggested nearby. But I don't really like it, because
  it robs us of the the chance to fsync() a relfilenode immediately after
  having synced all its buffers.
 
 Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
 once per sort-group and not just once per checkpoint, but it's not clear
 that that's a loss as long as the group size isn't tiny.

Because it wouldn't have the benefit of sycing the minimal amount of
data anymore. If lots of other relfilenodes have been synced inbetween
the amount of newly dirtied pages in the os' buffercache (written by
backends, bgwriter) for a individual relfilenode is much higher.

A fsync() on a file with dirty data often causes *serious* latency
spikes - we should try hard to avoid superflous calls.

As an example: Calling fsync() on pgbench_accounts's underlying files,
from outside postgres, *before* postgres even started its first
checkpoint does this:
progress: 72.0 s, 4324.9 tps, lat 41.481 ms stddev 40.567
progress: 73.0 s, 4704.9 tps, lat 38.465 ms stddev 35.436
progress: 74.0 s, 4448.5 tps, lat 40.058 ms stddev 32.634
progress: 75.0 s, 4634.5 tps, lat 39.229 ms stddev 33.463
progress: 76.8 s, 2753.1 tps, lat 48.693 ms stddev 75.309
progress: 77.1 s, 126.6 tps, lat 773.433 ms stddev 222.667
progress: 78.0 s, 183.7 tps, lat 786.401 ms stddev 395.954
progress: 79.1 s, 170.3 tps, lat 975.949 ms stddev 596.751
progress: 80.0 s, 2116.6 tps, lat 168.608 ms stddev 398.933
progress: 81.0 s, 4436.1 tps, lat 40.313 ms stddev 34.198
progress: 82.0 s, 4383.9 tps, lat 41.811 ms stddev 37.241

Note the dip from 4k tps to 130 tps.

We can get a handle on that (on some platforms at least) for writes
issued during the buffer sync by forcing the kernel to write out the
pages in small increments; but I doubt we want to do that for writes by
backends themselves.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-08-30 Thread Tom Lane
Emre Hasegeli e...@hasegeli.com writes:
 I updated the patch to cover semi and anti joins with eqjoinsel_semi().
 I think it is better than returning a constant.

What you did there is utterly unacceptable from a modularity standpoint;
and considering that the values will be nowhere near right, the argument
that it's better than returning a constant seems pretty weak.  I think
you should just take that out again.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-08-30 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 * inet_mcv_join_selec() is O(n^2) where n is the number of entries in 
 the MCV lists. With the max statistics target of 1, a worst case 
 query on my laptop took about 15 seconds to plan. Maybe that's 
 acceptable, but you went through some trouble to make planning of MCV vs 
 histogram faster, by the log2 method to compare only some values, so I 
 wonder why you didn't do the same for the MCV vs MCV case?

Actually, what I think needs to be asked is the opposite question: why is
the other code ignoring some of the statistical data?  If the user asked
us to collect a lot of stats detail it seems reasonable that he's
expecting us to use it to get more accurate estimates.  It's for sure
not obvious why these estimators should take shortcuts that are not being
taken in the much-longer-established code for scalar comparison estimates.

I'm not exactly convinced that the math adds up in this logic, either.
The way in which it combines results from looking at the MCV lists and
at the histograms seems pretty arbitrary.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What in the world is happening with castoroides and protosciurus?

2014-08-30 Thread Noah Misch
On Tue, Aug 26, 2014 at 10:17:05AM +0100, Dave Page wrote:
 On Tue, Aug 26, 2014 at 1:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  For the last month or so, these two buildfarm animals (which I believe are
  the same physical machine) have been erratically failing with errors that
  reflect low-order differences in floating-point calculations.
 
  A recent example is at
 
  http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52
 
  where the only regression diff is
 
  *** 
  /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out
 Mon Aug 25 11:41:00 2014
  --- 
  /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out
  Mon Aug 25 11:57:26 2014
  ***
  *** 171,179 
SELECT h.seqno AS i8096, h.random AS f1234_1234
   FROM hash_f8_heap h
   WHERE h.random = '-1234.1234'::float8;
  !  i8096 | f1234_1234
  ! ---+
  !   8906 | -1234.1234
(1 row)
 
UPDATE hash_f8_heap
  --- 171,179 
SELECT h.seqno AS i8096, h.random AS f1234_1234
   FROM hash_f8_heap h
   WHERE h.random = '-1234.1234'::float8;
  !  i8096 |f1234_1234
  ! ---+---
  !   8906 | -1234.12356777216
(1 row)
 
UPDATE hash_f8_heap
 
  ... a result that certainly makes no sense.  The results are not
  repeatable, failing in equally odd ways in different tests on different
  runs.  This is happening in all the back branches too, not just HEAD.

 I have
 no idea what is causing the current issue - the machine is stable
 software-wise, and only has private builds of dependency libraries
 update periodically (which are not used for the buildfarm). If I had
 to hazard a guess, I'd suggest this is an early symptom of an old
 machine which is starting to give up.

Agreed.  Rerunning each animal against older commits would test that theory.
Say, run against the last 6 months of REL9_0_STABLE commits.  If those runs
show today's failure frequencies instead of historic failure frequencies, it's
not a PostgreSQL regression.  Not that I see a commit back-patched near the
time of the failure uptick (2014-08-06) that looks remotely likely to have
introduced such a regression.

It would be sad to lose our only buildfarm coverage of plain Solaris and of
the Sun Studio compiler, but having buildfarm members this unstable is a pain.
Perhaps have those animals retry the unreliable steps up to, say, 7 times?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Heikki Linnakangas

On 08/30/2014 09:45 PM, Andres Freund wrote:

On 2014-08-30 14:16:10 -0400, Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

On 2014-08-30 13:50:40 -0400, Tom Lane wrote:

A possible compromise is to sort a limited number of
buffers  say, collect a few thousand dirty buffers then sort, dump and
fsync them, repeat as needed.



Yea, that's what I suggested nearby. But I don't really like it, because
it robs us of the the chance to fsync() a relfilenode immediately after
having synced all its buffers.


Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
once per sort-group and not just once per checkpoint, but it's not clear
that that's a loss as long as the group size isn't tiny.


Because it wouldn't have the benefit of sycing the minimal amount of
data anymore. If lots of other relfilenodes have been synced inbetween
the amount of newly dirtied pages in the os' buffercache (written by
backends, bgwriter) for a individual relfilenode is much higher.


I wonder how much of the benefit from sorting comes from sorting the 
pages within each file, and how much just from grouping all the writes 
of each file together. In other words, how much difference is there 
between sorting, and fsyncing between each file, or the crude patch I 
posted earlier.


If we're going to fsync between each file, there's no need to sort all 
the buffers at once. It's enough to pick one file as the target - like 
in my crude patch - and sort only the buffers for that file. Then fsync 
that file and move on to the next file. That requires scanning the 
buffers multiple times, but I think that's OK.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
 On 08/30/2014 09:45 PM, Andres Freund wrote:
 On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
 A possible compromise is to sort a limited number of
 buffers  say, collect a few thousand dirty buffers then sort, dump and
 fsync them, repeat as needed.
 
 Yea, that's what I suggested nearby. But I don't really like it, because
 it robs us of the the chance to fsync() a relfilenode immediately after
 having synced all its buffers.
 
 Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
 once per sort-group and not just once per checkpoint, but it's not clear
 that that's a loss as long as the group size isn't tiny.
 
 Because it wouldn't have the benefit of sycing the minimal amount of
 data anymore. If lots of other relfilenodes have been synced inbetween
 the amount of newly dirtied pages in the os' buffercache (written by
 backends, bgwriter) for a individual relfilenode is much higher.
 
 I wonder how much of the benefit from sorting comes from sorting the pages
 within each file, and how much just from grouping all the writes of each
 file together. In other words, how much difference is there between sorting,
 and fsyncing between each file, or the crude patch I posted earlier.

I haven't implemented fsync()ing between files so far. From the io stats
I'm seing the performance improvements come from the OS being able to
write data back in bigger chunks. Which seems entirely reasonable. If
the database and the write load are big enough, so writeback will be
triggered repeatedly during one checkpoint, the OS's buffercache will
have lots of nonsequential data to flush. Leading to much smaller IOs,
more seeks and deeper queues (= latency increases).

 If we're going to fsync between each file, there's no need to sort all the
 buffers at once. It's enough to pick one file as the target - like in my
 crude patch - and sort only the buffers for that file. Then fsync that file
 and move on to the next file. That requires scanning the buffers multiple
 times, but I think that's OK.

I really can't see that working out. Production instances of postgres
with large shared_buffers settings (say 96GB in one case) have tens of
thousands of relations (~34500 in the same case). And that's a database
with a relatively simple schema. I've seen much worse.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns

2014-08-30 Thread Bruce Momjian
On Wed, Aug 27, 2014 at 09:40:30PM -0400, Noah Misch wrote:
   3.  use the pg_dump binary-upgrade code when such cases happen
 
 +1.  We have the convention that, while --binary-upgrade can inject catalog
 hacks, regular pg_dump uses standard, documented DDL.  I like that convention
 on general aesthetic grounds and for its benefit to non-superusers.  Let's
 introduce the DDL needed to fix this bug while preserving that convention,
 namely DDL to toggle attislocal.

I have spend some time researching this, and I am not sure what to
recommend.  The basic issue is that CREATE TABLE INHERITS always puts
the inherited columns first, so to preserve column ordering, you have to
use CREATE TABLE and then ALTER TABLE INHERIT.  The problem there is
that ALTER TABLE INHERIT doesn't preserve attislocal, and it also has
problems with constraints not being marked local.  I am just not sure we
want to add SQL-level code to do that.  Would it be documented?

I decided to step back and consider some issues.  Basically, in
non-binary-upgrade mode, pg_dump is take:

CREATE TABLE A(a int, b int, c int);
CREATE TABLE B(a int, c int);
ALTER TABLE a INHERIT B;

and dumping it as:

CREATE TABLE a (
a integer,
b integer,
c integer
)
INHERITS (b);

This looks perfect, but, of course, it isn't.  Columns c is going to be
placed before 'b'.  You do get a notice about merged columns, but no
mention of the column reordering:

NOTICE:  merging column a with inherited definition
NOTICE:  merging column c with inherited definition

I think there are two common cases for CREATE TABLE INHERITS, and then
there is this odd one.  The common cases are cases where all columns
inherited are mentioned explicitly in CREATE TABLE INHERITS, in order,
and the other case where none of the inherited columns are mentioned. 
The case above is the odd case where some are mentioned, but in a
different order.

I have developed the attached patch to warn about column reordering in
this odd case.  The patch mentions the reordering of c:

NOTICE:  merging column a with inherited definition
NOTICE:  merging column c with inherited definition;  column moved 
earlier to match inherited column location

This, of course, will be emitted when the pg_dump output is restored.
This is probably the minimum we should do.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
new file mode 100644
index 3720a0f..4846d25
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** MergeAttributes(List *schema, List *supe
*** 1756,1767 
--- 1756,1771 
  	 */
  	if (inhSchema != NIL)
  	{
+ 		int		schema_attno = 0;
+ 
  		foreach(entry, schema)
  		{
  			ColumnDef  *newdef = lfirst(entry);
  			char	   *attributeName = newdef-colname;
  			int			exist_attno;
  
+ 			schema_attno++;
+ 
  			/*
  			 * Does it conflict with some previously inherited column?
  			 */
*** MergeAttributes(List *schema, List *supe
*** 1780,1788 
   * Yes, try to merge the two column definitions. They must
   * have the same type, typmod, and collation.
   */
! ereport(NOTICE,
!    (errmsg(merging column \%s\ with inherited definition,
! 		   attributeName)));
  def = (ColumnDef *) list_nth(inhSchema, exist_attno - 1);
  typenameTypeIdAndMod(NULL, def-typeName, defTypeId, deftypmod);
  typenameTypeIdAndMod(NULL, newdef-typeName, newTypeId, newtypmod);
--- 1784,1797 
   * Yes, try to merge the two column definitions. They must
   * have the same type, typmod, and collation.
   */
!  if (exist_attno == schema_attno)
! 	ereport(NOTICE,
! 	   (errmsg(merging column \%s\ with inherited definition,
! 			   attributeName)));
! else
! 	ereport(NOTICE,
! 	   (errmsg(merging column \%s\ with inherited definition;  column moved earlier to match inherited column location,
! 			   attributeName)));
  def = (ColumnDef *) list_nth(inhSchema, exist_attno - 1);
  typenameTypeIdAndMod(NULL, def-typeName, defTypeId, deftypmod);
  typenameTypeIdAndMod(NULL, newdef-typeName, newTypeId, newtypmod);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY and heap_sync

2014-08-30 Thread Fabrízio de Royes Mello
On Sat, Aug 30, 2014 at 5:05 AM, Atri Sharma atri.j...@gmail.com wrote:


 On Saturday, August 30, 2014, Amit Kapila amit.kapil...@gmail.com wrote:

 On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes jeff.ja...@gmail.com
wrote:
 
  If you insert tuples with COPY into a table created or truncated in
the same transaction, at the end of the COPY it calls heap_sync.
 
  But there cases were people use COPY in a loop with a small amount of
data in each statement.  Now it is calling heap_sync many times, and if
NBuffers is large doing that gets very slow.
 
  Could the heap_sync be safely delayed until the end of the
transaction, rather than the end of the COPY?

 Wouldn't unconditionally delaying sync until end of transaction
 can lead to burst of I/O at that time especially if there are many
 such copy commands in a transaction, leading to delay in some
 other operation's that might be happening concurrently in the
 system.




 I agree with that but then, it can provide us the same benefits like
group commit,especially when most of the copy commands touch pages which
are nearby,hence reducing the seek time overhead.

 We could look at making it optional through a GUC, since it is useful
albeit for some specific usecases.


It's interesting... maybe something analogous to SET CONSTRAINTS
DEFERRED...

SET COPY COMMIT { IMMEDIATE | DEFERRED }

or

SET COPY MODE { IMMEDIATE | DEFERRED }

Just some thoughts!

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-30 Thread Thomas Munro
On 28 August 2014 00:25, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Thomas Munro wrote:
 I haven't yet figured out how to get get into a situation where
 heap_lock_updated_tuple_rec waits.

 Well, as I think I said in the first post I mentioned this, maybe there
 is no such situation.  In any case, like the EvalPlanQualFetch issue, we
 can fix it later if we find it.

I finally came up with a NOWAIT spec that reaches
heap_lock_updated_rec and then blocks.  I can't explain why exactly...
but please see attached.  The fix seems fairly straightforward.  Do
you think I should submit an independent patch to fix this case (well
there are really two cases, since there is a separate multixact path)
for the existing NOWAIT support and then tackle the SKIP LOCKED
equivalent separately?

Best regards,
Thomas Munro
# Test NOWAIT with an updated tuple chain (heap_lock_updated_tuple case)

setup
{
  CREATE TABLE foo (
	id int PRIMARY KEY,
	data text NOT NULL
  );
  INSERT INTO foo VALUES (1, 'x');
}

teardown
{
  DROP TABLE foo;
}

session s1
setup		{ BEGIN; }
step s1a	{ SELECT * FROM foo WHERE pg_advisory_lock(0) IS NOT NULL FOR KEY SHARE NOWAIT; }
step s1b	{ COMMIT; }

session s2
setup   { BEGIN; }
step s2a	{ SELECT pg_advisory_lock(0); }
step s2b	{ UPDATE foo SET id = id; UPDATE foo SET id = id + 10; }
step s2c	{ SELECT pg_advisory_unlock(0); }
step s2d	{ COMMIT; }

permutation s2a s1a s2b s2c s1b s2d

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Allow distdir to be overridden on make command line

2014-08-30 Thread Peter Eisentraut
On Fri, 2014-08-29 at 10:04 +0800, Craig Ringer wrote:
 Not just a one line patch, a one character patch.
 
 Use ?= instead of = in distdir assignment, so it can be overridden on
 the command line when building dist tarballs with patches.

This is already possible without this patch.

You can also override the VERSION variable.

Moreover, you might be interested in the configure option
--with-extra-version.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Mitsumasa KONDO
Hi,

2014-08-31 8:10 GMT+09:00 Andres Freund and...@2ndquadrant.com:

 On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:

 If we're going to fsync between each file, there's no need to sort all the
  buffers at once. It's enough to pick one file as the target - like in my
  crude patch - and sort only the buffers for that file. Then fsync that
 file
  and move on to the next file. That requires scanning the buffers multiple
  times, but I think that's OK.

 I really can't see that working out. Production instances of postgres
 with large shared_buffers settings (say 96GB in one case) have tens of
 thousands of relations (~34500 in the same case). And that's a database
 with a relatively simple schema. I've seen much worse.

Yeah, it is impossible in one checkpointer process. All buffer search cost
is
relatively high than we expect. We need clever algorithm for efficient and
distributed buffer search using multi process or threads.

Regards,
--
Mitsumasa KONDO


Re: [HACKERS] PATCH: Allow distdir to be overridden on make command line

2014-08-30 Thread Michael Paquier
On Sun, Aug 31, 2014 at 10:37 AM, Peter Eisentraut pete...@gmx.net wrote:

 On Fri, 2014-08-29 at 10:04 +0800, Craig Ringer wrote:
  Not just a one line patch, a one character patch.
 
  Use ?= instead of = in distdir assignment, so it can be overridden on
  the command line when building dist tarballs with patches.

 This is already possible without this patch.

 You can also override the VERSION variable.

 Moreover, you might be interested in the configure option
 --with-extra-version.


--with-extra-version is available as well with MSVC scripts on HEAD.
-- 
Michael


Re: [HACKERS] add line number as prompt option to psql

2014-08-30 Thread Sawada Masahiko
On Tue, Aug 26, 2014 at 10:23 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-08-21 11:43:48 +0900, Sawada Masahiko wrote:
 On Wed, Aug 20, 2014 at 9:00 PM, Jeevan Chalke
 jeevan.cha...@enterprisedb.com wrote:
  Hi,
 
  I have reviewed this:
 
  I have initialize cur_lineno to UINTMAX - 2. And then observed following
  behaviour to check wrap-around.
 
  postgres=# \set PROMPT1 '%/[%l]%R%# '
  postgres[18446744073709551613]=# \set PROMPT2 '%/[%l]%R%# '
  postgres[18446744073709551613]=# select
  postgres[18446744073709551614]-# a
  postgres[18446744073709551615]-# ,
  postgres[0]-# b
  postgres[1]-# from
  postgres[2]-# dual;
 
  It is wrapping to 0, where as line number always start with 1. Any issues?
 
  I would like to ignore this as UINTMAX lines are too much for a input
  buffer to hold. It is almost NIL chances to hit this.
 
 
  However, I think you need to use UINT64_FORMAT while printing uint64
  number. Currently it is %u which wrap-around at UINT_MAX.
  See how pset.lineno is displayed.
 
  Apart from this, I didn't see any issues in my testing.
 
  Patch applies cleanly. make/make install/initdb/make check all are well.
 

 Thank you for reviewing the patch!
 Attached patch is latest version patch.
 I modified the output format of cur_lineno.

 I like the feature - and I wanted to commit it, but enough stuff turned
 up that I needed to fix that it warrants some new testing.

 Stuff I've changed:
 * removed include of limits.h - that probably was a rememnant from a
   previous version
 * removed a trailing whitespace
 * expanded the documentation about %l. The current line number isn't
   very clear. Of a file? Of all lines ever entered in psql? It's now
   The line number inside the current statement, starting from
   literal1/.
 * Correspondingly I've changed the variable's name to stmt_lineno.
 * COPY FROM ... STDIN/PROMPT3 was broken because a) the promp was only 
 generated
   once b) the lineno wasn't incremented.
 * CTRL-C didn't reset the line number.
 *  Unfortunately I've notice here that the prompting is broken in some
 common cases:

 postgres[1]=# SELECT 1,
 postgres[2]-# '2
 postgres[2]'# 2b
 postgres[2]'# 2c
 postgres[2]'# 2d',
 postgres[3]-# 3;
 ┌──┬──┬──┐
 │ ?column? │ ?column? │ ?column? │
 ├──┼──┼──┤
 │1 │ 2   ↵│3 │
 │  │ 2b  ↵│  │
 │  │ 2c  ↵│  │
 │  │ 2d   │  │
 └──┴──┴──┘
 (1 row)

 postgres[1]=# SELECT 1,
 '2
 2b
 2c
 2d',
 3
 postgres[7]-#

 That's rather inconsistent...


 I've attached my version of the patch. Note that I've got rid of all the
 PSCAN_INCOMPLETE checks... Please test!


Thank you for review comment and improving the patch!
I tested it.
Your patch always increment line number even if there is no input line
as follows.

postgres[1]=#
postgres[2]=# select
postgres[3]-# ,
postgres[4]-# from
postgres[5]-# hoge;
ERROR:  syntax error at or near , at character 8
STATEMENT:  select
,
from
hoge;
ERROR:  syntax error at or near ,
LINE 2: ,
^
Actually error syntax is in line 2 as postgres reported.
But it is inconsistent.
Attached patch is resolve above behavior based on your version patch.

Regards,

---
Sawada Masahiko


psql-line-number_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump refactor patch to remove global variables

2014-08-30 Thread Peter Eisentraut
The general idea of this patch is not disputed, I think.

Some concerns about the details:

- Why is quote_all_identifiers left behind as a global variable?

- Shouldn't pg_dumpall also use DumpOptions?

- What about binary_upgrade?

- I think some of the variables in pg_dump's main() don't need to be
moved into DumpOptions.  For example, compressLevel is only looked at
once in main() and then forgotten.  We don't need to pass that around
everywhere. The same goes for dumpencoding and possibly others.

- The forward declaration of struct _dumpOptions in pg_backup.h seems
kind of useless.  You could move some things around so that that's not
necessary.

- NewDumpOptions() and NewRestoreOptions() are both in
pg_backup_archiver.c, but NewRestoreOptions() is in pg_backup.h whereas
NewDumpOptions() is being put into pg_backup_archiver.h.  None of that
makes too much sense, but it could be made more consistent.

- In dumpOptionsFromRestoreOptions() you are building the return value
in a local variable that is not zeroed.  You should probably use
NewDumpOptions() there.

Also, looking at dumpOptionsFromRestoreOptions() and related code makes
me think that these should perhaps really be the same structure.  Have
you investigated that?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improving speed of make check-world

2014-08-30 Thread Peter Eisentraut
Updated, rebased patch.
diff --git a/.gitignore b/.gitignore
index 681af08..823d3ac 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,3 +34,4 @@ lib*.pc
 /pgsql.sln.cache
 /Debug/
 /Release/
+/tmp_install/
diff --git a/GNUmakefile.in b/GNUmakefile.in
index 69e0824..5667943 100644
--- a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ -47,6 +47,7 @@ $(call recurse,distprep,doc src config contrib)
 # it's not built by default
 $(call recurse,clean,doc contrib src config)
 clean:
+	rm -rf tmp_install/
 # Garbage from autoconf:
 	@rm -rf autom4te.cache/
 
@@ -61,6 +62,8 @@ distclean maintainer-clean:
 # Garbage from autoconf:
 	@rm -rf autom4te.cache/
 
+check-world: temp-install
+
 check check-tests: all
 
 check check-tests installcheck installcheck-parallel installcheck-tests:
diff --git a/contrib/earthdistance/Makefile b/contrib/earthdistance/Makefile
index 93dcbe3..cde1ae6 100644
--- a/contrib/earthdistance/Makefile
+++ b/contrib/earthdistance/Makefile
@@ -7,7 +7,7 @@ DATA = earthdistance--1.0.sql earthdistance--unpackaged--1.0.sql
 PGFILEDESC = earthdistance - calculate distances on the surface of the Earth
 
 REGRESS = earthdistance
-REGRESS_OPTS = --extra-install=contrib/cube
+EXTRA_INSTALL = contrib/cube
 
 LDFLAGS_SL += $(filter -lm, $(LIBS))
 
diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh
index 7bbd2c7..7d493d9 100644
--- a/contrib/pg_upgrade/test.sh
+++ b/contrib/pg_upgrade/test.sh
@@ -80,7 +80,7 @@ if [ $1 = '--install' ]; then
 	# use psql from the proper installation directory, which might
 	# be outdated or missing. But don't override anything else that's
 	# already in EXTRA_REGRESS_OPTS.
-	EXTRA_REGRESS_OPTS=$EXTRA_REGRESS_OPTS --psqldir='$bindir'
+	EXTRA_REGRESS_OPTS=$EXTRA_REGRESS_OPTS --bindir='$bindir'
 	export EXTRA_REGRESS_OPTS
 fi
 
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index d7f32c3..6210ddb 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -39,35 +39,33 @@ submake-test_decoding:
 
 REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact binary prepared
 
-regresscheck: all | submake-regress submake-test_decoding
+regresscheck: all | submake-regress submake-test_decoding temp-install
 	$(MKDIR_P) regression_output
 	$(pg_regress_check) \
 	--temp-config $(top_srcdir)/contrib/test_decoding/logical.conf \
-	--temp-install=./tmp_check \
-	--extra-install=contrib/test_decoding \
+	--temp-instance=./tmp_check \
 	--outputdir=./regression_output \
 	$(REGRESSCHECKS)
 
-regresscheck-install-force: | submake-regress submake-test_decoding
+regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	$(pg_regress_installcheck) \
-	--extra-install=contrib/test_decoding \
 	$(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup concurrent_ddl_dml
 
-isolationcheck: all | submake-isolation submake-test_decoding
+isolationcheck: all | submake-isolation submake-test_decoding temp-install
 	$(MKDIR_P) isolation_output
 	$(pg_isolation_regress_check) \
 	--temp-config $(top_srcdir)/contrib/test_decoding/logical.conf \
-	--extra-install=contrib/test_decoding \
 	--outputdir=./isolation_output \
 	$(ISOLATIONCHECKS)
 
-isolationcheck-install-force: all | submake-isolation submake-test_decoding
+isolationcheck-install-force: all | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_installcheck) \
-	--extra-install=contrib/test_decoding \
 	$(ISOLATIONCHECKS)
 
 PHONY: submake-test_decoding submake-regress check \
 	regresscheck regresscheck-install-force \
 	isolationcheck isolationcheck-install-force
+
+temp-install: EXTRA_INSTALL=contrib/test_decoding
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 0ffc1e8..3238c5c 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -41,6 +41,7 @@ MAJORVERSION = @PG_MAJORVERSION@
 
 # Support for VPATH builds
 vpath_build = @vpath_build@
+abs_top_builddir = @abs_top_builddir@
 abs_top_srcdir = @abs_top_srcdir@
 
 ifneq ($(vpath_build),yes)
@@ -296,6 +297,17 @@ BZIP2	= bzip2
 
 # Testing
 
+check: all temp-install
+
+.PHONY: temp-install
+temp-install:
+ifeq ($(MAKELEVEL),0)
+	rm -rf $(abs_top_builddir)/tmp_install
+	$(MKDIR_P) $(abs_top_builddir)/tmp_install/log
+	$(MAKE) -C $(top_builddir) DESTDIR=$(abs_top_builddir)/tmp_install install $(abs_top_builddir)/tmp_install/log/install.log 21
+endif
+	for extra in $(EXTRA_INSTALL); do $(MAKE) -C $(top_builddir)/$$extra DESTDIR=$(abs_top_builddir)/tmp_install install $(abs_top_builddir)/tmp_install/log/install.log 21 || exit; done
+
 PROVE = @PROVE@
 PG_PROVE_FLAGS = --ext='.pl' -I $(top_srcdir)/src/test/perl/
 PROVE_FLAGS = --verbose
@@ -310,14 +322,16 @@ define ld_library_path_var
 $(if $(filter $(PORTNAME),darwin),DYLD_LIBRARY_PATH,$(if $(filter $(PORTNAME),aix),LIBPATH,LD_LIBRARY_PATH))
 endef
 
+define with_temp_install
+PATH=$(abs_top_builddir)/tmp_install$(bindir):$$PATH $(call 

[HACKERS] Tips/advice for implementing integrated RESTful HTTP API

2014-08-30 Thread Dobes Vandermeer
A while back I was working on a little proposal to create a RESTful HTTP
front-end for PostgreSQL and recently I had the inspiration to work on
this.  So I successfully created a background worker for PostgreSQL 9.3
that can use the SPI to list off the databases in a JSON response.

WIP on github: https://github.com/dobesv/restgres/

Now I'm getting into murkier waters and I'm wonder if I can get some
helpful tips to guide my RD here.

1. Connecting to multiple databases

The background workers can apparently only connect to a single database at
a time, but I want to expose all the databases via the API.

I think I could use libpq to connect to PostgreSQL on localhost but this
might have weird side-effects in terms of authentication, pid use, stuff
like that.

I could probably manage a pool of dynamic workers (as of 9.4), one per
user/database combination or something along those lines.  Even one per
request?  Is there some kind of IPC system in place to help shuttle the
requests and responses between dynamic workers?  Or do I need to come up
with my own?

It seems like PostgreSQL itself has a way to shuttle requests out to
workers, is it possible to tap into that system instead?  Basically some
way to send the requests to a PostgreSQL backend from the background worker?

Or perhaps I shouldn't do this as a worker but rather modify PostgreSQL
itself and do it in a more integrated/destructive manner?

2. Authentication

I was trying to use a function md5_crypt_verify to authenticate the user
using their password, and I believe I am providing the right password but
it's not being accepted.

Any tips on authenticating users in a background worker?   Where should I
be looking for examples?

3. Parallelism

The regular PostgreSQL server can run many queries in parallel, but it
seems like if I am using SPI I could only run one query at a time - it's
not an asynchronous API.

This seems related to the multiple databases issue - either I could use
libpq to translate/forward requests onto PostgreSQL's own worker system or
setup my own little worker pool to run the requests in parallel and have a
way to send the request/response data to/from those workers.



Any help, sage advice, tips, and suggestions how to move forward in these
areas would be muchly appreciated!

Regards,

Dobes


Re: [HACKERS] [v9.5] Custom Plan API

2014-08-30 Thread Kohei KaiGai
2014-08-29 13:33 GMT-04:00 Robert Haas robertmh...@gmail.com:
 On Wed, Aug 27, 2014 at 6:51 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  I'd like to follow this direction, and start stripping the DDL support.

 ...please make it so.

 The attached patch eliminates DDL support.

 Instead of the new CREATE CUSTOM PLAN PROVIDER statement,
 it adds an internal function; register_custom_scan_provider
 that takes custom plan provider name and callback function
 to add alternative scan path (should have a form of CustomPath)
 during the query planner is finding out the cheapest path to
 scan the target relation.
 Also, documentation stuff is revised according to the latest
 design.
 Any other stuff keeps the previous design.

 Comments:

 1. There seems to be no reason for custom plan nodes to have MultiExec
 support; I think this as an area where extensibility is extremely
 unlikely to work out.  The MultiExec mechanism is really only viable
 between closely-cooperating nodes, like Hash and HashJoin, or
 BitmapIndexScan, BitmapAnd, BitmapOr, and BitmapHeapScan; and arguably
 those things could have been written as a single, more complex node.
 Are we really going to want to support a custom plan that can
 substitute for a Hash or BitmapAnd node?  I really doubt that's very
 useful.

This intends to allows a particular custom-scan provider to exchange
its internal data when multiple custom-scan node is stacked.
So, it can be considered a facility to implement closely-cooperating nodes;
both of them are managed by same custom-scan provider.
An example is gpu-accelerated version of hash-join that takes underlying
custom-scan node that will returns a hash table with gpu preferable data
structure, but should not be a part of row-by-row interface.
I believe it is valuable for some use cases, even though I couldn't find
a use-case in ctidscan example.

 2. This patch is still sort of on the fence about whether we're
 implementing custom plans (of any type) or custom scans (thus, of some
 particular relation).  I previously recommended that we confine
 ourselves initially to the task of adding custom *scans* and leave the
 question of other kinds of custom plan nodes to a future patch.  After
 studying the latest patch, I'm inclined to suggest a slightly revised
 strategy.  This patch is really adding THREE kinds of custom objects:
 CustomPlanState, CustomPlan, and CustomPath. CustomPlanState inherits
 from ScanState, so it is not really a generic CustomPlan, but
 specifically a CustomScan; likewise, CustomPlan inherits from Scan,
 and is therefore a CustomScan, not a CustomPlan.  But CustomPath is
 different: it's just a Path.  Even if we only have the hooks to inject
 CustomPaths that are effectively scans at this point, I think that
 part of the infrastructure could be somewhat generic.  Perhaps
 eventually we have CustomPath which can generate either CustomScan or
 CustomJoin which in turn could generate CustomScanState and
 CustomJoinState.

Suggestion seems to me reasonable. The reason why CustomPlanState
inheris ScanState and CustomPlan inherits Scan is, just convenience for
implementation of extensions. Some useful internal APIs, like ExecScan(),
takes argument of ScanState, so it was a better strategy to choose
Scan/ScanState instead of the bare Plan/PlanState.
Anyway, I'd like to follow the perspective that looks CustomScan as one
derivative from the CustomPath. It is more flexible.

 For now, I propose that we rename CustomPlan and CustomPlanState to
 CustomScan and CustomScanState, because that's what they are; but that
 we leave CustomPath as-is.  For ease of review, I also suggest
 splitting this into a series of three patches: (1) add support for
 CustomPath; (2) add support for CustomScan and CustomScanState; (3)
 ctidscan.

OK, I'll do that.

 3. Is it really a good idea to invoke custom scan providers for RTEs
 of every type?  It's pretty hard to imagine that a custom scan
 provider can do anything useful with, say, RTE_VALUES.  Maybe an
 accelerated scan of RTE_CTE or RTE_SUBQUERY is practical somehow, but
 even that feels like an awfully big stretch.  At least until clear use
 cases emerge, I'd be inclined to restrict this to RTE_RELATION scans
 where rte-relkind != RELKIND_FOREIGN_TABLE; that is, put the logic in
 set_plain_rel_pathlist() rather than set_rel_pathlist().

I'd like to agree. Indeed, it's not easy to assume a use case of
custom-logic for non-plain relations.

 (We might even want to consider whether the hook in
 set_plain_rel_pathlist() ought to be allowed to inject a non-custom
 plan; e.g. substitute a scan of relation B for a scan of relation A.
 For example, imagine that B contains all rows from A that satisfy some
 predicate. This could even be useful for foreign tables; e.g.
 substitute a scan of a local copy of a foreign table for a reference
 to that table.  But I put all of these ideas in parentheses because
 they're only good ideas to the extent that they don't sidetrack