[HACKERS] COPY and heap_sync
If you insert tuples with COPY into a table created or truncated in the same transaction, at the end of the COPY it calls heap_sync. But there cases were people use COPY in a loop with a small amount of data in each statement. Now it is calling heap_sync many times, and if NBuffers is large doing that gets very slow. Could the heap_sync be safely delayed until the end of the transaction, rather than the end of the COPY? Cheers, Jeff
Re: [HACKERS] Misleading error message in logical decoding for binary plugins
On Fri, Aug 29, 2014 at 11:15 PM, Andres Freund and...@2ndquadrant.com wrote: On top of that, a logical receiver receives data in textual form even if the decoder is marked as binary when getting a message with PQgetCopyData. I have no idea what you mean by that. The data is sent in the format the output plugin writes the data in. Well, that's where we don't understand each other. By reading the docs I understand that by setting output_type to OUTPUT_PLUGIN_BINARY_OUTPUT, all the changes generated by an output plugin would be automatically converted to binary and sent to a client back as binary data, but that's not the case. For example, when using pg_recvlogical on a slot plugged with test_decoding, the output received is not changed and remains like that even when using force-binary: BEGIN 1000 table public.aa: INSERT: a[integer]:2 COMMIT 1000 Perhaps I didn't understand the docs well, but this is confusing and it should be clearly specified to the user that output_type only impacts the SQL interface with the peekget functions (as far as I tested). As things stand now, by reading the docs a user can only know that output_type can be set as OUTPUT_PLUGIN_BINARY_OUTPUT or OUTPUT_PLUGIN_TEXTUAL_OUTPUT, but there is nothing about what each value means and how it impacts the output format, and you need to look at the source code to get that this parameter is used for some sanity checks, only within the logical functions. That's not really user-friendly. Attached is a patch improving the documentation regarding those comments. Regards, -- Michael diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml index e6880d0..1472a6c 100644 --- a/doc/src/sgml/logicaldecoding.sgml +++ b/doc/src/sgml/logicaldecoding.sgml @@ -403,9 +403,17 @@ typedef struct OutputPluginOptions OutputPluginOutputType output_type; } OutputPluginOptions; /programlisting - literaloutput_type/literal has to either be set to - literalOUTPUT_PLUGIN_TEXTUAL_OUTPUT/literal - or literalOUTPUT_PLUGIN_BINARY_OUTPUT/literal. + literaloutput_type/literal is the parameter defining the output + format of plugin; it can be one of + literalOUTPUT_PLUGIN_TEXTUAL_OUTPUT/literal (support for output + format as text) and literalOUTPUT_PLUGIN_BINARY_OUTPUT/literal + (support for format output as binary). This parameter influences only + the way output changes can be accessed by the set of SQL functions able + to query a replication slot. For example, the changes fetched by + xref linkend=app-pgrecvlogical are not impacted by this parameter + and remain the same, while functionpg_logical_slot_get_changes/ + is not able to query changes using output type + literalOUTPUT_PLUGIN_BINARY_OUTPUT/literal. /para para The startup callback should validate the options present in diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c index a3a58e6..4f802ad 100644 --- a/src/backend/replication/logical/logicalfuncs.c +++ b/src/backend/replication/logical/logicalfuncs.c @@ -394,14 +394,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin MemoryContextSwitchTo(oldcontext); /* - * Check whether the output pluggin writes textual output if that's + * Check whether the output plugin writes textual output if that's * what we need. */ if (!binary ctx-options.output_type != OUTPUT_PLUGIN_TEXTUAL_OUTPUT) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg(output plugin cannot produce binary output))); + errmsg(output plugin produces binary output but only textual output is accepted))) ctx-output_writer_private = p; diff --git a/src/include/catalog/objectaccess.h b/src/include/catalog/objectaccess.h index 4fdd056..1e1cb13 100644 --- a/src/include/catalog/objectaccess.h +++ b/src/include/catalog/objectaccess.h @@ -13,7 +13,7 @@ /* * Object access hooks are intended to be called just before or just after * performing certain actions on a SQL object. This is intended as - * infrastructure for security or logging pluggins. + * infrastructure for security or logging plugins. * * OAT_POST_CREATE should be invoked just after the object is created. * Typically, this is done after inserting the primary catalog records and -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange
On Tue, Aug 26, 2014 at 9:49 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: So my proposal is a bit more complicated. First we introduce the notion of a single number, to enable sorting and computations: the delay equivalent, which is the cost_limit divided by cost_delay. The highest the value is for any table, the fastest it is vacuumed. (It makes sense in physical terms: a higher cost_limit makes it faster, because vacuum sleeps less often; and a higher cost_delay makes it go slower, because vacuums sleeps for longer.) Now, the critical issue is to notice that not all tables are equal; they can be split in two groups, those that go faster than the global delay equivalent (i.e. the effective values of GUC variables autovacuum_vacuum_cost_limit/autovacuum_vacuum_cost_delay), and those that go equal or slower. For the latter group, the rebalancing algorithm distributes the allocated I/O by the global vars, in a pro-rated manner. For the former group (tables vacuumed faster than global delay equiv), to rebalance we don't consider the global delay equiv but the delay equiv of the fastest table currently being vacuumed. Suppose we have two tables, delay_equiv=10 each (which is the default value). If they are both vacuumed in parallel, then we distribute a delay_equiv of 5 to each (so set cost_limit=100, cost_delay=20). As soon as one of them finishes, the remaining one is allowed to upgrade to delay_equiv=10 (cost_limit=200, cost_delay=20). Now add a third table, delay_equiv=500 (cost_limit=1, cost_delay=20; this is Mark's volatile table). If it's being vacuumed on its own, just assign cost_limit=1 cost_delay=20, as normal. If one of the other two tables are being vacuumed, that one will use delay_equiv=10, as per above. To balance the volatile table, we take the delay_equiv of this one and subtract the already handed-out delay_equiv of 10; so we set the volatile table to delay_equiv=490 (cost_limit=9800, cost_delay=20). If we do it this way, the whole system is running at the full speed enabled by the fastest table we have set the per-table options, but also we have scaled things so that the slow tables go slow and the fast tables go fast. As a more elaborate example, add a fourth table with delay_equiv=50 (cost_limit=1000, cost_delay=20). This is also faster than the global vars, so we put it in the first group. If all four tables are being vacuumed in parallel, we have the two slow tables going at delay_equiv=5 each (cost_limit=100, cost_delay=20); then there are delay_equiv=490 to distribute among the remaining ones; pro-rating this we have delay_equiv=445 (cost_limit=8900, cost_delay=20) for the volatile table and delay_equiv=45 (cost_limit=900, cost_delay=20) for the other one. How will this calculation behave if third table has delay_equiv = 30 and fourth table has delay_equiv = 20 which are both greater than default delay_equiv = 10, so they will participate in fast group, as per my understanding from above calculation both might get same delay_equiv, but I might be wrong because still your patch has FixMe and I haven't yet fully understood the code of patch. In general, I have a feeling that distributing vacuuming speed is a good way to tune the system, however if user wants to override that by providing specific values for particular tables, we should honour that setting. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange
On Thu, Aug 28, 2014 at 11:06 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Robert Haas wrote: Now, in the case where you are setting an overall limit, there is at least an argument to be made that you can determine the overall rate of autovacuum-induced I/O activity that the system can tolerate, and set your limits to stay within that budget, and then let the system decide how to divide that I/O up between workers. But if you're overriding a per-table limit, I don't really see how that holds any water. The system I/O budget doesn't go up just because one particular table is being vacuumed rather than any other. The only plausible use case for setting a per-table rate that I can see is when you actually want the system to use that exact rate for that particular table. Yeah, this makes sense to me too -- at least as long as you only have one such table. But if you happen to have more than one, and due to some bad luck they happen to be vacuumed concurrently, they will eat a larger share of your I/O bandwidth budget than you anticipated, which you might not like. I think to control I/O bandwidth, there should be a separate mechanism (may be similar to what Simon proposed for WAL rate limiting) rather than by changing user specified values internally where he might specifically want that value to be used. This can give more predictable results which user can control. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] COPY and heap_sync
On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes jeff.ja...@gmail.com wrote: If you insert tuples with COPY into a table created or truncated in the same transaction, at the end of the COPY it calls heap_sync. But there cases were people use COPY in a loop with a small amount of data in each statement. Now it is calling heap_sync many times, and if NBuffers is large doing that gets very slow. Could the heap_sync be safely delayed until the end of the transaction, rather than the end of the COPY? Wouldn't unconditionally delaying sync until end of transaction can lead to burst of I/O at that time especially if there are many such copy commands in a transaction, leading to delay in some other operation's that might be happening concurrently in the system. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] COPY and heap_sync
On Saturday, August 30, 2014, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes jeff.ja...@gmail.com javascript:_e(%7B%7D,'cvml','jeff.ja...@gmail.com'); wrote: If you insert tuples with COPY into a table created or truncated in the same transaction, at the end of the COPY it calls heap_sync. But there cases were people use COPY in a loop with a small amount of data in each statement. Now it is calling heap_sync many times, and if NBuffers is large doing that gets very slow. Could the heap_sync be safely delayed until the end of the transaction, rather than the end of the COPY? Wouldn't unconditionally delaying sync until end of transaction can lead to burst of I/O at that time especially if there are many such copy commands in a transaction, leading to delay in some other operation's that might be happening concurrently in the system. I agree with that but then, it can provide us the same benefits like group commit,especially when most of the copy commands touch pages which are nearby,hence reducing the seek time overhead. We could look at making it optional through a GUC, since it is useful albeit for some specific usecases. Regards, Atri -- Regards, Atri *l'apprenant*
[HACKERS] Make LWLockAcquireCommon() inline?
Hi, when profiling optimized builds (linux, gcc 4.9) it's currently LWLockAcquireCommon() showing up, not it's callers. Instruction level profiles show that the tests for valptr show up in profiles to some extent. Since most callers don't need the valptr logic it seems prudent to mark the function inline which will then eliminate the superflous branches. Arguments against? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] delta relations in AFTER triggers
Kevin Grittner kgri...@ymail.com wrote: Kevin Grittner kgri...@ymail.com wrote: executor-tuplestore-relations covers parse analysis, planner/optimizer, and executor layers. 30 files changed, 786 insertions(+), 9 deletions(-) Testing and further review found a few places that needed to add lines for the new RTE kind that I had missed. Delta patch attached. 7 files changed, 58 insertions(+) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 799242b..2e61edf 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -2250,6 +2250,9 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable) APP_JUMB_STRING(rte-ctename); APP_JUMB(rte-ctelevelsup); break; + case RTE_TUPLESTORE: +APP_JUMB_STRING(rte-tsrname); +break; default: elog(ERROR, unrecognized RTE kind: %d, (int) rte-rtekind); break; @@ -2638,6 +2641,13 @@ JumbleExpr(pgssJumbleState *jstate, Node *node) JumbleQuery(jstate, (Query *) cte-ctequery); } break; + case T_TuplestoreRelation: + { +TuplestoreRelation *tsr = (TuplestoreRelation *) node; + +APP_JUMB_STRING(tsr-name); + } + break; case T_SetOperationStmt: { SetOperationStmt *setop = (SetOperationStmt *) node; diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 781a736..de31026 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -717,6 +717,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used) case T_FunctionScan: case T_ValuesScan: case T_CteScan: + case T_TuplestoreScan: case T_WorkTableScan: case T_ForeignScan: *rels_used = bms_add_member(*rels_used, @@ -930,6 +931,9 @@ ExplainNode(PlanState *planstate, List *ancestors, case T_CteScan: pname = sname = CTE Scan; break; + case T_TuplestoreScan: + pname = sname = Tuplestore Scan; + break; case T_WorkTableScan: pname = sname = WorkTable Scan; break; @@ -1300,6 +1304,7 @@ ExplainNode(PlanState *planstate, List *ancestors, case T_SeqScan: case T_ValuesScan: case T_CteScan: + case T_TuplestoreScan: case T_WorkTableScan: case T_SubqueryScan: show_scan_qual(plan-qual, Filter, planstate, ancestors, es); @@ -2160,6 +2165,11 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es) objectname = rte-ctename; objecttag = CTE Name; break; + case T_TuplestoreScan: + Assert(rte-rtekind == RTE_TUPLESTORE); + objectname = rte-tsrname; + objecttag = Tuplestore Name; + break; case T_WorkTableScan: /* Assert it's on a self-reference CTE */ Assert(rte-rtekind == RTE_CTE); diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 93f3905..255348b 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2442,6 +2442,13 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node) WRITE_NODE_FIELD(ctecoltypmods); WRITE_NODE_FIELD(ctecolcollations); break; + case RTE_TUPLESTORE: + WRITE_STRING_FIELD(tsrname); + WRITE_OID_FIELD(relid); + WRITE_NODE_FIELD(ctecoltypes); + WRITE_NODE_FIELD(ctecoltypmods); + WRITE_NODE_FIELD(ctecolcollations); + break; default: elog(ERROR, unrecognized RTE kind: %d, (int) node-rtekind); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 9f7f322..e11a116 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -287,6 +287,10 @@ print_rt(const List *rtable) printf(%d\t%s\t[cte], i, rte-eref-aliasname); break; + case RTE_TUPLESTORE: +printf(%d\t%s\t[tuplestore], + i, rte-eref-aliasname); +break; default: printf(%d\t%s\t[unknown rtekind], i, rte-eref-aliasname); diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 69d9989..bcaeeb0 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -1240,6 +1240,13 @@ _readRangeTblEntry(void) READ_NODE_FIELD(ctecoltypmods); READ_NODE_FIELD(ctecolcollations); break; + case RTE_TUPLESTORE: + READ_STRING_FIELD(tsrname); + READ_OID_FIELD(relid); + READ_NODE_FIELD(ctecoltypes); + READ_NODE_FIELD(ctecoltypmods); + READ_NODE_FIELD(ctecolcollations); + break; default: elog(ERROR, unrecognized RTE kind: %d, (int) local_node-rtekind); diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index fb6c44c..518691a 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -2473,6 +2473,15 @@ transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc, LCS_asString(lc-strength)), parser_errposition(pstate, thisrel-location))); break; + case RTE_TUPLESTORE: + ereport(ERROR, +
Re: [HACKERS] pgbench throttling latency limit
Hi, I generally want to say that having a feature like this feels *very* helpful to me. Lots of pg development hasn't really paid attention to anything but the final pgbench results... On 2014-08-29 19:48:43 +0200, Fabien COELHO wrote: + if (latency_limit) + printf(number of transactions above the %.1f ms latency limit: INT64_FORMAT \n, +latency_limit / 1000.0, latency_late); + Any reason not to report a percentage here? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
+ if (latency_limit) + printf(number of transactions above the %.1f ms latency limit: INT64_FORMAT \n, + latency_limit / 1000.0, latency_late); + Any reason not to report a percentage here? Yes: I did not thought of it. Here is a v7, with a percent. I also added a paragraph in the documenation about how the latency is computed under throttling, and I tried to reorder the reported stuff so that it is more logical. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 2f7d80e..c102ad7 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -141,6 +141,18 @@ double sample_rate = 0.0; int64 throttle_delay = 0; /* + * Transactions which take longer that this limit are counted as late + * and reported as such, although they are completed anyway. + * + * When under throttling: execution time slots which are more than + * this late (in us) are simply skipped, and the corresponding transaction + * is counted as such... it is not even started; + * otherwise above the limit transactions are counted as such, with the latency + * measured wrt the transaction schedule, not its actual start. + */ +int64 latency_limit = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -238,6 +250,8 @@ typedef struct int64 throttle_trigger; /* previous/next throttling (us) */ int64 throttle_lag; /* total transaction lag behind throttling */ int64 throttle_lag_max; /* max transaction lag */ + int64 throttle_latency_skipped; /* lagging transactions skipped */ + int64 latency_late; /* late transactions */ } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -250,6 +264,8 @@ typedef struct int64 sqlats; int64 throttle_lag; int64 throttle_lag_max; + int64 throttle_latency_skipped; + int64 latency_late; } TResult; /* @@ -367,6 +383,10 @@ usage(void) -f, --file=FILENAME read transaction script from FILENAME\n -j, --jobs=NUM number of threads (default: 1)\n -l, --logwrite transaction times to log file\n + -L, --limit=NUM count transactions lasting more than NUM ms.\n + under throttling (--rate), transactions behind schedule\n + more than NUM ms are skipped, and those finishing more\n + than NUM ms after their scheduled start are counted.\n -M, --protocol=simple|extended|prepared\n protocol for submitting queries (default: simple)\n -n, --no-vacuum do not run VACUUM before tests\n @@ -1016,6 +1036,24 @@ top: thread-throttle_trigger += wait; + if (latency_limit) + { + instr_time now; + int64 now_us; + INSTR_TIME_SET_CURRENT(now); + now_us = INSTR_TIME_GET_MICROSEC(now); + while (thread-throttle_trigger now_us - latency_limit) + { +/* if too far behind, this slot is skipped, and we + * iterate till the next nearly on time slot. + */ +int64 wait = (int64) (throttle_delay * + 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); +thread-throttle_trigger += wait; +thread-throttle_latency_skipped ++; + } + } + st-until = thread-throttle_trigger; st-sleeping = 1; st-throttling = true; @@ -1080,15 +1118,31 @@ top: thread-exec_count[cnum]++; } - /* transaction finished: record latency under progress or throttling */ - if ((progress || throttle_delay) commands[st-state + 1] == NULL) + /* transaction finished: record latency under progress or throttling, + * ot if latency limit is set + */ + if ((progress || throttle_delay || latency_limit) + commands[st-state + 1] == NULL) { instr_time diff; int64 latency; INSTR_TIME_SET_CURRENT(diff); - INSTR_TIME_SUBTRACT(diff, st-txn_begin); - latency = INSTR_TIME_GET_MICROSEC(diff); + + if (throttle_delay) + { +/* Under throttling, compute latency wrt to the initial schedule, + * not the actual transaction start. + */ +int64 now = INSTR_TIME_GET_MICROSEC(diff); +latency = now - thread-throttle_trigger; + } + else + { +INSTR_TIME_SUBTRACT(diff, st-txn_begin); +latency = INSTR_TIME_GET_MICROSEC(diff); + } + st-txn_latencies += latency; /* @@ -1099,6 +1153,11 @@ top: * would take 256 hours. */ st-txn_sqlats += latency * latency; + + /* record over the limit transactions if needed. + */ + if (latency_limit latency latency_limit) +thread-latency_late++; } /* @@ -1294,7 +1353,7 @@ top: } /* Record transaction start time under logging, progress or throttling */ - if ((logfile || progress || throttle_delay) st-state == 0) + if ((logfile || progress || throttle_delay || latency_limit) st-state == 0) INSTR_TIME_SET_CURRENT(st-txn_begin); /* Record statement start time if per-command latencies are requested */ @@ -2351,7 +2410,8 @@
Re: [HACKERS] [PATCH] empty xml values
While looking into this report http://www.postgresql.org/message-id/cf48ccfb.65a9d%tim.k...@gmail.com I noticed that we don't accept empty values as xml content values, even though this should apparently be allowed by the spec. Attached is a patch to fix it (which needs updates to xml_1.out, which I omit here for simplicity). The patch works, albeit must be applied with --ignore-whitespace. Attached the patch + xml_1.out updates. I'm marking this ready for commit -- Ali Akbar diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 422be69..7abe215 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -1400,11 +1400,14 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace, doc-encoding = xmlStrdup((const xmlChar *) UTF-8); doc-standalone = standalone; - res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, - utf8string + count, NULL); - if (res_code != 0 || xmlerrcxt-err_occurred) -xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT, - invalid XML content); + if (*(utf8string + count)) + { +res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, + utf8string + count, NULL); +if (res_code != 0 || xmlerrcxt-err_occurred) + xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT, +invalid XML content); + } } } PG_CATCH(); diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out index 382f9df..6e6c673 100644 --- a/src/test/regress/expected/xml.out +++ b/src/test/regress/expected/xml.out @@ -194,6 +194,18 @@ SELECT xmlelement(name foo, xmlattributes( as funny, xml 'ba/r' as fun foo funny=lt;gt;amp;quot;' funnier=blt;a/gt;r/ (1 row) +SELECT xmlparse(content ''); + xmlparse +-- + +(1 row) + +SELECT xmlparse(content ' '); + xmlparse +-- + +(1 row) + SELECT xmlparse(content 'abc'); xmlparse -- @@ -251,6 +263,22 @@ SELECT xmlparse(content 'nosuchprefix:tag/'); nosuchprefix:tag/ (1 row) +SELECT xmlparse(document ''); +ERROR: invalid XML document +DETAIL: line 1: switching encoding : no input + +^ +line 1: Document is empty + +^ +line 1: Start tag expected, '' not found + +^ +SELECT xmlparse(document ' '); +ERROR: invalid XML document +DETAIL: line 1: Start tag expected, '' not found + + ^ SELECT xmlparse(document 'abc'); ERROR: invalid XML document DETAIL: line 1: Start tag expected, '' not found diff --git a/src/test/regress/expected/xml_1.out b/src/test/regress/expected/xml_1.out index a34d1f4..b0e0067 100644 --- a/src/test/regress/expected/xml_1.out +++ b/src/test/regress/expected/xml_1.out @@ -164,6 +164,14 @@ SELECT xmlelement(name foo, xmlattributes( as funny, xml 'ba/r' as fun ERROR: unsupported XML feature DETAIL: This functionality requires the server to be built with libxml support. HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xmlparse(content ''); +ERROR: unsupported XML feature +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xmlparse(content ' '); +ERROR: unsupported XML feature +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. SELECT xmlparse(content 'abc'); ERROR: unsupported XML feature DETAIL: This functionality requires the server to be built with libxml support. @@ -196,6 +204,14 @@ SELECT xmlparse(content 'nosuchprefix:tag/'); ERROR: unsupported XML feature DETAIL: This functionality requires the server to be built with libxml support. HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xmlparse(document ''); +ERROR: unsupported XML feature +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xmlparse(document ' '); +ERROR: unsupported XML feature +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. SELECT xmlparse(document 'abc'); ERROR: unsupported XML feature DETAIL: This functionality requires the server to be built with libxml support. diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql index 90d4d67..922ab7a 100644 --- a/src/test/regress/sql/xml.sql +++ b/src/test/regress/sql/xml.sql @@ -60,6 +60,8 @@ SELECT xmlelement(name foo, xmlattributes('infinity'::timestamp as bar)); SELECT xmlelement(name foo, xmlattributes( as funny, xml 'ba/r' as funnier)); +SELECT xmlparse(content ''); +SELECT xmlparse(content ' '); SELECT xmlparse(content 'abc'); SELECT xmlparse(content 'abcx/abc'); SELECT xmlparse(content 'invalidentity/invalidentity'); @@ -69,6 +71,8 @@ SELECT xmlparse(content 'relativens xmlns=''relative''/'); SELECT xmlparse(content
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). The performance impact of that was inconclusive, but one thing that it allows nicely is to interleave the fsyncs, so that you write all the buffers for one file, then fsync it, then next file and so on. IIRC the biggest worry with that patch was that sorting the buffers requires a fairly large amount of memory, and making a large allocation in the checkpointer might cause an out-of-memory, which would be bad. I don't think anyone's seriously worked on this area since. If the impact on responsiveness or performance is significant, I'm pretty sure the OOM problem could be alleviated somehow. I've dug up that patch (after a bit of fighting with the archives) and refreshed it. It's *clearly* beneficial: master: andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1 starting vacuum...end. progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219 ... progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720 progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228 progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546 progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740 progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934 .. progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685 progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342 progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916 progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448 progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339 progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340 progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648 progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297 progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224 progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901 progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576 progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520 progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346 progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670 progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411 progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096 progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690 progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366 progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900 progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774 progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501 progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921 progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426 progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247 progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082 progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577 progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611 progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498 progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970 progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848 progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592 progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851 progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976 progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057 progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994 progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553 progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508 progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053 ... progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210 progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411 progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650 progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596 progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174 progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444 progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611 progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121 progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377 progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927 progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472 progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491 progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754 progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857 progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488 progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162 progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952 progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539 progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094 progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200 progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev
Re: [HACKERS] Built-in binning functions
Pavel Stehule pavel.steh...@gmail.com writes: 1. I am thinking so reduction to only numeric types is not necessary - although we can live without it - but there are lot of non numeric categories: chars, date, ... I wasn't terribly happy about that either. I still think we should reduce this to a single polymorphic function, as in the attached. 2. Still I strongly afraid about used searching method - there is not possible to check a validity of input. Did you check how much linear searching is slower - you spoke, so you have a real data and real use case? Well, if we check the input then we'll be doing O(N) comparisons instead of O(log N). That could be a serious cost penalty for large arrays and expensive comparison functions (such as strcoll()). I think it's probably sufficient to have a clear warning in the docs. 3. I am thinking about name - I don't think so varwidth_bucket is correct -- in relation to name width_bucket ... what about range_bucket or scope_bucket ? Neither of those seem like improvements from here. I agree with the objections to bin() as well. bucket() might be all right but it still seems a bit too generic. width_bucket(), which at least shows there's a relationship to the standard functions, still seems like the best of the suggestions so far. It occurs to me that there would be an advantage to using some other name besides width_bucket: we could then mark the function as variadic, which might be a notational advantage in some usages. (We can't do that if we call it width_bucket because the four-parameter case would be ambiguous with the existing functions.) I'm not sure that this is important enough to justify changing the name though, especially if we can't come up with a good name. Also, doing that would put a very large premium on picking a non-generic name, else we'd risk creating ambiguities against user-defined functions. Also, a documentation quibble: in Petr's patch the documentation and comments refer to the thresholds as right bounds, which I didn't care for and replaced with upper bounds. However, looking at it again, I wonder if we would not be better off describing them as lower bounds. They are exclusive bounds if seen as upper bounds, and inclusive if seen as lower bounds, and on the whole I think the latter viewpoint might be less confusing. Thoughts? Proposed patch with 1 polymorphic function attached. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 722640b..93cf1e7 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** *** 901,925 indexterm primarywidth_bucket/primary /indexterm ! literalfunctionwidth_bucket(parameterop/parameter typenumeric/type, parameterb1/parameter typenumeric/type, parameterb2/parameter typenumeric/type, parametercount/parameter typeint/type)/function/literal /entry entrytypeint/type/entry entryreturn the bucket to which parameteroperand/ would be assigned in an equidepth histogram with parametercount/ !buckets, in the range parameterb1/ to parameterb2//entry entryliteralwidth_bucket(5.35, 0.024, 10.06, 5)/literal/entry entryliteral3/literal/entry /row row !entryliteralfunctionwidth_bucket(parameterop/parameter typedp/type, parameterb1/parameter typedp/type, parameterb2/parameter typedp/type, parametercount/parameter typeint/type)/function/literal/entry entrytypeint/type/entry entryreturn the bucket to which parameteroperand/ would be assigned in an equidepth histogram with parametercount/ !buckets, in the range parameterb1/ to parameterb2//entry entryliteralwidth_bucket(5.35, 0.024, 10.06, 5)/literal/entry entryliteral3/literal/entry /row /tbody /tgroup /table --- 901,936 indexterm primarywidth_bucket/primary /indexterm ! literalfunctionwidth_bucket(parameteroperand/parameter typenumeric/type, parameterb1/parameter typenumeric/type, parameterb2/parameter typenumeric/type, parametercount/parameter typeint/type)/function/literal /entry entrytypeint/type/entry entryreturn the bucket to which parameteroperand/ would be assigned in an equidepth histogram with parametercount/ !buckets spanning the range parameterb1/ to parameterb2//entry entryliteralwidth_bucket(5.35, 0.024, 10.06, 5)/literal/entry entryliteral3/literal/entry /row row !entryliteralfunctionwidth_bucket(parameteroperand/parameter typedp/type, parameterb1/parameter typedp/type, parameterb2/parameter typedp/type, parametercount/parameter typeint/type)/function/literal/entry entrytypeint/type/entry entryreturn the bucket to which parameteroperand/ would be assigned in an equidepth histogram with
Re: [HACKERS] postgresql latency bgwriter not doing its job
Andres Freund and...@2ndquadrant.com writes: On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). The performance impact of that was inconclusive, but one thing that it allows nicely is to interleave the fsyncs, so that you write all the buffers for one file, then fsync it, then next file and so on. ... So, *very* clearly sorting is a benefit. pg_bench alone doesn't convince me on this. The original thread found cases where it was a loss, IIRC; you will need to test many more than one scenario to prove the point. Also, it does not matter how good it looks in test cases if it causes outright failures due to OOM; unlike you, I am not prepared to just wave away that risk. A possible compromise is to sort a limited number of buffers say, collect a few thousand dirty buffers then sort, dump and fsync them, repeat as needed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-30 13:50:40 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). The performance impact of that was inconclusive, but one thing that it allows nicely is to interleave the fsyncs, so that you write all the buffers for one file, then fsync it, then next file and so on. ... So, *very* clearly sorting is a benefit. pg_bench alone doesn't convince me on this. The original thread found cases where it was a loss, IIRC; you will need to test many more than one scenario to prove the point. Sure. And I'm not claiming Itagaki/your old patch is immediately going to be ready for commit. But our checkpoint performance has sucked for years in the field. I don't think we can wave that away. I think the primary reason it wasn't easily visible as being beneficial back then was that only the throughput, not the latency and such were looked at. Also, it does not matter how good it looks in test cases if it causes outright failures due to OOM; unlike you, I am not prepared to just wave away that risk. I'm not waving away any risks. If the sort buffer is allocated when the checkpointer is started, not everytime we sort, as you've done in your version of the patch I think that risk is pretty manageable. If we really want to be sure nothing is happening at runtime, even if the checkpointer was restarted, we can put the sort array in shared memory. We're talking about (sizeof(BufferTag) + sizeof(int))/8192 ~= 0.3 % overhead over shared_buffers here. If we decide to got that way, it's a pretty darn small to price not to regularly have stalls that last minutes. A possible compromise is to sort a limited number of buffers say, collect a few thousand dirty buffers then sort, dump and fsync them, repeat as needed. Yea, that's what I suggested nearby. But I don't really like it, because it robs us of the the chance to fsync() a relfilenode immediately after having synced all its buffers. Doing so is rather beneficial because then fewer independently dirtied pages have to be flushed out - reducing the impact of the fsync(). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
Andres Freund and...@2ndquadrant.com writes: On 2014-08-30 13:50:40 -0400, Tom Lane wrote: A possible compromise is to sort a limited number of buffers say, collect a few thousand dirty buffers then sort, dump and fsync them, repeat as needed. Yea, that's what I suggested nearby. But I don't really like it, because it robs us of the the chance to fsync() a relfilenode immediately after having synced all its buffers. Uh, how so exactly? You could still do that. Yeah, you might fsync a rel once per sort-group and not just once per checkpoint, but it's not clear that that's a loss as long as the group size isn't tiny. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-30 14:16:10 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-30 13:50:40 -0400, Tom Lane wrote: A possible compromise is to sort a limited number of buffers say, collect a few thousand dirty buffers then sort, dump and fsync them, repeat as needed. Yea, that's what I suggested nearby. But I don't really like it, because it robs us of the the chance to fsync() a relfilenode immediately after having synced all its buffers. Uh, how so exactly? You could still do that. Yeah, you might fsync a rel once per sort-group and not just once per checkpoint, but it's not clear that that's a loss as long as the group size isn't tiny. Because it wouldn't have the benefit of sycing the minimal amount of data anymore. If lots of other relfilenodes have been synced inbetween the amount of newly dirtied pages in the os' buffercache (written by backends, bgwriter) for a individual relfilenode is much higher. A fsync() on a file with dirty data often causes *serious* latency spikes - we should try hard to avoid superflous calls. As an example: Calling fsync() on pgbench_accounts's underlying files, from outside postgres, *before* postgres even started its first checkpoint does this: progress: 72.0 s, 4324.9 tps, lat 41.481 ms stddev 40.567 progress: 73.0 s, 4704.9 tps, lat 38.465 ms stddev 35.436 progress: 74.0 s, 4448.5 tps, lat 40.058 ms stddev 32.634 progress: 75.0 s, 4634.5 tps, lat 39.229 ms stddev 33.463 progress: 76.8 s, 2753.1 tps, lat 48.693 ms stddev 75.309 progress: 77.1 s, 126.6 tps, lat 773.433 ms stddev 222.667 progress: 78.0 s, 183.7 tps, lat 786.401 ms stddev 395.954 progress: 79.1 s, 170.3 tps, lat 975.949 ms stddev 596.751 progress: 80.0 s, 2116.6 tps, lat 168.608 ms stddev 398.933 progress: 81.0 s, 4436.1 tps, lat 40.313 ms stddev 34.198 progress: 82.0 s, 4383.9 tps, lat 41.811 ms stddev 37.241 Note the dip from 4k tps to 130 tps. We can get a handle on that (on some platforms at least) for writes issued during the buffer sync by forcing the kernel to write out the pages in small increments; but I doubt we want to do that for writes by backends themselves. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Selectivity estimation for inet operators
Emre Hasegeli e...@hasegeli.com writes: I updated the patch to cover semi and anti joins with eqjoinsel_semi(). I think it is better than returning a constant. What you did there is utterly unacceptable from a modularity standpoint; and considering that the values will be nowhere near right, the argument that it's better than returning a constant seems pretty weak. I think you should just take that out again. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Selectivity estimation for inet operators
Heikki Linnakangas hlinnakan...@vmware.com writes: * inet_mcv_join_selec() is O(n^2) where n is the number of entries in the MCV lists. With the max statistics target of 1, a worst case query on my laptop took about 15 seconds to plan. Maybe that's acceptable, but you went through some trouble to make planning of MCV vs histogram faster, by the log2 method to compare only some values, so I wonder why you didn't do the same for the MCV vs MCV case? Actually, what I think needs to be asked is the opposite question: why is the other code ignoring some of the statistical data? If the user asked us to collect a lot of stats detail it seems reasonable that he's expecting us to use it to get more accurate estimates. It's for sure not obvious why these estimators should take shortcuts that are not being taken in the much-longer-established code for scalar comparison estimates. I'm not exactly convinced that the math adds up in this logic, either. The way in which it combines results from looking at the MCV lists and at the histograms seems pretty arbitrary. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What in the world is happening with castoroides and protosciurus?
On Tue, Aug 26, 2014 at 10:17:05AM +0100, Dave Page wrote: On Tue, Aug 26, 2014 at 1:46 AM, Tom Lane t...@sss.pgh.pa.us wrote: For the last month or so, these two buildfarm animals (which I believe are the same physical machine) have been erratically failing with errors that reflect low-order differences in floating-point calculations. A recent example is at http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52 where the only regression diff is *** /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out Mon Aug 25 11:41:00 2014 --- /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out Mon Aug 25 11:57:26 2014 *** *** 171,179 SELECT h.seqno AS i8096, h.random AS f1234_1234 FROM hash_f8_heap h WHERE h.random = '-1234.1234'::float8; ! i8096 | f1234_1234 ! ---+ ! 8906 | -1234.1234 (1 row) UPDATE hash_f8_heap --- 171,179 SELECT h.seqno AS i8096, h.random AS f1234_1234 FROM hash_f8_heap h WHERE h.random = '-1234.1234'::float8; ! i8096 |f1234_1234 ! ---+--- ! 8906 | -1234.12356777216 (1 row) UPDATE hash_f8_heap ... a result that certainly makes no sense. The results are not repeatable, failing in equally odd ways in different tests on different runs. This is happening in all the back branches too, not just HEAD. I have no idea what is causing the current issue - the machine is stable software-wise, and only has private builds of dependency libraries update periodically (which are not used for the buildfarm). If I had to hazard a guess, I'd suggest this is an early symptom of an old machine which is starting to give up. Agreed. Rerunning each animal against older commits would test that theory. Say, run against the last 6 months of REL9_0_STABLE commits. If those runs show today's failure frequencies instead of historic failure frequencies, it's not a PostgreSQL regression. Not that I see a commit back-patched near the time of the failure uptick (2014-08-06) that looks remotely likely to have introduced such a regression. It would be sad to lose our only buildfarm coverage of plain Solaris and of the Sun Studio compiler, but having buildfarm members this unstable is a pain. Perhaps have those animals retry the unreliable steps up to, say, 7 times? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 08/30/2014 09:45 PM, Andres Freund wrote: On 2014-08-30 14:16:10 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-30 13:50:40 -0400, Tom Lane wrote: A possible compromise is to sort a limited number of buffers say, collect a few thousand dirty buffers then sort, dump and fsync them, repeat as needed. Yea, that's what I suggested nearby. But I don't really like it, because it robs us of the the chance to fsync() a relfilenode immediately after having synced all its buffers. Uh, how so exactly? You could still do that. Yeah, you might fsync a rel once per sort-group and not just once per checkpoint, but it's not clear that that's a loss as long as the group size isn't tiny. Because it wouldn't have the benefit of sycing the minimal amount of data anymore. If lots of other relfilenodes have been synced inbetween the amount of newly dirtied pages in the os' buffercache (written by backends, bgwriter) for a individual relfilenode is much higher. I wonder how much of the benefit from sorting comes from sorting the pages within each file, and how much just from grouping all the writes of each file together. In other words, how much difference is there between sorting, and fsyncing between each file, or the crude patch I posted earlier. If we're going to fsync between each file, there's no need to sort all the buffers at once. It's enough to pick one file as the target - like in my crude patch - and sort only the buffers for that file. Then fsync that file and move on to the next file. That requires scanning the buffers multiple times, but I think that's OK. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote: On 08/30/2014 09:45 PM, Andres Freund wrote: On 2014-08-30 14:16:10 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-30 13:50:40 -0400, Tom Lane wrote: A possible compromise is to sort a limited number of buffers say, collect a few thousand dirty buffers then sort, dump and fsync them, repeat as needed. Yea, that's what I suggested nearby. But I don't really like it, because it robs us of the the chance to fsync() a relfilenode immediately after having synced all its buffers. Uh, how so exactly? You could still do that. Yeah, you might fsync a rel once per sort-group and not just once per checkpoint, but it's not clear that that's a loss as long as the group size isn't tiny. Because it wouldn't have the benefit of sycing the minimal amount of data anymore. If lots of other relfilenodes have been synced inbetween the amount of newly dirtied pages in the os' buffercache (written by backends, bgwriter) for a individual relfilenode is much higher. I wonder how much of the benefit from sorting comes from sorting the pages within each file, and how much just from grouping all the writes of each file together. In other words, how much difference is there between sorting, and fsyncing between each file, or the crude patch I posted earlier. I haven't implemented fsync()ing between files so far. From the io stats I'm seing the performance improvements come from the OS being able to write data back in bigger chunks. Which seems entirely reasonable. If the database and the write load are big enough, so writeback will be triggered repeatedly during one checkpoint, the OS's buffercache will have lots of nonsequential data to flush. Leading to much smaller IOs, more seeks and deeper queues (= latency increases). If we're going to fsync between each file, there's no need to sort all the buffers at once. It's enough to pick one file as the target - like in my crude patch - and sort only the buffers for that file. Then fsync that file and move on to the next file. That requires scanning the buffers multiple times, but I think that's OK. I really can't see that working out. Production instances of postgres with large shared_buffers settings (say 96GB in one case) have tens of thousands of relations (~34500 in the same case). And that's a database with a relatively simple schema. I've seen much worse. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns
On Wed, Aug 27, 2014 at 09:40:30PM -0400, Noah Misch wrote: 3. use the pg_dump binary-upgrade code when such cases happen +1. We have the convention that, while --binary-upgrade can inject catalog hacks, regular pg_dump uses standard, documented DDL. I like that convention on general aesthetic grounds and for its benefit to non-superusers. Let's introduce the DDL needed to fix this bug while preserving that convention, namely DDL to toggle attislocal. I have spend some time researching this, and I am not sure what to recommend. The basic issue is that CREATE TABLE INHERITS always puts the inherited columns first, so to preserve column ordering, you have to use CREATE TABLE and then ALTER TABLE INHERIT. The problem there is that ALTER TABLE INHERIT doesn't preserve attislocal, and it also has problems with constraints not being marked local. I am just not sure we want to add SQL-level code to do that. Would it be documented? I decided to step back and consider some issues. Basically, in non-binary-upgrade mode, pg_dump is take: CREATE TABLE A(a int, b int, c int); CREATE TABLE B(a int, c int); ALTER TABLE a INHERIT B; and dumping it as: CREATE TABLE a ( a integer, b integer, c integer ) INHERITS (b); This looks perfect, but, of course, it isn't. Columns c is going to be placed before 'b'. You do get a notice about merged columns, but no mention of the column reordering: NOTICE: merging column a with inherited definition NOTICE: merging column c with inherited definition I think there are two common cases for CREATE TABLE INHERITS, and then there is this odd one. The common cases are cases where all columns inherited are mentioned explicitly in CREATE TABLE INHERITS, in order, and the other case where none of the inherited columns are mentioned. The case above is the odd case where some are mentioned, but in a different order. I have developed the attached patch to warn about column reordering in this odd case. The patch mentions the reordering of c: NOTICE: merging column a with inherited definition NOTICE: merging column c with inherited definition; column moved earlier to match inherited column location This, of course, will be emitted when the pg_dump output is restored. This is probably the minimum we should do. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c new file mode 100644 index 3720a0f..4846d25 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** MergeAttributes(List *schema, List *supe *** 1756,1767 --- 1756,1771 */ if (inhSchema != NIL) { + int schema_attno = 0; + foreach(entry, schema) { ColumnDef *newdef = lfirst(entry); char *attributeName = newdef-colname; int exist_attno; + schema_attno++; + /* * Does it conflict with some previously inherited column? */ *** MergeAttributes(List *schema, List *supe *** 1780,1788 * Yes, try to merge the two column definitions. They must * have the same type, typmod, and collation. */ ! ereport(NOTICE, ! (errmsg(merging column \%s\ with inherited definition, ! attributeName))); def = (ColumnDef *) list_nth(inhSchema, exist_attno - 1); typenameTypeIdAndMod(NULL, def-typeName, defTypeId, deftypmod); typenameTypeIdAndMod(NULL, newdef-typeName, newTypeId, newtypmod); --- 1784,1797 * Yes, try to merge the two column definitions. They must * have the same type, typmod, and collation. */ ! if (exist_attno == schema_attno) ! ereport(NOTICE, ! (errmsg(merging column \%s\ with inherited definition, ! attributeName))); ! else ! ereport(NOTICE, ! (errmsg(merging column \%s\ with inherited definition; column moved earlier to match inherited column location, ! attributeName))); def = (ColumnDef *) list_nth(inhSchema, exist_attno - 1); typenameTypeIdAndMod(NULL, def-typeName, defTypeId, deftypmod); typenameTypeIdAndMod(NULL, newdef-typeName, newTypeId, newtypmod); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY and heap_sync
On Sat, Aug 30, 2014 at 5:05 AM, Atri Sharma atri.j...@gmail.com wrote: On Saturday, August 30, 2014, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes jeff.ja...@gmail.com wrote: If you insert tuples with COPY into a table created or truncated in the same transaction, at the end of the COPY it calls heap_sync. But there cases were people use COPY in a loop with a small amount of data in each statement. Now it is calling heap_sync many times, and if NBuffers is large doing that gets very slow. Could the heap_sync be safely delayed until the end of the transaction, rather than the end of the COPY? Wouldn't unconditionally delaying sync until end of transaction can lead to burst of I/O at that time especially if there are many such copy commands in a transaction, leading to delay in some other operation's that might be happening concurrently in the system. I agree with that but then, it can provide us the same benefits like group commit,especially when most of the copy commands touch pages which are nearby,hence reducing the seek time overhead. We could look at making it optional through a GUC, since it is useful albeit for some specific usecases. It's interesting... maybe something analogous to SET CONSTRAINTS DEFERRED... SET COPY COMMIT { IMMEDIATE | DEFERRED } or SET COPY MODE { IMMEDIATE | DEFERRED } Just some thoughts! Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 28 August 2014 00:25, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Thomas Munro wrote: I haven't yet figured out how to get get into a situation where heap_lock_updated_tuple_rec waits. Well, as I think I said in the first post I mentioned this, maybe there is no such situation. In any case, like the EvalPlanQualFetch issue, we can fix it later if we find it. I finally came up with a NOWAIT spec that reaches heap_lock_updated_rec and then blocks. I can't explain why exactly... but please see attached. The fix seems fairly straightforward. Do you think I should submit an independent patch to fix this case (well there are really two cases, since there is a separate multixact path) for the existing NOWAIT support and then tackle the SKIP LOCKED equivalent separately? Best regards, Thomas Munro # Test NOWAIT with an updated tuple chain (heap_lock_updated_tuple case) setup { CREATE TABLE foo ( id int PRIMARY KEY, data text NOT NULL ); INSERT INTO foo VALUES (1, 'x'); } teardown { DROP TABLE foo; } session s1 setup { BEGIN; } step s1a { SELECT * FROM foo WHERE pg_advisory_lock(0) IS NOT NULL FOR KEY SHARE NOWAIT; } step s1b { COMMIT; } session s2 setup { BEGIN; } step s2a { SELECT pg_advisory_lock(0); } step s2b { UPDATE foo SET id = id; UPDATE foo SET id = id + 10; } step s2c { SELECT pg_advisory_unlock(0); } step s2d { COMMIT; } permutation s2a s1a s2b s2c s1b s2d -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Allow distdir to be overridden on make command line
On Fri, 2014-08-29 at 10:04 +0800, Craig Ringer wrote: Not just a one line patch, a one character patch. Use ?= instead of = in distdir assignment, so it can be overridden on the command line when building dist tarballs with patches. This is already possible without this patch. You can also override the VERSION variable. Moreover, you might be interested in the configure option --with-extra-version. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
Hi, 2014-08-31 8:10 GMT+09:00 Andres Freund and...@2ndquadrant.com: On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote: If we're going to fsync between each file, there's no need to sort all the buffers at once. It's enough to pick one file as the target - like in my crude patch - and sort only the buffers for that file. Then fsync that file and move on to the next file. That requires scanning the buffers multiple times, but I think that's OK. I really can't see that working out. Production instances of postgres with large shared_buffers settings (say 96GB in one case) have tens of thousands of relations (~34500 in the same case). And that's a database with a relatively simple schema. I've seen much worse. Yeah, it is impossible in one checkpointer process. All buffer search cost is relatively high than we expect. We need clever algorithm for efficient and distributed buffer search using multi process or threads. Regards, -- Mitsumasa KONDO
Re: [HACKERS] PATCH: Allow distdir to be overridden on make command line
On Sun, Aug 31, 2014 at 10:37 AM, Peter Eisentraut pete...@gmx.net wrote: On Fri, 2014-08-29 at 10:04 +0800, Craig Ringer wrote: Not just a one line patch, a one character patch. Use ?= instead of = in distdir assignment, so it can be overridden on the command line when building dist tarballs with patches. This is already possible without this patch. You can also override the VERSION variable. Moreover, you might be interested in the configure option --with-extra-version. --with-extra-version is available as well with MSVC scripts on HEAD. -- Michael
Re: [HACKERS] add line number as prompt option to psql
On Tue, Aug 26, 2014 at 10:23 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-21 11:43:48 +0900, Sawada Masahiko wrote: On Wed, Aug 20, 2014 at 9:00 PM, Jeevan Chalke jeevan.cha...@enterprisedb.com wrote: Hi, I have reviewed this: I have initialize cur_lineno to UINTMAX - 2. And then observed following behaviour to check wrap-around. postgres=# \set PROMPT1 '%/[%l]%R%# ' postgres[18446744073709551613]=# \set PROMPT2 '%/[%l]%R%# ' postgres[18446744073709551613]=# select postgres[18446744073709551614]-# a postgres[18446744073709551615]-# , postgres[0]-# b postgres[1]-# from postgres[2]-# dual; It is wrapping to 0, where as line number always start with 1. Any issues? I would like to ignore this as UINTMAX lines are too much for a input buffer to hold. It is almost NIL chances to hit this. However, I think you need to use UINT64_FORMAT while printing uint64 number. Currently it is %u which wrap-around at UINT_MAX. See how pset.lineno is displayed. Apart from this, I didn't see any issues in my testing. Patch applies cleanly. make/make install/initdb/make check all are well. Thank you for reviewing the patch! Attached patch is latest version patch. I modified the output format of cur_lineno. I like the feature - and I wanted to commit it, but enough stuff turned up that I needed to fix that it warrants some new testing. Stuff I've changed: * removed include of limits.h - that probably was a rememnant from a previous version * removed a trailing whitespace * expanded the documentation about %l. The current line number isn't very clear. Of a file? Of all lines ever entered in psql? It's now The line number inside the current statement, starting from literal1/. * Correspondingly I've changed the variable's name to stmt_lineno. * COPY FROM ... STDIN/PROMPT3 was broken because a) the promp was only generated once b) the lineno wasn't incremented. * CTRL-C didn't reset the line number. * Unfortunately I've notice here that the prompting is broken in some common cases: postgres[1]=# SELECT 1, postgres[2]-# '2 postgres[2]'# 2b postgres[2]'# 2c postgres[2]'# 2d', postgres[3]-# 3; ┌──┬──┬──┐ │ ?column? │ ?column? │ ?column? │ ├──┼──┼──┤ │1 │ 2 ↵│3 │ │ │ 2b ↵│ │ │ │ 2c ↵│ │ │ │ 2d │ │ └──┴──┴──┘ (1 row) postgres[1]=# SELECT 1, '2 2b 2c 2d', 3 postgres[7]-# That's rather inconsistent... I've attached my version of the patch. Note that I've got rid of all the PSCAN_INCOMPLETE checks... Please test! Thank you for review comment and improving the patch! I tested it. Your patch always increment line number even if there is no input line as follows. postgres[1]=# postgres[2]=# select postgres[3]-# , postgres[4]-# from postgres[5]-# hoge; ERROR: syntax error at or near , at character 8 STATEMENT: select , from hoge; ERROR: syntax error at or near , LINE 2: , ^ Actually error syntax is in line 2 as postgres reported. But it is inconsistent. Attached patch is resolve above behavior based on your version patch. Regards, --- Sawada Masahiko psql-line-number_v7.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump refactor patch to remove global variables
The general idea of this patch is not disputed, I think. Some concerns about the details: - Why is quote_all_identifiers left behind as a global variable? - Shouldn't pg_dumpall also use DumpOptions? - What about binary_upgrade? - I think some of the variables in pg_dump's main() don't need to be moved into DumpOptions. For example, compressLevel is only looked at once in main() and then forgotten. We don't need to pass that around everywhere. The same goes for dumpencoding and possibly others. - The forward declaration of struct _dumpOptions in pg_backup.h seems kind of useless. You could move some things around so that that's not necessary. - NewDumpOptions() and NewRestoreOptions() are both in pg_backup_archiver.c, but NewRestoreOptions() is in pg_backup.h whereas NewDumpOptions() is being put into pg_backup_archiver.h. None of that makes too much sense, but it could be made more consistent. - In dumpOptionsFromRestoreOptions() you are building the return value in a local variable that is not zeroed. You should probably use NewDumpOptions() there. Also, looking at dumpOptionsFromRestoreOptions() and related code makes me think that these should perhaps really be the same structure. Have you investigated that? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving speed of make check-world
Updated, rebased patch. diff --git a/.gitignore b/.gitignore index 681af08..823d3ac 100644 --- a/.gitignore +++ b/.gitignore @@ -34,3 +34,4 @@ lib*.pc /pgsql.sln.cache /Debug/ /Release/ +/tmp_install/ diff --git a/GNUmakefile.in b/GNUmakefile.in index 69e0824..5667943 100644 --- a/GNUmakefile.in +++ b/GNUmakefile.in @@ -47,6 +47,7 @@ $(call recurse,distprep,doc src config contrib) # it's not built by default $(call recurse,clean,doc contrib src config) clean: + rm -rf tmp_install/ # Garbage from autoconf: @rm -rf autom4te.cache/ @@ -61,6 +62,8 @@ distclean maintainer-clean: # Garbage from autoconf: @rm -rf autom4te.cache/ +check-world: temp-install + check check-tests: all check check-tests installcheck installcheck-parallel installcheck-tests: diff --git a/contrib/earthdistance/Makefile b/contrib/earthdistance/Makefile index 93dcbe3..cde1ae6 100644 --- a/contrib/earthdistance/Makefile +++ b/contrib/earthdistance/Makefile @@ -7,7 +7,7 @@ DATA = earthdistance--1.0.sql earthdistance--unpackaged--1.0.sql PGFILEDESC = earthdistance - calculate distances on the surface of the Earth REGRESS = earthdistance -REGRESS_OPTS = --extra-install=contrib/cube +EXTRA_INSTALL = contrib/cube LDFLAGS_SL += $(filter -lm, $(LIBS)) diff --git a/contrib/pg_upgrade/test.sh b/contrib/pg_upgrade/test.sh index 7bbd2c7..7d493d9 100644 --- a/contrib/pg_upgrade/test.sh +++ b/contrib/pg_upgrade/test.sh @@ -80,7 +80,7 @@ if [ $1 = '--install' ]; then # use psql from the proper installation directory, which might # be outdated or missing. But don't override anything else that's # already in EXTRA_REGRESS_OPTS. - EXTRA_REGRESS_OPTS=$EXTRA_REGRESS_OPTS --psqldir='$bindir' + EXTRA_REGRESS_OPTS=$EXTRA_REGRESS_OPTS --bindir='$bindir' export EXTRA_REGRESS_OPTS fi diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index d7f32c3..6210ddb 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -39,35 +39,33 @@ submake-test_decoding: REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact binary prepared -regresscheck: all | submake-regress submake-test_decoding +regresscheck: all | submake-regress submake-test_decoding temp-install $(MKDIR_P) regression_output $(pg_regress_check) \ --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf \ - --temp-install=./tmp_check \ - --extra-install=contrib/test_decoding \ + --temp-instance=./tmp_check \ --outputdir=./regression_output \ $(REGRESSCHECKS) -regresscheck-install-force: | submake-regress submake-test_decoding +regresscheck-install-force: | submake-regress submake-test_decoding temp-install $(pg_regress_installcheck) \ - --extra-install=contrib/test_decoding \ $(REGRESSCHECKS) ISOLATIONCHECKS=mxact delayed_startup concurrent_ddl_dml -isolationcheck: all | submake-isolation submake-test_decoding +isolationcheck: all | submake-isolation submake-test_decoding temp-install $(MKDIR_P) isolation_output $(pg_isolation_regress_check) \ --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf \ - --extra-install=contrib/test_decoding \ --outputdir=./isolation_output \ $(ISOLATIONCHECKS) -isolationcheck-install-force: all | submake-isolation submake-test_decoding +isolationcheck-install-force: all | submake-isolation submake-test_decoding temp-install $(pg_isolation_regress_installcheck) \ - --extra-install=contrib/test_decoding \ $(ISOLATIONCHECKS) PHONY: submake-test_decoding submake-regress check \ regresscheck regresscheck-install-force \ isolationcheck isolationcheck-install-force + +temp-install: EXTRA_INSTALL=contrib/test_decoding diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 0ffc1e8..3238c5c 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -41,6 +41,7 @@ MAJORVERSION = @PG_MAJORVERSION@ # Support for VPATH builds vpath_build = @vpath_build@ +abs_top_builddir = @abs_top_builddir@ abs_top_srcdir = @abs_top_srcdir@ ifneq ($(vpath_build),yes) @@ -296,6 +297,17 @@ BZIP2 = bzip2 # Testing +check: all temp-install + +.PHONY: temp-install +temp-install: +ifeq ($(MAKELEVEL),0) + rm -rf $(abs_top_builddir)/tmp_install + $(MKDIR_P) $(abs_top_builddir)/tmp_install/log + $(MAKE) -C $(top_builddir) DESTDIR=$(abs_top_builddir)/tmp_install install $(abs_top_builddir)/tmp_install/log/install.log 21 +endif + for extra in $(EXTRA_INSTALL); do $(MAKE) -C $(top_builddir)/$$extra DESTDIR=$(abs_top_builddir)/tmp_install install $(abs_top_builddir)/tmp_install/log/install.log 21 || exit; done + PROVE = @PROVE@ PG_PROVE_FLAGS = --ext='.pl' -I $(top_srcdir)/src/test/perl/ PROVE_FLAGS = --verbose @@ -310,14 +322,16 @@ define ld_library_path_var $(if $(filter $(PORTNAME),darwin),DYLD_LIBRARY_PATH,$(if $(filter $(PORTNAME),aix),LIBPATH,LD_LIBRARY_PATH)) endef +define with_temp_install +PATH=$(abs_top_builddir)/tmp_install$(bindir):$$PATH $(call
[HACKERS] Tips/advice for implementing integrated RESTful HTTP API
A while back I was working on a little proposal to create a RESTful HTTP front-end for PostgreSQL and recently I had the inspiration to work on this. So I successfully created a background worker for PostgreSQL 9.3 that can use the SPI to list off the databases in a JSON response. WIP on github: https://github.com/dobesv/restgres/ Now I'm getting into murkier waters and I'm wonder if I can get some helpful tips to guide my RD here. 1. Connecting to multiple databases The background workers can apparently only connect to a single database at a time, but I want to expose all the databases via the API. I think I could use libpq to connect to PostgreSQL on localhost but this might have weird side-effects in terms of authentication, pid use, stuff like that. I could probably manage a pool of dynamic workers (as of 9.4), one per user/database combination or something along those lines. Even one per request? Is there some kind of IPC system in place to help shuttle the requests and responses between dynamic workers? Or do I need to come up with my own? It seems like PostgreSQL itself has a way to shuttle requests out to workers, is it possible to tap into that system instead? Basically some way to send the requests to a PostgreSQL backend from the background worker? Or perhaps I shouldn't do this as a worker but rather modify PostgreSQL itself and do it in a more integrated/destructive manner? 2. Authentication I was trying to use a function md5_crypt_verify to authenticate the user using their password, and I believe I am providing the right password but it's not being accepted. Any tips on authenticating users in a background worker? Where should I be looking for examples? 3. Parallelism The regular PostgreSQL server can run many queries in parallel, but it seems like if I am using SPI I could only run one query at a time - it's not an asynchronous API. This seems related to the multiple databases issue - either I could use libpq to translate/forward requests onto PostgreSQL's own worker system or setup my own little worker pool to run the requests in parallel and have a way to send the request/response data to/from those workers. Any help, sage advice, tips, and suggestions how to move forward in these areas would be muchly appreciated! Regards, Dobes
Re: [HACKERS] [v9.5] Custom Plan API
2014-08-29 13:33 GMT-04:00 Robert Haas robertmh...@gmail.com: On Wed, Aug 27, 2014 at 6:51 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to follow this direction, and start stripping the DDL support. ...please make it so. The attached patch eliminates DDL support. Instead of the new CREATE CUSTOM PLAN PROVIDER statement, it adds an internal function; register_custom_scan_provider that takes custom plan provider name and callback function to add alternative scan path (should have a form of CustomPath) during the query planner is finding out the cheapest path to scan the target relation. Also, documentation stuff is revised according to the latest design. Any other stuff keeps the previous design. Comments: 1. There seems to be no reason for custom plan nodes to have MultiExec support; I think this as an area where extensibility is extremely unlikely to work out. The MultiExec mechanism is really only viable between closely-cooperating nodes, like Hash and HashJoin, or BitmapIndexScan, BitmapAnd, BitmapOr, and BitmapHeapScan; and arguably those things could have been written as a single, more complex node. Are we really going to want to support a custom plan that can substitute for a Hash or BitmapAnd node? I really doubt that's very useful. This intends to allows a particular custom-scan provider to exchange its internal data when multiple custom-scan node is stacked. So, it can be considered a facility to implement closely-cooperating nodes; both of them are managed by same custom-scan provider. An example is gpu-accelerated version of hash-join that takes underlying custom-scan node that will returns a hash table with gpu preferable data structure, but should not be a part of row-by-row interface. I believe it is valuable for some use cases, even though I couldn't find a use-case in ctidscan example. 2. This patch is still sort of on the fence about whether we're implementing custom plans (of any type) or custom scans (thus, of some particular relation). I previously recommended that we confine ourselves initially to the task of adding custom *scans* and leave the question of other kinds of custom plan nodes to a future patch. After studying the latest patch, I'm inclined to suggest a slightly revised strategy. This patch is really adding THREE kinds of custom objects: CustomPlanState, CustomPlan, and CustomPath. CustomPlanState inherits from ScanState, so it is not really a generic CustomPlan, but specifically a CustomScan; likewise, CustomPlan inherits from Scan, and is therefore a CustomScan, not a CustomPlan. But CustomPath is different: it's just a Path. Even if we only have the hooks to inject CustomPaths that are effectively scans at this point, I think that part of the infrastructure could be somewhat generic. Perhaps eventually we have CustomPath which can generate either CustomScan or CustomJoin which in turn could generate CustomScanState and CustomJoinState. Suggestion seems to me reasonable. The reason why CustomPlanState inheris ScanState and CustomPlan inherits Scan is, just convenience for implementation of extensions. Some useful internal APIs, like ExecScan(), takes argument of ScanState, so it was a better strategy to choose Scan/ScanState instead of the bare Plan/PlanState. Anyway, I'd like to follow the perspective that looks CustomScan as one derivative from the CustomPath. It is more flexible. For now, I propose that we rename CustomPlan and CustomPlanState to CustomScan and CustomScanState, because that's what they are; but that we leave CustomPath as-is. For ease of review, I also suggest splitting this into a series of three patches: (1) add support for CustomPath; (2) add support for CustomScan and CustomScanState; (3) ctidscan. OK, I'll do that. 3. Is it really a good idea to invoke custom scan providers for RTEs of every type? It's pretty hard to imagine that a custom scan provider can do anything useful with, say, RTE_VALUES. Maybe an accelerated scan of RTE_CTE or RTE_SUBQUERY is practical somehow, but even that feels like an awfully big stretch. At least until clear use cases emerge, I'd be inclined to restrict this to RTE_RELATION scans where rte-relkind != RELKIND_FOREIGN_TABLE; that is, put the logic in set_plain_rel_pathlist() rather than set_rel_pathlist(). I'd like to agree. Indeed, it's not easy to assume a use case of custom-logic for non-plain relations. (We might even want to consider whether the hook in set_plain_rel_pathlist() ought to be allowed to inject a non-custom plan; e.g. substitute a scan of relation B for a scan of relation A. For example, imagine that B contains all rows from A that satisfy some predicate. This could even be useful for foreign tables; e.g. substitute a scan of a local copy of a foreign table for a reference to that table. But I put all of these ideas in parentheses because they're only good ideas to the extent that they don't sidetrack