Corrected: Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
(2012/03/12 13:04), Etsuro Fujita wrote: (2012/03/09 23:48), Tom Lane wrote: Etsuro Fujitafujita.ets...@lab.ntt.co.jp writes: 2. IMHO RelOptInfo.fdw_private seems confusing. How about renaming it to e.g., RelOptInfo.fdw_state? Why is that better? It seems just as open to confusion with another field (ie, the execution-time fdw_state). I thought the risk. However, I feel that the naming of RelOptInfo.fdw_state is not so bad because it is used only at the query planning time, not used along with the execution-time fdw_private. I wrote the execution-time fdw_private by mistake. I meant the execution-time fdw_state. I'm sorry about that. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: True. If (max count - min count + 1) is small, enumerating of frequencies is both more compact and more precise representation. Simultaneously, if (max count - min count + 1) is large, we can run out of statistics_target with such representation. We can use same representation of count distribution as for scalar column value: MCV and HISTOGRAM, but it would require additional statkind and statistics slot. Probably, you've better ideas? I wasn't thinking of introducing two different representations, but just trimming the histogram length when it's larger than necessary. On reflection my idea above is wrong; for example assume that we have a column with 900 arrays of length 1 and 100 arrays of length 2. Going by what I said, we'd reduce the histogram to {1,2}, which might accurately capture the set of lengths present but fails to show that 1 is much more common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) would capture the situation perfectly in one-tenth the space that the current logic does. More generally, by limiting the histogram to statistics_target entries, we are already accepting errors of up to 1/(2*statistics_target) in the accuracy of the bin-boundary representation. What the above example shows is that sometimes we could meet the same accuracy requirement with fewer entries. I'm not sure how this could be mechanized but it seems worth thinking about. I can propose following representation of histogram. If (max_count - min_count + 1) = statistics_target then 1) store max_count and min_count in stavalues 2) store frequencies from min_count ot max_count in numvalues If (max_count - min_count + 1) statistics_target then store histogram in current manner. I think in this case it's unlikely to be many repeating values. I can propose patch which change histogram representation to this. Comments? -- With best regards, Alexander Korotkov.
Re: [HACKERS] elegant and effective way for running jobs inside a database
On Sat, Mar 10, 2012 at 2:59 PM, Andrew Dunstan and...@dunslane.net wrote: The devil is in the details, though, pace Mies van der Rohe. In particular, it's the tight integration piece I'm worried about. What is the postmaster supposed to do if the daemon start fails? What if it gets a flood of failures? What access will the daemon have to Postgres internals? What OS privileges will it have, since this would have to run as the OS postgres user? In general I think we don't want arbitrary processes running as the OS postgres user. So why are the answers to those questions different for a daemon than for a C function executed from an external client? What additional exposure is there? I accept that cron might not be the best tool for the jobs, since a) its finest granularity is 1 minute and b) it would need a new connection for each job. But a well written external daemon that runs as a different user and is responsible for making its own connection to the database and re-establishing it if necessary, seems to me at least as clean a design for a job scheduler as one that is stopped and started by the postmaster. As of this thread, you can see that many people don't agree. Bear in mind that nobody is trying to prevent you from writing a program in that way if you believe that. That route will remain available. It's a key aspect of modular software we're talking about. People want to have programs that are intimately connected to the database, so that nobody needs to change the operational instructions when they start or stop the database. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
Tom Lane wrote: Shigeru Hanada shigeru.han...@gmail.com writes: Thanks for the review. Agreed to write own depraser for pgsql_fdw which handles nodes which can be pushed down. Every SQL-based FDW which constructs SQL statement for each local query would need such module inside. Yeah. That's kind of annoying, and the first thing you think of is that we ought to find a way to share that code somehow. But I think it's folly to try to design a shared implementation until we have some concrete implementations to compare. An Oracle FDW, for instance, would need to emit SQL code with many differences in detail from pgsql_fdw. It's not clear to me whether a shared implementation is even practical, but for sure I don't want to try to build it before we've done some prototype single-purpose implementations. Having written something like that for Oracle, I tend to share that opinion. Anything general-purpose enough to cater for every whim and oddity of the remote system would probably be so unwieldy that it wouldn't be much easier to use it than to write the whole thing from scratch. To illustrate this, a few examples from the Oracle case: - Empty strings have different semantics in Oracle (to wit, they mean NULL). So you can push down all string constants except empty strings. - Oracle can only represent intervals with just year and month or with just day of month and smaller fields. So you can either punt on intervals or translate only the ones that fit the bill. - You can push down - for date arithmetic except when both operands on the Oracle side are of type DATE, because that would result in a NUMERIC value (number of days between). Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] poll: CHECK TRIGGER?
2012/3/11 Petr Jelinek pjmo...@pjmodos.net: On 03/10/2012 04:36 PM, Pavel Stehule wrote: and there some cleaned version Reran my tests and adjusted docs a bit, tbh I considered the previous versions way more useful but even in this form it's nice and useful functionality. remove two lines of death code Regards Pavel Regards Petr Jelinek plpgsql_check_function2-2012-03-12.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] NOT NULL violation error handling in file_fdw
According to the following documentation on IterateForeignScan() in 50.2. Foreign Data Wrapper Callback Routines, I have created a patch to support the error handling in file_fdw. Please find attached a patch. Note that PostgreSQL's executor doesn't care whether the rows returned violate the NOT NULL constraints which were defined on the foreign table columns - but the planner does care, and may optimize queries incorrectly if NULL values are present in a column declared not to contain them. If a NULL value is encountered when the user has declared that none should be present, it may be appropriate to raise an error (just as you would need to do in the case of a data type mismatch). Best regards, Etsuro Fujita *** a/contrib/file_fdw/file_fdw.c --- b/contrib/file_fdw/file_fdw.c *** *** 502,507 fileIterateForeignScan(ForeignScanState *node) --- 502,510 { FileFdwExecutionState *festate = (FileFdwExecutionState *) node-fdw_state; TupleTableSlot *slot = node-ss.ss_ScanTupleSlot; + Relationrel = node-ss.ss_currentRelation; + TupleConstr *constr = rel-rd_att-constr; + boolhas_not_null = (constr != NULL) ? constr-has_not_null : false; boolfound; ErrorContextCallback errcontext; *** *** 528,534 fileIterateForeignScan(ForeignScanState *node) --- 531,556 slot-tts_values, slot-tts_isnull, NULL); if (found) + { ExecStoreVirtualTuple(slot); + if (has_not_null) + { + int natts = rel-rd_att-natts; + int attrChk; + + for (attrChk = 1; attrChk = natts; attrChk++) + { + if (rel-rd_att-attrs[attrChk - 1]-attnotnull + slot_attisnull(slot, attrChk)) + ereport(ERROR, + (errcode(ERRCODE_NOT_NULL_VIOLATION), +errmsg(null value in column \%s\ violates not-null constraint, + NameStr(rel-rd_att-attrs[attrChk - 1]-attname)), +errdetail(Failing row contains %s., + ExecBuildSlotValueDescription(slot, 64; + } + } + } /* Remove error callback. */ error_context_stack = errcontext.previous; *** a/src/backend/executor/execMain.c --- b/src/backend/executor/execMain.c *** *** 86,93 static void ExecutePlan(EState *estate, PlanState *planstate, DestReceiver *dest); static bool ExecCheckRTEPerms(RangeTblEntry *rte); static void ExecCheckXactReadOnly(PlannedStmt *plannedstmt); - static char *ExecBuildSlotValueDescription(TupleTableSlot *slot, - int maxfieldlen); static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree); static void OpenIntoRel(QueryDesc *queryDesc); --- 86,91 *** *** 1604,1610 ExecConstraints(ResultRelInfo *resultRelInfo, * here since heap field values could be very long, whereas index entries * typically aren't so wide. */ ! static char * ExecBuildSlotValueDescription(TupleTableSlot *slot, int maxfieldlen) { StringInfoData buf; --- 1602,1608 * here since heap field values could be very long, whereas index entries * typically aren't so wide. */ ! char * ExecBuildSlotValueDescription(TupleTableSlot *slot, int maxfieldlen) { StringInfoData buf; *** a/src/include/executor/executor.h --- b/src/include/executor/executor.h *** *** 184,189 extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid); --- 184,191 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids); extern void ExecConstraints(ResultRelInfo *resultRelInfo, TupleTableSlot *slot, EState *estate); + extern char *ExecBuildSlotValueDescription(TupleTableSlot *slot, + int maxfieldlen); extern ExecRowMark *ExecFindRowMark(EState *estate, Index rti); extern ExecAuxRowMark *ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist); extern TupleTableSlot *EvalPlanQual(EState *estate, EPQState *epqstate, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Bug: walsender and high CPU usage
On 09.03.2012 13:40, Fujii Masao wrote: I found the bug which causes walsender to enter into busy loop when replication connection is terminated. Walsender consumes lots of CPU resource (%sys), and this situation lasts until it has detected the termination of replication connection and exited. The cause of this bug is that the walsender loop doesn't call ResetLatch at all in the above case. Since the latch remains set, the walsender loop cannot sleep on the latch, i.e., WaitLatch always returns immediately. We can fix this bug by adding ResetLatch into the top of the walsender loop. Patch attached. This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has already been fixed by the commit (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit refactors and refines the walsender loop logic in addition to adding ResetLatch. So I'm tempted to backport this commit (except the deletion of wal_sender_delay) to 9.1 rather than applying the attached patch. OTOH, attached patch is quite simple, and its impact on 9.1 would be very small, so it's easy to backport that. Thought? This patch makes the code that follows bogus: /* * If we don't have any pending data in the output buffer, try to send * some more. */ if (!pq_is_send_pending()) { XLogSend(output_message, caughtup); /* * Even if we wrote all the WAL that was available when we started * sending, more might have arrived while we were sending this * batch. We had the latch set while sending, so we have not * received any signals from that time. Let's arm the latch again, * and after that check that we're still up-to-date. */ if (caughtup !pq_is_send_pending()) { ResetLatch(MyWalSnd-latch); XLogSend(output_message, caughtup); } } The comment is no longer valid, and the calls to ResetLatch and XLogSend are no longer necessary, once you have the ResetLatch() call at the top of the loop. I also think we should backport commit cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrect behaviour when using a GiST index on points
I believe that attached version of patch can be backpatched. It fixes this problem without altering of index building procedure. It just makes checks in internal pages softener enough to compensate effect of gist_box_same implementation. -- With best regards, Alexander Korotkov. *** a/src/backend/access/gist/gistproc.c --- b/src/backend/access/gist/gistproc.c *** *** 70,76 gist_box_consistent(PG_FUNCTION_ARGS) if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); ! /* * if entry is not leaf, use rtree_internal_consistent, else use * gist_box_leaf_consistent --- 70,76 if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); ! /* * if entry is not leaf, use rtree_internal_consistent, else use * gist_box_leaf_consistent *** *** 80,88 gist_box_consistent(PG_FUNCTION_ARGS) query, strategy)); else ! PG_RETURN_BOOL(rtree_internal_consistent(DatumGetBoxP(entry-key), query, strategy)); } static void --- 80,102 query, strategy)); else ! { ! /* ! * Box in internal page can be narrower than box in leaf page not ! * more than EPSILON in each boundary. Do corresponding correction. ! */ ! BOX key, *entrykey; ! ! entrykey = DatumGetBoxP(entry-key); ! key.low.x = entrykey-low.x - EPSILON; ! key.low.y = entrykey-low.y - EPSILON; ! key.high.x = entrykey-high.x + EPSILON; ! key.high.y = entrykey-high.y + EPSILON; ! ! PG_RETURN_BOOL(rtree_internal_consistent(key, query, strategy)); + } } static void *** *** 847,852 gist_box_same(PG_FUNCTION_ARGS) --- 861,871 BOX *b2 = PG_GETARG_BOX_P(1); bool *result = (bool *) PG_GETARG_POINTER(2); + /* + * box_same function allow difference between boxes limited by EPSILON. + * Thus box in internal page can be narrower than box in leaf page not + * more than EPSILON in each boundary. + */ if (b1 b2) *result = DatumGetBool(DirectFunctionCall2(box_same, PointerGetDatum(b1), *** *** 1072,1077 gist_poly_consistent(PG_FUNCTION_ARGS) --- 1091,1097 GISTENTRY *entry = (GISTENTRY *) PG_GETARG_POINTER(0); POLYGON*query = PG_GETARG_POLYGON_P(1); StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2); + BOX key, *entrykey; /* Oid subtype = PG_GETARG_OID(3); */ bool *recheck = (bool *) PG_GETARG_POINTER(4); *** *** 1083,1094 gist_poly_consistent(PG_FUNCTION_ARGS) if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); /* * Since the operators require recheck anyway, we can just use * rtree_internal_consistent even at leaf nodes. (This works in part * because the index entries are bounding boxes not polygons.) */ ! result = rtree_internal_consistent(DatumGetBoxP(entry-key), (query-boundbox), strategy); /* Avoid memory leak if supplied poly is toasted */ --- 1103,1128 if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); + entrykey = DatumGetBoxP(entry-key); + if (!GIST_LEAF(entry)) + { + /* + * Box in internal page can be narrower than box in leaf page not + * more than EPSILON in each boundary. Do corresponding correction. + */ + key.low.x = entrykey-low.x - EPSILON; + key.low.y = entrykey-low.y - EPSILON; + key.high.x = entrykey-high.x + EPSILON; + key.high.y = entrykey-high.y + EPSILON; + entrykey = key; + } + /* * Since the operators require recheck anyway, we can just use * rtree_internal_consistent even at leaf nodes. (This works in part * because the index entries are bounding boxes not polygons.) */ ! result = rtree_internal_consistent(entrykey, (query-boundbox), strategy); /* Avoid memory leak if supplied poly is toasted */ *** *** 1152,1158 gist_circle_consistent(PG_FUNCTION_ARGS) /* Oid subtype = PG_GETARG_OID(3); */ bool *recheck = (bool *) PG_GETARG_POINTER(4); ! BOX bbox; bool result; /* All cases served by this function are inexact */ --- 1186,1192 /* Oid subtype = PG_GETARG_OID(3); */ bool *recheck = (bool *) PG_GETARG_POINTER(4); ! BOX bbox, *entrykey, key; bool result; /* All cases served by this function are inexact */ *** *** 1170,1177 gist_circle_consistent(PG_FUNCTION_ARGS) bbox.low.x = query-center.x - query-radius; bbox.high.y = query-center.y + query-radius; bbox.low.y = query-center.y - query-radius; ! result = rtree_internal_consistent(DatumGetBoxP(entry-key), bbox, strategy); PG_RETURN_BOOL(result); --- 1204,1225 bbox.low.x = query-center.x - query-radius; bbox.high.y = query-center.y + query-radius; bbox.low.y =
Re: [HACKERS] Bug: walsender and high CPU usage
On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 09.03.2012 13:40, Fujii Masao wrote: I found the bug which causes walsender to enter into busy loop when replication connection is terminated. Walsender consumes lots of CPU resource (%sys), and this situation lasts until it has detected the termination of replication connection and exited. The cause of this bug is that the walsender loop doesn't call ResetLatch at all in the above case. Since the latch remains set, the walsender loop cannot sleep on the latch, i.e., WaitLatch always returns immediately. We can fix this bug by adding ResetLatch into the top of the walsender loop. Patch attached. This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has already been fixed by the commit (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit refactors and refines the walsender loop logic in addition to adding ResetLatch. So I'm tempted to backport this commit (except the deletion of wal_sender_delay) to 9.1 rather than applying the attached patch. OTOH, attached patch is quite simple, and its impact on 9.1 would be very small, so it's easy to backport that. Thought? This patch makes the code that follows bogus: /* * If we don't have any pending data in the output buffer, try to send * some more. */ if (!pq_is_send_pending()) { XLogSend(output_message, caughtup); /* * Even if we wrote all the WAL that was available when we started * sending, more might have arrived while we were sending this * batch. We had the latch set while sending, so we have not * received any signals from that time. Let's arm the latch again, * and after that check that we're still up-to-date. */ if (caughtup !pq_is_send_pending()) { ResetLatch(MyWalSnd-latch); XLogSend(output_message, caughtup); } } The comment is no longer valid, and the calls to ResetLatch and XLogSend are no longer necessary, once you have the ResetLatch() call at the top of the loop. Right. I also think we should backport commit cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay). Agreed. The attached patch is the same as cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay. Could you apply this? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug: walsender and high CPU usage
On Mon, Mar 12, 2012 at 10:27 PM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 09.03.2012 13:40, Fujii Masao wrote: I found the bug which causes walsender to enter into busy loop when replication connection is terminated. Walsender consumes lots of CPU resource (%sys), and this situation lasts until it has detected the termination of replication connection and exited. The cause of this bug is that the walsender loop doesn't call ResetLatch at all in the above case. Since the latch remains set, the walsender loop cannot sleep on the latch, i.e., WaitLatch always returns immediately. We can fix this bug by adding ResetLatch into the top of the walsender loop. Patch attached. This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has already been fixed by the commit (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit refactors and refines the walsender loop logic in addition to adding ResetLatch. So I'm tempted to backport this commit (except the deletion of wal_sender_delay) to 9.1 rather than applying the attached patch. OTOH, attached patch is quite simple, and its impact on 9.1 would be very small, so it's easy to backport that. Thought? This patch makes the code that follows bogus: /* * If we don't have any pending data in the output buffer, try to send * some more. */ if (!pq_is_send_pending()) { XLogSend(output_message, caughtup); /* * Even if we wrote all the WAL that was available when we started * sending, more might have arrived while we were sending this * batch. We had the latch set while sending, so we have not * received any signals from that time. Let's arm the latch again, * and after that check that we're still up-to-date. */ if (caughtup !pq_is_send_pending()) { ResetLatch(MyWalSnd-latch); XLogSend(output_message, caughtup); } } The comment is no longer valid, and the calls to ResetLatch and XLogSend are no longer necessary, once you have the ResetLatch() call at the top of the loop. Right. I also think we should backport commit cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay). Agreed. The attached patch is the same as cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay. Could you apply this? Oh, I forgot to attach the patch. Patch attached really. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** *** 476,482 ProcessRepliesIfAny(void) { unsigned char firstchar; int r; ! int received = false; for (;;) { --- 476,482 { unsigned char firstchar; int r; ! bool received = false; for (;;) { *** *** 700,705 WalSndLoop(void) --- 700,708 /* Loop forever, unless we get an error */ for (;;) { + /* Clear any already-pending wakeups */ + ResetLatch(MyWalSnd-latch); + /* * Emergency bailout if postmaster has died. This is to avoid the * necessity for manual cleanup of all postmaster children. *** *** 718,777 WalSndLoop(void) /* Normal exit from the walsender is here */ if (walsender_shutdown_requested) { ! /* Inform the standby that XLOG streaming was done */ pq_puttextmessage('C', COPY 0); pq_flush(); proc_exit(0); } /* * If we don't have any pending data in the output buffer, try to send ! * some more. */ if (!pq_is_send_pending()) - { XLogSend(output_message, caughtup); /* ! * Even if we wrote all the WAL that was available when we started ! * sending, more might have arrived while we were sending this ! * batch. We had the latch set while sending, so we have not ! * received any signals from that time. Let's arm the latch again, ! * and after that check that we're still up-to-date. */ ! if (caughtup !pq_is_send_pending()) { ! ResetLatch(MyWalSnd-latch); XLogSend(output_message, caughtup); } } - /* Flush pending output to the client */ - if (pq_flush_if_writable() != 0) - break; - /* ! * When SIGUSR2 arrives, we send any outstanding logs up to the ! * shutdown checkpoint record (i.e., the latest record) and exit. */ ! if (walsender_ready_to_stop !pq_is_send_pending()) ! { ! XLogSend(output_message,
[HACKERS] elegant and effective way for running jobs inside a database
Dear Developers, I am looking for elegant and effective way for running jobs inside a database or cluster - for now I can not find that solution. OK if You say use cron or pgAgent I say I know that solutions, but the are not effective and elegant. Compilation of pgAgent is a pain (especially wxWidgets usage on system with no X) - it can run jobs with minimal 60s periods but what when someone needs run it faster for eg. with 5s period ? Of course using cron I can do that but it is not effective and elegant solution. Why PostgreSQL can not have so elegant solution like Oracle database ? I am working with Oracle databases for many years, but I like much more PostgreSQL project but this one thing... I can not understand - the lack of jobs inside the database... Best regards, Artur -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] query planner does not canonicalize infix operators
I created an index on an hstore function, fetchval(hstore, text), however when I use the - infix operator which resolves to the very same function, this index is not used. It should be used. I have included an example: Table with hstore index: de10keipt01939= \d log_data Table public.log_data Column | Type | Modifiers +--+--- id | bigint | not null default nextval('log_data_id_seq'::regclass) time | timestamp with time zone | data | hstore | Indexes: index_log_data_by_time btree (time) index_participant_id btree (fetchval(data, 'participant_id'::text)) query with function notation: de10keipt01939= explain ANALYZE select * from log_data where (data-'participant_id')='2851' order by id desc; QUERY PLAN - Sort (cost=16432.56..16433.36 rows=1583 width=315) (actual time=198.643..198.777 rows=183 loops=1) Sort Key: id Sort Method: quicksort Memory: 119kB - Seq Scan on log_data (cost=0.00..16415.74 rows=1583 width=315) (actual time=6.926..198.297 rows=183 loops=1) Filter: ((data - 'participant_id'::text) = '2851'::text) Total runtime: 198.922 ms (6 rows) query with infix notation: de10keipt01939= explain ANALYZE select * from log_data where fetchval(data,'participant_id')='2851' order by id desc; QUERY PLAN --- Sort (cost=341.14..341.23 rows=179 width=315) (actual time=0.724..0.841 rows=183 loops=1) Sort Key: id Sort Method: quicksort Memory: 119kB - Bitmap Heap Scan on log_data (cost=2.35..339.80 rows=179 width=315) (actual time=0.091..0.489 rows=183 loops=1) Recheck Cond: (fetchval(data, 'participant_id'::text) = '2851'::text) - Bitmap Index Scan on index_participant_id (cost=0.00..2.34 rows=179 width=0) (actual time=0.060..0.060 rows=183 loops=1) Index Cond: (fetchval(data, 'participant_id'::text) = '2851'::text) Total runtime: 1.010 ms (8 rows) —Will
[HACKERS] Requesting Ideas fro project proposal.
I am looking for some project suggestions to make proposals for an OS course at Columbia university (4-6weeks length single person), working with Scribe ( kernel based application record and replay) focusing on DBMS ( I would like to focus on postgres) . I have some ideas on the line of explore usage for debugging and profiling scenarios but wanted if someone can propose a better suggestion ? Thanks, -Ali
Re: [HACKERS] Requesting Ideas fro project proposal.
Sorry ,here is the link the scribe project in reference it's a kernel based record and replay system. https://github.com/nviennot/linux-2.6-scribe On Wed, Mar 7, 2012 at 5:25 PM, Ali Ahmed ahmal...@gmail.com wrote: I am looking for some project suggestions to make proposals for an OS course at Columbia university (4-6weeks length single person), working with Scribe ( kernel based application record and replay) focusing on DBMS ( I would like to focus on postgres) . I have some ideas on the line of explore usage for debugging and profiling scenarios but wanted if someone can propose a better suggestion ? Thanks, -Ali -- -Ali
Re: [HACKERS] wal_buffers, redux
On Sun, Mar 11, 2012 at 11:51 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Mar 11, 2012 at 12:55 PM, Robert Haas robertmh...@gmail.com wrote: I've finally been able to run some more tests of the effect of adjusting wal_buffers to values higher than 16MB. I ran the test on the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms I ran three 30-minute tests at scale factor 300 with wal_buffers set at various values from 16MB up to 160MB, in multiples of 16MB, using pgbench with 32 clients and 32 threads in each case. The short version is that 32MB seems to be significantly better than 16MB, by about 1000 tps, and after that it gets murky; full results are below. Currently the max of wal_buffers is 16MB (i.e., the size of one WAL file) when it's set to -1. Thanks to your result, we should increase the max to 32MB? I think that might be a good idea, although I'm not entirely convinced that we understand why increasing wal_buffers is helping as much as it is. I stuck an elog() into AdvanceXLInsertBuffer() to complain in the case that we were writing buffers while holding the insert lock. Then, I reran 30-minute tests 32 clients, one with wal_buffers=16MB and the other wal_buffers=32MB. On the 16MB test, the elog() fired 15 times in a single second shortly after the start of the test, and then 9 more times over the rest of the test. On the 32MB test, the elog() fired a total 6 times over the course of the test. The first test got 14320 tps, while the second got 15026 tps. I find that quite surprising, because although WAL buffer wraparound is certainly bad (on this test, it probably brings the system completely to a halt until fsync() finishes) it really shouldn't lock up the system for multiple seconds at a time. And yet that's what it would need to be doing to account for the tps discrepancy on this test, considering how rarely it occurs. Studying AdvanceXLInsertBuffer() a bit more, I'm wondering if the problem isn't so much - or isn't only - that it's expensive to write buffers while also holding WALInsertLock. Maybe it's too expensive even to acquire WalWriteLock in the first place - that is, the real problem isn't so much the wraparound condition itself, but the expense of testing whether a possible wraparound has actually occurred. A quick test suggests that we acquire WALWriteLock here much more often than we actually write anything while holding it, and that we get a big burst of WALWriteLock acquisitions here immediately after a checkpoint. I don't have any proof that this is what's causing the tps drop with smaller wal_buffers, but I think there has to be something other than an actual wraparound condition causing problems here, because that just doesn't seem to happen frequently enough to be an issue. Anyway, maybe none of that matters at the moment. Perhaps it's enough to know that wal_buffers16MB can help, and just bump up the maximum auto-tuned value a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
2012/3/9 Robert Haas robertmh...@gmail.com: On Tue, Mar 6, 2012 at 9:14 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: [ new patch ] Are we absolutely certain that we want the semantics of sepgsql_setcon() to be transactional? Because if we made them non-transactional, this would be a whole lot simpler, and it would still meet the originally proposed use case, which is to allow the security context of a connection to be changed on a one-time basis before handing it off to a client application. I hesitate to implement sepgsql_setcon() being not-transaction aware, because any other functionality are all transaction aware, such as SECURITY LABEL statement. Even though the original use-case from Joshua does not require it being transaction-aware, I'd like to keep consistency of behavior with any other features. Is it really unacceptable complexity? It seems to me that it would make more sense to view the set of security labels in effect as a stack. When we enter a trusted procedure, it pushes a new label on the stack; when we exit a trusted procedure, it pops the top label off the stack. sepgsql_setcon() changes the top label on the stack. If we want to retain transactional semantics, then you can also have a toplevel label that doesn't participate in the stack; a commit copies the sole item on the stack into the toplevel label, and transaction start copies the toplevel label into an empty stack. In the current coding, I think client_label_peer is redundant with client_label_committed - once the latter is set, the former is unused, IIUC - and what I'm proposing is that client_label_func shouldn't be separate, but rather should mutate the stack of labels maintained by client_label_pending. I almost agree with your opinion. A semantics to stack security label will be more straight-forward. One reason of the redundant client_label_peer is to support reset client label using NULL-input on sepgsql_setcon(). We need to save the original value somewhere. In case of sepgsql_setcon() being invoked inside of the trusted- procedure, indeed, the existing behavior is confusing as you pointed out. If we would changed the semantics using a stack structure, an invocation of trusted procedure takes a label transition from A to B, then it invokes sepgsql_setcon() to take a label transition from B to C, and the label B shall be popped from the stack at end of the trusted procedure. Eventually, the label C shall be applied on top of the stack. Do we share same image to whole of the idea? The docs need updating, both to reflect the version bump in sepgsql-regtest.te and to describe the actual feature. OK, please wait for a few days. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] query planner does not canonicalize infix operators
Will Leinweber w...@heroku.com writes: I created an index on an hstore function, fetchval(hstore, text), however when I use the - infix operator which resolves to the very same function, this index is not used. It should be used. Don't hold your breath. Create an index on the expression you intend to use, not random respellings of it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
2012/3/11 Yeb Havinga yebhavi...@gmail.com: On 2012-03-10 10:39, I wrote: I can probably write some docs tomorrow. Attached is v5 of the patch, with is exactly equal to v4 but with added documentation. Thanks for your dedicated volunteer. I'm under checking of the updates at documentation. Some other notes. - Robert asked why sepgsql_setcon with NULL to reset the value to the initial client label was supported. Maybe this could be a leftover from the initial implementation as GUC variable? It is a practical reason. In case when httpd open the connection to PG and set a suitable security label according to the given credential prior to launch of user application, then keep this connection for upcoming request, it is worthwhile to reset security label of the client. - earlier I suggested preventing setting a new client label from a trusted procedure, however I just read in the original post that this was how the current usecase of Joshua is set up. Suggestion withdrawn. In the scenario of Joshua's security policy, it does not allow httpd to issue SQL commands except for the trusted procedure that calls sepgsql_setcon() according to the given credential. (Thus, we have no way to set an arbitrary security label without credentials.) The security label being switched is allowed to issue SQL commands with restricted privileges, and also allows to translate into httpd's domain. If we would not support sepgsql_setcon() in trusted procedure, we have to allow httpd to translate an arbitrary label thus it also disallow to keep connection because it should not revert the client label to httpd (it means restricted users enable to switch someone arbitrary!) Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
On Sat, Mar 10, 2012 at 4:35 PM, Yeb Havinga yebhavi...@gmail.com wrote: The semantics are muddled because the client labels are mixed with labels from trusted procedures. The stack you described upthread may also exhibit surprising behaviour. If a trusted procedure is called, it's label is pushed onto the stack. What I was proposing is that it would *replace* the label on top of the stack. Suppose it pushes another label by calling sepgsql_setcon(). In the stack case, this is now the top of the stack and the result of sepgsql_get_client_label(). The procedure exists. What should now be the result of sepgsql_get_client_label()? Since labels are managed by a stack, we can only pop what's on top and need to pop twice, The above avoids the need to pop twice. so we've lost the label pushed onto the stack by the trusted function, which means that trusted procedures cannot be used to change client labels beyond their own scope, but other functions would. That's true, but I'm not convinced it's bad. I mean, if you instruct the system to change security labels for the duration of a trusted procedure, then it shouldn't be surprising that you end up with the same security label after it exits that you had before entering it. At the very least, it has the virtue of being consistent with other things, like how GUCs behave. The current behavior is that you change the context and you don't see the results of your own context change, which seems far worse. Maybe this semantics muddling of trusted procedures and setting the client label can be avoided by denying changing the client label with sepgsql_setcon() from a trusted procedure, which would also make some sense: ERROR: cannot override the client label of a trusted procedure during execution. That also seems possibly reasonable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
On Mon, Mar 12, 2012 at 10:58 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: It is a practical reason. In case when httpd open the connection to PG and set a suitable security label according to the given credential prior to launch of user application, then keep this connection for upcoming request, it is worthwhile to reset security label of the client. But wait a minute - how is that any good? That allows the client to pretty trivially circumvent the security restriction that we were trying to impose by doing sepgsql_setcon() in the first place. It doesn't matter how convenient it is if it's flagrantly insecure. Am I missing something here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings
Excerpts from Noah Misch's message of jue mar 08 12:11:37 -0300 2012: On Wed, Mar 07, 2012 at 04:57:12PM -0500, Robert Haas wrote: As a side note, the documentation for PQexec() is misleading about what will happen if COPY is present in a multi-command string. It says: Note however that the returned PGresult structure describes only the result of the last command executed from the string. Should one of the commands fail, processing of the string stops with it and the returned PGresult describes the error condition. It does not explain that it also stops if it hits a COPY. I had to read the source code for libpq to understand why this psql logic was coded the way it is. Agreed; I went through a similar process. Awhile after reading the code, I found the behavior documented in section Functions Associated with the COPY Command: If a COPY command is issued via PQexec in a string that could contain additional commands, the application must continue fetching results via PQgetResult after completing the COPY sequence. Only when PQgetResult returns NULL is it certain that the PQexec command string is done and it is safe to issue more commands. I'm not quite sure what revision would help most here -- a cross reference, moving that content, duplicating that content. Offhand, I'm inclined to move it to the PQexec() documentation with some kind of reference back from its original location. Thoughts? I would vote for moving it and adding a reference in the COPY functions section. That way, the PQexec doc is complete by itself without having to duplicate anything. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
2012/3/12 Robert Haas robertmh...@gmail.com: On Mon, Mar 12, 2012 at 10:58 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: It is a practical reason. In case when httpd open the connection to PG and set a suitable security label according to the given credential prior to launch of user application, then keep this connection for upcoming request, it is worthwhile to reset security label of the client. But wait a minute - how is that any good? That allows the client to pretty trivially circumvent the security restriction that we were trying to impose by doing sepgsql_setcon() in the first place. It doesn't matter how convenient it is if it's flagrantly insecure. Am I missing something here? It is a practical reason. If we would not support the reset feature, the application has to know the security label of itself, as a target label to be reverted. However, I'm not certain the status of script- language binding of libselinux feature to obtain the self label, although it is supported on Perl, Ruby and PHP (with extension by myself) at least. It seems to me a reasonable cost to track the original label to eliminate a restriction of application side that tries to revert the security label once switched. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Partitioning triggers doc patch
Attached is a doc patch based on a suggestion by Rural Hunter to address something he found confusing while setting up partitioning. Original thread is here: http://archives.postgresql.org/pgsql-admin/2012-03/msg00066.php -Kevin *** a/doc/src/sgml/ddl.sgml --- b/doc/src/sgml/ddl.sgml *** *** 2411,2416 VALUES ('New York', NULL, NULL, 'NY'); --- 2411,2420 constraints on this table, unless you intend them to be applied equally to all partitions. There is no point in defining any indexes or unique constraints on it, either. + Because triggers on this table will not be fired for operations + on the related partitions, there is usually no reason to create + triggers on it other than the optional trigger described below + to redirect operations to the appropriate partitions. /para /listitem -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
On Mon, Mar 12, 2012 at 11:13 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: 2012/3/12 Robert Haas robertmh...@gmail.com: On Mon, Mar 12, 2012 at 10:58 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: It is a practical reason. In case when httpd open the connection to PG and set a suitable security label according to the given credential prior to launch of user application, then keep this connection for upcoming request, it is worthwhile to reset security label of the client. But wait a minute - how is that any good? That allows the client to pretty trivially circumvent the security restriction that we were trying to impose by doing sepgsql_setcon() in the first place. It doesn't matter how convenient it is if it's flagrantly insecure. Am I missing something here? It is a practical reason. If we would not support the reset feature, the application has to know the security label of itself, as a target label to be reverted. However, I'm not certain the status of script- language binding of libselinux feature to obtain the self label, although it is supported on Perl, Ruby and PHP (with extension by myself) at least. You're still missing my point. The issue isn't the particular choice of mechanism for reverting to the original security label; it's the fact that such a thing would be possible at all. Suppose that the connection starts out in context connection_pooler_t. Based on the identity of the user, we transition to foo_t, bar_t, or baz_t. If it's possible, by any method, for one of those contexts to get back to connection_pooler_t, then we've got a problem. We give a connection to user foo which is in foo_t; he transitions it back to connection_pooler_t, then to bar_t, and usurps user bar's privileges. Unless there's some way to prevent that, the only way to make this secure is to make the transition to foo_t irreversible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
2012/3/12 Robert Haas robertmh...@gmail.com: On Mon, Mar 12, 2012 at 11:13 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: 2012/3/12 Robert Haas robertmh...@gmail.com: On Mon, Mar 12, 2012 at 10:58 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: It is a practical reason. In case when httpd open the connection to PG and set a suitable security label according to the given credential prior to launch of user application, then keep this connection for upcoming request, it is worthwhile to reset security label of the client. But wait a minute - how is that any good? That allows the client to pretty trivially circumvent the security restriction that we were trying to impose by doing sepgsql_setcon() in the first place. It doesn't matter how convenient it is if it's flagrantly insecure. Am I missing something here? It is a practical reason. If we would not support the reset feature, the application has to know the security label of itself, as a target label to be reverted. However, I'm not certain the status of script- language binding of libselinux feature to obtain the self label, although it is supported on Perl, Ruby and PHP (with extension by myself) at least. You're still missing my point. The issue isn't the particular choice of mechanism for reverting to the original security label; it's the fact that such a thing would be possible at all. Suppose that the connection starts out in context connection_pooler_t. Based on the identity of the user, we transition to foo_t, bar_t, or baz_t. If it's possible, by any method, for one of those contexts to get back to connection_pooler_t, then we've got a problem. We give a connection to user foo which is in foo_t; he transitions it back to connection_pooler_t, then to bar_t, and usurps user bar's privileges. Unless there's some way to prevent that, the only way to make this secure is to make the transition to foo_t irreversible. It is the reason why I advocate the idea to allow sepgsql_setcon() inside of trusted-procedures. The original use-case of Joshua does not allow connection_pooler_t to execute any SQL commands except for invocation of a particular trusted-procedures; that takes a secret credential as an argument, then it switches the client label to foo_t, bar_t or baz_t according to the supplied credential. These labels are allowed to switch back to the original connection_pooler_t, but it is unavailable to switch arbitrary label without suitable credential. Please also note that the reset of security label is handled as a switch from the current one to the original one; that takes permission check as normal manner. So, it is an option to prevent to reset the client label to the original one; that is allowed to switch arbitrary label, in an environment without connection pooling. Do we still have problematic scenario here? Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Sat, Mar 10, 2012 at 7:55 PM, Robert Haas robertmh...@gmail.com wrote: I've finally been able to run some more tests of the effect of adjusting wal_buffers to values higher than 16MB. I ran the test on the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms I ran three 30-minute tests at scale factor 300 with wal_buffers set at various values from 16MB up to 160MB, in multiples of 16MB, using pgbench with 32 clients and 32 threads in each case. The short version is that 32MB seems to be significantly better than 16MB, by about 1000 tps, and after that it gets murky; full results are below. On Nate Boley's machine, the difference was ~100% increase rather than ~10%. Do you think the difference is in the CPU architecture, or the IO subsystem? Also, do you have the latency numbers? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgfiledump - pg_filedump: Updates for latest Postgres 9.2 sources.
Excerpts from tgl's message of lun mar 12 12:08:33 -0400 2012: Log Message: --- Updates for latest Postgres 9.2 sources. We no longer need pg_crc.c, and hence not a source tree, yay. FWIW I had some fun last week compiling this to 9.1 sources because I was using a VPATH build. I had to add another -I to LDFLAGS or something like that. Also, what do you think about adding the ability to dump pg_filenode.map files? Do you think it belongs in pg_filedump, or should we look at doing that elsewhere? (Hmm, I just realized we might also conceivably want to dump VM and FSM forks, too) -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Thanks! BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm planning to do a longer run with his test script - I only run it for about 1000 iterations - to see if I can reproduce the PANIC with both the earlier patch version he tested, and this new one. Hi Heikki, I've run the v12 patch for 17,489 rounds of crash and recovery, and detected no assertion failures or other problems. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] query planner does not canonicalize infix operators
On Mon, Mar 12, 2012 at 7:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Will Leinweber w...@heroku.com writes: I created an index on an hstore function, fetchval(hstore, text), however when I use the - infix operator which resolves to the very same function, this index is not used. It should be used. Don't hold your breath. Create an index on the expression you intend to use, not random respellings of it. Is this saying there no need for that or no one is working on it, and I certainly don't intend to, or definitely not in the next version or something else entirely? I disagreement with the former, and am in total understanding of the latter two. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_filedump improvements
Alvaro Herrera alvhe...@commandprompt.com writes: Also, what do you think about adding the ability to dump pg_filenode.map files? Do you think it belongs in pg_filedump, or should we look at doing that elsewhere? Not sure. It does already contain the ability to dump pg_control, but that seems like rather a wart (not to mention redundant with pg_controldata) because you have to explicitly say -c to get it to realize that's what you want. Map files would have to be another special switch I think. (Hmm, I just realized we might also conceivably want to dump VM and FSM forks, too) Yeah, that's been on the to-do list for awhile. There's not a way that pg_filedump could automatically recognize such files, is there? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Sun, Feb 26, 2012 at 9:47 PM, Robert Haas robertmh...@gmail.com wrote: Regarding performance, the good thing about this patch is that if you have an operation that used to block, it might now not block. So maybe multixact-related operation is a bit slower than before, but if it allows you to continue operating rather than sit waiting until some other transaction releases you, it's much better. That's probably true, although there is some deferred cost that is hard to account for. You might not block immediately, but then later somebody might block either because the mxact SLRU now needs fsyncs or because they've got to decode an mxid long after the relevant segment has been evicted from the SLRU buffers. In general, it's hard to bound that latter cost, because you only avoid blocking once (when the initial update happens) but you might pay the extra cost of decoding the mxid as many times as the row is read, which could be arbitrarily many. How much of a problem that is in practice, I'm not completely sure, but it has worried me before and it still does. In the worst case scenario, a handful of frequently-accessed rows with MXIDs all of whose members are dead except for the UPDATE they contain could result in continual SLRU cache-thrashing. From a performance standpoint, we really need to think not only about the cases where the patch wins, but also, and maybe more importantly, the cases where it loses. There are some cases where the current mechanism, use SHARE locks for foreign keys, is adequate. In particular, it's adequate whenever the parent table is not updated at all, or only very lightly. I believe that those people will pay somewhat more with this patch, and especially in any case where backends end up waiting for fsyncs in order to create new mxids, but also just because I think this patch will have the effect of increasing the space consumed by each individual mxid, which imposes a distributed cost of its own. I spent some time thinking about this over the weekend, and I have an observation, and an idea. Here's the observation: I believe that locking a tuple whose xmin is uncommitted is always a noop, because if it's ever possible for a transaction to wait for an XID that is part of its own transaction (exact same XID, or sub-XIDs of the same top XID), then a transaction could deadlock against itself. I believe that this is not possible: if a transaction were to wait for an XID assigned to that same backend, then the lock manager would observe that an ExclusiveLock on the xid is already held, so the request for a ShareLock would be granted immediately. I also don't believe there's any situation in which the existence of an uncommitted tuple fails to block another backend, but a lock on that same uncommitted tuple would have caused another backend to block. If any of that sounds wrong, you can stop reading here (but please tell me why it sounds wrong). If it's right, then here's the idea: what if we stored mxids using xmin rather than xmax? This would mean that, instead of making mxids contain the tuple's original xmax, they'd need to instead contain the tuple's original xmin. This might seem like rearranging the deck chairs on the titanic, but I think it actually works out substantially better, because if we can assume that the xmin is committed, then we only need to know its exact value until it becomes older than RecentGlobalXmin. This means that a tuple can be both updated and locked at the same time without the MultiXact SLRU needing to be crash-safe, because if we crash and restart, any mxids that are still around from before the crash are known to contain only xmins that are now all-visible. We therefore don't need their exact values, so it doesn't matter if that data actually made it to disk. Furthermore, in the case where a previously-locked tuple is read repeatedly, we only need to keep doing SLRU lookups until the xmin becomes all-visible; after that, we can set a hint bit indicating that the tuple's xmin is all-visible, and any future readers (or writers) can use that to skip the SLRU lookup. In the degenerate (and probably common) case where a tuple is already all-visible at the time it's locked, we don't really need to record the original xmin at all; we can still do so if convenient, but we can set the xmin-all-visible hint right away, so nobody needs to probe the SLRU just to get xmin. In other words, we'd entirely avoid needing to make mxacts crash-safe, and we'd avoid most of the extra SLRU lookups that the current implementation requires; they'd only be needed when (and for as long as) the locked tuple was not yet all-visible. This also seems like it would make the anti-wraparound issues simpler to handle - once an mxid is old enough that any xmin it contains must be all-visible, we can simply overwrite the tuple's xmin with FrozenXID, which is pretty much what we're already doing anyway. It's already the case that a
Re: [HACKERS] Partitioning triggers doc patch
On Mon, Mar 12, 2012 at 10:41 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Attached is a doc patch based on a suggestion by Rural Hunter to address something he found confusing while setting up partitioning. Original thread is here: http://archives.postgresql.org/pgsql-admin/2012-03/msg00066.php you can create trigger that execute before the trigger that redirect to the partition, for example to stamp the time or the user... so it should state something like i said about constraints Do not define any check constraints on this table, unless you intend them to be applied equally to all partitions. -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partitioning triggers doc patch
Jaime Casanova ja...@2ndquadrant.com wrote: On Mon, Mar 12, 2012 at 10:41 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Attached is a doc patch based on a suggestion by Rural Hunter to address something he found confusing while setting up partitioning. Original thread is here: http://archives.postgresql.org/pgsql-admin/2012-03/msg00066.php you can create trigger that execute before the trigger that redirect to the partition, for example to stamp the time or the user... so it should state something like i said about constraints Do not define any check constraints on this table, unless you intend them to be applied equally to all partitions. Yeah, I thought about that issue, but the problem is that the rule or trigger that routes the operation to another partition might prevent the trigger from firing at all. I figured that if I went as far as what you describe, I would need to get into those timing issues, too; and I wasn't sure whether that might be too much information for that particular spot. But if you think we should go there, I could draft something which pointed out that an AFTER trigger is never useful and that a BEFORE trigger to enforce things globally must sort ahead of the routing trigger, and should not be used at all if a RULE is used to route operations to the partitions. Do we really want to get into all that in the middle of step-by-step instructions for the simple case? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
On Mon, Mar 12, 2012 at 12:30 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote: Suppose that the connection starts out in context connection_pooler_t. Based on the identity of the user, we transition to foo_t, bar_t, or baz_t. If it's possible, by any method, for one of those contexts to get back to connection_pooler_t, then we've got a problem. We give a connection to user foo which is in foo_t; he transitions it back to connection_pooler_t, then to bar_t, and usurps user bar's privileges. Unless there's some way to prevent that, the only way to make this secure is to make the transition to foo_t irreversible. It is the reason why I advocate the idea to allow sepgsql_setcon() inside of trusted-procedures. The original use-case of Joshua does not allow connection_pooler_t to execute any SQL commands except for invocation of a particular trusted-procedures; that takes a secret credential as an argument, then it switches the client label to foo_t, bar_t or baz_t according to the supplied credential. These labels are allowed to switch back to the original connection_pooler_t, but it is unavailable to switch arbitrary label without suitable credential. Oh, I get it. Given that that's the intended use case, the current design does make sense, but it seems awfully special-purpose. Not knowing that this is what you had in mind, I never would have guessed the reason for all this complexity. I worry that this is too much of a purpose-built mechanism, and that nobody will ever be able to use it for much of anything beyond the extremely specific use case that you've laid out here. I think that, at the very least, the comments and documentation need to make it clear that this is very deliberately intended to modify only the toplevel security context of the session, which may be different from the currently active context if a TP is in use; and also that the change will apply to future transactions only if the current transaction commits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Mar 12, 2012 at 5:28 PM, Robert Haas robertmh...@gmail.com wrote: In other words, we'd entirely avoid needing to make mxacts crash-safe, and we'd avoid most of the extra SLRU lookups that the current implementation requires; they'd only be needed when (and for as long as) the locked tuple was not yet all-visible. The current implementation only requires additional lookups in the update/check case, which is the case that doesn't do anything other than block right now. Since we're replacing lock contention with physical access contention even the worst case situation is still better than what we have now. Please feel free to point out worst case situations and show that isn't true. I've also pointed out how to avoid overhead of making mxacts crash safe when the new facilities are not in use, so I don't see problems with the proposed mechanism. Given that I am still myself reviewing the actual code. So those things are not something we need to avoid. My feeling is that overwriting xmin is a clever idea, but arrives too late to require sensible analysis in this stage of the CF. It's not solving a problem, its just an alternate mechanism and at best an optimisation of the mechanism. Were we to explore it now, it seems certain that another person would observe that design were taking place and so the patch should be rejected, which would be unnecessary and wasteful. I also think it would alter our ability to diagnose problems, not least the normal test that xmax matches xmin across an update. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 12:32 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Nate Boley's machine, the difference was ~100% increase rather than ~10%. Oh, right. I had forgotten how dramatic the changes were in those test runs. I guess I should be happy that the absolute numbers on this machine were as high as they were. This machine seems to be beating that one on every metric. Do you think the difference is in the CPU architecture, or the IO subsystem? That is an excellent question. I tried looking at vmstat output, but a funny thing kept happening: periodically, the iowait column would show a gigantic negative number instead of a number between 0 and 100. This makes me a little chary of believing any of it. Even if I did, I'm not sure that would fully answer the question. So I guess the short answer is that I don't know, and I'm not even sure how I might go about figuring it out. Any ideas? Also, do you have the latency numbers? Not at the moment, but I'll generate them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partitioning triggers doc patch
On Mon, Mar 12, 2012 at 12:39 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Do we really want to get into all that in the middle of step-by-step instructions for the simple case? is there any other place we can add a more detailed explanation? so from your text after saying usually there is no reason for you can add a parentesis saying more info in link -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Mar 12, 2012 at 1:50 PM, Simon Riggs si...@2ndquadrant.com wrote: On Mon, Mar 12, 2012 at 5:28 PM, Robert Haas robertmh...@gmail.com wrote: In other words, we'd entirely avoid needing to make mxacts crash-safe, and we'd avoid most of the extra SLRU lookups that the current implementation requires; they'd only be needed when (and for as long as) the locked tuple was not yet all-visible. The current implementation only requires additional lookups in the update/check case, which is the case that doesn't do anything other than block right now. Since we're replacing lock contention with physical access contention even the worst case situation is still better than what we have now. Please feel free to point out worst case situations and show that isn't true. I think I already have: http://archives.postgresql.org/pgsql-hackers/2012-02/msg01258.php The case I'm worried about is where we are allocating mxids quickly, and we end up having to wait for fsyncs on mxact segments. That might be very slow, but you could argue that it could *possibly* be still worthwhile if it avoids blocking. That doesn't strike me as a slam-dunk, though, because we've already seen and fixed cases where too many fsyncs causes the performance of the entire system to go down the tubes (cf. commit 7f242d880b5b5d9642675517466d31373961cf98). But it's really bad if there are no updates on the parent table - then, whatever extra overhead there is will be all for naught, since the more fine-grained locking doesn't help anyway. I've also pointed out how to avoid overhead of making mxacts crash safe when the new facilities are not in use, so I don't see problems with the proposed mechanism. Given that I am still myself reviewing the actual code. The closest thing I can find to a proposal from you in that regard is this comment: # I was really thinking we could skip the fsync of a page if we've not # persisted anything important on that page, since that was one of # Robert's performance points. It might be possible to do something with that idea, but at the moment I'm not seeing how to make it work. So those things are not something we need to avoid. My feeling is that overwriting xmin is a clever idea, but arrives too late to require sensible analysis in this stage of the CF. It's not solving a problem, its just an alternate mechanism and at best an optimisation of the mechanism. Were we to explore it now, it seems certain that another person would observe that design were taking place and so the patch should be rejected, which would be unnecessary and wasteful. Considering that nobody's done any work to resolve the uncertainty about whether the worst-case performance characteristics of this patch are acceptable, and considering further that it was undergoing massive code churn for more than a month after the final CommitFest, I think it's not that unreasonable to think it might not be ready for prime time at this point. In any event, your argument is exactly backwards: we need to first decide whether the patch needs a redesign and then, if it does, postpone it. Deciding that we don't want to postpone it first, and therefore we're not going to redesign it even if that is what's really needed makes no sense. I also think it would alter our ability to diagnose problems, not least the normal test that xmax matches xmin across an update. There's nothing stopping the new tuple from being frozen before the old one, even today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partitioning triggers doc patch
Jaime Casanova ja...@2ndquadrant.com wrote: is there any other place we can add a more detailed explanation? so from your text after saying usually there is no reason for you can add a parentesis saying more info in link Sure, I'll look around for a good spot. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] query planner does not canonicalize infix operators
Daniel Farina dan...@heroku.com writes: On Mon, Mar 12, 2012 at 7:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Will Leinweber w...@heroku.com writes: I created an index on an hstore function, fetchval(hstore, text), however when I use the - infix operator which resolves to the very same function, this index is not used. It should be used. Don't hold your breath. Create an index on the expression you intend to use, not random respellings of it. Is this saying there no need for that or no one is working on it, and I certainly don't intend to, or definitely not in the next version or something else entirely? The reason this is difficult is that the bulk of the planner's optimization knowledge is attached to operators, not functions. It'd be easy enough to smash operators down to their underlying functions during expression preprocessing, sure. But then we would fail to recognize index applicability at all --- for instance an index on an integer column can get matched to indexcol = 42, but not to int4eq(indexcol, 42). And we'd have little clue about the selectivity of the expression, either, since selectivity estimators are attached to operators not functions. Arguably the Berkeley guys got this wrong 25 years ago, and they should have defined indexes and selectivity in terms of functions not operators. However I really don't foresee us reinventing all the operator class infrastructure to make that happen; the amount of work required is far out of proportion to the benefit, even without considering the number of external projects that would get impacted. (Inventing selectivity estimators for functions is a less daunting task, and we might do that sometime ... but I think it would likely live alongside operator selectivity rather than replace it.) More generally, I'm not prepared to buy into the idea that the planner should be expected to recognize alternate spellings of the same expression. There are too many variants of that idea that are infeasible either because the planner doesn't have the necessary knowledge, or it does but trying to recognize the equivalence would cost an impractical number of planning cycles. Neither of those objections apply to replace an operator by its underlying function; but they do apply to other comparable requests we've gotten such as recognize that x + 0 is the same as x or x + 1 is the same as 1 + x. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Mar 12, 2012 at 6:14 PM, Robert Haas robertmh...@gmail.com wrote: Considering that nobody's done any work to resolve the uncertainty about whether the worst-case performance characteristics of this patch are acceptable, and considering further that it was undergoing massive code churn for more than a month after the final CommitFest, I think it's not that unreasonable to think it might not be ready for prime time at this point. Thank you for cutting to the chase. The uncertainty of which you speak is a theoretical point you raised. It has been explained, but nobody has yet shown the performance numbers to illustrate the point but only because they seemed so clear. I would point out that you haven't demonstrated the existence of a problem either, so redesigning something without any proof of a problem seems a strange. Let me explain again what this patch does and why it has such major performance benefit. This feature give us a step change in lock reductions from FKs. A real world best case might be to examine the benefit this patch has on a large batch load that inserts many new orders for existing customers. In my example case the orders table has a FK to the customer table. At the same time as the data load, we attempt to update a customer's additional details, address or current balance etc. The large load takes locks on the customer table and keeps them for the whole transaction. So the customer updates are locked out for multiple seconds, minutes or maybe hours, depending upon how far you want to stretch the example. With this patch the customer updates don't cause lock conflicts but they require mxact lookups in *some* cases, so they might take 1-10ms extra, rather than 1-10 minutes more. 1000x faster. The only case that causes the additional lookups is the case that otherwise would have been locked. So producing best case results is trivial and can be as enormous as you like. I agree with you that some worst case performance tests should be done. Could you please say what you think the worst cases would be, so those can be tested? That would avoid wasting time or getting anything backwards. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On 09.03.2012 12:04, Heikki Linnakangas wrote: I've been doing some performance testing with this, using a simple C function that just inserts a dummy WAL record of given size. I'm not totally satisfied. Although the patch helps with scalability at 3-4 concurrent backends doing WAL insertions, it seems to slow down the single-client case with small WAL records by about 5-10%. This is what Robert also saw with an earlier version of the patch (http://archives.postgresql.org/pgsql-hackers/2011-12/msg01223.php). I tested this with the data directory on a RAM drive, unfortunately I don't have a server with a hard drive that can sustain the high insertion rate. I'll post more detailed results, once I've refined the tests a bit. So, here's more detailed test results, using Greg Smith's excellent pgbench-tools test suite: http://community.enterprisedb.com/xloginsert-scale-tests/ The workload in all of these tests was a simple C function that writes a lot of very small WAL records, with 16 bytes of payload each. I ran the tests with the data directory on a regular hard drive, on an SDD, and on a ram drive (/dev/shm). With HDD, I also tried fsync=off and synchronous_commit_off. For each of those, I ran the tests with 1-16 concurrent backends. Summary: The patch hurts single-backend performance by about 10%, except for the synchronous_commit=off test. Between 2-6 clients, it either helps, doesn't make any difference, or hurts. With 6 clients, it hurts. So, that's quite disappointing. The patch has two problems: the 10% slowdown in single-client case, and the slowdown with 6 clients. I don't know where exactly the single-client slowdown comes from, although I'm not surprised that the bookkeeping with slots etc. has some overhead. Hopefully that overhead can be made smaller, if not eliminated completely.. The slowdown with 6 clients seems to be spinlock contention. I ran perf record for a short duration during one of the ramdrive tests, and saw the spinlock acquisition in ReserveXLogInsertLocation() consuming about 80% of all CPU time. I then hacked the patch a little bit, removing the check in XLogInsert for fullPageWrites and forcePageWrites, as well as the check for did a checkpoint just happen (see http://community.enterprisedb.com/xloginsert-scale-tests/disable-fpwcheck.patch). My hunch was that accessing those fields causes cache line stealing, making the cache line containing the spinlock even more busy. That hunch seems to be correct; when I reran the tests with that patch, the performance with high # of clients became much better. See the results with xloginsert-scale-13.patch. With that change, the single-client case is still about 10% slower than current code, but the performance with 8 clients is almost as good as with current code. Between 2-6 clients, the patch is a win. The hack that restored the 6 clients performance to current level is not safe, of course, so I'll have to figure out a safe way to get that effect. Also, even when the performance is as good as current code, it's not good to spend all the CPU time spinning on the spinlock. I didn't measure the CPU usage with current code, but I would expect it to be sleeping, not spinning, when not doing useful work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] query planner does not canonicalize infix operators
On Mon, Mar 12, 2012 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: Daniel Farina dan...@heroku.com writes: On Mon, Mar 12, 2012 at 7:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Will Leinweber w...@heroku.com writes: I created an index on an hstore function, fetchval(hstore, text), however when I use the - infix operator which resolves to the very same function, this index is not used. It should be used. Don't hold your breath. Create an index on the expression you intend to use, not random respellings of it. Is this saying there no need for that or no one is working on it, and I certainly don't intend to, or definitely not in the next version or something else entirely? The reason this is difficult is that the bulk of the planner's optimization knowledge is attached to operators, not functions. It'd be easy enough to smash operators down to their underlying functions during expression preprocessing, sure. But then we would fail to recognize index applicability at all --- for instance an index on an integer column can get matched to indexcol = 42, but not to int4eq(indexcol, 42). And we'd have little clue about the selectivity of the expression, either, since selectivity estimators are attached to operators not functions. Argh. That is very sad. Given that an operator definitely resolves to a function (right?) in the function catalog it seems a little insane. Arguably the Berkeley guys got this wrong 25 years ago, and they should have defined indexes and selectivity in terms of functions not operators. However I really don't foresee us reinventing all the operator class infrastructure to make that happen; the amount of work required is far out of proportion to the benefit, even without considering the number of external projects that would get impacted. (Inventing selectivity estimators for functions is a less daunting task, and we might do that sometime ... but I think it would likely live alongside operator selectivity rather than replace it.) I see. That is ugly. More generally, I'm not prepared to buy into the idea that the planner should be expected to recognize alternate spellings of the same expression. There are too many variants of that idea that are infeasible either because the planner doesn't have the necessary knowledge, or it does but trying to recognize the equivalence would cost an impractical number of planning cycles. Neither of those objections apply to replace an operator by its underlying function; but they do apply to other comparable requests we've gotten such as recognize that x + 0 is the same as x or x + 1 is the same as 1 + x. I think that the case of infix-operator-to-function stands out in that it doesn't really rely on knowing any algebriac properties of the underlying function such as commutativity, associativity, the notion of a nil-value in the algebra, et al. It's a bit concerning because ORMs are starting to pick up on the extra data types in Postgres and they'll have to all spell the preferred spelling of the functionality to properly compose. If SQLAlchemy generates fetchval and ActiveRecord 4 uses -, tough times are ahead. Thanks for your detailed response. As a small point of criticism, I think your response to Will came off as a bit strong, even though there are good reasons why this deceptively small request turns into a snarl that may only be of interest to optimizer hackers. On the more constructive side, if I were to till the fields to change this aspect of the optimizer, is there any interest in rectifying the operator-function confusion? -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Mar 12, 2012 at 12:32 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Nate Boley's machine, the difference was ~100% increase rather than ~10%. Oh, right. I had forgotten how dramatic the changes were in those test runs. I guess I should be happy that the absolute numbers on this machine were as high as they were. This machine seems to be beating that one on every metric. Do you think the difference is in the CPU architecture, or the IO subsystem? That is an excellent question. I tried looking at vmstat output, but a funny thing kept happening: periodically, the iowait column would show a gigantic negative number instead of a number between 0 and 100. On which machine was that happening? This makes me a little chary of believing any of it. Even if I did, I'm not sure that would fully answer the question. So I guess the short answer is that I don't know, and I'm not even sure how I might go about figuring it out. Any ideas? Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. If it increases the TPS of Nate@16MB, but doesn't change the other 3 situations much, then that suggests the IO system is driving it. Basically moving up to 32MB is partially innoculating against slow fsyncs upon log switch on that machine. Does the POWER7 have a nonvolatile cache? What happened with synchronous_commit=on? Also, since all data fits in shared_buffers, making checkpoint_segments and checkpoint_timeout be larger than the benchmark period should remove the only other source of writing from the system. With no checkpoints, no evictions, and no fysncs, it is unlikely for the remaining IO to be the bottleneck. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GitHub mirror is not updating
Hi list, I don't know who's maintaining the PostgreSQL GitHub mirror, but it hasn't been updated for 6 days now: https://github.com/postgres/postgres Just letting you know. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GitHub mirror is not updating
On Mon, Mar 12, 2012 at 21:53, Marti Raudsepp ma...@juffo.org wrote: Hi list, I don't know who's maintaining the PostgreSQL GitHub mirror, but it hasn't been updated for 6 days now: https://github.com/postgres/postgres Just letting you know. Thanks for letting us know. I think it's because we missed to approve an SSH key after the github security hole a week ago. I've approved it now, hopefully the mirror should kick back to life shortly. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade --logfile option documentation
On Thu, Mar 08, 2012 at 08:33:48PM -0500, Bruce Momjian wrote: OK, it seems people do care about keeping log files from multiple runs so I went with Tom's idea and have: - pg_upgrade run on Thu Mar 8 19:30:12 2012 - Performing Consistency Checks - Updated patch attached. FYI, in retain mode, these are the files left in the current directory: delete_old_cluster.sh pg_upgrade_dump_all.sql pg_upgrade_dump_db.sql pg_upgrade_dump_globals.sql pg_upgrade_internal.log pg_upgrade_restore.log pg_upgrade_server.log pg_upgrade_utility.log I will address the idea of using /tmp in another email. OK, attached pg_upgrade patch applied, with this list of improvements in logging: add ability to control permissions of created files have psql echo its queries for easier debugging output four separate log files, and delete them on success add -r/--retain option to keep log files after success make log files append-only remove -g/-G/-l logging options sugggest tailing appropriate log file on failure enhance -v/--verbose behavior Thanks for all the feedback. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c new file mode 100644 index a5f63eb..cf43384 *** a/contrib/pg_upgrade/check.c --- b/contrib/pg_upgrade/check.c *** issue_warnings(char *sequence_script_fil *** 165,176 if (sequence_script_file_name) { prep_status(Adjusting sequences); ! exec_prog(true, ! SYSTEMQUOTE \%s/psql\ --set ON_ERROR_STOP=on --no-psqlrc --port %d --username \%s\ ! -f \%s\ --dbname template1 \%s\ SYSTEMQUOTE, new_cluster.bindir, new_cluster.port, os_info.user, ! sequence_script_file_name, log_opts.filename2); unlink(sequence_script_file_name); check_ok(); } --- 165,177 if (sequence_script_file_name) { prep_status(Adjusting sequences); ! exec_prog(true, true, UTILITY_LOG_FILE, ! SYSTEMQUOTE \%s/psql\ --echo-queries ! --set ON_ERROR_STOP=on --no-psqlrc --port %d --username \%s\ ! -f \%s\ --dbname template1 \%s\ 21 SYSTEMQUOTE, new_cluster.bindir, new_cluster.port, os_info.user, ! sequence_script_file_name, UTILITY_LOG_FILE); unlink(sequence_script_file_name); check_ok(); } *** create_script_for_old_cluster_deletion(c *** 393,402 prep_status(Creating script to delete old cluster); ! snprintf(*deletion_script_file_name, MAXPGPATH, %s/delete_old_cluster.%s, ! os_info.cwd, SCRIPT_EXT); ! if ((script = fopen(*deletion_script_file_name, w)) == NULL) pg_log(PG_FATAL, Could not open file \%s\: %s\n, *deletion_script_file_name, getErrorText(errno)); --- 394,403 prep_status(Creating script to delete old cluster); ! snprintf(*deletion_script_file_name, MAXPGPATH, delete_old_cluster.%s, ! SCRIPT_EXT); ! if ((script = fopen_priv(*deletion_script_file_name, w)) == NULL) pg_log(PG_FATAL, Could not open file \%s\: %s\n, *deletion_script_file_name, getErrorText(errno)); *** check_for_isn_and_int8_passing_mismatch( *** 541,548 return; } ! snprintf(output_path, sizeof(output_path), %s/contrib_isn_and_int8_pass_by_value.txt, ! os_info.cwd); for (dbnum = 0; dbnum cluster-dbarr.ndbs; dbnum++) { --- 542,549 return; } ! snprintf(output_path, sizeof(output_path), ! contrib_isn_and_int8_pass_by_value.txt); for (dbnum = 0; dbnum cluster-dbarr.ndbs; dbnum++) { *** check_for_isn_and_int8_passing_mismatch( *** 569,575 for (rowno = 0; rowno ntups; rowno++) { found = true; ! if (script == NULL (script = fopen(output_path, w)) == NULL) pg_log(PG_FATAL, Could not open file \%s\: %s\n, output_path, getErrorText(errno)); if (!db_used) --- 570,576 for (rowno = 0; rowno ntups; rowno++) { found = true; ! if (script == NULL (script = fopen_priv(output_path, w)) == NULL) pg_log(PG_FATAL, Could not open file \%s\: %s\n, output_path, getErrorText(errno)); if (!db_used) *** check_for_reg_data_type_usage(ClusterInf *** 628,635 prep_status(Checking for reg* system OID user data types); ! snprintf(output_path, sizeof(output_path), %s/tables_using_reg.txt, ! os_info.cwd); for (dbnum = 0; dbnum cluster-dbarr.ndbs; dbnum++) { --- 629,635 prep_status(Checking for reg* system OID user data types);
Re: [HACKERS] pg_upgrade and umask
On Fri, Mar 09, 2012 at 11:33:36AM -0500, Bruce Momjian wrote: On Fri, Mar 09, 2012 at 10:41:53AM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: The problem is that these files are being created often by shell redirects, e.g. pg_dump -f out 2 log_file. There is no clean way to control the file creation permissions in this case --- only umask gives us a process-level setting. Actually, one crafty idea would be to do the umask only when I exec something, and when I create the initial files with the new banner you suggested. Let me look into that. You could create empty log files with the desired permissions, and then do the execs with log_file, and thereby not have to globally change umask. Yes, that is what I have done, with the attached patch. I basically wrapped the fopen call with umask calls, and have the system() call wrapped too. That takes care of all the files pg_upgrade creates. Frankly, the permissions are already being modified by the default umask, e.g. 0022. Do we want a zero umask? I'm not so worried about default umask; nobody's complained yet about wrong permissions on pg_upgrade output files. But umask 077 would be likely to do things like get rid of group access to postgresql.conf, which some people intentionally set. Yes, that was my conclusion too, but I wanted to ask. FYI, this doesn't affect the install itself, just what pg_upgrade changes, and it doesn't touch postgresql.conf, but, as you, I am worried there might be long-term problems with an aggressive umask that covered the entire executable. I ended up creating fopen_priv to centralize the umask calls to a single function, and added an is_priv boolean to exec_prog for the same purpose. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlog location arithmetic
On Fri, Mar 09, 2012 at 03:04:23PM -0500, Tom Lane wrote: The main actual simplification would be in getting rid of the hole at the end of each 4GB worth of WAL, cf this bit in xlog_internal.h: /* * We break each logical log file (xlogid value) into segment files of the * size indicated by XLOG_SEG_SIZE. One possible segment at the end of each * log file is wasted, to ensure that we don't have problems representing * last-byte-position-plus-1. */ #define XLogSegSize ((uint32) XLOG_SEG_SIZE) #define XLogSegsPerFile (((uint32) 0x) / XLogSegSize) #define XLogFileSize (XLogSegsPerFile * XLogSegSize) If we can't get rid of that and have a continuous 64-bit WAL address space then it's unlikely we can actually simplify any logic. Now, doing that doesn't break the naming convention exactly; what it changes is that there will be WAL files numbered xxx (for some number of trailing-1-bits I'm too lazy to work out at the moment) where before there were not. So the question really is how much external code there is that is aware of that specific noncontiguous numbering behavior and would be broken if things stopped being that way. Our current WAL naming is hopelessly arcane, and we would certainly be benfitting users to simplify it. Is this a TODO? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks, 2nd attempt
On Mon, Mar 12, 2012 at 01:28:11PM -0400, Robert Haas wrote: I spent some time thinking about this over the weekend, and I have an observation, and an idea. Here's the observation: I believe that locking a tuple whose xmin is uncommitted is always a noop, because if it's ever possible for a transaction to wait for an XID that is part of its own transaction (exact same XID, or sub-XIDs of the same top XID), then a transaction could deadlock against itself. I believe that this is not possible: if a transaction were to wait for an XID assigned to that same backend, then the lock manager would observe that an ExclusiveLock on the xid is already held, so the request for a ShareLock would be granted immediately. I also don't believe there's any situation in which the existence of an uncommitted tuple fails to block another backend, but a lock on that same uncommitted tuple would have caused another backend to block. If any of that sounds wrong, you can stop reading here (but please tell me why it sounds wrong). When we lock an update-in-progress row, we walk the t_ctid chain and lock all descendant tuples. They may all have uncommitted xmins. This is essential to ensure that the final outcome of the updating transaction does not affect whether the locking transaction has its KEY SHARE lock. Similarly, when we update a previously-locked tuple, we copy any locks (always KEY SHARE locks) to the new version. That new tuple is both uncommitted and has locks, and we cannot easily sacrifice either property. Do you see a way to extend your scheme to cover these needs? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_upgrade and statistics
As noted by the manual, pg_statistic is ported in any way when performing pg_upgrade. I have been investigating what it would take to (even via just a connected SQL superuser client running UPDATE or INSERT against pg_statistic) get at least some baseline statistics into the database as quickly as possible, since in practice the underlying implementation of the statistics and cost estimation does not change so dramatically between releases as to make the old statistics useless (AFAIK). I eventually used a few contortions to be able to update the anyarray elements in pg_statistic: UPDATE pg_statistic SET stavalues1=array_in(anyarray_out('{thearrayliteral}'::concrete_type[]), 'concrete_type'::regtype, atttypemod) WHERE staattnum = attnum and starelid = therelation; Notably, the type analysis phase is a bit too smart for me to simply cast to anyarray from a concrete type, so I run it through a deparse/reparse phase instead to fool it. Now I'm stuck trying to ensure that autoanalyze will run at least once after we have committed the old statistics to the new catalogs, regardless of how much activity has taken place on the table, regardless of how cold (thus, tuning the GUC thresholds is not attractive, because at what point should I tune them back to normal settings?). One idea I had was to jigger pg_stat to indicate that a lot of tuples have changed since the last analyze (which will be automatically fixed after autoanalyze on a relation completes) but because this is not a regular table it doesn't look too easy unless I break out a new C extension. You probably are going to ask: why not just run ANALYZE and be done with it? The reasons are: * ANALYZE can take a sufficiently long time on large databases that the downtime of switching versions is not attractive * If we don't run ANALYZE and have no old statistics, then the plans can be disastrously bad for the user * If we do run the ANALYZE statement on a user's behalf as part of the upgrade, any compatibility fixups that require an exclusive lock (such as some ALTER TABLE statements) would have to block on this relatively long ANALYZE. autoanalyze/autovacuum, by comparison, backs off frequently, so disaster is averted. If anyone has any insightful comments as to how to meet these requirements, I'd appreciate them, otherwise I can consider it an interesting area for improvement and will eat the ANALYZE and salt the documentation with caveats. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote: Do you think the difference is in the CPU architecture, or the IO subsystem? That is an excellent question. I tried looking at vmstat output, but a funny thing kept happening: periodically, the iowait column would show a gigantic negative number instead of a number between 0 and 100. On which machine was that happening? The IBM server. This makes me a little chary of believing any of it. Even if I did, I'm not sure that would fully answer the question. So I guess the short answer is that I don't know, and I'm not even sure how I might go about figuring it out. Any ideas? Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. If it increases the TPS of Nate@16MB, but doesn't change the other 3 situations much, then that suggests the IO system is driving it. Basically moving up to 32MB is partially innoculating against slow fsyncs upon log switch on that machine. Mmm, yeah. Although, I think it might have been 64MB rather than 32MB that I tested on that machine. Does the POWER7 have a nonvolatile cache? What happened with synchronous_commit=on? Haven't tried that yet. Also, since all data fits in shared_buffers, making checkpoint_segments and checkpoint_timeout be larger than the benchmark period should remove the only other source of writing from the system. With no checkpoints, no evictions, and no fysncs, it is unlikely for the remaining IO to be the bottleneck. Another thing to test. Meanwhile, here are some TPS graphs at 16MB and 32MB on the IBM POWER7 machine. 32 clients, 1800 seconds, scale factor 300, synchronous commit off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: tps-16MB.pngattachment: tps-32MB.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlog location arithmetic
On Sat, Mar 10, 2012 at 3:23 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 9, 2012 at 12:38 PM, Magnus Hagander mag...@hagander.net wrote: On Fri, Mar 9, 2012 at 18:18, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Fri, Mar 9, 2012 at 15:37, Tom Lane t...@sss.pgh.pa.us wrote: Why would it be useful to use pg_size_pretty on xlog locations? -1 because of the large expense of bigint-numeric-whatever conversion that would be added to existing uses. Given the expense, perhaps we need to different (overloaded) functions instead? Agreed. Attached patch introduces the overloaded funtion pg_size_pretty(numeric). That would be a workable solution, but I continue to not believe that this is useful enough to be worth the trouble. There's certainly some use to being able to prettify it. Wouldn't a pg_size_pretty(numeric) also be useful if you want to pg_size_() a sum() of something? Used on files it doesn't make too much sense, given how big those files have to be, but it can be used on other things as well... I can see a usecase for having a pg_size_pretty(numeric) as an option. Not necessarily a very big one, but a 0 one. +1. +1, too. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** *** 14989,14995 postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); /row row entry ! literalfunctionpg_size_pretty(typebigint/type)/function/literal /entry entrytypetext/type/entry entryConverts a size in bytes into a human-readable format with size units/entry --- 14989,14995 /row row entry ! literalfunctionpg_size_pretty(typebigint/type or typenumeric/type)/function/literal /entry entrytypetext/type/entry entryConverts a size in bytes into a human-readable format with size units/entry *** a/src/backend/utils/adt/dbsize.c --- b/src/backend/utils/adt/dbsize.c *** *** 24,29 --- 24,30 #include storage/fd.h #include utils/acl.h #include utils/builtins.h + #include utils/numeric.h #include utils/rel.h #include utils/relmapper.h #include utils/syscache.h *** *** 550,555 pg_size_pretty(PG_FUNCTION_ARGS) --- 551,652 PG_RETURN_TEXT_P(cstring_to_text(buf)); } + Datum + pg_size_pretty_numeric(PG_FUNCTION_ARGS) + { + Numeric size = PG_GETARG_NUMERIC(0); + Numeric limit, limit2; + char *buf, *result; + + limit = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) (10 * 1024; + limit2 = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) (10 * 1024 * 2 - 1; + + if (DatumGetBool(DirectFunctionCall2(numeric_lt, NumericGetDatum(size), NumericGetDatum(limit + { + buf = DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(size))); + result = palloc(strlen(buf) + 7); + strcpy(result, buf); + strcat(result, bytes); + } + else + { + Numeric arg2; + + /* keep one extra bit for rounding */ + /* size = 9 */ + arg2 = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) pow(2, 9; + size = DatumGetNumeric(DirectFunctionCall2(numeric_div_trunc, NumericGetDatum(size), NumericGetDatum(arg2))); + + if (DatumGetBool(DirectFunctionCall2(numeric_lt, NumericGetDatum(size), NumericGetDatum(limit2 + { + /* size = (size + 1) / 2 */ + size = DatumGetNumeric(DirectFunctionCall2(numeric_add, NumericGetDatum(size), + DirectFunctionCall1(int8_numeric, Int64GetDatum(1; + size = DatumGetNumeric(DirectFunctionCall2(numeric_div_trunc, NumericGetDatum(size), + DirectFunctionCall1(int8_numeric, Int64GetDatum(2; + buf = DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(size))); + result = palloc(strlen(buf) + 4); + strcpy(result, buf); + strcat(result, kB); + } + else + { + Numeric arg3; + + /* size = 10 */ + arg3 = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) pow(2, 10; + size = DatumGetNumeric(DirectFunctionCall2(numeric_div_trunc, NumericGetDatum(size), NumericGetDatum(arg3))); + + if (DatumGetBool(DirectFunctionCall2(numeric_lt, NumericGetDatum(size), NumericGetDatum(limit2 + { + /* size = (size + 1) / 2 */ + size = DatumGetNumeric(DirectFunctionCall2(numeric_add, NumericGetDatum(size), + DirectFunctionCall1(int8_numeric, Int64GetDatum(1; + size = DatumGetNumeric(DirectFunctionCall2(numeric_div_trunc, NumericGetDatum(size), + DirectFunctionCall1(int8_numeric, Int64GetDatum(2; + buf = DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(size))); + result = palloc(strlen(buf) + 4); + strcpy(result, buf); + strcat(result, MB); + }
Re: [HACKERS] Measuring relation free space
On Fri, Mar 09, 2012 at 02:18:02AM -0500, Jaime Casanova wrote: On Wed, Feb 22, 2012 at 12:27 AM, Noah Misch n...@leadboat.com wrote: On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote: 1) pgstattuple-gin_spgist.patch This first patch adds gin and spgist support to pgstattuple, also makes pgstattuple use a ring buffer when reading tables or indexes. The buffer access strategy usage bits look fine to commit. ok. i extracted that part. which basically makes pgstattuple usable in production (i mean, by not bloating shared buffers when using the function) I created a CF entry for this and marked it Ready for Committer. You left the bstrategy variable non-static, but that didn't seem important enough to justify another round trip. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] SPGiST versus hot standby - question about conflict resolution rules
There is one more (known) stop-ship problem in SPGiST, which I'd kind of like to get out of the way now before I let my knowledge of that code get swapped out again. This is that SPGiST is unsafe for use by hot standby slaves. The problem comes from redirect tuples, which are short-lifespan objects that replace a tuple that's been moved to another page. A redirect tuple can be recycled as soon as no active indexscan could be in flight from the parent index page to the moved tuple. SPGiST implements this by marking each redirect tuple with the XID of the creating transaction, and assuming that the tuple can be recycled once that XID is below the OldestXmin horizon (implying that all active transactions started after it ended). This is fine as far as transactions on the master are concerned, but there is no guarantee that the recycling WAL record couldn't be replayed on a hot standby slave while there are still HS transactions that saw the old state of the parent index tuple. Now, btree has a very similar problem with deciding when it's safe to recycle a deleted index page: it has to wait out transactions that could be in flight to the page, and it does that by marking deleted pages with XIDs. I see that the problem has been patched for btree by emitting a special WAL record just before a page is recycled. However, I'm a bit nervous about copying that solution, because the details are a bit different. In particular, I see that btree marks deleted pages with ReadNewTransactionId() --- that is, the next-to-be-assigned XID --- rather than the XID of the originating transaction, and then it subtracts one from the XID before sending it to the WAL stream. The comments about this are not clear enough for me, and so I'm wondering whether it's okay to use the originating transaction XID in a similar way, or if we need to modify SPGiST's rule for how to mark redirection tuples. I think that the use of ReadNewTransactionId is because btree page deletion happens in VACUUM, which does not have its own XID; this is unlike the situation for SPGiST where creation of redirects is caused by index tuple insertion, so there is a surrounding transaction with a real XID. But it's not clear to me how GetConflictingVirtualXIDs makes use of the limitXmin and whether a live XID is okay to pass to it, or whether we actually need next XID - 1. Info appreciated. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlog location arithmetic
On Sat, Mar 10, 2012 at 5:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Mar 9, 2012 at 2:34 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 9, 2012 at 2:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hm. I think thousands is an overestimate, but yeah the logic could be greatly simplified. However, I'm not sure we could avoid breaking the existing naming convention for WAL files. How much do we care about that? Probably not very much, since WAL files aren't portable across major versions anyway. But I don't see why you couldn't keep the naming convention - there's nothing to prevent you from converting a 64-bit integer back into two 32-bit integers if and where needed. On further reflection, this seems likely to break quite a few third-party tools. Maybe it'd be worth it anyway, but it definitely seems like it would be worth going to at least some minor trouble to avoid it. The main actual simplification would be in getting rid of the hole at the end of each 4GB worth of WAL, cf this bit in xlog_internal.h: /* * We break each logical log file (xlogid value) into segment files of the * size indicated by XLOG_SEG_SIZE. One possible segment at the end of each * log file is wasted, to ensure that we don't have problems representing * last-byte-position-plus-1. */ #define XLogSegSize ((uint32) XLOG_SEG_SIZE) #define XLogSegsPerFile (((uint32) 0x) / XLogSegSize) #define XLogFileSize (XLogSegsPerFile * XLogSegSize) If we can't get rid of that and have a continuous 64-bit WAL address space then it's unlikely we can actually simplify any logic. Now, doing that doesn't break the naming convention exactly; what it changes is that there will be WAL files numbered xxx (for some number of trailing-1-bits I'm too lazy to work out at the moment) where before there were not. So the question really is how much external code there is that is aware of that specific noncontiguous numbering behavior and would be broken if things stopped being that way. A page header contains WAL location, so getting rid of hole seems to break pg_upgrade. No? Unless pg_upgrade converts noncontinuous location to continuous one, we still need to handle noncontinuous one after upgrade. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlog location arithmetic
Fujii Masao masao.fu...@gmail.com writes: On Sat, Mar 10, 2012 at 5:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: The main actual simplification would be in getting rid of the hole at the end of each 4GB worth of WAL, cf this bit in xlog_internal.h: If we can't get rid of that and have a continuous 64-bit WAL address space then it's unlikely we can actually simplify any logic. ... Now, doing that doesn't break the naming convention exactly; what it changes is that there will be WAL files numbered xxx (for some number of trailing-1-bits I'm too lazy to work out at the moment) where before there were not. So the question really is how much external code there is that is aware of that specific noncontiguous numbering behavior and would be broken if things stopped being that way. A page header contains WAL location, so getting rid of hole seems to break pg_upgrade. No? No, why would it do that? The meaning and ordering of WAL addresses is the same as before. The only difference is that after the upgrade, the system will stop skipping over 16MB of potentially usable WAL addresses at the end of each subsequently-used 4GB of space. The holes before the switchover point are still holes, but that doesn't matter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade and statistics
On Mon, Mar 12, 2012 at 06:38:30PM -0700, Daniel Farina wrote: You probably are going to ask: why not just run ANALYZE and be done with it? The reasons are: * ANALYZE can take a sufficiently long time on large databases that the downtime of switching versions is not attractive * If we don't run ANALYZE and have no old statistics, then the plans can be disastrously bad for the user * If we do run the ANALYZE statement on a user's behalf as part of the upgrade, any compatibility fixups that require an exclusive lock (such as some ALTER TABLE statements) would have to block on this relatively long ANALYZE. autoanalyze/autovacuum, by comparison, backs off frequently, so disaster is averted. If anyone has any insightful comments as to how to meet these requirements, I'd appreciate them, otherwise I can consider it an interesting area for improvement and will eat the ANALYZE and salt the documentation with caveats. Copying the statistics from the old server is on the pg_upgrade TODO list. I have avoided it because it will add an additional requirement that will make pg_upgrade more fragile in case of major version changes. Does anyone have a sense of how often we change the statistics data between major versions? Ideally, pg_dump/pg_dumpall would add the ability to dump statistics, and pg_upgrade could use that. To answer your specific question, I think clearing the last analyzed fields should cause autovacuum to run on analyze those tables. What I don't know is whether not clearing the last vacuum datetime will cause the table not to be analyzed. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Mon, Mar 12, 2012 at 9:41 PM, Noah Misch n...@leadboat.com wrote: I created a CF entry for this and marked it Ready for Committer. i wasn't sure if create an entry this late was a good idea or not... but now i feel better because is less probable that it will fall out on the cracks, thanks You left the bstrategy variable non-static, but that didn't seem important enough to justify another round trip. ah! i forgot that... -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] query planner does not canonicalize infix operators
On mån, 2012-03-12 at 13:04 -0700, Daniel Farina wrote: On the more constructive side, if I were to till the fields to change this aspect of the optimizer, is there any interest in rectifying the operator-function confusion? I once proposed to do that [1], but there was not much enthusiasm elsewhere. I still think it'd be worthwhile, for the reasons you mentioned. But it would be a lot of detail work. [1] http://archives.postgresql.org/message-id/1309640525.26660.22.ca...@vanquo.pezone.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xlog location arithmetic
On Tue, Mar 13, 2012 at 12:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: Fujii Masao masao.fu...@gmail.com writes: On Sat, Mar 10, 2012 at 5:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: The main actual simplification would be in getting rid of the hole at the end of each 4GB worth of WAL, cf this bit in xlog_internal.h: If we can't get rid of that and have a continuous 64-bit WAL address space then it's unlikely we can actually simplify any logic. ... Now, doing that doesn't break the naming convention exactly; what it changes is that there will be WAL files numbered xxx (for some number of trailing-1-bits I'm too lazy to work out at the moment) where before there were not. So the question really is how much external code there is that is aware of that specific noncontiguous numbering behavior and would be broken if things stopped being that way. A page header contains WAL location, so getting rid of hole seems to break pg_upgrade. No? No, why would it do that? The meaning and ordering of WAL addresses is the same as before. The only difference is that after the upgrade, the system will stop skipping over 16MB of potentially usable WAL addresses at the end of each subsequently-used 4GB of space. The holes before the switchover point are still holes, but that doesn't matter. Oh, I see. You're right. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote: I meant primarily to illustrate the need to be comprehensive, not comment on which executable should fsync a particular file. Bootstrap-mode backends do not sync anything during an initdb run on my system. With your patch, we'll fsync a small handful of files and leave nearly everything else vulnerable. Thank you for pointing that out. With that in mind, I have a new version of the patch which just recursively fsync's the whole directory (attached). I also introduced a new option --nosync (-N) to disable this behavior. The bad news is that it introduces a lot more time to initdb -- it goes from about 1s to about 10s on my machine. I tried fsync'ing the whole directory twice just to make sure that the second was a no-op, and indeed it didn't make much difference (still about 10s). That's pretty inefficient considering that initdb -D data --nosync sync only takes a couple seconds. Clearly batching the operation is a big help. Maybe there's some more efficient way to fsync a lot of files/directories? Or maybe I can mitigate it by avoiding files that don't really need to be fsync'd? Regards, Jeff Davis initdb-fsync-20120312.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade and statistics
Bruce Momjian br...@momjian.us writes: Copying the statistics from the old server is on the pg_upgrade TODO list. I have avoided it because it will add an additional requirement that will make pg_upgrade more fragile in case of major version changes. Does anyone have a sense of how often we change the statistics data between major versions? I don't think pg_statistic is inherently any more stable than any other system catalog. We've whacked it around significantly just last week, which might color my perception a bit, but there are other changes on the to-do list. (For one example, see nearby complaints about estimating TOAST-related costs, which we could not fix without adding more stats data.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Tue, Mar 13, 2012 at 1:59 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Mar 9, 2012 at 2:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Thanks! BTW, I haven't forgotten about the recovery bugs Jeff found earlier. I'm planning to do a longer run with his test script - I only run it for about 1000 iterations - to see if I can reproduce the PANIC with both the earlier patch version he tested, and this new one. Hi Heikki, I've run the v12 patch for 17,489 rounds of crash and recovery, and detected no assertion failures or other problems. In v12 patch, I could no longer reproduce the assertion failure, too. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers