Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On 08/03/2015 04:25 PM, Ildus Kurbangaliev wrote: On 07/28/2015 10:28 PM, Heikki Linnakangas wrote: On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote: Hello. In the attached patch I've made a refactoring for tranches. The prefix for them was extended, and I've did a split of LWLockAssign to two functions (one with tranche and second for user defined LWLocks). This needs some work in order to be maintainable: * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync with the list of individual locks in lwlock.h. Sooner or later someone will add an LWLock and forget to update the names-array. That needs to be made less error-prone, so that the names are maintained in the same place as the #defines. Perhaps something like rmgrlist.h. * The base tranches are a bit funny. They all have the same array_base, pointing to MainLWLockArray. If there are e.g. 5 clog buffer locks, I would expect the T_NAME() to return ClogBufferLocks for all of them, and T_ID() to return numbers between 0-4. But in reality, T_ID() will return something like 55-59. Instead of passing a tranche-id to LWLockAssign(), I think it would be more clear to have a new function to allocate a contiguous block of lwlocks as a new tranche. It could then set the base correctly. * Instead of having LWLOCK_INDIVIDUAL_NAMES to name individual locks, how about just giving each one of them a separate tranche? * User manual needs to be updated to explain the new column in pg_stat_activity. - Heikki Hello. Thanks for review. I attached new version of patch. It adds new field in pg_stat_activity that shows current wait in backend. I've did a some refactoring LWLocks tranche mechanism. In lwlock.c only invididual and user defined LWLocks is creating, other LWLocks created by modules who need them. I think that is more logical (user know about module, not module about of all users). It also simplifies LWLocks acquirement. Now each individual LWLock and other groups of LWLocks have their tranche, and can be easily identified. If somebody will add new individual LWLock and forget to add its name, postgres will show a message. Individual LWLocks still allocated in one array but tranches for them point to particular index in main array. Sample: b1=# select pid, wait_event from pg_stat_activity; \watch 1 pid | wait_event --+- 7722 | Storage: READ 7653 | 7723 | Network: WRITE 7725 | Network: READ 7727 | Locks: Transaction 7731 | Storage: READ 7734 | Network: READ 7735 | Storage: READ 7739 | LWLocks: WALInsertLocks 7738 | Locks: Transaction 7740 | LWLocks: BufferMgrLocks 7741 | Network: READ 7742 | Network: READ 7743 | Locks: Transaction Attatched new version of patch with some small fixes in code -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index e64b7ef..2e4e67e 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -630,6 +630,11 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser entryTrue if this backend is currently waiting on a lock/entry /row row + entrystructfieldwait_event//entry + entrytypetext//entry + entryName of a current wait event in backend/entry +/row +row entrystructfieldstate//entry entrytypetext//entry entryCurrent overall state of this backend. diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 3a58f1e..10c25cf 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -457,7 +457,8 @@ CLOGShmemInit(void) { ClogCtl-PagePrecedes = CLOGPagePrecedes; SimpleLruInit(ClogCtl, CLOG Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - CLogControlLock, pg_clog); + CLogControlLock, pg_clog, + CLogBufferLocks); } /* diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5ad35c0..dd7578f 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -466,7 +466,8 @@ CommitTsShmemInit(void) CommitTsCtl-PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, CommitTs Ctl, CommitTsShmemBuffers(), 0, - CommitTsControlLock, pg_commit_ts); + CommitTsControlLock, pg_commit_ts, + CommitTSBufferLocks); commitTsShared = ShmemInitStruct(CommitTs shared, sizeof(CommitTimestampShared), diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 1933a87..b905c59 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1842,10 +1842,12 @@ MultiXactShmemInit(void) SimpleLruInit(MultiXactOffsetCtl, MultiXactOffset Ctl, NUM_MXACTOFFSET_BUFFERS, 0, - MultiXactOffsetControlLock, pg_multixact/offsets); +
Re: [HACKERS] GROUP BY before JOIN
On Tue, Aug 4, 2015 at 4:00 PM, David Rowley david.row...@2ndquadrant.com wrote: On 4 August 2015 at 21:56, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: This looks like one example of general problem of finding optimal order for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A, B), sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like join, grouping, aggregation, projection, union, intersection, limit, ordering etc. and A, B, C, D are tables. Given such a representation, how do we find an optimal order of operation? In you example the query which is defined like GROUPBY(JOIN(A, B)) might have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint groups i.e. UNION is distributive over GROUPBY. I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B). If the group by contained columns from relations A and B, then it wouldn't be possible to group each one individually. We would have to wait until the join was performed in that case. Sorry for not being clear enough. GROUPBY(A) implies GROUPBY on relation A, using some columns (unspecified but common to both A and B here). GROUPBY is used as an operator over relations, in classic sense of operator. This representation does not have any columns, join conditions specified for simplicity. It specifies SQL operators on relations as operands. Right now, we use dynamic programming to find the optimal order for join tree (make_one_rel). In the language above, we find the optimal order for evaluating an expression consisting of JOIN operators by exploiting their commutative and associative properties. If we could extend that to SQL expression tree representing the query, we will be able to solve many of such re-ordering problems using current path infrastructure. The dynamic program for this would look something like below root = top operator of the query child = the operands of the root cost of query = minimum of 1. cost of query without any mutation 2. cost of root and its child operator commuted, if there is only one child for root operator and root operator and child operator are commutable 3. cost of root distributed over children operands if there are multiple operands and root operator is distributive over child operators 4. ... 5. ... Each operator root will have a RelOptInfo to which we add many paths possible for performing that operation using different orders of evaluation. Current add_path should eliminate the expensive ones. So in a query such as: select s.product_id from sale s inner join product p on s.product_id = p.product_id group by s.product_id; Do you mean that there would be 2 RelOptInfos for the sale relation, one without the GROUP BY, and one with? That would likely solve the issue I have with join row estimates, but which of the 2 RelOptInfo would all of the Vars in the parse tree be pointing to? This query would be represented as GROUPBY(JOIN(sales, product)), grouping happens on s.product_id and JOIN on s.product_id = p.product_id. The dynamic programming would consider GROUPBY(JOIN(sales, product)) and JOIN(GROUPBY(sales), GROUPBY(product)) as possible evaluation orders, since GROUPBY is distributive over JOIN in this case and choose the order with the least cost. This is going to explode the number of path (plan) trees considered by planner, so we need to be judicious when to apply this technique or have a rigorous pruning technique to discard costly paths early enough. This approach might be better than trying to solve each re-ordering problem separately and might go well with upper planner pathification being discussed at http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com . It's certainly going to increase that, but I'm not sure if 'explode' is the best word as we at least require all of the rels seen in the GROUP BY expressions and aggregate function parameters to be joined before we can consider a GroupingPath. The patch uses bms_is_subset(root-minimumGroupingRels, rel-relids) to determine that. On Tue, Aug 4, 2015 at 1:57 PM, David Rowley david.row...@2ndquadrant.com wrote: == Overview == As of today we always perform GROUP BY at the final stage, after each relation in the query has been joined. This of course works, but it's not always the most optimal way of executing the query. Consider the following two relations: create table product (product_id int primary key, code varchar(32) not null); create table sale (sale_id int primary key, product_id int not null, qty int not null); create index on sale (product_id); Populated with: insert into product select x.x,'ABC' || x.x from generate_series(1,100) x(x); insert into sale select x.x,x.x%100+1, 10 from
Re: [HACKERS] [PATCH] postgres_fdw extension support
Thanks so much Michael! Let me know when you have further feedback I should incorporate. ATB, P. -- http://postgis.net http://cleverelephant.ca On July 25, 2015 at 4:52:11 AM, Michael Paquier (michael.paqu...@gmail.com) wrote: On Sat, Jul 25, 2015 at 2:19 AM, Paul Ramsey pram...@cleverelephant.ca wrote: On Thu, Jul 23, 2015 at 7:48 AM, Paul Ramsey pram...@cleverelephant.ca wrote: Here's an updated patch that clears the cache on changes to foreign wrappers and servers. Any chance one of you folks could by my official commitfest reviewer? Appreciate all the feedback so far! https://commitfest.postgresql.org/6/304/ That's an interesting topic, I will register as such. -- Michael
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On 08/04/2015 03:15 PM, Robert Haas wrote: On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangashlinn...@iki.fi wrote: * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync with the list of individual locks in lwlock.h. Sooner or later someone will add an LWLock and forget to update the names-array. That needs to be made less error-prone, so that the names are maintained in the same place as the #defines. Perhaps something like rmgrlist.h. This is a good idea, but it's not easy to do in the style of rmgrlist.h, because I don't believe there's any way to define a macro that expands to a preprocessor directive. Attached is a patch that instead generates the list of macros from a text file, and also generates an array inside lwlock.c with the lock names that gets used by the Trace_lwlocks stuff where applicable. Any objections to this solution to the problem? If not, I'd like to go ahead and push this much. I can't test the Windows changes locally, though, so it would be helpful if someone could check that out. In my latest patch I still have an array with names, but postgres will show a message if somebody adds an individual LWLock and forgets to add its name. Code generation is also a solution, and if commiters will support it I'll merge it to main patch. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote: The attached proof of concept patch greatly improves the bloat for both the insert and the update cases. You need to turn on both features: adding the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3). The first of those two things could probably be adopted for real, but the second probably is not acceptable. What is the right way to do this? Could a variant of RecordFreeIndexPage bubble the free space up the map immediately rather than waiting for a vacuum? It would only have to move up until it found a page with freespace already recorded in it, which the vast majority of the time would mean observing up one level and then not writing to it, assuming the pending list pages remain well clustered. You make a good case for action here since insert only tables with GIN indexes on text are a common use case for GIN. Why would vacuuming the FSM be unacceptable? With a large gin_pending_list_limit it makes sense. If it is unacceptable, perhaps we can avoid calling it every time, or simply have FreeSpaceMapVacuum() terminate more quickly on some kind of 80/20 heuristic for this case. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Hi Ashutosh, Sorry for leaving the thread. 2015-07-20 16:09 GMT+09:00 Ashutosh Bapat ashutosh.ba...@enterprisedb.com: In find_user_mapping(), if the first cache search returns a valid tuple, it is checked twice for validity, un-necessarily. Instead if the first search returns a valid tuple, it should be returned immediately. I see that this code was copied from GetUserMapping(), where as well it had the same problem. Oops. I changed find_user_mapping to exit immediately when any valid cache was found. In build_join_rel(), we check whether user mappings of the two joining relations are same. If the user mappings are same, doesn't that imply that the foreign server and local user are same too? Yes, validity of umid is identical to serverid. We can remove the check for serverid for some cycles. One idea is to put Assert for serverid inside the if-statement block. Rest of the patch looks fine. Thanks -- Shigeru HANADA -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cost_agg() with AGG_HASHED does not account for startup costs
David Rowley david.row...@2ndquadrant.com writes: During working on allowing the planner to perform GROUP BY before joining I've noticed that cost_agg() completely ignores input_startup_cost when aggstrategy == AGG_HASHED. Isn't your proposed patch double-counting the input startup cost? input_total_cost already includes that charge. The calculation reflects the fact that we have to read all of the input before we can deliver any aggregated results, so the time to get the first input row isn't really interesting. If this were wrong, the PLAIN costing path would also be wrong, but I don't think that either one is. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On 08/04/2015 03:15 PM, Robert Haas wrote: On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas hlinn...@iki.fi wrote: * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync with the list of individual locks in lwlock.h. Sooner or later someone will add an LWLock and forget to update the names-array. That needs to be made less error-prone, so that the names are maintained in the same place as the #defines. Perhaps something like rmgrlist.h. This is a good idea, but it's not easy to do in the style of rmgrlist.h, because I don't believe there's any way to define a macro that expands to a preprocessor directive. Attached is a patch that instead generates the list of macros from a text file, and also generates an array inside lwlock.c with the lock names that gets used by the Trace_lwlocks stuff where applicable. Any objections to this solution to the problem? If not, I'd like to go ahead and push this much. I can't test the Windows changes locally, though, so it would be helpful if someone could check that out. A more low-tech solution would be to something like this in lwlocknames.c: static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS]; /* Turn pointer into one of the LWLocks in main array into an index number */ #define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name InitLWLockNames() { NAME_LWLOCK(ShmemIndexLock, ShmemIndexLock); NAME_LWLOCK(OidGenLock, OidGenLock); ... } That would not be auto-generated, so you'd need to keep that list in sync with lwlock.h, but it would be much better than the original patch because if you forgot to add an entry in the names-array, the numbering of all the other locks would not go wrong. And you could have a runtime check that complains if there's an entry missing, like Ildus did in his latest patch. I have no particular objection to your perl script either, though. I'll leave it up to you. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On 08/04/2015 04:35 PM, Simon Riggs wrote: This and the OP seem like 9.5 open items to me. Why? This is nothing new in 9.5. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On 4 August 2015 at 14:55, Heikki Linnakangas hlinn...@iki.fi wrote: On 08/04/2015 04:35 PM, Simon Riggs wrote: This and the OP seem like 9.5 open items to me. Why? This is nothing new in 9.5. gin_pending_list_limit is new in 9.5 We're in Alpha, so if something has been added and isn't very usable, we should change it while we can. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] FSM versus GIN pending list bloat
On 2015-08-04 14:59:11 +0100, Simon Riggs wrote: On 4 August 2015 at 14:55, Heikki Linnakangas hlinn...@iki.fi wrote: On 08/04/2015 04:35 PM, Simon Riggs wrote: This and the OP seem like 9.5 open items to me. Why? This is nothing new in 9.5. gin_pending_list_limit is new in 9.5 We're in Alpha, so if something has been added and isn't very usable, we should change it while we can. The only thing that variable does is change what the pending size limit is determined by. Previously it was work_mem, now it's gin_pending_list_limit. Imo that has pretty much nothing to do with not registering pages as free. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Sun, Aug 2, 2015 at 8:06 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Amit, Let me ask three more detailed questions. Why Funnel has a valid qual of the subplan? The 2nd argument of make_funnel() is qualifier of the subplan (PartialSeqScan) then it is initialized at ExecInitFunnel, but never executed on the run-time. Why does Funnel node has useless qualifier expression here (even though it is harmless)? The idea is that if in some case the qualification can't be pushed down (consider the case where qualification contains parallel restricted functions (functions that can only be executed in master backend)) and needs to be only executed in master backend, then we need it in Funnel node, so that it can be executed for tuples passed by worker backends. It is currently not used, but I think we should retain it as it is because it can be used in some cases either as part of this patch itself or in future. As of now, it is used in other places in patch (like during Explain) as well, although we might want to optimize the same, but overall I think it is required. Why Funnel delivered from Scan? Even though it constructs a compatible target-list with underlying partial-scan node, it does not require the node is also delivered from Scan. It needs it's own target-list due to reason mentioned above for qual and yet another reason is that the same is required for FunnelState which inturn is required ScanSlot used to retrieve tuples from workers. Also it is not excatly same as partialseqscan, because for the case when the partialseqscan node is executed by worker, we modify the targetlist as well, refer create_parallel_worker_plannedstmt(). Does ExecFunnel() need to have a special code path to handle EvalPlanQual()? Probably, it just calls underlying node in the local context. ExecScan() of PartialSeqScan will check its qualifier towards estate-es_epqTuple[]. Isn't EvalPlanQual() called for modifytable node and which won't be allowed in parallel mode, so I think EvalPlanQual() is not required for ExecFunnel path. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Support for N synchronous standby servers - take 2
Michael Paquier wrote: And this is the case of any format as well. String format validation for a GUC occurs when server is reloaded or restarted, one advantage of JSON is that the parser validator is already here, so we don't need to reinvent a new machinery for that. IIUC correctly, we would also have to add additional code to check that that given JSON has the required keys and entries. For ex: The group mentioned in the s_s_names should be definied in the groups section, etc. - Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860758.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
On 4 August 2015 at 05:56, Michael Paquier michael.paqu...@gmail.com wrote: Hi all, As mentioned in the thread related to lowering locks of autovacuum reloptions in ALTER TABLE SET ( http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com ), I have noticed the following code in AlterTableGetLockLevel@tablecmds.c: /* * Take the greatest lockmode from any subcommand */ if (cmd_lockmode lockmode) lockmode = cmd_lockmode; The thing is that, as mentioned by Alvaro and Andres on this thread, we have no guarantee that the different relation locks compared have a monotone hierarchy and we may finish by taking a lock that does not behave as you would like to. We are now lucky enough that ALTER TABLE only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and AccessExclusiveLock that actually have a hierarchy so this is not a problem yet. However it may become a problem if we add in the future more lock modes and that are used by ALTER TABLE. Please provide the link to the discussion of this. I don't see a problem here right now that can't be solved by saying Assert(locklevel==ShareUpdateExclusiveLock || locklevelShareRowExclusiveLock); -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Tue, Aug 4, 2015 at 3:27 PM, Beena Emerson memissemer...@gmail.com wrote: Michael Paquier wrote: And this is the case of any format as well. String format validation for a GUC occurs when server is reloaded or restarted, one advantage of JSON is that the parser validator is already here, so we don't need to reinvent a new machinery for that. IIUC correctly, we would also have to add additional code to check that that given JSON has the required keys and entries. For ex: The group mentioned in the s_s_names should be definied in the groups section, etc. Yep, true as well. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Michael Paquier wrote: As mentioned in the thread related to lowering locks of autovacuum reloptions in ALTER TABLE SET (http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com), I have noticed the following code in AlterTableGetLockLevel@tablecmds.c: /* * Take the greatest lockmode from any subcommand */ if (cmd_lockmode lockmode) lockmode = cmd_lockmode; The thing is that, as mentioned by Alvaro and Andres on this thread, we have no guarantee that the different relation locks compared have a monotone hierarchy and we may finish by taking a lock that does not behave as you would like to. Maybe the solution to this is to add the concept of addition of two lock modes, where the result is another lock mode that conflicts with any lock that would conflict with either of the two operand lock modes. For instance, if you add a lock mode to itself, you get the same lock mode; if you add anything to AccessExclusive, you get AccessExclusive; if you add anything to AccessShare, you end up with that other lock mode. The most interesting case in our current lock table is if you add ShareUpdateExclusive to Share, where you end up with ShareRowExclusive. In essence, holding the result of that operation is the same as holding both locks, which AFAICS is the behavior we want. That's commutative, as this is basically looking at the conflict table to get the union of the bits to indicate what are all the locks conflicting with lock A and lock B, and then we select the lock on the table that includes the whole union, with a minimum number of them. Now, let's take for example this case with locks A, B, C, D: - Lock A conflicts with ACD - B with BCD - C with itself - D with itself What would you choose as a result of add(C,D)? A or B? Or the super lock conflicting with all of them? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On 08/04/2015 08:03 AM, Jeff Janes wrote: For a GIN index with fastupdate turned on, both the user backends and autoanalyze routine will clear out the pending list, pushing the entries into the normal index structure and deleting the pages used by the pending list. But those deleted pages will not get added to the freespace map until a vacuum is done. This leads to horrible bloat on insert only tables, as it is never vacuumed and so the pending list space is never reused. And the pending list is very inefficient in space usage to start with, even compared to the old style posting lists and especially compared to the new compressed ones. (If they were aggressively recycled, this inefficient use wouldn't be much of a problem.) Good point. The attached proof of concept patch greatly improves the bloat for both the insert and the update cases. You need to turn on both features: adding the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3). The first of those two things could probably be adopted for real, but the second probably is not acceptable. What is the right way to do this? Could a variant of RecordFreeIndexPage bubble the free space up the map immediately rather than waiting for a vacuum? It would only have to move up until it found a page with freespace already recorded in it, which the vast majority of the time would mean observing up one level and then not writing to it, assuming the pending list pages remain well clustered. Yep, that sounds good. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Tue, Aug 4, 2015 at 2:57 PM, Masahiko Sawada wrote: On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson wrote: Since there will not be many nesting and grouping, I still prefer new language to JSON. I understand one can easily, modify/add groups in JSON using in built functions but I think changes will not be done too often. If we decided to use dedicated language, the syntax checker for that language is needed, via SQL or something. Well, sure, both approaches have downsides. Otherwise we will not be able to know whether the parsing that value will be done correctly, until reloading or restarting server. And this is the case of any format as well. String format validation for a GUC occurs when server is reloaded or restarted, one advantage of JSON is that the parser validator is already here, so we don't need to reinvent a new machinery for that. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: SCRAM authentication
On Mon, Mar 30, 2015 at 7:52 PM, Heikki Linnakangas hlinn...@iki.fi wrote: There have been numerous threads on replacing our MD5 authentication method, so I started hacking on that to see what it might look like. Just to be clear, this is 9.6 material. Attached is a WIP patch series that adds support for SCRAM. There's no need to look at the details yet, but it demonstrates what the protocol changes and the code structure would be like. I'm not wedded to SCRAM - SRP or JPAKE or something else might be better. But replacing the algorithm, or adding more of them, should be straightforward with this. Agreed. We need such a facility. There is no negotiation of the authentication mechanism. SCRAM is just added as a new one, alongside all the existing ones. If the server requests SCRAM authentication, but the client doesn't support it, the attempt will fail. We might want to do something about that, to make the transition easier, but it's an orthogonal feature and not absolutely required. There are four patches in the series. The first two are just refactoring: moving the SHA-1 implementation from pgcrypto to src/common, and some refactoring in src/backend/auth.c that IMHO would make sense anyway. The two first patches of the series look good to me. Patches three and four are the interesting ones: I have not looked in details yet at number implementing SCRAM. 3. Allow storing multiple verifiers in pg_authid Replace the pg_authid.rolpassword text field with an array, and rename it to 'rolverifiers'. This allows storing multiple password hashes: an MD5 hash for MD5 authentication, and a SCRAM salt and stored key for SCRAM authentication, etc. Each element in the array is a string that begins with the method's name. For example md5:MD5 hash, or password:plaintext. For dump/reload, and for clients that wish to create the hashes in the client-side, there is a new option to CREATE/ALTER USER commands: PASSWORD VERIFIERS '{ ... }', that allows replacing the array. The old ENCRYPTED/UNENCRYPTED PASSWORD 'foo' options are still supported for backwards-compatibility, but it's not clear what it should mean now. TODO: * Password-checking hook needs to be redesigned, to allow for more kinds of hashes. * With CREATE USER PASSWORD 'foo', which hashes/verifiers should be generated by default? We currently have a boolean password_encryption setting for that. Needs to be a list. I have been looking more in depths at this one, which adds essential infrastructure to support multiple authentication hashes for more protocols. Here are some comments: - Docs are missing (not a big issue for a WIP) - Instead of an array that has an identified embedded, let's add a new catalog pg_authid_hashes that stores all the hashes for a user (idea by Heikki): -- hashrol, role Oid associated with the hash -- hashmet, hash method -- hashval, value of the hash - New password-checking hook (contrib/passwordcheck will need a refresh). As of now, we have that: void (*check_password_hook_type) (const char *username, const char *password, int password_type, Datum validuntil_time, bool validuntil_null); We need to switch to something that checks a list of hashes: void (*check_password_hook_type) (const char *username, list *passwd, Datum validuntil_time, bool validuntil_null); passwd is a structure containing the password type and the hash value. Password type can then be plain (or password to match pg_hba.conf) or md5 for now. - When password_encryption is switched to a list, true means md5, and false means plain. At the addition of SCRAM, we could think harder the default value, true may be worth meaning md5,scram. - For CREATE ROLE/ALTER ROLE, it is necessary to be able to define the list of hashes that need to be generated, with something like that for example: [ ENCRYPTED [(md5[, scram])] | UNENCRYPTED ] PASSWORD 'password' When UNENCRYPTED is used, we could simply store the password as plain. When only ENCRYPTED is used, we store it for all the methods available, except plain. ENCRYPTED and plain are not allowed combinations. - Also, do we really want an option at SQL level to allow storing custom hashes generated on client side as a first step? We could have something like WITH (md5 = 'blah', scram = 'blah2') appended after PASSWORD for example. - rolpassword is removed from pg_authid. I am willing to write a patch for the next CF following more or less those lines, depending of course on the outcome of the discussion we can have here, so feel free to comment. I'll have a look more in-depth at the scram patch as well. Regards, -- Michael
Re: [HACKERS] tablecmds.c and lock hierarchy
On Tue, Aug 4, 2015 at 3:35 PM, Simon Riggs si...@2ndquadrant.com wrote: Please provide the link to the discussion of this. I don't see a problem here right now that can't be solved by saying Thread: http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com Particularly those messages: http://www.postgresql.org/message-id/20150731022857.gc11...@alap3.anarazel.de http://www.postgresql.org/message-id/20150731200012.gc2...@postgresql.org http://www.postgresql.org/message-id/CAB7nPqSK-hSZG7T1tAJ_=HEYsi6P1ejgX2x5LW3LYXJ7=9c...@mail.gmail.com Assert(locklevel==ShareUpdateExclusiveLock || locklevelShareRowExclusiveLock); Yep, true as things stand now. But this would get broken if we add a new lock level between ShareRowExclusiveLock and AccessExclusiveLock that does not respect the current monotone hierarchy between those. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On 2015-08-03 14:15:27 +0900, Michael Paquier wrote: On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote: On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote: For instance, if you told me to choose between ShareLock and ShareUpdateExclusiveLock I wouldn't know which one is strongest. I don't it's sensible to have the lock mode compare primitive honestly. I don't have any great ideas to offer ATM sadly. Yes, the thing is that lowering the lock levels is good for concurrency, but the non-monotony of the lock levels makes it impossible to choose an intermediate state correctly. How about simply acquiring all the locks individually of they're different types? These few acquisitions won't matter. As long as this only applies on master, this may be fine... We could basically pass a LOCKMASK to the multiple layers of tablecmds.c instead of LOCKMODE to track all the locks that need to be taken, and all the relations open during operations. This sounds far too complicated to me. Just LockRelationOid() the relation with the appropriate level everytime you pass through the function? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GROUP BY before JOIN
== Overview == As of today we always perform GROUP BY at the final stage, after each relation in the query has been joined. This of course works, but it's not always the most optimal way of executing the query. Consider the following two relations: create table product (product_id int primary key, code varchar(32) not null); create table sale (sale_id int primary key, product_id int not null, qty int not null); create index on sale (product_id); Populated with: insert into product select x.x,'ABC' || x.x from generate_series(1,100) x(x); insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100) x(x); Here we have a table with 100 products and another with 1 million sales records. If we wanted to see the total sales for each product we may write a query such as: select s.product_id,sum(qty) as qty from sale s inner join product p on s.product_id = p.product_id group by s.product_id; If we look at the explain for this query we can see that the grouping happened after the join took place: QUERY PLAN -- HashAggregate Group Key: s.product_id - Hash Join Hash Cond: (s.product_id = p.product_id) - Seq Scan on sale s - Hash - Seq Scan on product p The join here would product exactly 1 million rows, then hashaggregate will group 1 million rows. If it were possible for PostgreSQL to perform the GROUP BY before the join we could change that to Grouping 1 million rows, then joining 100 rows. Of course we could have written the query as: select s.product_id,qty from (select product_id,sum(qty) as qty from sale group by product_id) s inner join product p on s.product_id = p.product_id; Which would do exactly that, and in this particular case it is much faster, but it's not all rainbows and unicorns, as if the join happened to filter out many of the groups, then it might not be a win at all. Consider: select s.product_id from sale s inner join product p on s.product_id = p.product_id where p.code = 'ABC1' group by s.product_id; Since the product.code has no equivalence member in the sale table the predicate cannot be applied to the sale table, so in this case if we'd written the query as: select s.product_id,qty from (select product_id,sum(qty) as qty from sale group by product_id) s inner join product p on s.product_id = p.product_id where p.code = 'ABC1'; We'd have thrown away 99% of the groups created in the subquery. == Proposal == I've been working on allowing the planner to properly cost which option is better and choose to perform the GROUP BY at estimated most efficient join level. So far I've not gotten as far as supporting queries which contain aggregate functions, as I need the ability to combine aggregate states (Simon and I proposed a patch for that here https://commitfest.postgresql.org/5/131/). Performance of the above test case is looking quite good so far: select s.product_id from sale s inner join product p on s.product_id = p.product_id group by s.product_id; -- Master = 312.294 ms -- Patched = 95.328 ms So a 327% performance increase in this particular case. == Implementation == I've had to make some choices about how exactly to implement this feature. As I explain above, we can't just always do the grouping at the first possible opportunity due to the possibility of joins eliminating groups and needless groups being created. To support this in the planner I've invented a GroupingPath type and I've also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to RelOptInfo. This part I wasn't too sure about and I wondered if these GroupingPaths should just be added to the normal list with add_path(), but since these paths perform more work than normal paths that would give them an unfair disadvantage and they could be thrown out. These paths are only possibly cheaper after subsequent JOIN has taken place. There's also an issue with row estimates being wrong on joinrels, and the row estimate is using the base rel's row count rather than the number of groups. As yet I'm unsure the best way to fix this. I wanted to post this early so as to register the fact that I'm working on this, but I'm also posting in hope to get some early feedback on what I have so far. Of course there's lots more work to do here, aggregates need to be supported, functional dependencies and transitive closures will also need to be detected in a more complete implementation. Here's what I have so far (attached) Comments are most welcome. Thanks. Regards David Rowley -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services groupbeforejoin_v0.1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
On 2015-08-04 PM 05:58, Geoff Winkless wrote: Although it seems Amit has defined the problem better than I could, so this is a bit late to the party (!), yes, the table had been ALTERed after it was created (looking back through the history, that modification included at least one DROP COLUMN). It seems using any columns that used to be after a dropped columns cause EXCLUDE pseudo-relation to misbehave. For example, I observed another symptom: test=# CREATE TABLE upsert_fail_test(a int, b int, c int, d smallint); CREATE TABLE test=# ALTER TABLE upsert_fail_test DROP b; ALTER TABLE test=# ALTER TABLE upsert_fail_test ADD PRIMARY KEY (a, c, d); ALTER TABLE test=# INSERT INTO upsert_fail_test(a, c, d) VALUES (1, 2, 3) ON CONFLICT (a, c, d) DO UPDATE SET c = EXCLUDED.c; INSERT 0 1 test=# INSERT INTO upsert_fail_test(a, c, d) VALUES (1, 2, 3) ON CONFLICT (a, c, d) DO UPDATE SET c = EXCLUDED.c; ERROR: null value in column c violates not-null constraint DETAIL: Failing row contains (1, null, 3). Or, the EXCLUDED pseudo-rel failed to deliver '2' produced by the subplan and instead produced a 'null' which I guess was caused by the dropped column 'b'. Perhaps, it may have to do with how EXCLUDED pseudo-rel's targetlist is manipulated through parse-plan stage? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
Robert Haas wrote: Maybe shoehorning this into the GUC mechanism is the wrong thing, and what we really need is a new config file for this. The information we're proposing to store seems complex enough to justify that. I think the consensus is that JSON is better. And using a new file with multi line support would be good. Name of the file: how about pg_syncinfo.conf? Backward compatibility: synchronous_standby_names will be supported. synchronous_standby_names='pg_syncinfo' indicates use of new file. JSON format: It would contain 2 main keys: sync_info and groups The sync_info would consist of quorum/priority with the count and nodes/group with the group name or node list. The optional groups key would list out all the group mentioned within sync_info along with the node list. Ex: 1. { sync_info: { quorum:2, nodes: [ node1, node2, node3 ] } } 2. { sync_info: { quorum:2, nodes: [ {priority:1,group:cluster1}, {quorum:2,group: cluster2}, node99 ] }, groups: { cluster1:[node11,node12], cluster2:[node21,node22,node23] } } Thoughts? - Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860791.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Yeah, something like that. To paraphrase, if I'm now understanding it correctly, Peter's idea is: When all the tuples have been fed to tuplesort, and it's time to perform the sort, quicksort all the tuples currently in the heap, ignoring the run numbers, and turn the resulting array into another tape. That tape is special: it's actually stored completely in memory. It is merged with the real tapes when tuples are returned from the tuplesort, just like regular tapes in TSS_FINALMERGE. Yeah. I imagine that we'll want to put memory prefetch hints for the new case, since I've independently shown that that works well for the in-memory case, which this can be very close to. My next patch will also include quicksorting of runs after we give up on heapification (after there is more than one run and it is established that we cannot use my quicksort with spillover optimization, so there are two or more real runs on tape). Once there is clearly not going to be one huge run (which can happen due to everything largely being in order, even when work_mem is small), and once incrementally spilling does not end in time to do a quicksort with spillover, then the replacement selection thing isn't too valuable. Especially with large memory sizes but memory bandwidth + latency as a bottleneck, which is the norm these days. This seems simpler than my earlier idea of reusing half the memtuples array only, and resorting the entire array each time, to have something that consistently approximates replacement selection but with quicksorting + batching, which I discussed before. I have this working, and it takes about a good chunk of the runtime off a sort that merges 3 runs on one reasonable case tested where work_mem was 300MB. It went from about 56.6 seconds with master to 35.8 seconds with this new approach when tested just now (this approach saves no writing of tuples, so it's not as effective as the original quicksort with spillover patch can be, but covers a fundamentally different case). I just need to clean up the patch, and see if I missed any further optimizations, but this feels like the way forward multi-run wise. I think it's worth giving up on replacement selection style runs after the first run is produced, because that's where the benefits are, if anywhere. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Autonomous Transaction is back
On 03 August 2015 18:40, Merlin Moncure [mailto:mmonc...@gmail.com] Wrote: On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi rajeev.rast...@huawei.com wrote: On 31 July 2015 23:10, Robert Haas Wrote: I think we're going entirely down the wrong path here. Why is it ever useful for a backend's lock requests to conflict with themselves, even with autonomous transactions? That seems like an artifact of somebody else's implementation that we should be happy we don't need to copy. IMHO, since most of the locking are managed at transaction level not backend level and we consider main autonomous transaction to be independent transaction, then practically they may conflict right. It is also right as you said that there is no as such useful use-cases where autonomous transaction conflicts with main (parent) transaction. But we cannot take it for granted as user might make a mistake. So at- least we should have some mechanism to handle this rare case, for which as of now I think throwing error from autonomous transaction as one of the solution. Once error thrown from autonomous transaction, main transaction may continue as it is (or abort main transaction also??). hm. OK, what's the behavior of: BEGIN UPDATE foo SET x = x + 1 WHERE foo_id = 1; BEGIN WITH AUTONOMOUS TRANSACTION UPDATE foo SET x = x + 1 WHERE foo_id = 1; END; RAISE EXCEPTION ...; EXCEPTION ... END; It should throw an error (or something equivalent) as the second update will wait for record lock to get released, which in this case will not happen till second update finishes. So catch 22. Also, *) What do the other candidate implementations do? IMO, compatibility should be the underlying design principle. Oracle throws error in such case. But we can decide on what behavior we want to keep. *) What will the SQL only feature look like? Similar to PL as mentioned in your example, we can provide the SQL only feature also. *) Is the SPI interface going to be extended to expose AT? I don’t think at this point that there is any need of exposing SPI interface for this. Thanks and Regards, Kumar Rajeev Rastogi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GROUP BY before JOIN
This looks like one example of general problem of finding optimal order for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A, B), sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like join, grouping, aggregation, projection, union, intersection, limit, ordering etc. and A, B, C, D are tables. Given such a representation, how do we find an optimal order of operation? In you example the query which is defined like GROUPBY(JOIN(A, B)) might have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint groups i.e. UNION is distributive over GROUPBY. Right now, we use dynamic programming to find the optimal order for join tree (make_one_rel). In the language above, we find the optimal order for evaluating an expression consisting of JOIN operators by exploiting their commutative and associative properties. If we could extend that to SQL expression tree representing the query, we will be able to solve many of such re-ordering problems using current path infrastructure. The dynamic program for this would look something like below root = top operator of the query child = the operands of the root cost of query = minimum of 1. cost of query without any mutation 2. cost of root and its child operator commuted, if there is only one child for root operator and root operator and child operator are commutable 3. cost of root distributed over children operands if there are multiple operands and root operator is distributive over child operators 4. ... 5. ... Each operator root will have a RelOptInfo to which we add many paths possible for performing that operation using different orders of evaluation. Current add_path should eliminate the expensive ones. This is going to explode the number of path (plan) trees considered by planner, so we need to be judicious when to apply this technique or have a rigorous pruning technique to discard costly paths early enough. This approach might be better than trying to solve each re-ordering problem separately and might go well with upper planner pathification being discussed at http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com . On Tue, Aug 4, 2015 at 1:57 PM, David Rowley david.row...@2ndquadrant.com wrote: == Overview == As of today we always perform GROUP BY at the final stage, after each relation in the query has been joined. This of course works, but it's not always the most optimal way of executing the query. Consider the following two relations: create table product (product_id int primary key, code varchar(32) not null); create table sale (sale_id int primary key, product_id int not null, qty int not null); create index on sale (product_id); Populated with: insert into product select x.x,'ABC' || x.x from generate_series(1,100) x(x); insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100) x(x); Here we have a table with 100 products and another with 1 million sales records. If we wanted to see the total sales for each product we may write a query such as: select s.product_id,sum(qty) as qty from sale s inner join product p on s.product_id = p.product_id group by s.product_id; If we look at the explain for this query we can see that the grouping happened after the join took place: QUERY PLAN -- HashAggregate Group Key: s.product_id - Hash Join Hash Cond: (s.product_id = p.product_id) - Seq Scan on sale s - Hash - Seq Scan on product p The join here would product exactly 1 million rows, then hashaggregate will group 1 million rows. If it were possible for PostgreSQL to perform the GROUP BY before the join we could change that to Grouping 1 million rows, then joining 100 rows. Of course we could have written the query as: select s.product_id,qty from (select product_id,sum(qty) as qty from sale group by product_id) s inner join product p on s.product_id = p.product_id; Which would do exactly that, and in this particular case it is much faster, but it's not all rainbows and unicorns, as if the join happened to filter out many of the groups, then it might not be a win at all. Consider: select s.product_id from sale s inner join product p on s.product_id = p.product_id where p.code = 'ABC1' group by s.product_id; Since the product.code has no equivalence member in the sale table the predicate cannot be applied to the sale table, so in this case if we'd written the query as: select s.product_id,qty from (select product_id,sum(qty) as qty from sale group by product_id) s inner join product p on s.product_id = p.product_id where p.code = 'ABC1'; We'd have thrown away 99% of the groups created in the subquery. == Proposal ==
Re: [HACKERS] FSM versus GIN pending list bloat
On 4 August 2015 at 09:39, Simon Riggs si...@2ndquadrant.com wrote: On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote: The attached proof of concept patch greatly improves the bloat for both the insert and the update cases. You need to turn on both features: adding the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3). The first of those two things could probably be adopted for real, but the second probably is not acceptable. What is the right way to do this? Could a variant of RecordFreeIndexPage bubble the free space up the map immediately rather than waiting for a vacuum? It would only have to move up until it found a page with freespace already recorded in it, which the vast majority of the time would mean observing up one level and then not writing to it, assuming the pending list pages remain well clustered. You make a good case for action here since insert only tables with GIN indexes on text are a common use case for GIN. Why would vacuuming the FSM be unacceptable? With a large gin_pending_list_limit it makes sense. If it is unacceptable, perhaps we can avoid calling it every time, or simply have FreeSpaceMapVacuum() terminate more quickly on some kind of 80/20 heuristic for this case. Couple of questions here... * the docs say it's desirable to have pending-list cleanup occur in the background, but there is no way to invoke that, except via VACUUM. I think we need a separate function to be able to call this as a background action. If we had that, we wouldn't need much else, would we? * why do we have two parameters: gin_pending_list_limit and fastupdate? What happens if we set gin_pending_list_limit but don't set fastupdate? * how do we know how to set that parameter? Is there a way of knowing gin_pending_list_limit has been reached? This and the OP seem like 9.5 open items to me. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes
Takashi Horikawa t-horik...@aj.jp.nec.com writes: Why does this cause a core dump? We could consider fixing whatever the problem is rather than capping the value. As far as I experiment with my own evaluation environment using PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the patch attached. I'm unsure whether this represents a complete fix ... but even if it does, it would be awfully easy to re-introduce similar bugs in future code changes, and who would notice? Josh's approach of restricting the buffer size seems a lot more robust. If there were any practical use-case for such large WAL buffers then it might be worth spending some effort/risk here. But AFAICS, there is not. Indeed, capping wal_buffers might be argued to be a good thing in itself because it would prevent users from wasting shared memory foolishly. So my vote is for the original approach. (I've not read Josh's patch, so there might be something wrong with it in detail, but I like the basic approach.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes
On 2015-08-04 09:49:58 -0400, Tom Lane wrote: Takashi Horikawa t-horik...@aj.jp.nec.com writes: Why does this cause a core dump? We could consider fixing whatever the problem is rather than capping the value. As far as I experiment with my own evaluation environment using PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the patch attached. I'm unsure whether this represents a complete fix ... but even if it does, it would be awfully easy to re-introduce similar bugs in future code changes, and who would notice? Josh's approach of restricting the buffer size seems a lot more robust. If there were any practical use-case for such large WAL buffers then it might be worth spending some effort/risk here. But AFAICS, there is not. Indeed, capping wal_buffers might be argued to be a good thing in itself because it would prevent users from wasting shared memory foolishly. So my vote is for the original approach. (I've not read Josh's patch, so there might be something wrong with it in detail, but I like the basic approach.) +1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Tue, Aug 4, 2015 at 5:55 AM, Andres Freund and...@anarazel.de wrote: On 2015-08-03 14:15:27 +0900, Michael Paquier wrote: On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote: On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote: For instance, if you told me to choose between ShareLock and ShareUpdateExclusiveLock I wouldn't know which one is strongest. I don't it's sensible to have the lock mode compare primitive honestly. I don't have any great ideas to offer ATM sadly. Yes, the thing is that lowering the lock levels is good for concurrency, but the non-monotony of the lock levels makes it impossible to choose an intermediate state correctly. How about simply acquiring all the locks individually of they're different types? These few acquisitions won't matter. As long as this only applies on master, this may be fine... We could basically pass a LOCKMASK to the multiple layers of tablecmds.c instead of LOCKMODE to track all the locks that need to be taken, and all the relations open during operations. This sounds far too complicated to me. Just LockRelationOid() the relation with the appropriate level everytime you pass through the function? Hi all, IMHO is more simply we just fallback to AccessExclusiveLock if there are different lockmodes in reloptions as Michael suggested before. Look at the new version attached. Regards, *** This work is funded by Zenvia Mobile Results (http://www.zenvia.com.br) -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 1c1c181..ad985cd 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -543,6 +543,10 @@ ALTER TABLE ALL IN TABLESPACE replaceable class=PARAMETERname/replaceable of commandALTER TABLE/ that forces a table rewrite. /para + para + Changing autovacuum storage parameters acquires a literalSHARE UPDATE EXCLUSIVE/literal lock. + /para + note para While commandCREATE TABLE/ allows literalOIDS/ to be specified diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 180f529..e0f9f09 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] = { autovacuum_enabled, Enables autovacuum in this relation, - RELOPT_KIND_HEAP | RELOPT_KIND_TOAST + RELOPT_KIND_HEAP | RELOPT_KIND_TOAST, + ShareUpdateExclusiveLock }, true }, @@ -65,7 +66,8 @@ static relopt_bool boolRelOpts[] = { user_catalog_table, Declare a table as an additional catalog table, e.g. for the purpose of logical replication, - RELOPT_KIND_HEAP + RELOPT_KIND_HEAP, + AccessExclusiveLock }, false }, @@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] = { fastupdate, Enables \fast update\ feature for this GIN index, - RELOPT_KIND_GIN + RELOPT_KIND_GIN, + AccessExclusiveLock }, true }, @@ -81,7 +84,8 @@ static relopt_bool boolRelOpts[] = { security_barrier, View acts as a row security barrier, - RELOPT_KIND_VIEW + RELOPT_KIND_VIEW, + AccessExclusiveLock }, false }, @@ -95,7 +99,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs table pages only to this percentage, - RELOPT_KIND_HEAP + RELOPT_KIND_HEAP, + AccessExclusiveLock }, HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100 }, @@ -103,7 +108,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs btree index pages only to this percentage, - RELOPT_KIND_BTREE + RELOPT_KIND_BTREE, + AccessExclusiveLock }, BTREE_DEFAULT_FILLFACTOR, BTREE_MIN_FILLFACTOR, 100 }, @@ -111,7 +117,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs hash index pages only to this percentage, - RELOPT_KIND_HASH + RELOPT_KIND_HASH, + AccessExclusiveLock }, HASH_DEFAULT_FILLFACTOR, HASH_MIN_FILLFACTOR, 100 }, @@ -119,7 +126,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs gist index pages only to this percentage, - RELOPT_KIND_GIST + RELOPT_KIND_GIST, + AccessExclusiveLock }, GIST_DEFAULT_FILLFACTOR, GIST_MIN_FILLFACTOR, 100 }, @@ -127,7 +135,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs spgist index pages only to this percentage, - RELOPT_KIND_SPGIST + RELOPT_KIND_SPGIST, + AccessExclusiveLock }, SPGIST_DEFAULT_FILLFACTOR, SPGIST_MIN_FILLFACTOR, 100 }, @@ -135,7 +144,8 @@ static relopt_int intRelOpts[] = { autovacuum_vacuum_threshold, Minimum number of tuple updates or deletes prior to vacuum, - RELOPT_KIND_HEAP | RELOPT_KIND_TOAST +
Re: [HACKERS] Sharing aggregate states between different aggregate functions
On 08/03/2015 08:53 AM, David Rowley wrote: Attached is a delta patched which is based on sharing_aggstate-heikki-2.patch to fix up the out-dated comments and also a few more test scenarios which test the sharing works with matching INITCOND and that it does not when they don't match. What do you think? I committed this, after some more cleanup of the comments. Thanks! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 08/03/2015 11:36 PM, Robert Haas wrote: On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan p...@heroku.com wrote: When it's time to drain the heap, in performsort, divide the array into two arrays, based on the run number of each tuple, and then quicksort the arrays separately. The first array becomes the in-memory tail of the current tape, and the second array becomes the in-memory tail of the next tape. You wouldn't want to actually allocate two arrays and copy SortTuples around, but keep using the single large array, just logically divided into two. So the bookkeeping isn't trivial, but seems doable. You're talking about a new thing here, that happens when it is necessary to dump everything and do a conventional merging of on-tape runs. IOW, we cannot fit a significant fraction of overall tuples in memory, and we need much of the memtuples array for the next run (usually this ends as a TSS_FINALMERGE). That being the case (...right?), I don't think that's what Heikki is talking about. He can correct me if I'm wrong, but what I think he's saying is that we should try to exploit the fact that we've already determined which in-memory tuples can be part of the current run and which in-memory tuples must become part of the next run. Suppose half the tuples in memory can become part of the current run and the other half must become part of the next run. If we throw all of those tuples into a single bucket and quicksort it, we're losing the benefit of the comparisons we did to figure out which tuples go in which runs. Instead, we could quicksort the current-run tuples and, separately, quick-sort the next-run tuples. Ignore the current-run tuples completely until the tape is empty, and then merge them with any next-run tuples that remain. Yeah, something like that. To paraphrase, if I'm now understanding it correctly, Peter's idea is: When all the tuples have been fed to tuplesort, and it's time to perform the sort, quicksort all the tuples currently in the heap, ignoring the run numbers, and turn the resulting array into another tape. That tape is special: it's actually stored completely in memory. It is merged with the real tapes when tuples are returned from the tuplesort, just like regular tapes in TSS_FINALMERGE. And my idea is: When all the tuples have been fed to tuplesort, and it's time to perform the sort, take all the tuples in the heap belonging to currentRun, quicksort them, and make them part of the current tape. They're not just pushed to the tape as usual, however, but attached as in-memory tail of the current tape. The logical tape abstraction will return them after all the tuples already in the tape, as if they were pushed to the tape as usual. Then take all the remaining tuples in the heap (if any), belonging to next tape, and do the same for them. They become an in-memory tail of the next tape. I'm not sure if there's any reason to believe that would be faster than your approach. In general, sorting is O(n lg n) so sorting two arrays that are each half as large figures to be slightly faster than sorting one big array. But the difference may not amount to much. Yeah, I don't think there's a big performance difference between the two approaches. I'm not wedded to either approach. Whichever approach we use, my main point was that it would be better to handle this in the logical tape abstraction. In my approach, you would have those in-memory tails attached to the last two tapes. In Peter's approach, you would have one tape that's completely in memory, backed by the array. In either case, the tapes would look like normal tapes to most of tuplesort.c. There would be no extra TSS state, it would be TSS_SORTEDONTAPE or TSS_FINALMERGE as usual. The logical tape abstraction is currently too low-level for that. It's just a tape of bytes, and tuplesort.c determines where a tuple begins and ends. That needs to be changed so that the logical tape abstraction works tuple-at-a-time instead. For example, instead of LogicalTapeRead(N) which reads N bytes, you would have LogicalTapeReadTuple(), which reads next tuple, and returns its length and the tuple itself. But that would be quite sensible anyway. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
On 4 August 2015 at 09:30, Amit Langote langote_amit...@lab.ntt.co.jp wrote: On 2015-08-04 AM 02:57, Peter Geoghegan wrote: On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj wrote: If I create a copy of the table using CREATE mytab (LIKE brokentab INCLUDING ALL); INSERT INTO mytab SELECT * FROM brokentab; Also, did you drop any columns from the original brokentab table where the bug can be reproduced? This seem to be the case. I could reproduce the reported problem: Although it seems Amit has defined the problem better than I could, so this is a bit late to the party (!), yes, the table had been ALTERed after it was created (looking back through the history, that modification included at least one DROP COLUMN). Thanks Geoff
Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
On 2015-08-04 AM 02:57, Peter Geoghegan wrote: On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj wrote: If I create a copy of the table using CREATE mytab (LIKE brokentab INCLUDING ALL); INSERT INTO mytab SELECT * FROM brokentab; Also, did you drop any columns from the original brokentab table where the bug can be reproduced? This seem to be the case. I could reproduce the reported problem: test=# CREATE TABLE upsert_fail_test(a int, b int, c smallint); CREATE TABLE test=# ALTER TABLE upsert_fail_test DROP c; ALTER TABLE test=# ALTER TABLE upsert_fail_test ADD c smallint; ALTER TABLE test=# ALTER TABLE upsert_fail_test ADD PRIMARY KEY (a, b, c); ALTER TABLE test=# INSERT INTO upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT (a, b, c) DO UPDATE SET c = EXCLUDED.c; INSERT 0 1 test=# INSERT INTO upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT (a, b, c) DO UPDATE SET b = EXCLUDED.b; INSERT 0 1 test=# INSERT INTO upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT (a, b, c) DO UPDATE SET c = EXCLUDED.c; ERROR: attribute 3 has wrong type DETAIL: Table has type integer, but query expects smallint. FWIW, I tried to look why that happens. It seems during set_plan_refs(), fix_join_expr on the splan-onConflictSet targetlist using EXCLUDE pseudo-rel's targetlist as inner targetlist causes columns c's varattno to be changed to 3, whereas in the actual tuple descriptor it's 4 (dropped and added). Eventually, ExecEvalScalarVar() complains when it finds attno 3 is a dropped attribute. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GROUP BY before JOIN
On 4 August 2015 at 21:56, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: This looks like one example of general problem of finding optimal order for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A, B), sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like join, grouping, aggregation, projection, union, intersection, limit, ordering etc. and A, B, C, D are tables. Given such a representation, how do we find an optimal order of operation? In you example the query which is defined like GROUPBY(JOIN(A, B)) might have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint groups i.e. UNION is distributive over GROUPBY. I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B). If the group by contained columns from relations A and B, then it wouldn't be possible to group each one individually. We would have to wait until the join was performed in that case. Right now, we use dynamic programming to find the optimal order for join tree (make_one_rel). In the language above, we find the optimal order for evaluating an expression consisting of JOIN operators by exploiting their commutative and associative properties. If we could extend that to SQL expression tree representing the query, we will be able to solve many of such re-ordering problems using current path infrastructure. The dynamic program for this would look something like below root = top operator of the query child = the operands of the root cost of query = minimum of 1. cost of query without any mutation 2. cost of root and its child operator commuted, if there is only one child for root operator and root operator and child operator are commutable 3. cost of root distributed over children operands if there are multiple operands and root operator is distributive over child operators 4. ... 5. ... Each operator root will have a RelOptInfo to which we add many paths possible for performing that operation using different orders of evaluation. Current add_path should eliminate the expensive ones. So in a query such as: select s.product_id from sale s inner join product p on s.product_id = p.product_id group by s.product_id; Do you mean that there would be 2 RelOptInfos for the sale relation, one without the GROUP BY, and one with? That would likely solve the issue I have with join row estimates, but which of the 2 RelOptInfo would all of the Vars in the parse tree be pointing to? This is going to explode the number of path (plan) trees considered by planner, so we need to be judicious when to apply this technique or have a rigorous pruning technique to discard costly paths early enough. This approach might be better than trying to solve each re-ordering problem separately and might go well with upper planner pathification being discussed at http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com . It's certainly going to increase that, but I'm not sure if 'explode' is the best word as we at least require all of the rels seen in the GROUP BY expressions and aggregate function parameters to be joined before we can consider a GroupingPath. The patch uses bms_is_subset(root-minimumGroupingRels, rel-relids) to determine that. On Tue, Aug 4, 2015 at 1:57 PM, David Rowley david.row...@2ndquadrant.com wrote: == Overview == As of today we always perform GROUP BY at the final stage, after each relation in the query has been joined. This of course works, but it's not always the most optimal way of executing the query. Consider the following two relations: create table product (product_id int primary key, code varchar(32) not null); create table sale (sale_id int primary key, product_id int not null, qty int not null); create index on sale (product_id); Populated with: insert into product select x.x,'ABC' || x.x from generate_series(1,100) x(x); insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100) x(x); Here we have a table with 100 products and another with 1 million sales records. If we wanted to see the total sales for each product we may write a query such as: select s.product_id,sum(qty) as qty from sale s inner join product p on s.product_id = p.product_id group by s.product_id; If we look at the explain for this query we can see that the grouping happened after the join took place: QUERY PLAN -- HashAggregate Group Key: s.product_id - Hash Join Hash Cond: (s.product_id = p.product_id) - Seq Scan on sale s - Hash - Seq Scan on product p The join here would product exactly 1 million rows, then hashaggregate will group 1 million rows. If it were possible
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas hlinn...@iki.fi wrote: * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync with the list of individual locks in lwlock.h. Sooner or later someone will add an LWLock and forget to update the names-array. That needs to be made less error-prone, so that the names are maintained in the same place as the #defines. Perhaps something like rmgrlist.h. This is a good idea, but it's not easy to do in the style of rmgrlist.h, because I don't believe there's any way to define a macro that expands to a preprocessor directive. Attached is a patch that instead generates the list of macros from a text file, and also generates an array inside lwlock.c with the lock names that gets used by the Trace_lwlocks stuff where applicable. Any objections to this solution to the problem? If not, I'd like to go ahead and push this much. I can't test the Windows changes locally, though, so it would be helpful if someone could check that out. * Instead of having LWLOCK_INDIVIDUAL_NAMES to name individual locks, how about just giving each one of them a separate tranche? I don't think it's good to split things up to that degree; standardizing on one name per fixed lwlock and one per tranche otherwise seems like a good compromise to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/Makefile b/src/backend/Makefile index 98b978f..d16ab10 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -106,7 +106,7 @@ endif endif # aix # Update the commonly used headers before building the subdirectories -$(SUBDIRS:%=%-recursive): $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/probes.h +$(SUBDIRS:%=%-recursive): $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/storage/lwlocknames.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/probes.h # run this unconditionally to avoid needing to know its dependencies here: submake-schemapg: @@ -135,6 +135,9 @@ postgres.o: $(OBJS) parser/gram.h: parser/gram.y $(MAKE) -C parser gram.h +storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt + $(MAKE) -C storage/lmgr lwlocknames.h + utils/fmgroids.h: utils/Gen_fmgrtab.pl catalog/Catalog.pm $(top_srcdir)/src/include/catalog/pg_proc.h $(MAKE) -C utils fmgroids.h @@ -165,6 +168,11 @@ $(top_builddir)/src/include/catalog/schemapg.h: catalog/schemapg.h cd '$(dir $@)' rm -f $(notdir $@) \ $(LN_S) $$prereqdir/$(notdir $) . +$(top_builddir)/src/include/storage/lwlocknames.h: storage/lmgr/lwlocknames.h + prereqdir=`cd '$(dir $)' /dev/null pwd` \ + cd '$(dir $@)' rm -f $(notdir $@) \ + $(LN_S) $$prereqdir/$(notdir $) . + $(top_builddir)/src/include/utils/errcodes.h: utils/errcodes.h prereqdir=`cd '$(dir $)' /dev/null pwd` \ cd '$(dir $@)' rm -f $(notdir $@) \ @@ -192,6 +200,7 @@ distprep: $(MAKE) -C bootstrap bootparse.c bootscanner.c $(MAKE) -C catalog schemapg.h postgres.bki postgres.description postgres.shdescription $(MAKE) -C replication repl_gram.c repl_scanner.c + $(MAKE) -C storage lwlocknames.h $(MAKE) -C utils fmgrtab.c fmgroids.h errcodes.h $(MAKE) -C utils/misc guc-file.c $(MAKE) -C utils/sort qsort_tuple.c @@ -307,6 +316,7 @@ maintainer-clean: distclean catalog/postgres.shdescription \ replication/repl_gram.c \ replication/repl_scanner.c \ + storage/lmgr/lwlocknames.h \ utils/fmgroids.h \ utils/fmgrtab.c \ utils/errcodes.h \ diff --git a/src/backend/storage/lmgr/.gitignore b/src/backend/storage/lmgr/.gitignore new file mode 100644 index 000..9355cae --- /dev/null +++ b/src/backend/storage/lmgr/.gitignore @@ -0,0 +1,2 @@ +/lwlocknames.c +/lwlocknames.h diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile index e12a854..e941f2d 100644 --- a/src/backend/storage/lmgr/Makefile +++ b/src/backend/storage/lmgr/Makefile @@ -24,8 +24,17 @@ s_lock_test: s_lock.c $(top_builddir)/src/port/libpgport.a $(CC) $(CPPFLAGS) $(CFLAGS) -DS_LOCK_TEST=1 $(srcdir)/s_lock.c \ $(TASPATH) -L $(top_builddir)/src/port -lpgport -o s_lock_test +# see explanation in ../../parser/Makefile +lwlocknames.c: lwlocknames.h ; + +lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl + $(PERL) $(srcdir)/generate-lwlocknames.pl $ + check: s_lock_test ./s_lock_test -clean distclean maintainer-clean: +clean distclean: rm -f s_lock_test + +maintainer-clean: clean + rm -f lwlocknames.h diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl new
Re: [HACKERS] Reduce ProcArrayLock contention
On 2015-08-04 11:29:39 -0400, Robert Haas wrote: On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com wrote: 1. I got rid of all of the typecasts. You're supposed to treat pg_atomic_u32 as a magic data type that is only manipulated via the primitives provided, not just cast back and forth between that and u32. Absolutely. Otherwise no fallbacks can work. 2. I got rid of the memory barriers. System calls are full barriers, and so are compare-and-exchange operations. Between those two facts, we should be fine without these. Actually by far not all system calls are full barriers? Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] brin index vacuum versus transaction snapshots
On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I think the real solution to this problem is to avoid use of GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far as I can see, that should completely close the hole. This requires patching IndexBuildHeapRangeScan() to allow for that. Actually I think there's another problem: if a transaction starts and inserts a tuple into the page range, then goes to sleep, and then another session does the summarization of the page range, session 1 is seen as in progress by session 2 (so the scan does not see the new tuple), but the placeholder tuple was not modified either because it was inserted later than the snapshot. So the update is lost. I think the only way to close this hole is to have summarize_range() sleep until all open snapshots are gone after inserting the placeholder tuple and before acquiring the snapshot, similarly to how CREATE INDEX CONCURRENTLY does it. That's gonna be really slow, though, right? Even if you rework things so that vacuum inserts all the placeholder tuples first, then waits, then does all the summarization, that could easily turn a vacuum that would have finished in a second into one that instead takes multiple hours. During that time an AV worker is pinned down, and all sorts of badness can ensue. Maybe I'm misunderstanding, but this whole thing seems like a pretty serious problem for BRIN. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On 2015-08-04 11:43:45 -0400, Robert Haas wrote: On Tue, Aug 4, 2015 at 11:33 AM, Andres Freund and...@anarazel.de wrote: Actually by far not all system calls are full barriers? How do we know which ones are and which ones are not? Good question. Reading the source code of all implementations I suppose :( E.g. gettimeofday()/clock_gettime(), getpid() on linux aren't barriers. I can't believe PGSemaphoreUnlock isn't a barrier. That would be cruel. Yea, I think that's a pretty safe bet. I mean even if you'd implement it locklessly in the kernel, that'd still employ significant enough barriers/atomic ops itself. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upgrade failure from 9.5 to head
On Sun, Aug 2, 2015 at 8:20 PM, Stephen Frost sfr...@snowman.net wrote: +1. I was doing testing the other day and ran into the pg_dump doesn't support shell types issue and it was annoyingly confusing. Is anyone working on this? Should it be added to the open items list? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On 4 August 2015 at 15:18, Andres Freund and...@anarazel.de wrote: On 2015-08-04 14:59:11 +0100, Simon Riggs wrote: On 4 August 2015 at 14:55, Heikki Linnakangas hlinn...@iki.fi wrote: On 08/04/2015 04:35 PM, Simon Riggs wrote: This and the OP seem like 9.5 open items to me. Why? This is nothing new in 9.5. gin_pending_list_limit is new in 9.5 We're in Alpha, so if something has been added and isn't very usable, we should change it while we can. The only thing that variable does is change what the pending size limit is determined by. Previously it was work_mem, now it's gin_pending_list_limit. Imo that has pretty much nothing to do with not registering pages as free. We've made a change to the way GIN fast update works and that needs to be an effective change, not one that immediately triggers a secondary annoyance for the user. I've asked some questions; they may turn into actions, or not, but they are open items. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Autonomous Transaction is back
On Tue, Aug 4, 2015 at 4:12 AM, Rajeev rastogi rajeev.rast...@huawei.com wrote: On 03 August 2015 18:40, Merlin Moncure [mailto:mmonc...@gmail.com] Wrote: On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi rajeev.rast...@huawei.com wrote: On 31 July 2015 23:10, Robert Haas Wrote: I think we're going entirely down the wrong path here. Why is it ever useful for a backend's lock requests to conflict with themselves, even with autonomous transactions? That seems like an artifact of somebody else's implementation that we should be happy we don't need to copy. IMHO, since most of the locking are managed at transaction level not backend level and we consider main autonomous transaction to be independent transaction, then practically they may conflict right. It is also right as you said that there is no as such useful use-cases where autonomous transaction conflicts with main (parent) transaction. But we cannot take it for granted as user might make a mistake. So at- least we should have some mechanism to handle this rare case, for which as of now I think throwing error from autonomous transaction as one of the solution. Once error thrown from autonomous transaction, main transaction may continue as it is (or abort main transaction also??). hm. OK, what's the behavior of: BEGIN UPDATE foo SET x = x + 1 WHERE foo_id = 1; BEGIN WITH AUTONOMOUS TRANSACTION UPDATE foo SET x = x + 1 WHERE foo_id = 1; END; RAISE EXCEPTION ...; EXCEPTION ... END; It should throw an error (or something equivalent) as the second update will wait for record lock to get released, which in this case will not happen till second update finishes. So catch 22. Yeah. Point being, from my point of view autonomous transactions have to conflict with the master transaction (or any transaction really). I agree the right course of action is to error out immediately...what else could you do? There isn't even a deadlock in the classic sense and allowing control to continue would result in indeterminate behavior FWICT. Also, *) What do the other candidate implementations do? IMO, compatibilityshould be the underlying design principle. Oracle throws error in such case. But we can decide on what behavior we want to keep. gotcha. makes sense. *) What will the SQL only feature look like? Similar to PL as mentioned in your example, we can provide the SQL only feature also. *) Is the SPI interface going to be extended to expose AT? I don’t think at this point that there is any need of exposing SPI interface for this. Ok, how do AT work in a non-plpgsql (SQL only) scenario? Are you going to similarly extend BEGIN TRANSACTION? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com wrote: I agree and modified the patch to use 32-bit atomics based on idea suggested by Robert and didn't modify lwlock.c. While looking at patch, I found that the way it was initialising the list to be empty was wrong, it was using pgprocno as 0 to initialise the list, as 0 is a valid pgprocno. I think we should use a number greater that PROCARRAY_MAXPROC (maximum number of procs in proc array). Apart from above fix, I have modified src/backend/access/transam/README to include the information about the improvement this patch brings to reduce ProcArrayLock contention. I spent some time looking at this patch today and found that a good deal of cleanup work seemed to be needed. Attached is a cleaned-up version which makes a number of changes: 1. I got rid of all of the typecasts. You're supposed to treat pg_atomic_u32 as a magic data type that is only manipulated via the primitives provided, not just cast back and forth between that and u32. 2. I got rid of the memory barriers. System calls are full barriers, and so are compare-and-exchange operations. Between those two facts, we should be fine without these. 3. I removed the clearXid field you added to PGPROC. Since that's just a sentinel, I used nextClearXidElem for that purpose. There doesn't seem to be a need for two fields. 4. I factored out the actual XID-clearing logic into a new function ProcArrayEndTransactionInternal instead of repeating it twice. On the flip side, I merged PushProcAndWaitForXidClear with PopProcsAndClearXids and renamed the result to ProcArrayGroupClearXid, since there seemed to be no need to separate them. 5. I renamed PROC_NOT_IN_PGPROCARRAY to INVALID_PGPROCNO, which I think is more consistent with what we do elsewhere, and made it a compile-time constant instead of a value that must be computed every time it's used. 6. I overhauled the comments and README updates. I'm not entirely happy with the name nextClearXidElem but apart from that I'm fairly happy with this version. We should probably test it to make sure I haven't broken anything; on a quick 3-minute pgbench test on cthulhu (128-way EDB server) with 128 clients I got 2778 tps with master and 4330 tps with this version of the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index bc68b47..f6db580 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -252,6 +252,9 @@ implementation of this is that GetSnapshotData takes the ProcArrayLock in shared mode (so that multiple backends can take snapshots in parallel), but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode while clearing MyPgXact-xid at transaction end (either commit or abort). +(To reduce context switching, when multiple transactions commit nearly +simultaneously, we have one backend take ProcArrayLock and clear the XIDs +of multiple processes at once.) ProcArrayEndTransaction also holds the lock while advancing the shared latestCompletedXid variable. This allows GetSnapshotData to use diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 4f3c5c9..14d1219 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -167,6 +167,9 @@ static int KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, static TransactionId KnownAssignedXidsGetOldestXmin(void); static void KnownAssignedXidsDisplay(int trace_level); static void KnownAssignedXidsReset(void); +static void ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact, +TransactionId latestXid); +static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid); /* * Report shared-memory space needed by CreateSharedProcArray. @@ -370,7 +373,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid) elog(LOG, failed to find proc %p in ProcArray, proc); } - /* * ProcArrayEndTransaction -- mark a transaction as no longer running * @@ -399,26 +401,18 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid) */ Assert(TransactionIdIsValid(allPgXact[proc-pgprocno].xid)); - LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); - - pgxact-xid = InvalidTransactionId; - proc-lxid = InvalidLocalTransactionId; - pgxact-xmin = InvalidTransactionId; - /* must be cleared with xid/xmin: */ - pgxact-vacuumFlags = ~PROC_VACUUM_STATE_MASK; - pgxact-delayChkpt = false; /* be sure this is cleared in abort */ - proc-recoveryConflictPending = false; - - /* Clear the subtransaction-XID cache too while holding the lock */ - pgxact-nxids = 0; - pgxact-overflowed = false; - - /* Also advance global latestCompletedXid while holding the lock */ - if (TransactionIdPrecedes(ShmemVariableCache-latestCompletedXid, - latestXid)) - ShmemVariableCache-latestCompletedXid =
Re: [HACKERS] FSM versus GIN pending list bloat
Simon Riggs si...@2ndquadrant.com writes: On 4 August 2015 at 15:18, Andres Freund and...@anarazel.de wrote: The only thing that variable does is change what the pending size limit is determined by. Previously it was work_mem, now it's gin_pending_list_limit. Imo that has pretty much nothing to do with not registering pages as free. We've made a change to the way GIN fast update works and that needs to be an effective change, not one that immediately triggers a secondary annoyance for the user. Said annoyance has been there since day one, no? I've asked some questions; they may turn into actions, or not, but they are open items. If it's not a new problem introduced by 9.5, I don't think it's appropriate to insist that 9.5 must solve it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Tue, Aug 4, 2015 at 11:33 AM, Andres Freund and...@anarazel.de wrote: On 2015-08-04 11:29:39 -0400, Robert Haas wrote: On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com wrote: 1. I got rid of all of the typecasts. You're supposed to treat pg_atomic_u32 as a magic data type that is only manipulated via the primitives provided, not just cast back and forth between that and u32. Absolutely. Otherwise no fallbacks can work. 2. I got rid of the memory barriers. System calls are full barriers, and so are compare-and-exchange operations. Between those two facts, we should be fine without these. Actually by far not all system calls are full barriers? How do we know which ones are and which ones are not? I can't believe PGSemaphoreUnlock isn't a barrier. That would be cruel. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Tue, Aug 4, 2015 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com wrote: I agree and modified the patch to use 32-bit atomics based on idea suggested by Robert and didn't modify lwlock.c. While looking at patch, I found that the way it was initialising the list to be empty was wrong, it was using pgprocno as 0 to initialise the list, as 0 is a valid pgprocno. I think we should use a number greater that PROCARRAY_MAXPROC (maximum number of procs in proc array). Apart from above fix, I have modified src/backend/access/transam/README to include the information about the improvement this patch brings to reduce ProcArrayLock contention. I spent some time looking at this patch today and found that a good deal of cleanup work seemed to be needed. Attached is a cleaned-up version which makes a number of changes: 1. I got rid of all of the typecasts. You're supposed to treat pg_atomic_u32 as a magic data type that is only manipulated via the primitives provided, not just cast back and forth between that and u32. 2. I got rid of the memory barriers. System calls are full barriers, and so are compare-and-exchange operations. Between those two facts, we should be fine without these. I have kept barriers based on comments on top of atomic read, refer below code: * No barrier semantics. */ STATIC_IF_INLINE uint32 pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr) Note - The function header comments on pg_atomic_read_u32 and corresponding write call seems to be reversed, but that is something separate. I'm not entirely happy with the name nextClearXidElem but apart from that I'm fairly happy with this version. We should probably test it to make sure I haven't broken anything; Okay, will look into it tomorrow. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Incorrect comment about abbreviated keys
On Sun, Aug 2, 2015 at 5:11 AM, Peter Geoghegan p...@heroku.com wrote: Attached patch fixes this issue. This was missed by 78efd5c1edb59017f06ef96773e64e6539bfbc86 Committed and back-patched to 9.5. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On 2015-08-04 21:20:20 +0530, Amit Kapila wrote: I have kept barriers based on comments on top of atomic read, refer below code: * No barrier semantics. */ STATIC_IF_INLINE uint32 pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr) Note - The function header comments on pg_atomic_read_u32 and corresponding write call seems to be reversed, but that is something separate. Well, the question is whether you *need* barrier semantics in that place. If you just have a retry loop around a compare/exchange there's no point in having one, it'll just cause needless slowdown due to another bus-lock. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
On Wed, Aug 5, 2015 at 3:05 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Aug 4, 2015 at 2:41 AM, Michael Paquier michael.paqu...@gmail.com wrote: Yep, true as things stand now. But this would get broken if we add a new lock level between ShareRowExclusiveLock and AccessExclusiveLock that does not respect the current monotone hierarchy between those. But we're probably not going to do that, so it doesn't matter; and if we do do it, we can worry about it then. I don't think this is worth getting concerned about now. OK. Then let me suggest the attached Assert safeguard then. It ensures that all the locks used follow a monotony hierarchy per definition of what is on the conflict table. That looks like a cheap insurance... In any case, this means as well that we should move on with the current logic presented by Fabrizio on the other thread. -- Michael diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 970abd4..3aff387 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -3057,8 +3057,13 @@ AlterTableGetLockLevel(List *cmds) } /* - * Take the greatest lockmode from any subcommand + * Take the greatest lockmode from any subcommand following a monotone + * hierarchy. */ + Assert(cmd_lockmode == ShareUpdateExclusiveLock || + cmd_lockmode == ShareRowExclusiveLock || + cmd_lockmode == ExclusiveLock || + cmd_lockmode == AccessExclusiveLock); if (cmd_lockmode lockmode) lockmode = cmd_lockmode; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Make timestamptz_out less slow.
On 29 July 2015 at 03:25, Andres Freund and...@anarazel.de wrote: On 2015-07-29 03:10:41 +1200, David Rowley wrote: timestamp_out() = 2015-07-29 02:24:33.34 in 3.506000 timestamp_out_old() = 2015-07-29 02:24:33.034 in 64.518000 timestamp_out_af() = 2015-07-29 02:24:33.034 in 2.981000 timestamp_out_old is master's version, the timestamp_out_af() is yours, and timestamp_out() is my one. times are in seconds to perform 100 million calls. That looks good. So it appears your version is a bit faster than mine, but we're both about 20 times faster than the current one. Also mine needs fixed up as the fractional part is not padded the same as yours, but I doubt that'll affect the performance by much. Worthwhile to finish that bit and try ;) My view: It's probably not worth going quite as far as you've gone for a handful of nanoseconds per call, but perhaps something along the lines of mine can be fixed up. Yes, I agreee that your's is probably going to be fast enough. Have you thought about what to do when HAVE_INT64_TIMESTAMP is not defined? I don't think it's actually important. The only difference vs float timestamps is that in the latter case we set fsecs to zero BC. Unless we want to slow down the common case it seems not unlikely that we're going to end up with a separate slow path anyway. E.g. neither your version nor mine handles 5 digit years (which is why I fell back to the slow path in that case in my patch). It occurred to me that handling the 5 digit year is quite a simple change to my code: static char * pg_uint2str_padding(char *str, unsigned int value, unsigned int padding) { char *start = str; char *end = str[padding]; unsigned int num = value; //Assert(padding 0); *end = '\0'; while (padding--) { str[padding] = num % 10 + '0'; num /= 10; } /* * if value was too big for the specified padding then rebuild the whole * number again without padding. Conveniently pg_uint2str() does exactly * this. */ if (num 0) return pg_uint2str(str, value); return end; } We simply just need to check if there was any 'num' left after consuming the given space. If there's any left then just use pg_uint2str(). This keeps things very fast for the likely most common case where the year is 4 digits long. I've not thought about negative years. The whole function should perhaps take signed ints instead of unsigned. Regards David Rowley -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On 2015-08-05 AM 06:11, Robert Haas wrote: On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: On 2015-08-03 PM 09:24, Ashutosh Bapat wrote: For postgres_fdw it's a boolean server-level option 'twophase_compliant' (suggestions for name welcome). How about just 'twophase'? How about two_phase_commit? Much cleaner, +1 Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Wed, Aug 5, 2015 at 2:15 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Andres Freund wrote: On 2015-08-03 14:15:27 +0900, Michael Paquier wrote: As long as this only applies on master, this may be fine... We could basically pass a LOCKMASK to the multiple layers of tablecmds.c instead of LOCKMODE to track all the locks that need to be taken, and all the relations open during operations. This sounds far too complicated to me. Just LockRelationOid() the relation with the appropriate level everytime you pass through the function? That opens up for lock escalation and deadlocks, doesn't it? You are probably thinking that it's okay to ignore those but I don't necessarily agree with that. Yes, I think so as long as we would try to take multiple locks... And we go back to the lock hierarchy then, because there is no way to define in some cases which lock is strictly stronger than the other. Honestly I think that we should go with the version Fabrizio doing the lock comparison, with a assert safeguard in AlterTableGetLockLevel. The logic would get broken if ALTER TABLE begins to use a lock level that cannot be ordered according to the others, but that's not the case now. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
On Wed, Aug 5, 2015 at 11:47 AM, Noah Misch n...@leadboat.com wrote: On Tue, Aug 04, 2015 at 07:35:43AM +0100, Simon Riggs wrote: On 4 August 2015 at 05:56, Michael Paquier michael.paqu...@gmail.com wrote: The thing is that, as mentioned by Alvaro and Andres on this thread, we have no guarantee that the different relation locks compared have a monotone hierarchy and we may finish by taking a lock that does not behave as you would like to. We are now lucky enough that ALTER TABLE only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and AccessExclusiveLock that actually have a hierarchy so this is not a problem yet. However it may become a problem if we add in the future more lock modes and that are used by ALTER TABLE. Please provide the link to the discussion of this. I don't see a problem here right now that can't be solved by saying Assert(locklevel==ShareUpdateExclusiveLock || locklevelShareRowExclusiveLock); Agreed; that addresses the foreseeable future of this threat. Some sub-commands are using ShareRowExclusiveLock... So this one is better :) Assert(locklevel==ShareUpdateExclusiveLock || locklevel = ShareRowExclusiveLock); Or we simply list all the locks allowed individually... But that's a minor point. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sharing aggregate states between different aggregate functions
On 5 August 2015 at 03:03, Heikki Linnakangas hlinn...@iki.fi wrote: On 08/03/2015 08:53 AM, David Rowley wrote: Attached is a delta patched which is based on sharing_aggstate-heikki-2.patch to fix up the out-dated comments and also a few more test scenarios which test the sharing works with matching INITCOND and that it does not when they don't match. What do you think? I committed this, after some more cleanup of the comments. Thanks! Great! Thanks for doing the cleanups and committing it. -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: On 2015-08-03 PM 09:24, Ashutosh Bapat wrote: On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas robertmh...@gmail.com wrote: OK, sure. But let's make sure postgres_fdw gets a server-level option to control this. For postgres_fdw it's a boolean server-level option 'twophase_compliant' (suggestions for name welcome). How about just 'twophase'? How about two_phase_commit? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] brin index vacuum versus transaction snapshots
Alvaro Herrera wrote: Thankfully I found another way to solve it, which is to forgo usage of MVCC here and instead use SnapshotAny. There's already a mode in IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks to do what we need. I'm going to propose a patch along those lines shortly. Here's the promised patch. Includes a new isolation test (which fails with the old code in two ways: first when using VACUUM it causes the error message Kevin didn't like to be raised; second when using brin_summarize_new_values it causes the inserted tuple to not be part of the summary, which is wrong). This test leaves the pageinspect extension installed in the isolationtest database, but that seems fine to me. (No other isolation test installs an extension.) The error message Kevin complained about is gone, as is some related commentary. This is replaced by tweaks in IndexBuildHeapRangeScan that know to do a SnapshotAny scan without being noisy about in progress insertions/deletions. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 360b26e..de32cca 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -951,9 +951,13 @@ summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel, * Execute the partial heap scan covering the heap blocks in the specified * page range, summarizing the heap tuples in it. This scan stops just * short of brinbuildCallback creating the new index entry. + * + * Note that it is critical we use the any visible mode of + * IndexBuildHeapRangeScan here: otherwise, we would miss tuples inserted + * by transactions that are still in progress, among other corner cases. */ state-bs_currRangeStart = heapBlk; - IndexBuildHeapRangeScan(heapRel, state-bs_irel, indexInfo, false, + IndexBuildHeapRangeScan(heapRel, state-bs_irel, indexInfo, false, true, heapBlk, state-bs_pagesPerRange, brinbuildCallback, (void *) state); @@ -1058,36 +1062,6 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized, state = initialize_brin_buildstate(index, revmap, pagesPerRange); indexInfo = BuildIndexInfo(index); - -/* - * We only have ShareUpdateExclusiveLock on the table, and - * therefore other sessions may insert tuples into the range - * we're going to scan. This is okay, because we take - * additional precautions to avoid losing the additional - * tuples; see comments in summarize_range. Set the - * concurrent flag, which causes IndexBuildHeapRangeScan to - * use a snapshot other than SnapshotAny, and silences - * warnings emitted there. - */ -indexInfo-ii_Concurrent = true; - -/* - * If using transaction-snapshot mode, it would be possible - * for another transaction to insert a tuple that's not - * visible to our snapshot if we have already acquired one, - * when in snapshot-isolation mode; therefore, disallow this - * from running in such a transaction unless a snapshot hasn't - * been acquired yet. - * - * This code is called by VACUUM and - * brin_summarize_new_values. Have the error message mention - * the latter because VACUUM cannot run in a transaction and - * thus cannot cause this issue. - */ -if (IsolationUsesXactSnapshot() FirstSnapshotSet) - ereport(ERROR, - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), - errmsg(brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot))); } summarize_range(indexInfo, state, heapRel, heapBlk); @@ -,7 +1085,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized, /* free resources */ brinRevmapTerminate(revmap); if (state) + { terminate_brin_buildstate(state); + pfree(indexInfo); + } } /* diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 69f35c9..e59b163 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -2161,6 +2161,7 @@ IndexBuildHeapScan(Relation heapRelation, { return IndexBuildHeapRangeScan(heapRelation, indexRelation, indexInfo, allow_sync, + false, 0, InvalidBlockNumber, callback, callback_state); } @@ -2170,12 +2171,17 @@ IndexBuildHeapScan(Relation heapRelation, * number of blocks are scanned. Scan to end-of-rel can be signalled by * passing InvalidBlockNumber as numblocks. Note that restricting the range * to scan cannot be done when requesting syncscan. + * + * When anyvisible mode is requested, all tuples visible to any transaction + * are considered, including those inserted or deleted by transactions that are + * still in progress. */ double IndexBuildHeapRangeScan(Relation heapRelation, Relation indexRelation,
Re: [HACKERS] [DOCS] max_worker_processes on the standby
Adding CC to hackers, since this is clearly not just a docs issue. Also CCing Petr and Craig since they are the ones that know how this is used in BDR. Robert Haas wrote: On Tue, Aug 4, 2015 at 12:41 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: The alternative is to turn the feature on automatically if it sees that the master also has it on, i.e. the value would not be what the config file says it is. Doing this would be a bit surprising IMO, but given the behavior above maybe it's better than the current behavior. I think it's totally reasonable for the standby to follow the master's behavior rather than the config file. That should be documented, but otherwise, no problem. If it were technologically possible for the standby to follow the config file rather than the master in all cases, that would be fine, too. But the current behavior is somewhere in the middle, and that doesn't seem like a good plan. So I discussed this with Petr. He points out that if we make the standby follows the master, then the problem would be the misbehavior that results once the standby is promoted: at that point the standby would no longer follow the master and would start with the feature turned off, which could be disastrous (depending on what are you using the commit timestamps for). To solve that problem, you could suggest that if we see the setting turned on in pg_control then we should follow that instead of the config file; but then the problem is that there's no way to turn the feature off. And things are real crazy by then. Given this, we're leaning towards the idea that the standby should not try to follow the master at all. Instead, an extension that wants to use this stuff can check the value for itself, and raise a fatal error if it's not already turned on the config file. That way, a lot of the strange corner cases disappear. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
Piotr Stefaniak postg...@piotr-stefaniak.me writes: On 08/03/2015 09:18 PM, Tom Lane wrote: ... but I can't reproduce it on HEAD with either of these queries. Not clear why you're getting different results. I'm terribly sorry, but I didn't notice that postgresql.conf was modified... Set join_collapse_limit = 32 and you should see the error. Ah ... now I get that error on the smaller query, but the larger one (that you put in an attachment) still doesn't show any problem. Do you have any other nondefault settings? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
Hi there,I've been following the sqlsmith work and wanted to jump in and try it out. I took Peter's idea and tried building postgres with the flags suggested but it was hard to get anything working. I'm on commit 85e5e222b1dd02f135a8c3bf387d0d6d88e669bd (Tue Aug 4 14:55:32 2015 -0400) Configure arguments:./configure --prefix=$HOME/pkg CC=clang CFLAGS='-O1 -g -fsanitize=address -fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert I had to make a simple leak suppression file: $ cat leak.supp leak:save_ps_display_args leak:__GI___strdup $ export LSAN_OPTIONS=suppressions=leak.supp And then I could run postgres. After 50,000 queries, I'm left with the following report: queries: 50514 AST stats (avg): height = 7.29877 nodes = 37.8156 296 ERROR: canceling statement due to statement timeout 166 ERROR: invalid regular expression: quantifier operand invalid 26 ERROR: could not determine which collation to use for string comparison 23 ERROR: cannot compare arrays of different element types 12 ERROR: invalid regular expression: brackets [] not balanced 5 ERROR: cache lookup failed for index 2619 2 ERROR: invalid regular expression: parentheses () not balanced error rate: 0.0104921 AddressSanitizer didn't fire except for the suppressed leaks. The suppressed leaks were only hit at the beginning: - Suppressions used: count bytes template 1 520 save_ps_display_args 1 10 __GI___strdup - sqlsmith is a cool little piece of kit and I see a lot of room for on going work (performance bumps for more queries per second; more db back ends; different fuzzers). Yours,Ewan Higgs From: Peter Geoghegan p...@heroku.com To: Andreas Seltenreich seltenre...@gmx.de Cc: Pg Hackers pgsql-hackers@postgresql.org Sent: Sunday, 2 August 2015, 10:39 Subject: Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich seltenre...@gmx.de wrote: sqlsmith triggered the following assertion in master (c188204). Thanks for writing sqlsmith. It seems like a great tool. I wonder, are you just running the tool with assertions enabled when PostgreSQL is built? If so, it might make sense to make various problems more readily detected. As you may know, Clang has a pretty decent option called AddressSanitizer that can detect memory errors as they occur with an overhead that is not excessive. One might use the following configure arguments when building PostgreSQL to use AddressSanitizer: ./configure CC=clang CFLAGS='-O1 -g -fsanitize=address -fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert Of course, it remains to be seen if this pays for itself. Apparently the tool has about a 2x overhead [1]. I'm really not sure that you'll find any more bugs this way, but it's certainly possible that you'll find a lot more. Given your success in finding bugs without using AddressSanitizer, introducing it may be premature. [1] http://clang.llvm.org/docs/AddressSanitizer.html#introduction -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
On Tue, Aug 4, 2015 at 2:29 AM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: Perhaps, it may have to do with how EXCLUDED pseudo-rel's targetlist is manipulated through parse-plan stage? I think so, yes. I'll look into writing a fix for this later in the week. Thanks for the report, Geoff, and thanks for the analysis Amit. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
On Wed, Aug 5, 2015 at 2:23 AM, Alvaro Herrera wrote: Michael Paquier wrote: On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera wrote: Now, let's take for example this case with locks A, B, C, D: - Lock A conflicts with ACD - B with BCD - C with itself - D with itself What would you choose as a result of add(C,D)? A or B? Or the super lock conflicting with all of them? Actually the answer is the sum of all the potential candidates. This converges to AccessExclusiveLock. This appears to me an hypothetical case that I don't think occurs in our conflicts table, so I wouldn't worry about it. No it doesn't. I am using a huge hypothetical if here :) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Tue, Aug 4, 2015 at 11:50 AM, Amit Kapila amit.kapil...@gmail.com wrote: I have kept barriers based on comments on top of atomic read, refer below code: * No barrier semantics. */ STATIC_IF_INLINE uint32 pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr) Note - The function header comments on pg_atomic_read_u32 and corresponding write call seems to be reversed, but that is something separate. That doesn't matter, because the compare-and-exchange *is* a barrier. Putting a barrier between a store and an operation that is already a barrier doesn't do anything useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upgrade failure from 9.5 to head
Robert Haas robertmh...@gmail.com writes: On Sun, Aug 2, 2015 at 8:20 PM, Stephen Frost sfr...@snowman.net wrote: +1. I was doing testing the other day and ran into the pg_dump doesn't support shell types issue and it was annoyingly confusing. Is anyone working on this? Should it be added to the open items list? I plan to work on it as soon as I'm done fixing the planner issue Andreas found (which I'm almost done with). Not sure whether we should consider it a back-patchable bug fix or something to do only in HEAD, though --- comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upgrade failure from 9.5 to head
Robert Haas robertmh...@gmail.com writes: On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund and...@anarazel.de wrote: On 2015-08-04 13:52:54 -0400, Tom Lane wrote: Not sure whether we should consider it a back-patchable bug fix or something to do only in HEAD, though --- comments? Tentatively I'd say it's a bug and should be back-patched. Agreed. If investigation turns up reasons to worry about back-patching it, I'd possibly back-track on that position, but I think we should start with the notion that it is back-patchable and retreat from that position only at need. OK. Certainly we can fix 9.5 the same way as HEAD; the pg_dump code hasn't diverged much yet. I'll plan to push it as far back as convenient, but I won't expend any great effort on making the older branches do it if they turn out to be too different. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
Andres Freund wrote: On 2015-08-03 14:15:27 +0900, Michael Paquier wrote: As long as this only applies on master, this may be fine... We could basically pass a LOCKMASK to the multiple layers of tablecmds.c instead of LOCKMODE to track all the locks that need to be taken, and all the relations open during operations. This sounds far too complicated to me. Just LockRelationOid() the relation with the appropriate level everytime you pass through the function? That opens up for lock escalation and deadlocks, doesn't it? You are probably thinking that it's okay to ignore those but I don't necessarily agree with that. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
Michael Paquier wrote: On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Maybe the solution to this is to add the concept of addition of two lock modes, where the result is another lock mode that conflicts with any lock that would conflict with either of the two operand lock modes. That's commutative, as this is basically looking at the conflict table to get the union of the bits to indicate what are all the locks conflicting with lock A and lock B, and then we select the lock on the table that includes the whole union, with a minimum number of them. Yes. Now, let's take for example this case with locks A, B, C, D: - Lock A conflicts with ACD - B with BCD - C with itself - D with itself What would you choose as a result of add(C,D)? A or B? Or the super lock conflicting with all of them? This appears to me an hypothetical case that I don't think occurs in our conflicts table, so I wouldn't worry about it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Tue, Aug 4, 2015 at 11:29 AM, Robert Haas robertmh...@gmail.com wrote: 4. I factored out the actual XID-clearing logic into a new function ProcArrayEndTransactionInternal instead of repeating it twice. On the flip side, I merged PushProcAndWaitForXidClear with PopProcsAndClearXids and renamed the result to ProcArrayGroupClearXid, since there seemed to be no need to separate them. Thinking about this a bit more, it's probably worth sticking an inline designation on ProcArrayEndTransactionInternal. Keeping the time for which we hold ProcArrayLock in exclusive mode down to the absolute minimum possible number of instructions seems like a good plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tab completion for CREATE SEQUENCE
On Mon, Aug 3, 2015 at 2:17 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Jul 7, 2015 at 9:14 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-19 06:41:19 +, Brendan Jurd wrote: I'm marking this Waiting on Author. Once the problems have been corrected, it should be ready for a committer. Vik, are you going to update the patch? Seeing no activity on this thread and as it would be a waste not to do it, I completed the patch as attached. The following things are done: - Fixed code indentation - Removal of RESTART, SET SCHEMA, OWNER TO, and RENAME TO in CREATE SEQUENCE - Added START WITH. - Added TEMP/TEMPORARY in the set of keywords tracked. I am switching at the same time this patch as Ready for committer. You didn't remove RESTART, so I did that. And this needed some more parentheses to make my compiler happy. Committed with those changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED
On 2015-08-04 13:16:52 -0400, Robert Haas wrote: On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao masao.fu...@gmail.com wrote: track_commit_timestamp tracks COMMIT PREPARED as expected in standby server, but not in master server. Is this intentional? It should track COMMIT PREPARED even in master? Otherwise, we cannot use commit_timestamp feature to check the replication lag properly while we use 2PC. That sounds like it must be a bug. I think you should add it to the open items list. Yea, completely agreed. It's probably because twophase.c duplicates a good bit of xact.c. How about we also add a warning to the respective places that one should always check the others? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upgrade failure from 9.5 to head
On 2015-08-04 13:52:54 -0400, Tom Lane wrote: Not sure whether we should consider it a back-patchable bug fix or something to do only in HEAD, though --- comments? Tentatively I'd say it's a bug and should be back-patched. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
Jeff Janes wrote: The attached proof of concept patch greatly improves the bloat for both the insert and the update cases. You need to turn on both features: adding the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3). The first of those two things could probably be adopted for real, but the second probably is not acceptable. What is the right way to do this? I think that's the right way, only that I wouldn't call FSMVacuum() inside the loop but only once you're out of it. FWIW I had to add a FSMVacuum call somewhere in the BRIN code also, which I didn't like at all. If there's a way to optimize this bubbling up of some individual page all the way to the root level, that would probably be a lot better than what it's currently doing. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED
On Tue, Aug 4, 2015 at 1:18 PM, Andres Freund and...@anarazel.de wrote: On 2015-08-04 13:16:52 -0400, Robert Haas wrote: On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao masao.fu...@gmail.com wrote: track_commit_timestamp tracks COMMIT PREPARED as expected in standby server, but not in master server. Is this intentional? It should track COMMIT PREPARED even in master? Otherwise, we cannot use commit_timestamp feature to check the replication lag properly while we use 2PC. That sounds like it must be a bug. I think you should add it to the open items list. Yea, completely agreed. It's probably because twophase.c duplicates a good bit of xact.c. How about we also add a warning to the respective places that one should always check the others? Sounds like a good idea. Or we could try to, heh heh, refactor away the duplication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes
On Tue, Aug 4, 2015 at 9:52 AM, Andres Freund and...@anarazel.de wrote: So my vote is for the original approach. (I've not read Josh's patch, so there might be something wrong with it in detail, but I like the basic approach.) +1 OK, committed and back-patched that all the way back to 9.0. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
On Tue, Aug 4, 2015 at 2:41 AM, Michael Paquier michael.paqu...@gmail.com wrote: Yep, true as things stand now. But this would get broken if we add a new lock level between ShareRowExclusiveLock and AccessExclusiveLock that does not respect the current monotone hierarchy between those. But we're probably not going to do that, so it doesn't matter; and if we do do it, we can worry about it then. I don't think this is worth getting concerned about now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED
On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao masao.fu...@gmail.com wrote: track_commit_timestamp tracks COMMIT PREPARED as expected in standby server, but not in master server. Is this intentional? It should track COMMIT PREPARED even in master? Otherwise, we cannot use commit_timestamp feature to check the replication lag properly while we use 2PC. That sounds like it must be a bug. I think you should add it to the open items list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upgrade failure from 9.5 to head
On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund and...@anarazel.de wrote: On 2015-08-04 13:52:54 -0400, Tom Lane wrote: Not sure whether we should consider it a back-patchable bug fix or something to do only in HEAD, though --- comments? Tentatively I'd say it's a bug and should be back-patched. Agreed. If investigation turns up reasons to worry about back-patching it, I'd possibly back-track on that position, but I think we should start with the notion that it is back-patchable and retreat from that position only at need. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] brin index vacuum versus transaction snapshots
Robert Haas wrote: On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I think the only way to close this hole is to have summarize_range() sleep until all open snapshots are gone after inserting the placeholder tuple and before acquiring the snapshot, similarly to how CREATE INDEX CONCURRENTLY does it. That's gonna be really slow, though, right? Even if you rework things so that vacuum inserts all the placeholder tuples first, then waits, then does all the summarization, that could easily turn a vacuum that would have finished in a second into one that instead takes multiple hours. During that time an AV worker is pinned down, and all sorts of badness can ensue. Yeah, it is bad and I was concerned about that too. Thankfully I found another way to solve it, which is to forgo usage of MVCC here and instead use SnapshotAny. There's already a mode in IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks to do what we need. I'm going to propose a patch along those lines shortly. Maybe I'm misunderstanding, but this whole thing seems like a pretty serious problem for BRIN. :-( With this new approach it shouldn't be. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] subplan variable reference / unassigned NestLoopParams (was: [sqlsmith] Failed assertion in joinrels.c)
Andreas Seltenreich seltenre...@gmx.de writes: Tom Lane writes: Well, I certainly think all of these represent bugs: 3 | ERROR: plan should not reference subplan's variable 2 | ERROR: failed to assign all NestLoopParams to plan nodes These appear to be related. The following query produces the former, but if you replace the very last reference of provider with the literal 'bar', it raises the latter error. Fixed that, thanks for the test case! ,[ FWIW: git bisect run ] | first bad commit: [e83bb10d6dcf05a666d4ada00d9788c7974ad378] | Adjust definition of cheapest_total_path to work better with LATERAL. ` There's still something fishy about your git bisect results; they don't have much to do with what seems to me to be the triggering condition. I suspect the problem is that git bisect doesn't allow for the possibility that the symptom might appear and disappear over time, ie it might have been visible at some early stage of the LATERAL work but been fixed later, and then reintroduced by still-later optimization efforts. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On Tue, Aug 4, 2015 at 1:39 AM, Simon Riggs si...@2ndquadrant.com wrote: On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote: The attached proof of concept patch greatly improves the bloat for both the insert and the update cases. You need to turn on both features: adding the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3). The first of those two things could probably be adopted for real, but the second probably is not acceptable. What is the right way to do this? Could a variant of RecordFreeIndexPage bubble the free space up the map immediately rather than waiting for a vacuum? It would only have to move up until it found a page with freespace already recorded in it, which the vast majority of the time would mean observing up one level and then not writing to it, assuming the pending list pages remain well clustered. You make a good case for action here since insert only tables with GIN indexes on text are a common use case for GIN. Why would vacuuming the FSM be unacceptable? With a large gin_pending_list_limit it makes sense. But with a smallish gin_pending_list_limit (like the default 4MB) this could be called a lot (multiple times a second during some spurts), and would read the entire fsm each time. If it is unacceptable, perhaps we can avoid calling it every time, or simply have FreeSpaceMapVacuum() terminate more quickly on some kind of 80/20 heuristic for this case. Or maybe it could be passed a range of blocks which need vacuuming, so it concentrated on that range. But from the README file, it sounds like it is already supposed to be bubbling up. I'll have to see just whats going on there when I get a chance. Cheers, Jeff
Re: [HACKERS] more-helpful-izing a debug message
On Wed, Jul 8, 2015 at 5:38 AM, Marko Tiikkaja ma...@joh.to wrote: One of the debug messages related to logical replication could be more helpful than it currently is. The attached patch reorders the two operations to make it so. Please consider patching and back-patching. Andres, this looks like a bug fix to me. What do you think? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev i.kurbangal...@postgrespro.ru wrote: A new version of the patch. I used your idea with macros, and with tranches that allowed us to remove array with names (they can be written directly to the corresponding tranche). You seem not to have addressed a few of the points I brought up here: http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirupjs...@mail.gmail.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On Tue, Aug 4, 2015 at 3:55 PM, Andres Freund and...@anarazel.de wrote: On 2015-08-04 15:45:44 -0400, Tom Lane wrote: I'm not sure that there's any great urgency about changing the instances that exist now; the real point of this discussion is that we will allow new code to use static inlines in headers. I agree that we don't have to (and probably shouldn't) make the required configure changes and change definitions. But I do think some of the current macro usage deserves to be cleaned up at some point. I, somewhere during 9.4 or 9.5, redefined some of the larger macros into static inlines and it both reduced the binary size and gave minor performance benefits. We typically recommend that people write their new code like the existing code. If we say that the standards for new code are now different from old code in this one way, I don't think that's going to be very helpful to anyone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching
On Fri, Jul 3, 2015 at 8:33 PM, Peter Geoghegan p...@heroku.com wrote: Since apparently we're back to development work, I thought it was time to share a patch implementing a few additional simple tricks to make sorting text under a non-C locale even faster than in 9.5. These techniques are mostly effective when values are physically clustered together. This might be because there is a physical/logical correlation, but cases involving any kind of clustering of values are helped significantly. Interesting work. Some comments: 1. My biggest gripe with this patch is that the comments are not easy to understand. For example: + * Attempt to re-use buffers across calls. Also, avoid work in the event + * of clustered together identical items by exploiting temporal locality. + * This works well with divide-and-conquer, comparison-based sorts like + * quicksort and mergesort. + * + * With quicksort, there is, in general, a pretty strong chance that the + * same buffer contents can be used repeatedly for pivot items -- early + * pivot items will account for large number of total comparisons, since + * they must be compared against many (possibly all other) items. Well, what I would have written is something like: We're likely to be asked to compare the same strings repeatedly, and memcmp() is so much cheaper than memcpy() that it pays to attempt a memcmp() in the hopes of avoiding a memcpy(). This doesn't seem to slow things down measurably even if it doesn't work out very often. + * While it is worth going to the trouble of trying to re-use buffer + * contents across calls, ideally that will lead to entirely avoiding a + * strcoll() call by using a cached return value. + * + * This optimization can work well again and again for the same set of + * clustered together identical attributes; as they're relocated to new + * subpartitions, only one strcoll() is required for each pivot (in respect + * of that clump of identical values, at least). Similarly, the final + * N-way merge of a mergesort can be effectively accelerated if each run + * has its own locally clustered values. And here I would have written something like: If we're comparing the same two strings that we compared last time, we can return the same answer without calling strcoll() again. This is more likely than it seems, because quicksort compares the same pivot against many values, and some of those values might be duplicates. Of course everybody may prefer something different here; I'm just telling you what I think. 2. I believe the change to bttextcmp_abbrev() should be pulled out into a separate patch and committed separately. That part seems like a slam dunk. 3. What is the worst case for the strxfrm()-reuse stuff? I suppose it's the case where we have many strings that are long, all equal-length, and all different, but only in the last few characters. Then the memcmp() is as expensive as possible but never works out. How does the patch hold up in that case? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On Tue, Aug 4, 2015 at 6:35 AM, Simon Riggs si...@2ndquadrant.com wrote: On 4 August 2015 at 09:39, Simon Riggs si...@2ndquadrant.com wrote: On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote: The attached proof of concept patch greatly improves the bloat for both the insert and the update cases. You need to turn on both features: adding the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3). The first of those two things could probably be adopted for real, but the second probably is not acceptable. What is the right way to do this? Could a variant of RecordFreeIndexPage bubble the free space up the map immediately rather than waiting for a vacuum? It would only have to move up until it found a page with freespace already recorded in it, which the vast majority of the time would mean observing up one level and then not writing to it, assuming the pending list pages remain well clustered. You make a good case for action here since insert only tables with GIN indexes on text are a common use case for GIN. Why would vacuuming the FSM be unacceptable? With a large gin_pending_list_limit it makes sense. If it is unacceptable, perhaps we can avoid calling it every time, or simply have FreeSpaceMapVacuum() terminate more quickly on some kind of 80/20 heuristic for this case. Couple of questions here... * the docs say it's desirable to have pending-list cleanup occur in the background, but there is no way to invoke that, except via VACUUM. I think we need a separate function to be able to call this as a background action. If we had that, we wouldn't need much else, would we? I thought maybe the new bgworker framework would be a way to have a backend signal a bgworker to do the cleanup when it notices the pending list is getting large. But that wouldn't directly fix this issue, because the bgworker still wouldn't recycle that space (without further changes), only vacuum workers do that currently. But I don't think this could be implemented as an extension, because the signalling code has to be in core, so (not having studied the matter at all) I don't know if it is good fit for bgworker. * why do we have two parameters: gin_pending_list_limit and fastupdate? What happens if we set gin_pending_list_limit but don't set fastupdate? Fastupdate is on by default. If it were turned off, then gin_pending_list_limit would be mostly irrelevant for those tables. Fastupdate could have been implemented as a magic value (0 or -1) for gin_pending_list_limit but that would break backwards compatibility (and arguably would not be a better way of doing things, anyway). * how do we know how to set that parameter? Is there a way of knowing gin_pending_list_limit has been reached? I don't think there is an easier answer to that. The trade offs are complex and depend on things like how well cached the parts of the index needing insertions are, how many lexemes/array elements are in an average document, and how many documents inserted near the same time as each other share lexemes in common. And of course what you need to optimize for, latency or throughput, and if latency search latency or insert latency. This and the OP seem like 9.5 open items to me. I don't think so. Freeing gin_pending_list_limit from being forcibly tied to work_mem is a good thing. Even if I don't know exactly how to set gin_pending_list_limit, I know I don't want to be 4GB just because work_mem was set there for some temporary reason. I'm happy to leave it at its default and let its fine tuning be a topic for people who really care about every microsecond of performance. Cheers, Jeff
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Tue, Aug 4, 2015 at 4:47 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev i.kurbangal...@postgrespro.ru wrote: A new version of the patch. I used your idea with macros, and with tranches that allowed us to remove array with names (they can be written directly to the corresponding tranche). You seem not to have addressed a few of the points I brought up here: http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirupjs...@mail.gmail.com More generally, I'd like to stop smashing all the things that need to be done here into one patch. We need to make some changes, such as the one I proposed earlier today, to make it easier to properly identify locks. Let's talk about how to do that and agree on the details. Then, once that's done, let's do the main part of the work afterwards, in a separate commit. We're running through patch versions at light speed here, but I'm not sure we're really building consensus around how to do things. The actual technical work here isn't really the problem; that part is easy. The hard part is agreeing on the details of how it should work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
Andres Freund and...@anarazel.de writes: On 2015-08-04 15:20:14 -0400, Robert Haas wrote: OK, so do we want to rip out all instances of the static inline dance in favor of more straightforward coding? Do we then shut pandemelon and any other affected buildfarm members down as unsupported, or what? I think all that happens is that they'll log a couple more warnings about defined but unused static functions. configure already defines inline away if not supported. Right. We had already concluded that this would be safe to do, it's just a matter of somebody being motivated to do it. I'm not sure that there's any great urgency about changing the instances that exist now; the real point of this discussion is that we will allow new code to use static inlines in headers. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
Ildus Kurbangaliev wrote: A new version of the patch. I used your idea with macros, and with tranches that allowed us to remove array with names (they can be written directly to the corresponding tranche). Just a bystander here, I haven't reviewed this patch at all, but I have two questions, 1. have you tested this under -DEXEC_BACKEND ? I wonder if those initializations are going to work on Windows. 2. why keep the SLRU control locks as individual locks? Surely we could put them in the SlruCtl struct and get rid of a few individual lwlocks? Also, I wonder if we shouldn't be doing this in two parts, one that changes the underlying lwlock structure and another one to change pg_stat_activity. We have CppAsString() in c.h IIRC, which we use instead of # (we do use ## in a few places though). I wonder if that stuff has any value anymore. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On 2015-08-04 15:20:14 -0400, Robert Haas wrote: OK, so do we want to rip out all instances of the static inline dance in favor of more straightforward coding? Do we then shut pandemelon and any other affected buildfarm members down as unsupported, or what? I think all that happens is that they'll log a couple more warnings about defined but unused static functions. configure already defines inline away if not supported. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching
On Tue, Aug 4, 2015 at 12:41 PM, Robert Haas robertmh...@gmail.com wrote: Interesting work. Thanks. 1. My biggest gripe with this patch is that the comments are not easy to understand. Of course everybody may prefer something different here; I'm just telling you what I think. I have struggled with trying to put just the right amount of exposition on the theory behind a particular optimization in source code comments, and things like that. Since no one is telling me that I need to write more, clearly I don't have the balance right yet. To a certain extent, it is a matter of personal style, but I'll try and be more terse. 2. I believe the change to bttextcmp_abbrev() should be pulled out into a separate patch and committed separately. That part seems like a slam dunk. Makes sense. 3. What is the worst case for the strxfrm()-reuse stuff? I suppose it's the case where we have many strings that are long, all equal-length, and all different, but only in the last few characters. Then the memcmp() is as expensive as possible but never works out. How does the patch hold up in that case? I haven't tested it. I'll get around to it at some point in the next couple of weeks. I imagine that it's exactly the same as the memcmp() equality thing because of factors like speculative execution, and the fact that we need both strings in cache anyway. It's almost exactly the same story, although unlike the memcmp() opportunistic equality pre-check thing, this check happens only n times, not n log n times. I'm quite sure that the cost needs to be virtually zero to go ahead with the idea. I think it probably is. Note that like the memcmp() thing, we check string length first, before a memcmp(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching
On Tue, Aug 4, 2015 at 1:30 PM, Peter Geoghegan p...@heroku.com wrote: 2. I believe the change to bttextcmp_abbrev() should be pulled out into a separate patch and committed separately. That part seems like a slam dunk. Makes sense. BTW, I want to put the string_uint() macro in a common header now. It can be used for other types. I've written a SortSupport + abbreviated keys patch for the UUID type which will use it, too, so that it too can use simple unsigned integer comparisons within its abbreviated comparator. I haven't posted the UUID patch yet only because everyone is up to their ears in my sorting patches. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On 2015-08-04 15:45:44 -0400, Tom Lane wrote: I'm not sure that there's any great urgency about changing the instances that exist now; the real point of this discussion is that we will allow new code to use static inlines in headers. I agree that we don't have to (and probably shouldn't) make the required configure changes and change definitions. But I do think some of the current macro usage deserves to be cleaned up at some point. I, somewhere during 9.4 or 9.5, redefined some of the larger macros into static inlines and it both reduced the binary size and gave minor performance benefits. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On Thu, Jul 2, 2015 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: So far this thread is all about the costs of desupporting compilers that don't have these features, and you're making a good argument (that I think we all agree with) that the cost is small. But you haven't really said what the benefits are. I made the same remark with respect to varargs macros, and I continue to think that the cost-benefit ratio there is pretty marginal. However, I do think that there's a case to be made for adopting static inline. The INCLUDE_DEFINITIONS dance is very inconvenient, so it discourages people from using static inlines over macros, even in cases where the macro approach is pretty evil (usually, because of multiple- evaluation hazards). If we allowed static inlines then we could work towards having a safer coding standard along the lines of thou shalt not write macros that evaluate their arguments more than once. So the benefits here are easier development and less risk of bugs. On the other side of the ledger, the costs are pretty minimal; we will not actually break any platforms, at worst we'll make them unpleasant to develop on because of lots of warning messages. We have some platforms like that already, and it's not been a huge problem. OK, so do we want to rip out all instances of the static inline dance in favor of more straightforward coding? Do we then shut pandemelon and any other affected buildfarm members down as unsupported, or what? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dependency between bgw_notify_pid and bgw_flags
On Tue, Jul 7, 2015 at 4:34 AM, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: With that notion of backend, to fix the original problem I reported, PostmasterMarkPIDForWorkerNotify() should also look at the BackgroundWorkerList. As per the comments in the prologue of this function, it assumes that only backends need to be notified about worker state change, which looks too restrictive. Any backend or background worker, which wants to be notified about a background worker it requested to be spawned, should be allowed to do so. Yeah. I'm wondering if we should fix this problem by just insisting that all workers have an entry in BackendList. At first look, this seems like it would make the code a lot simpler and have virtually no downside. As it is, we're repeatedly reinventing new and different ways for unconnected background workers to do things that work fine in all other cases, and I don't see the point of that. I haven't really tested the attached patch - sadly, we have no testing of background workers without shared memory access in the tree - but it shows what I have in mind. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 000524d..1818f7c 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -155,8 +155,7 @@ * they will never become live backends. dead_end children are not assigned a * PMChildSlot. * - * Background workers that request shared memory access during registration are - * in this list, too. + * Background workers are in this list, too. */ typedef struct bkend { @@ -404,13 +403,11 @@ static long PostmasterRandom(void); static void RandomSalt(char *md5Salt); static void signal_child(pid_t pid, int signal); static bool SignalSomeChildren(int signal, int targets); -static bool SignalUnconnectedWorkers(int signal); static void TerminateChildren(int signal); #define SignalChildren(sig) SignalSomeChildren(sig, BACKEND_TYPE_ALL) static int CountChildren(int target); -static int CountUnconnectedWorkers(void); static void maybe_start_bgworker(void); static bool CreateOptsFile(int argc, char *argv[], char *fullprogname); static pid_t StartChildProcess(AuxProcType type); @@ -2414,7 +2411,6 @@ SIGHUP_handler(SIGNAL_ARGS) (errmsg(received SIGHUP, reloading configuration files))); ProcessConfigFile(PGC_SIGHUP); SignalChildren(SIGHUP); - SignalUnconnectedWorkers(SIGHUP); if (StartupPID != 0) signal_child(StartupPID, SIGHUP); if (BgWriterPID != 0) @@ -2491,7 +2487,6 @@ pmdie(SIGNAL_ARGS) /* and bgworkers too; does this need tweaking? */ SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER); -SignalUnconnectedWorkers(SIGTERM); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); @@ -2543,11 +2538,11 @@ pmdie(SIGNAL_ARGS) signal_child(BgWriterPID, SIGTERM); if (WalReceiverPID != 0) signal_child(WalReceiverPID, SIGTERM); - SignalUnconnectedWorkers(SIGTERM); if (pmState == PM_RECOVERY) { +SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER); /* - * Only startup, bgwriter, walreceiver, unconnected bgworkers, + * Only startup, bgwriter, walreceiver, possibly bgworkers, * and/or checkpointer should be active in this state; we just * signaled the first four, and we don't want to kill * checkpointer yet. @@ -2999,25 +2994,21 @@ CleanupBackgroundWorker(int pid, } /* Get it out of the BackendList and clear out remaining data */ - if (rw-rw_backend) - { - Assert(rw-rw_worker.bgw_flags BGWORKER_BACKEND_DATABASE_CONNECTION); - dlist_delete(rw-rw_backend-elem); + dlist_delete(rw-rw_backend-elem); #ifdef EXEC_BACKEND - ShmemBackendArrayRemove(rw-rw_backend); + ShmemBackendArrayRemove(rw-rw_backend); #endif - /* - * It's possible that this background worker started some OTHER - * background worker and asked to be notified when that worker - * started or stopped. If so, cancel any notifications destined - * for the now-dead backend. - */ - if (rw-rw_backend-bgworker_notify) -BackgroundWorkerStopNotifications(rw-rw_pid); - free(rw-rw_backend); - rw-rw_backend = NULL; - } + /* + * It's possible that this background worker started some OTHER + * background worker and asked to be notified when that worker + * started or stopped. If so, cancel any notifications destined + * for the now-dead backend. + */ + if (rw-rw_backend-bgworker_notify) + BackgroundWorkerStopNotifications(rw-rw_pid); + free(rw-rw_backend); + rw-rw_backend = NULL; rw-rw_pid = 0; rw-rw_child_slot = 0; ReportBackgroundWorkerPID(rw); /* report child death */ @@ -3160,15 +3151,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) * Found entry for
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Aug 4, 2015, at 4:54 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 08/04/2015 03:15 PM, Robert Haas wrote: On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas hlinn...@iki.fi wrote: * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync with the list of individual locks in lwlock.h. Sooner or later someone will add an LWLock and forget to update the names-array. That needs to be made less error-prone, so that the names are maintained in the same place as the #defines. Perhaps something like rmgrlist.h. This is a good idea, but it's not easy to do in the style of rmgrlist.h, because I don't believe there's any way to define a macro that expands to a preprocessor directive. Attached is a patch that instead generates the list of macros from a text file, and also generates an array inside lwlock.c with the lock names that gets used by the Trace_lwlocks stuff where applicable. Any objections to this solution to the problem? If not, I'd like to go ahead and push this much. I can't test the Windows changes locally, though, so it would be helpful if someone could check that out. A more low-tech solution would be to something like this in lwlocknames.c: static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS]; /* Turn pointer into one of the LWLocks in main array into an index number */ #define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name InitLWLockNames() { NAME_LWLOCK(ShmemIndexLock, ShmemIndexLock); NAME_LWLOCK(OidGenLock, OidGenLock); ... } That would not be auto-generated, so you'd need to keep that list in sync with lwlock.h, but it would be much better than the original patch because if you forgot to add an entry in the names-array, the numbering of all the other locks would not go wrong. And you could have a runtime check that complains if there's an entry missing, like Ildus did in his latest patch. I have no particular objection to your perl script either, though. I'll leave it up to you. - Heikki A new version of the patch. I used your idea with macros, and with tranches that allowed us to remove array with names (they can be written directly to the corresponding tranche). Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company extend_pg_stat_activity_v7.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upgrade failure from 9.5 to head
On 08/04/2015 02:09 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund and...@anarazel.de wrote: On 2015-08-04 13:52:54 -0400, Tom Lane wrote: Not sure whether we should consider it a back-patchable bug fix or something to do only in HEAD, though --- comments? Tentatively I'd say it's a bug and should be back-patched. Agreed. If investigation turns up reasons to worry about back-patching it, I'd possibly back-track on that position, but I think we should start with the notion that it is back-patchable and retreat from that position only at need. OK. Certainly we can fix 9.5 the same way as HEAD; the pg_dump code hasn't diverged much yet. I'll plan to push it as far back as convenient, but I won't expend any great effort on making the older branches do it if they turn out to be too different. From my POV 9.5 is the one that's most critical, because it's the one that introduced a regression test that leaves a shell type lying around. But as far back as convenient works for me. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM versus GIN pending list bloat
On 4 August 2015 at 21:04, Jeff Janes jeff.ja...@gmail.com wrote: Couple of questions here... * the docs say it's desirable to have pending-list cleanup occur in the background, but there is no way to invoke that, except via VACUUM. I think we need a separate function to be able to call this as a background action. If we had that, we wouldn't need much else, would we? I thought maybe the new bgworker framework would be a way to have a backend signal a bgworker to do the cleanup when it notices the pending list is getting large. But that wouldn't directly fix this issue, because the bgworker still wouldn't recycle that space (without further changes), only vacuum workers do that currently. But I don't think this could be implemented as an extension, because the signalling code has to be in core, so (not having studied the matter at all) I don't know if it is good fit for bgworker. We need to expose 2 functions: 1. a function to perform the recycling directly (BRIN has an equivalent function) 2. a function to see how big the pending list is for a particular index, i.e. do we need to run function 1? We can then build a bgworker that polls the pending list and issues a recycle if and when needed - which is how autovac started. * why do we have two parameters: gin_pending_list_limit and fastupdate? What happens if we set gin_pending_list_limit but don't set fastupdate? Fastupdate is on by default. If it were turned off, then gin_pending_list_limit would be mostly irrelevant for those tables. Fastupdate could have been implemented as a magic value (0 or -1) for gin_pending_list_limit but that would break backwards compatibility (and arguably would not be a better way of doing things, anyway). * how do we know how to set that parameter? Is there a way of knowing gin_pending_list_limit has been reached? I don't think there is an easier answer to that. The trade offs are complex and depend on things like how well cached the parts of the index needing insertions are, how many lexemes/array elements are in an average document, and how many documents inserted near the same time as each other share lexemes in common. And of course what you need to optimize for, latency or throughput, and if latency search latency or insert latency. So we also need a way to count the number of times the pending list is flushed. Perhaps record that on the metapage, so we can see how often it has happened - and another function to view the stats on that This and the OP seem like 9.5 open items to me. I don't think so. Freeing gin_pending_list_limit from being forcibly tied to work_mem is a good thing. Even if I don't know exactly how to set gin_pending_list_limit, I know I don't want to be 4GB just because work_mem was set there for some temporary reason. I'm happy to leave it at its default and let its fine tuning be a topic for people who really care about every microsecond of performance. OK, I accept this. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] tablecmds.c and lock hierarchy
On Tue, Aug 04, 2015 at 07:35:43AM +0100, Simon Riggs wrote: On 4 August 2015 at 05:56, Michael Paquier michael.paqu...@gmail.com wrote: The thing is that, as mentioned by Alvaro and Andres on this thread, we have no guarantee that the different relation locks compared have a monotone hierarchy and we may finish by taking a lock that does not behave as you would like to. We are now lucky enough that ALTER TABLE only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and AccessExclusiveLock that actually have a hierarchy so this is not a problem yet. However it may become a problem if we add in the future more lock modes and that are used by ALTER TABLE. Please provide the link to the discussion of this. I don't see a problem here right now that can't be solved by saying Assert(locklevel==ShareUpdateExclusiveLock || locklevelShareRowExclusiveLock); Agreed; that addresses the foreseeable future of this threat. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes
... Josh's approach of restricting the buffer size seems a lot more robust. I understand that the capping of approach of restricting the buffer size is much more robust and is suitable in this case. I, howerver, think that the chane from 'page = XLogCtl-pages[firstIdx * XLOG_BLCKSZ];' to 'page = XLogCtl-pages[firstIdx * (Size) XLOG_BLCKSZ];' is no harm even when restricting the wal buffer size. It is in harmony with the usage of 'XLogCtl-pages' found in, for example, 'cachedPos = XLogCtl-pages + idx * (Size) XLOG_BLCKSZ;' in GetXLogBuffer(XLogRecPtr ptr) and 'NewPage = (XLogPageHeader) (XLogCtl-pages + nextidx * (Size) XLOG_BLCKSZ); ' in AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) , etc. Only exception is 'page = XLogCtl-pages[firstIdx * XLOG_BLCKSZ];' in StartupXLOG(void) -- Takashi Horikawa NEC Corporation Knowledge Discovery Research Laboratories -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane Sent: Tuesday, August 04, 2015 10:50 PM To: Horikawa Takashi(堀川 隆) Cc: PostgreSQL-development Subject: Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes Takashi Horikawa t-horik...@aj.jp.nec.com writes: Why does this cause a core dump? We could consider fixing whatever the problem is rather than capping the value. As far as I experiment with my own evaluation environment using PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the patch attached. I'm unsure whether this represents a complete fix ... but even if it does, it would be awfully easy to re-introduce similar bugs in future code changes, and who would notice? Josh's approach of restricting the buffer size seems a lot more robust. If there were any practical use-case for such large WAL buffers then it might be worth spending some effort/risk here. But AFAICS, there is not. Indeed, capping wal_buffers might be argued to be a good thing in itself because it would prevent users from wasting shared memory foolishly. So my vote is for the original approach. (I've not read Josh's patch, so there might be something wrong with it in detail, but I like the basic approach.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Tue, Aug 4, 2015 at 8:37 PM, Beena Emerson memissemer...@gmail.com wrote: Robert Haas wrote: Maybe shoehorning this into the GUC mechanism is the wrong thing, and what we really need is a new config file for this. The information we're proposing to store seems complex enough to justify that. I think the consensus is that JSON is better. I guess so as well. Thanks for brainstorming the whole thread in a single post. And using a new file with multi line support would be good. This file just contains a JSON blob, hence we just need to fetch its content entirely and then let the server parse it using the existing facilities. Name of the file: how about pg_syncinfo.conf? Backward compatibility: synchronous_standby_names will be supported. synchronous_standby_names='pg_syncinfo' indicates use of new file. This strengthens the fact that parsing is done at SIGHUP, so that sounds fine to me. We may still find out an application_name that uses pg_syncinfo but well, that's unlikely to happen... JSON format: It would contain 2 main keys: sync_info and groups The sync_info would consist of quorum/priority with the count and nodes/group with the group name or node list. The optional groups key would list out all the group mentioned within sync_info along with the node list. [...] Thoughts? Yes, I think that's the idea. I would let a couple of days to let people time to give their opinion and objections regarding this approach though. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers