Re: [HACKERS] row_security GUC, BYPASSRLS
On Tue, Sep 15, 2015 at 03:18:21PM -0400, Adam Brightwell wrote: > On Tue, Sep 15, 2015 at 2:26 PM, Joe Conwaywrote: > >> Joe Conway writes: > >>> There are use cases where row_security=force will be set in production > >>> environments, not only in testing. > > Noah's suggestion of using a per table attribute > > would work -- in fact I like the idea of that better than using the > > current GUC. > > FWIW, I also concur with a per table attribute for this purpose. In > fact, I think I really like the per-table flexibility over an > 'all-or-nothing' approach better too. Great. Robert, does that work for you, too? If so, this sub-thread is looking at three patches: 1. remove row_security=force 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies 3. add DDL-controlled, per-table policy forcing They ought to land in that order. PostgreSQL 9.5 would need at least (1) and (2); would RLS experts find it beneficial for me to take care of those? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] creating extension including dependencies
On Sep 17, 2015 7:52 PM, "Petr Jelinek"wrote: > > On 2015-09-17 17:31, Jeff Janes wrote: >> >> >> Also, It would be nice to have psql tab complete the word CASCADE. >> > > Hmm, it already does? Indeed it does. Oops. I need to run the program I just compiled, and not some other version that happens to be in my $PATH. I've learned that for pg_ctl mostly, but still forget for psql. Cheers, Jeff
Re: [HACKERS] Use pg_rewind when target timeline was switched
On Fri, Sep 18, 2015 at 12:34 PM, Michael Paquierwrote: > On Fri, Sep 18, 2015 at 9:00 AM, Alexander Korotkov wrote: >> BTW, it would be an option to generate system_identifier to each new >> timeline, by analogy of initdb do for the whole WAL. >> Having such system_identifiers we can distinguish different timeline which >> have assigned same ids. >> Any thoughts? > > If you mean a new field incorporated in XLogLongPageHeader and > ControlFile to ensure that a new timeline generated is unique across > the same installation identified with system_identifier, then I'm not > against it for something up to 2 bytes (even 1 byte would be fine), Er, 2 bytes may be a bad idea as well, 1/16k% chance of collision looks dangerous when rewinding a node... It could cause silent data corruption on the standby being rewounded. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use pg_rewind when target timeline was switched
On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote: > The refactoring of getTimelineHistory as you propose looks like a good > idea to me, I tried to remove by myself the difference between source > and target in copy_fetch.c and friends but this gets uglier, > particularly because of datadir_source in copy_file_range. Not worth > it. Forgot that: if (ControlFile_target.state != DB_SHUTDOWNED) pg_fatal("target server must be shut down cleanly\n"); We may want to allow a target node shutdowned in recovery as well here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
On 7/23/15 6:39 PM, Tom Lane wrote: > + 2202HEERRCODE_INVALID_TABLESAMPLE_ARGUMENT > invalid_tablesample_argument > + 2202GEERRCODE_INVALID_TABLESAMPLE_REPEAT > invalid_tablesample_repeat Where did you get these error codes from? The constants in the SQL standard would map to ERRCODE_INVALID_SAMPLE_SIZE ERRCODE_INVALID_REPEAT_ARGUMENT_IN_A_SAMPLE_CLAUSE Were you looking at a different standard, or did you intentionally choose to rephrase? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_security GUC, BYPASSRLS
On Fri, Sep 18, 2015 at 09:01:15AM -0400, Adam Brightwell wrote: > > 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to > > policies > > I believe this one has already been addressed by Stephen > (20150910192313.gt3...@tamriel.snowman.net)? Are there further > considerations for his proposed changes? Right. I'll use that patch, minus the bits examining InLocalUserIdChange(). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extend pgbench expressions with functions
Hello Kyotaro-san, My description should have been obscure. Indeed the call tree is finite for *sane* expression node. But it makes infinit call for a value of expr->etype unknown by both evalDouble and evalInt. Such issue would be detected if the function is actually tested, hopefully this should be the case... :-) However I agree that relying implicitely on the "default" case is not very good practice, so I updated the code in the attached v11 to fail explicitely on such errors. I also attached a small test script, which exercises most (all?) functions: ./pgbench -f functions.sql -t 1 [...] By the way, the complexity comes from separating integer and double. If there is no serios reason to separate them, handling all values as double makes things far simpler. Yep, but no. Could you let me know the reason why it strictly separates integer and double? I don't see no problem in possible errors of floating point calculations for this purpose. Is there any? Indeed it would make things simpler, but it would break large integers as the int64 -> double -> int64 casts would result in approximations. The integer type is the important one here because it is used for primary keys, and you do not want a key to be approximated in any way, so the int64 type must be fully and exactly supported. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 0ac40f1..11aa518 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -771,17 +771,20 @@ pgbench options dbname Sets variable varname to an integer value calculated from expression. The expression may contain integer constants such as 5432, - references to variables :variablename, + double constants such as 3.14156, + references to integer variables :variablename, and expressions composed of unary (-) or binary operators (+, -, *, /, %) - with their usual associativity, and parentheses. + with their usual associativity, function calls and parentheses. + shows the available + functions. Examples: \set ntellers 10 * :scale -\set aid (1021 * :aid) % (10 * :scale) + 1 +\set aid (1021 * random(1, 10 * :scale)) % (10 * :scale) + 1 @@ -931,18 +934,110 @@ pgbench options dbname + + +PgBench Functions + + + + Function + Return Type + Description + Example + Result + + + + + abs(a) + same as a + integer or double absolute value + abs(-17) + 17 + + + ddebug(x) + double + stderr print for debug and return argument + ddebug(5432.1) + 5432.1 + + + double(i) + double + evaluate as int and cast to double + double(5432) + 5432.0 + + + exporand(i, j, t) + integer + exponentially distributed random integer in the bounds, see below + exporand(1, 10, 3.0) + int between 1 and 10 + + + idebug(i) + integer + stderr print for debug and return argument + idebug(5432) + 5432 + + + int(x) + integer + evaluate as double and cast to int + int(5.4 + 3.8) + 9 + + + gaussrand(i, j, t) + integer + gaussian distributed random integer in the bounds, see below + gaussrand(1, 10, 2.5) + int between 1 and 10 + + + min(i, ...) + integer + minimum value + min(5, 4, 3, 2) + 2 + + + max(i, ...) + integer + maximum value + max(5, 4, 3, 2) + 5 + + + random(i, j) + integer + uniformly distributed random integer in the bounds + random(1, 10) + int between 1 and 10 + + + sqrt(x) + double + square root + sqrt(2.0) + 1.414213562 + + + + + As an example, the full definition of the built-in TPC-B-like transaction is: -\set nbranches :scale -\set ntellers 10 * :scale -\set naccounts 10 * :scale -\setrandom aid 1 :naccounts -\setrandom bid 1 :nbranches -\setrandom tid 1 :ntellers -\setrandom delta -5000 5000 +\set aid random(1, 10 * :scale) +\set bid random(1, 1 * :scale) +\set tid random(1, 10 * :scale) +\set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y index e68631e..97bb559 100644 --- a/src/bin/pgbench/exprparse.y +++ b/src/bin/pgbench/exprparse.y @@ -16,10 +16,14 @@ PgBenchExpr *expr_parse_result; +static PgBenchExprList *make_elist(PgBenchExpr *exp, PgBenchExprList *list); static PgBenchExpr *make_integer_constant(int64 ival); +static
Re: [HACKERS] On-demand running query plans using auto_explain and signals
2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr: > On Thu, Sep 17, 2015 at 10:13 PM, Robert Haas > wrote: > >> On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule >> wrote: >> >> >> Second, using a shm_mq manipulates the state of the process latch. I >> >> don't think you can make the assumption that it's safe to reset the >> >> process latch at any and every place where we check for interrupts. >> >> For example, suppose the process is already using a shm_mq and the >> >> CHECK_FOR_INTERRUPTS() call inside that code then discovers that >> >> somebody has activated this mechanism and you now go try to send and >> >> receive from a new shm_mq. But even if that and every other >> >> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset >> >> today, it's a new coding rule that could easily trip people up in the >> >> future. >> > >> > It is valid, and probably most important. But if we introduce own >> mechanism, >> > we will play with process latch too (although we can use LWlocks) >> >> With the design I proposed, there is zero need to touch the process >> latch, which is good, because I'm pretty sure that is going to be a >> problem. I don't think there is any need to use LWLocks here either. >> When you get a request for data, you can just publish a DSM segment >> with the data and that's it. Why do you need anything more? You >> could set the requestor's latch if it's convenient; that wouldn't be a >> problem. But the process supplying the data can't end up in a >> different state than it was before supplying that data, or stuff WILL >> break. >> > > There is still the whole problem of where exactly the backend being > queried for the status should publish that DSM segment and when to free it? > > If it's a location shared between all backends, there should be locking > around it. Probably this is not a big problem, if you don't expect all the > backends start querying each other rapidly. That is how it was implemented > in the first versions of this patch actually. > > If we take the per-backend slot approach the locking seems unnecessary and > there are principally two options: > > 1) The backend puts the DSM handle in its own slot and notifies the > requester to read it. > 2) The backend puts the DSM handle in the slot of the requester (and > notifies it). > > If we go with the first option, the backend that has created the DSM will > not know when it's OK to free it, so this has to be responsibility of the > requester. If the latter exits before reading and freeing the DSM, we have > a leak. Even bigger is the problem that the sender backend can no longer > send responses to a number of concurrent requestors: if its slot is > occupied by a DSM handle, it can not send a reply to another backend until > the slot is freed. > > With the second option we have all the same problems with not knowing when > to free the DSM and potentially leaking it, but we can handle concurrent > requests. > It should not be true - the data sender create DSM and fills it. Then set caller slot and send signal to caller. Caller can free DSM any time, because data sender send newer touch it. > > The current approach where the requester creates and frees the DSM doesn't > suffer from these problems, so if we pre-allocate the segment just big > enough we can avoid the use of shm_mq. That will take another GUC for the > segment size. Certainly no one expects a query plan to weigh a bloody > megabyte, but this is what happens to Pavel apparently. > It is plan C - last variant from my view. Any other GUC :( Pavel > > -- > Alex > >
Re: [HACKERS] [patch] Proposal for \rotate in psql
3. When data are multiattribute - then merging together with space > separator is not practical > > * important information is lost > * same transformation can be done as expression, so this feature is > useless > > Is possible to use one cell per attribute (don't do merge)? > > DATA QUERY: SELECT dim1, dim2, sum(x), avg(x) FROM .. GROUP BY dim1, dim2 > > and result header of rotate can be > > DIM1 | dim2_val1/sum | dim2_val1/avg | dim2_val2/sum | dim2_val2/avg | > ... > Last point can wait - we don't need to show pivot table with all details perfectly in first step. The main issue of this patch is name - "rotate" is really pretty strange for me. Please, change it :) - crosstab is much better Regards Pavel
Re: [HACKERS] checkpointer continuous flushing
Hello, [...] If you make the sorting criterion include the tablespace id you wouldn't need the lookahead loop in NextBufferToWrite(). I'm considering this precise point, i.e. including the tablespace as a sorting criterion. Currently the array used for sorting is 16 bytes per buffer (although I wrote 12 in another mail, I was wrong...). The data include the bufid (4 bytes) the relation & fork num (8 bytes, but really 4 bytes + 2 bits are used), and the block number (4 bytes) which is the offset within the relation. These 3 combined data allow to find the file and the offset within that file, for the given buffer id. I'm concerned that these 16 bytes are already significant and I do not want to extend them any more. I was already pretty happy with the previous version with 4 bytes per buffer. Now as the number of tablespace is expected to be very small (1, 2, maybe 3), there is no problem to pack it within the unused 30 bits in forknum. That would mean some masking and casts here and there, so it would not be very beautiful, but it would make it easy to find the buffers for a given tablespace, and indeed remove the lookahead stuff in the next buffer function, as you suggest. My question is: would that be acceptable, or would someone object to the use of masks and things like that? The benefit would be a simpler/more direct next buffer function, but some more tinkering around the sorting criterion to use a packed representation. Note that I do not think that it would have any actual impact on performance... it would only make a difference if there were really many tablespaces (the scanning complexity would be Nbuffer instead of Nbuffer*Ntablespace, but as Ntablespace is small...). My motivation is rather to help the patch get through, so I'm fine if this is not needed. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawadawrote: > On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao wrote: >> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada >> wrote: >>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada >>> wrote: On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera wrote: > Jim Nasby wrote: > >> I think things like pageinspect are very different; I really can't see >> any >> use for those beyond debugging (and debugging by an expert at that). > > I don't think that necessarily means it must continue to be in contrib. > Quite the contrary, I think it is a tool critical enough that it should > not be relegated to be a second-class citizen as it is now (let's face > it, being in contrib *is* second-class citizenship). > Attached patch is latest patch. >>> >>> The previous patch lacks some files for regression test. >>> Attached fixed v12 patch. >> >> The patch could be applied cleanly. "make check" could pass successfully. >> But "make check-world -j 2" failed. >> > > Thank you for looking at this patch. > Could you tell me what test you got failed? > make check-world -j 2 or more is done successfully in my environment. I tried to do the test again, but initdb failed with the following error. creating template1 database in data/base/1 ... FATAL: invalid input syntax for type oid: "f" This error didn't happen when I tested before. So the commit which was applied recently might interfere with the patch. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] numbering plan nodes
> On Thu, Sep 17, 2015 at 9:01 PM, Kouhei Kaigaiwrote: > > I entirely agree with the idea of plan-node identifier, however, > > uncertain whether the node-id shall represent physical location on > > the dynamic shared memory segment, because > > (1) Relatively smaller number of node type needs shared state, > > thus most of array items are empty. > > (2) Extension that tries to modify plan-tree using planner_hook > > may need to adjust node-id also. > > > > Even though shm_toc_lookup() has to walk on the toc entries to find > > out the node-id, it happens at once on beginning of the executor at > > background worker side. I don't think it makes a significant problem. > > Yes, I was thinking that what would make sense is to have each > parallel-aware node call shm_toc_insert() using its ID as the key. > Then, we also need Instrumentation nodes. For those, I thought we > could use some fixed, high-numbered key, and Tom's idea. > Hmm, indeed, run-time statistics are needed for every node. If an array indexed by node-id would be a hash slot, we can treat non-contiguous node-id with no troubles. > Are there extensions that use planner_hook to do surgery on the plan > tree? What do they do, exactly? > (Even though it will not work under Funnel,) PG-Strom often inject a preprocessor node under Agg-node to produce partial aggregation to reduce number of rows to be processed by CPU. Also, I have seen a paper published by Fujitsu folks. Their module modifies plan-tree to replace built-in scan node with their own columnar storage scan node. http://db-event.jpn.org/deim2015/paper/195.pdf This paper is written in Japanese, however, figure-3 in page.4 shows what I explain above. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On-demand running query plans using auto_explain and signals
On Thu, Sep 17, 2015 at 10:13 PM, Robert Haaswrote: > On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule > wrote: > > >> Second, using a shm_mq manipulates the state of the process latch. I > >> don't think you can make the assumption that it's safe to reset the > >> process latch at any and every place where we check for interrupts. > >> For example, suppose the process is already using a shm_mq and the > >> CHECK_FOR_INTERRUPTS() call inside that code then discovers that > >> somebody has activated this mechanism and you now go try to send and > >> receive from a new shm_mq. But even if that and every other > >> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset > >> today, it's a new coding rule that could easily trip people up in the > >> future. > > > > It is valid, and probably most important. But if we introduce own > mechanism, > > we will play with process latch too (although we can use LWlocks) > > With the design I proposed, there is zero need to touch the process > latch, which is good, because I'm pretty sure that is going to be a > problem. I don't think there is any need to use LWLocks here either. > When you get a request for data, you can just publish a DSM segment > with the data and that's it. Why do you need anything more? You > could set the requestor's latch if it's convenient; that wouldn't be a > problem. But the process supplying the data can't end up in a > different state than it was before supplying that data, or stuff WILL > break. > There is still the whole problem of where exactly the backend being queried for the status should publish that DSM segment and when to free it? If it's a location shared between all backends, there should be locking around it. Probably this is not a big problem, if you don't expect all the backends start querying each other rapidly. That is how it was implemented in the first versions of this patch actually. If we take the per-backend slot approach the locking seems unnecessary and there are principally two options: 1) The backend puts the DSM handle in its own slot and notifies the requester to read it. 2) The backend puts the DSM handle in the slot of the requester (and notifies it). If we go with the first option, the backend that has created the DSM will not know when it's OK to free it, so this has to be responsibility of the requester. If the latter exits before reading and freeing the DSM, we have a leak. Even bigger is the problem that the sender backend can no longer send responses to a number of concurrent requestors: if its slot is occupied by a DSM handle, it can not send a reply to another backend until the slot is freed. With the second option we have all the same problems with not knowing when to free the DSM and potentially leaking it, but we can handle concurrent requests. The current approach where the requester creates and frees the DSM doesn't suffer from these problems, so if we pre-allocate the segment just big enough we can avoid the use of shm_mq. That will take another GUC for the segment size. Certainly no one expects a query plan to weigh a bloody megabyte, but this is what happens to Pavel apparently. -- Alex
Re: [HACKERS] On-demand running query plans using auto_explain and signals
On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehulewrote: > 2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr < > oleksandr.shul...@zalando.de>: > >> >> If we take the per-backend slot approach the locking seems unnecessary >> and there are principally two options: >> >> 1) The backend puts the DSM handle in its own slot and notifies the >> requester to read it. >> 2) The backend puts the DSM handle in the slot of the requester (and >> notifies it). >> >> If we go with the first option, the backend that has created the DSM will >> not know when it's OK to free it, so this has to be responsibility of the >> requester. If the latter exits before reading and freeing the DSM, we have >> a leak. Even bigger is the problem that the sender backend can no longer >> send responses to a number of concurrent requestors: if its slot is >> occupied by a DSM handle, it can not send a reply to another backend until >> the slot is freed. >> >> With the second option we have all the same problems with not knowing >> when to free the DSM and potentially leaking it, but we can handle >> concurrent requests. >> > > It should not be true - the data sender create DSM and fills it. Then set > caller slot and send signal to caller. Caller can free DSM any time, > because data sender send newer touch it. > But the requester can timeout on waiting for reply and exit before it sees the reply DSM. Actually, I now don't even think a backend can free the DSM it has not created. First it will need to attach it, effectively increasing the refcount, and upon detach it will only decrease the refcount, but not actually release the segment... So this has to be the responsibility of the reply sending backend in the end: to create and release the DSM *at some point*. -- Alex
Re: [HACKERS] synchronous_commit = apply
On Fri, Sep 18, 2015 at 7:06 PM, Kyotaro HORIGUCHIwrote: > Hello, I have some random comments. Thanks for the feedback! I have fixed several of the things that you found in the attached new version -- see comments inline below. However, I now know that Simon has a better patch in development to do this, so I won't be developing this further. (Until that work is available, this patch is temporarily useful as a prerequisite for something else that I'm working on so I'll still be using it...) > At Wed, 16 Sep 2015 23:07:03 +1200, Thomas Munro > wrote in >
Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM
On Fri, Sep 18, 2015 at 6:27 AM, Tom Lanewrote: > [ moving thread to -hackers ] > > Fujii Masao writes: > > So autoanalyze still doesn't call IndexFreeSpaceMapVacuum(). > > That is, only backend can clean the list in INSERT-only workload. > > I don't think that this is desirable. Because the backend will > > periodically take a big performance penalty. > Calling IndexFreeSpaceMapVacuum is only need to reuse the space freed up by cleaning the pending list. The list is still getting cleaned and truncated by autoanalyze, it is just that the space is not getting marked for reuse. When I wrote this, my thought was that a table vacuum is going to call IndexFreeSpaceMapVacuum() from another place in its code and so should not call it here as well. I neglected to consider the autoanalyze case. > > So I'm thinking that even autoanalyze should call > IndexFreeSpaceMapVacuum() > > to clean the list in a background, in order to avoid such spikes in > > INSERT response time. Thought? > But it already is cleaning the list. If that space is not marked as reusable, that causes index bloat, but doesn't cause latency spikes. > It seems quite bizarre for auto-analyze to do that. auto-vacuum, sure, > but I do not think this should get plugged into ANALYZE. > It may be odd that autoanalyze cleans the list at all (which this patch doesn't change), but given that it does clean the list I don't see why it would be bizarre to make the cleaned-up space re-usable. Cheers, Jeff
Re: [HACKERS] Parallel Seq Scan
On Fri, Sep 18, 2015 at 12:56 PM, Robert Haaswrote: > On Thu, Sep 17, 2015 at 11:44 PM, Amit Kapila wrote: >> Okay, but I think the same can be achieved with this as well. Basic idea >> is that each worker will work on one planned statement at a time and in >> above case there will be two different planned statements and they will >> store partial seq scan related information in two different loctions in >> toc, although the key (PARALLEL_KEY_SCAN) would be same and I think this >> will quite similar to what we are already doing for response queues. >> The worker will work on one of those keys based on planned statement >> which it chooses to execute. I have explained this in somewhat more details >> in one of my previous mails [1]. > > shm_toc keys are supposed to be unique. If you added more than one > with the same key, there would be no look up the second one. That was > intentional, and I don't want to revise it. > > I don't want to have multiple PlannedStmt objects in any case. That > doesn't seem like the right approach. I think passing down an Append > tree with multiple Partial Seq Scan children to be run in order is > simple and clear, and I don't see why we would do it any other way. > The master should be able to generate a plan and then copy the part of > it below the Funnel and send it to the worker. But there's clearly > never more than one PlannedStmt in the master, so where would the > other ones come from in the worker? There's no reason to introduce > that complexity. Also, as KaiGai pointed out on the other thread, even if you DID pass two PlannedStmt nodes to the worker, you still need to know which one goes with which ParallelHeapScanDesc. If both of the ParallelHeapScanDesc nodes are stored under the same key, then you can't do that. That's why, as discussed in the other thread, we need some way of uniquely identifying a plan node. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Proposal for \rotate in psql
Pavel Stehule wrote: > 2. Data column are not well aligned - numbers are aligned as text Thanks for spotting that, it's fixed in the attached new iteration of the patch. > 3. When data are multiattribute - then merging together with space separator > is not practical > > * important information is lost > * same transformation can be done as expression, so this feature is > useless The primary use case is for queries with 3 output columns (A,B,C) where A and B are dimensions and C is uniquely determined by (A,B). How columns 4 and above get displayed is not essential to the feature, as it's somehow a degenerate case. As you note, it could be avoided by the user limiting the query to 3 columns, and providing whatever expression fits for the 3rd column. Still if the query has > 3 columns, it has to be dealt with. The choices I've considered: a- Just error out. b- Force the user to specify which single column should be taken as the value. That would be an additional argument to the command, or the fixed 3rd column in the invocation without arg. c- Stack the values horizontally in the same cell with a separator. As the query implies f(A,B)=(C,D,E) we display C D E in the cell at coordinates (A,B). It's what it does currently. [a] is not very user friendly. [b] seems acceptable. It discards columns but the user decides which. [c] is meant as a best effort at not discarding anything. When [c] gives poor results, the next step from my point of view would be for the user to rework the query, just like in the general case when a query is not satisfying. You're suggesting a [d] choice, subdividing the horizontal headers. It seems to me like a pretty radical change, multiplying the number of columns, and it has also the potential to give poor results visually. Let's see if more feedback comes. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index f5c9552..0ed2e3d 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -2445,6 +2445,96 @@ lo_import 152801 + +\rotate [ colV [-]colH] + + +Execute the current query buffer (like \g) and shows the results +inside a crosstab grid. The contents at +output column colV are transformed into a +vertical header and the contents at +output column colH into a +horizontal header. The results for the other output columns are projected inside the grid. + + + +colV +and colH can indicate a +column position (starting at 1), or a column name. Normal case folding +and quoting rules apply on column names. By default, +colV designates column 1 +and colH column 2. +A query having less than two output columns cannot be rotated, and +colH must differ from +colV. + + + +The vertical header, displayed as the leftmost column in the output, +contains the set of all distinct values found in +column colV, in the order +of their first appearance in the results. + + +The horizontal header, displayed as the first row in the output, +contains the set of all distinct non-null values found in +column colH. It is +sorted in ascending order of values, unless a minus (-) sign +precedes colH, in which +case this order is reversed. + + + +The query results being tuples of N columns +(including colV and +colH), +for each distinct value x of +colH +and each distinct value y of +colV, +a cell is output at the intersection (x,y) in the grid, +and its contents are determined by these rules: + + + + if there is no corresponding row in the results such that the value + for colH + is x and the value + for colV + is y, the cell is empty. + + + + + + if there is exactly one row such that the value + for colH + is x and the value + for colV + is y, then the N-2 other + columns are displayed in the cell, separated between each other by + a space character if needed. + + If N=2, the letter X is displayed in the cell as + if a virtual third column contained that character. + + + + + + if there are several corresponding rows, the behavior is identical to one row + except that the values coming from different rows are stacked + vertically, rows being separated by newline characters inside + the same cell. + + + + + + + + + \s [ filename ] diff --git
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
> 18 сент. 2015 г., в 20:16, Robert Haasнаписал(а): > > On Fri, Sep 18, 2015 at 4:08 AM, Vladimir Borodin wrote: >> For both scenarios on linux we got approximately the same results - version >> with timings was faster then version with sampling (sampling was done every >> 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version >> with timings gave ~14500 tps while version with sampling gave ~13800 tps. In >> all cases processor was 100% utilized. Comparing vanilla PostgreSQL and >> version with timings on constant workload (12000 tps) gave the following >> results in latencies for queries: > > If the timing is speeding things up, that's most likely a sign that > the spinlock contention on that workload is so severe that you are > spending a lot of time in s_lock. Adding more things for the system > to do that don't require that lock will speed the system up by > reducing the contention. Instead of inserting gettimeofday() calls, > you could insert a for loop that counts to some large number without > doing any useful work, and that would likely have a similar effect. > > In any case, I think your experiment clearly proves that the presence > or absence of this instrumentation *is* performance-relevant and that > we *do* need to worry about what it costs. If the system gets 20% > faster when you call gettimeofday() a lot, does that mean we should > insert gettimeofday() calls all over the system in random places to > speed it up? No, probably you misunderstood the results, let me explain one more time. Unpatched PostgreSQL from REL9_4_STABLE gave 15500 tps. Version with timings - 14500 tps which is 6,5% worse. Version with sampling wait events every 10 ms gave 13800 tps (11% worse than unpatched and 5% worse than with timings). We also made a test with a stable workload of 12000 tps for unpatched version and version with timings. In thas test we saw that response times are a bit worse in version with timings as shown in the table below. You should read this table as follows: 99% of all queries in unpatched version fits in 79 ms while in version with timings 99% of all queries fits in 97 ms which is 18 ms slower, and so on. That test also showed that version with timings consumes extra 7% of CPU to handle the same workload as unpatched version. So this is the cost of waits monitoring with timings on lwlock stress workload - 6,5% less throughput, a bit worse timings and extra 7% of CPU. If you will insert gettimeofday() calls all over the system in random places, you expectedly will not speed up, you will be getting slower. q'thvanilla timing 99% 79.097.0(+18.0) 98% 64.076.0(+12.0) 95% 38.047.0(+9.0) 90% 16.021.0(+5.0) 85% 7.0 11.0(+4.0) 80% 5.0 7.0 (+2.0) 75% 4.0 5.0 (+1.0) 50% 2.0 3.0 (+1.0) > > I do agree that if we're going to include support for timings, having > them be controlled by a GUC is a good idea. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- May the force be with you… https://simply.name
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Fri, Sep 18, 2015 at 11:08 PM, Jesper Pedersen < jesper.peder...@redhat.com> wrote: > On 09/11/2015 10:31 AM, Amit Kapila wrote: > >> Updated comments and the patch (increate_clog_bufs_v2.patch) >> containing the same is attached. >> >> > I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x > RAID10 SSD (data + xlog) with Min(64,). > > The benefit with this patch could be seen at somewhat higher client-count as you can see in my initial mail, can you please once try with client count > 64? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing ClogControlLock contention
On Fri, Sep 18, 2015 at 11:31 PM, Jesper Pedersen < jesper.peder...@redhat.com> wrote: > On 08/31/2015 07:34 AM, Amit Kapila wrote: > >> I have updated the patch (attached with mail) to show >> you what I have in mind. >> >> > I havn't been able to get a successful run with _v5 using pgbench. > > This patch is still not ready for performance test, as you might have seen in comments that we are still discussing locking issues. > TransactionIdSetStatusBit assumes an exclusive lock on CLogControlLock > when called, but that part is removed from TransactionIdSetPageStatus now. > It doesn't seem to be a necessary condition, thats why in this patch a smaller granularity lock is introduced. > I tried an > > if (!LWLockHeldByMe(CLogControlLock)) > { > LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); > mode = LW_EXCLUSIVE; > } > > approach, but didn't get further. I suspect the problem is something else. > Plus that probably isn't the best way, since we will traverse all LWLock's, Yes, but it does in such a way that probably the caller will find it in very few initial entries in the array. Thank you very much for doing valuable performance tests with the CLog patches. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [patch] Proposal for \rotate in psql
> You're suggesting a [d] choice, subdividing the horizontal headers. > It seems to me like a pretty radical change, multiplying the number > of columns, and it has also the potential to give poor results visually. > Let's see if more feedback comes. > yes, I know, plan @d needs lot of new code - and described feature is from "nice to have" kind. The prerequisite is enhancing drawing system in psql to support multiattribute (records) cells - what can be nice feature generally. Some like: id |C1 | C2| +---++---+---+---+---+ |A1 | A2 |A3 |A4 |A5 |A6 | +===++===+===+===+===+ | || | | | | Without this possibility plan @d is impossible. Regards Pavel > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite >
Re: [HACKERS] Parallel Seq Scan
On Fri, Sep 18, 2015 at 9:45 PM, Amit Kapilawrote: > On Fri, Sep 18, 2015 at 1:33 PM, Haribabu Kommi > wrote: >> >> On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila >> wrote: >> > >> > Attached, find the rebased version of patch. >> >> Here are the performance test results: >> >> Query selectivityHashAgg HashAgg >> (million) + seqscan(ms)+ >> parallel seq scan(ms) >> 2 >> workers 4 workers 8 workers >> $1 <= '001' 0.1 16717.00 7086.00 >> 4459.00 2912.00 >> $1 <= '004' 0.4 17962.00 7410.00 >> 4651.00 2977.00 >> $1 <= '008' 0.8 18870.00 7849.00 >> 4868.00 3092.00 >> $1 <= '016' 1.5 21368.00 8645.00 >> 6800.00 3486.00 >> $1 <= '030' 2.7 24622.00 14796.0013108.00 >> 9981.00 >> $1 <= '060' 5.4 31690.00 29839.0026544.00 >> 23814.00 >> $1 <= '080' 7.2 37147.00 40485.0035763.00 >> 32679.00 >> > > I think here probably when the selectivity is more than 5, then it should > not have selected Funnel plan. Have you by any chance changed > cpu_tuple_comm_cost? If not, then you can try by setting value of > parallel_setup_cost (may be 10) and then see if it selects the Funnel > Plan. Is it possible for you to check the cost difference of Sequence > and Funnel plan, hopefully explain or explain analyze should be sufficient? Yes, I changed cpu_tuple_comm_cost to zero to observe how parallel seq scan performs in high selectivity. Forgot to mention in the earlier mail. Overall the parallel seq scan performance is good. >> And also attached perf results for selectivity of 0.1 million and 5.4 >> million cases for analysis. >> > > I have checked perf reports and it seems that when selectivity is more, it > seems to be spending time in some kernel calls which could be due > communication of tuples. Yes. And also in low selectivity with increase of workers, tas and s_lock functions usage is getting increased. May be these are also one of the reasons for scaling problem. Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use pg_rewind when target timeline was switched
On Wed, Sep 16, 2015 at 9:47 AM, Alexander Korotkov wrote: > On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier wrote: > OK, I get it. I tried to get rid of code duplication in the attached patch. > There is getTimelineHistory() function which gets control file as argument > and fetches history from appropriate path. Cool, thanks! > Imagine following configuration of server. > 1 > / \ > 2 3 > | > 4 > > Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4 > is cascaded standby for node 3. > Then node 2 and node 3 are promoted. They both got timeline number 2. Then > node 3 is promoted and gets timeline number 3. Typo. You mean node 4 getting timeline 3 here. > Then we could try to rewind node 4 with node 2 as source. How pg_rewind > could know that timeline number 2 for those nodes is not the same? > We can only check that those timelines are forked from timeline 1 at the > same place. But coincidence is still possible. OK, I see your point and you are right. This additional check allows pg_rewind to switch one timeline back and make the scan of blocks begin at the real origin of both timelines. I had in mind the case where you tried to actually rewind node 2 to node 3 actually which would not be possible with your patch, and by thinking more about that I think that it could be possible to remove the check I am listing below and rely on the checks in the history files, basically what is in findCommonAncestorTimeline: if (ControlFile_target.checkPointCopy.ThisTimeLineID == ControlFile_source.checkPointCopy.ThisTimeLineID) pg_fatal("source and target cluster are on the same timeline\n"); Alexander, what do you think about that? I think that we should be able to rewind with for example node 2 as target and node 3 as source, and vice-versa as per the example you gave even if both nodes are on the same timeline, just that they forked at a different point. Could you test this particular case? Using my script, simply be sure to switch archive_mode to on/off depending on the node, aka only 3 and 4 do archiving. + /* +* Since incomplete segments are copied into next timelines, find the +* lastest timeline holding required segment. Assuming we can move +* in xlog both forward and backward, consider also switching timeline +* back. +*/ s/lastest/correct? "lastest" does sound very English to me even if that's grammatically correct. + * Find minimum from two xlog pointers assuming invalid pointer (0) means + * infinity as timeline.h states. This needs an update, now timeline.h points out correctly that this is not 0, but InvalidXLogRecPtr. + /* Retreive timelines for both source and target */ s/Retreive/Retrieve. The refactoring of getTimelineHistory as you propose looks like a good idea to me, I tried to remove by myself the difference between source and target in copy_fetch.c and friends but this gets uglier, particularly because of datadir_source in copy_file_range. Not worth it. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Thu, Sep 17, 2015 at 2:28 AM, Robert Haaswrote: > > On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila wrote: > +/* > + * We expect each worker to populate the BufferUsage structure > + * allocated by master backend and then master backend will aggregate > + * all the usage along with it's own, so account it for each worker. > + */ > > This also needs improvement. Especially because... > > +/* > + * We expect each worker to populate the instrumentation structure > + * allocated by master backend and then master backend will aggregate > + * all the information, so account it for each worker. > + */ > > ...it's almost identical to this one. > I think we can combine them and have one comment. > > +GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params, > > Could this be a static function? Will it really be needed outside this file? > It is already declared as static, but will add static in function definition as well. > And is there any use case for letting some of the arguments be NULL? In earlier versions of patch this API was used from other places, but now there is no such use, so will change accordingly. > > +bool > +ExecParallelBufferUsageAccum(Node *node) > +{ > +if (node == NULL) > +return false; > + > +switch (nodeTag(node)) > +{ > +case T_FunnelState: > +{ > +FinishParallelSetupAndAccumStats((FunnelState*)node); > +return true; > +} > +break; > +default: > +break; > +} > + > +(void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree, NULL, > + ExecParallelBufferUsageAccum, 0); > +(void) planstate_tree_walker((Node*)((PlanState *)node)->righttree, NULL, > + ExecParallelBufferUsageAccum, 0); > +return false; > +} > > This seems wacky. I mean, isn't the point of planstate_tree_walker() > that the callback itself doesn't have to handle recursion like this? > And if not, then this wouldn't be adequate anyway, because some > planstate nodes have children that are not in lefttree or righttree > (cf. explain.c). > Will change according to recent commit for planstate_tree_walker > +currentRelation = ExecOpenScanRelation(estate, > + ((SeqScan *) > node->ss.ps.plan)->scanrelid, > + eflags); > > I can't see how this can possibly be remotely correct. The funnel > node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist). > This is mainly used for generating tuple descriptor and that tuple descriptor will be used for forming scanslot, funnel node itself won't do any scan. However, we can completely eliminate this InitFunnel() function and use ExecAssignProjectionInfo() instead of ExecAssignScanProjectionInfo() to form the projection info. > > +buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * > sizeof(BufferUsage))); > > Cast it to a BufferUsage * first. Then you can use [i] to find > the i'th element. > Do you mean to say that the way code is written won't work? Values of structure BufferUsage for each worker is copied into string buffer_usage which I believe could be fetched in above way. > +/* > + * Re-initialize the parallel context and workers to perform > + * rescan of relation. We want to gracefully shutdown all the > + * workers so that they should be able to propagate any error > + * or other information to master backend before dying. > + */ > +FinishParallelSetupAndAccumStats(node); > > Somehow, this makes me feel like that function is badly named. > I think here comment seems to be slightly misleading, shall we change the comment as below: Destroy the parallel context to gracefully shutdown all the workers so that they should be able to propagate any error or other information to master backend before dying. > +/* > + * _readPlanInvalItem > + */ > +static PlanInvalItem * > +_readPlanInvalItem(void) > +{ > +READ_LOCALS(PlanInvalItem); > + > +READ_INT_FIELD(cacheId); > +READ_UINT_FIELD(hashValue); > + > +READ_DONE(); > +} > > I don't see why we should need to be able to copy PlanInvalItems. In > fact, it seems like a bad idea. > We are not copying PlanInvalItems, so I don't think this is required. Now I don't exactly remember why I have added this at first place, one reason could be that in earlier versions PlanInvalItems might have been copied. Anyway, I will verify it once more and if not required, I will remove it. > +#parallel_setup_cost = 0.0 # same scale as above > +#define DEFAULT_PARALLEL_SETUP_COST 0.0 > > This value is probably a bit on the low side. > How about keeping it as 10.0? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On Fri, Sep 18, 2015 at 1:33 PM, Haribabu Kommiwrote: > On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila > wrote: > > > > Attached, find the rebased version of patch. > > Here are the performance test results: > > Query selectivityHashAgg HashAgg > (million) + seqscan(ms)+ > parallel seq scan(ms) > 2 > workers 4 workers 8 workers > $1 <= '001' 0.1 16717.00 7086.00 > 4459.00 2912.00 > $1 <= '004' 0.4 17962.00 7410.00 > 4651.00 2977.00 > $1 <= '008' 0.8 18870.00 7849.00 > 4868.00 3092.00 > $1 <= '016' 1.5 21368.00 8645.00 > 6800.00 3486.00 > $1 <= '030' 2.7 24622.00 14796.0013108.00 > 9981.00 > $1 <= '060' 5.4 31690.00 29839.0026544.00 > 23814.00 > $1 <= '080' 7.2 37147.00 40485.0035763.00 > 32679.00 > > I think here probably when the selectivity is more than 5, then it should not have selected Funnel plan. Have you by any chance changed cpu_tuple_comm_cost? If not, then you can try by setting value of parallel_setup_cost (may be 10) and then see if it selects the Funnel Plan. Is it possible for you to check the cost difference of Sequence and Funnel plan, hopefully explain or explain analyze should be sufficient? > > And also attached perf results for selectivity of 0.1 million and 5.4 > million cases for analysis. > > I have checked perf reports and it seems that when selectivity is more, it seems to be spending time in some kernel calls which could be due communication of tuples. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [patch] Proposal for \rotate in psql
Pavel Stehule wrote: > my comments: > > 1. I don't understand why you are use two methods for sorting columns > (qsort, and query with ORDER BY) qsort (with strcmp as the comparator) is only used to determine the set of distinct values for the vertical and horizontal headers. In fact this is just to allow the use of bsearch() when doing that instead of a slower sequential search. Once the values for the horizontal headers are known, they are passed to the server for sorting with server-side semantics according to their type. The order will differ from qsort/strcmp found as the comparison semantics are different. The values for the vertical header are not sorted server-side because we keep the order of the query for displaying top to bottom. In the typical use case , it will have ORDER BY 1[,2] possibly with one/two DESC qualifiers. > 2. Data column are not well aligned - numbers are aligned as text Yes. I'll look shortly into fixing that. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On-demand running query plans using auto_explain and signals
On Fri, Sep 18, 2015 at 12:59 PM, Pavel Stehulewrote: > 2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr < > oleksandr.shul...@zalando.de>: > >> On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule >> wrote: >> >>> >>> It should not be true - the data sender create DSM and fills it. Then >>> set caller slot and send signal to caller. Caller can free DSM any time, >>> because data sender send newer touch it. >>> >> >> But the requester can timeout on waiting for reply and exit before it >> sees the reply DSM. Actually, I now don't even think a backend can free >> the DSM it has not created. First it will need to attach it, effectively >> increasing the refcount, and upon detach it will only decrease the >> refcount, but not actually release the segment... >> > > I am afraid so it has not simple and nice solution - when data sender will > wait for to moment when data are received, then we have same complexity > like we use shm_mq. > > Isn't better to introduce new background worker with responsibility to > clean orphaned DSM? > I'm not thrilled by this idea. We don't even need a worker to do that, as leaked segments are reported by the backend itself upon transaction commit (see ResourceOwnerReleaseInternal), e.g: WARNING: dynamic shared memory leak: segment 808539725 still referenced Still I believe relying on some magic cleanup mechanism and not caring about managing the shared memory properly is not the way to go. -- Alex
Re: [HACKERS] [patch] Proposal for \rotate in psql
Pavel Stehule wrote: > in the help inside your last patch, you are using "crosstab". Cannto be > crosstab the name for this feature? If it wasn't taken already by contrib/tablefunc, that would be a first choice. But now, when searching for crosstab+postgresql, pages of results come out concerning the crosstab() function. So not using \crosstab is deliberate; it's to prevent confusion with the server-side function. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Thu, Sep 17, 2015 at 4:44 PM, Robert Haaswrote: > On Thu, Sep 17, 2015 at 2:54 AM, Amit Kapila wrote: > > > > Here the subplan is generated before the top level plan and while generation > > of subplan we can't predict whether it is okay to generate it as Funnel or > > not, > > because it might be that top level plan is non-Funnel. Also if such a > > subplan > > is actually an InitPlan, then we are safe (as we execute the InitPlans in > > master backend and then pass the result to parallel worker) even if top > > level > > plan is Funnel. I think the place where we can catch this is during the > > generation of Funnel path, basically we can evaluate if any nodes beneath > > Funnel node has 'filter' or 'targetlist' as another Funnel node, then we > > have > > two options to proceed: > > a. Mark such a filter or target list as non-pushable which will indicate > > that > > they need to be executed only in master backend. If we go with this > > option, then we have to make Funnel node capable of evaluating Filter > > and Targetlist which is not a big thing. > > b. Don't choose the current path as Funnel path. > > > > I prefer second one as that seems to be simpler as compare to first and > > there doesn't seem to be much benefit in going by first. > > > > Any better ideas? > > I haven't studied the planner logic in enough detail yet to have a > clear opinion on this. But what I do think is that this is a very > good reason why we should bite the bullet and add outfuncs/readfuncs > support for all Plan nodes. Otherwise, we're going to have to scan > subplans for nodes we're not expecting to see there, which seems > silly. We eventually want to allow all of those nodes in the worker > anyway. > makes sense to me. There are 39 plan nodes and it seems we have support for all of them in outfuncs and needs to add for most of them in readfuncs. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On-demand running query plans using auto_explain and signals
2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr: > On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule > wrote: > >> 2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr < >> oleksandr.shul...@zalando.de>: >> >>> >>> If we take the per-backend slot approach the locking seems unnecessary >>> and there are principally two options: >>> >>> 1) The backend puts the DSM handle in its own slot and notifies the >>> requester to read it. >>> 2) The backend puts the DSM handle in the slot of the requester (and >>> notifies it). >>> >>> If we go with the first option, the backend that has created the DSM >>> will not know when it's OK to free it, so this has to be responsibility of >>> the requester. If the latter exits before reading and freeing the DSM, we >>> have a leak. Even bigger is the problem that the sender backend can no >>> longer send responses to a number of concurrent requestors: if its slot is >>> occupied by a DSM handle, it can not send a reply to another backend until >>> the slot is freed. >>> >>> With the second option we have all the same problems with not knowing >>> when to free the DSM and potentially leaking it, but we can handle >>> concurrent requests. >>> >> >> It should not be true - the data sender create DSM and fills it. Then set >> caller slot and send signal to caller. Caller can free DSM any time, >> because data sender send newer touch it. >> > > But the requester can timeout on waiting for reply and exit before it sees > the reply DSM. Actually, I now don't even think a backend can free the DSM > it has not created. First it will need to attach it, effectively > increasing the refcount, and upon detach it will only decrease the > refcount, but not actually release the segment... > I am afraid so it has not simple and nice solution - when data sender will wait for to moment when data are received, then we have same complexity like we use shm_mq. Isn't better to introduce new background worker with responsibility to clean orphaned DSM? Regards Pavel > > So this has to be the responsibility of the reply sending backend in the > end: to create and release the DSM *at some point*. > > -- > Alex > >
Re: [HACKERS] Freeze avoidance of very large table.
On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masaowrote: > On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada wrote: >> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao wrote: >>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada >>> wrote: On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada wrote: > On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera > wrote: >> Jim Nasby wrote: >> >>> I think things like pageinspect are very different; I really can't see >>> any >>> use for those beyond debugging (and debugging by an expert at that). >> >> I don't think that necessarily means it must continue to be in contrib. >> Quite the contrary, I think it is a tool critical enough that it should >> not be relegated to be a second-class citizen as it is now (let's face >> it, being in contrib *is* second-class citizenship). >> > > Attached patch is latest patch. The previous patch lacks some files for regression test. Attached fixed v12 patch. >>> >>> The patch could be applied cleanly. "make check" could pass successfully. >>> But "make check-world -j 2" failed. >>> >> >> Thank you for looking at this patch. >> Could you tell me what test you got failed? >> make check-world -j 2 or more is done successfully in my environment. > > I tried to do the test again, but initdb failed with the following error. > > creating template1 database in data/base/1 ... FATAL: invalid > input syntax for type oid: "f" > > This error didn't happen when I tested before. So the commit which was > applied recently might interfere with the patch. > Thank you for testing! Attached fixed version patch. Regards, -- Masahiko Sawada 000_add_frozen_bit_into_visibilitymap_v13.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On-demand running query plans using auto_explain and signals
2015-09-18 13:22 GMT+02:00 Shulgin, Oleksandr: > On Fri, Sep 18, 2015 at 12:59 PM, Pavel Stehule > wrote: > >> 2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr < >> oleksandr.shul...@zalando.de>: >> >>> On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule >> > wrote: >>> It should not be true - the data sender create DSM and fills it. Then set caller slot and send signal to caller. Caller can free DSM any time, because data sender send newer touch it. >>> >>> But the requester can timeout on waiting for reply and exit before it >>> sees the reply DSM. Actually, I now don't even think a backend can free >>> the DSM it has not created. First it will need to attach it, effectively >>> increasing the refcount, and upon detach it will only decrease the >>> refcount, but not actually release the segment... >>> >> >> I am afraid so it has not simple and nice solution - when data sender >> will wait for to moment when data are received, then we have same >> complexity like we use shm_mq. >> >> Isn't better to introduce new background worker with responsibility to >> clean orphaned DSM? >> > > I'm not thrilled by this idea. > > We don't even need a worker to do that, as leaked segments are reported by > the backend itself upon transaction commit (see > ResourceOwnerReleaseInternal), e.g: > > WARNING: dynamic shared memory leak: segment 808539725 still referenced > > Still I believe relying on some magic cleanup mechanism and not caring > about managing the shared memory properly is not the way to go. > It was one my idea too, to check shared memory on exit. The disadvantage is clean - some times you can wait too long - although the very practical limit for session is about 2h. > > -- > Alex > >
Re: [HACKERS] [patch] Proposal for \rotate in psql
2015-09-18 13:36 GMT+02:00 Daniel Verite: > Pavel Stehule wrote: > > > in the help inside your last patch, you are using "crosstab". Cannto be > > crosstab the name for this feature? > > If it wasn't taken already by contrib/tablefunc, that would be a first > choice. But now, when searching for crosstab+postgresql, pages of > results come out concerning the crosstab() function. > > So not using \crosstab is deliberate; it's to prevent confusion with > the server-side function. > I don't afraid about this - crosstab is a function in extension. Psql backslash commands living in different worlds. When we introduce new command, then google will adapt on it. For this use case the correct keywords are "crosstab psql postgres" Regards Pavel > > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite >
Re: [HACKERS] On-demand running query plans using auto_explain and signals
On Thu, Sep 17, 2015 at 12:47 PM, Shulgin, Oleksandrwrote: > But if we make the sender backend create the DSM with the response payload, > it suddenly becomes really unclear at which point and who should make the > final detach of that DSM. We're getting back to the problem of > synchronization between these processes all over again. Yes, some thought is needed here. There's not really a problem if somebody asks for information just once: you could just have the publishing process keep it attached until process exit, and nothing bad would happen. The real problem comes when multiple people ask for information in quick succession: if you detach the old segment too quickly after publishing it, the guy who requested that data might not have attached it yet, and then when he gets around to looking at it, it won't be there. This seems like a pretty solvable problem, though. For example, when somebody asks for information, they store in the main shared segment a pointer to their PGPROC and then they signal the target process, which responds by publishing a dsm_segment_id. When the requesting process has accessed it, or when it errors out or exits, it clears the PGPROC and the dsm_segment_id. The publisher avoids unmapping it until that happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Proposal for \rotate in psql
2015-09-18 13:59 GMT+02:00 Daniel Verite: > Pavel Stehule wrote: > > > > my comments: > > > > 1. I don't understand why you are use two methods for sorting columns > > (qsort, and query with ORDER BY) > > qsort (with strcmp as the comparator) is only used to determine the > set of distinct values for the vertical and horizontal headers. > In fact this is just to allow the use of bsearch() when doing that > instead of a slower sequential search. > > Once the values for the horizontal headers are known, they > are passed to the server for sorting with server-side semantics > according to their type. The order will differ from qsort/strcmp > found as the comparison semantics are different. > > The values for the vertical header are not sorted server-side > because we keep the order of the query for displaying > top to bottom. > In the typical use case , it will have ORDER BY 1[,2] possibly > with one/two DESC qualifiers. > ok > > > 2. Data column are not well aligned - numbers are aligned as text > > Yes. I'll look shortly into fixing that. > > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite >
Re: [HACKERS] synchronous_commit = apply
Hello, I have some random comments. At Wed, 16 Sep 2015 23:07:03 +1200, Thomas Munrowrote in
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
> 16 сент. 2015 г., в 20:52, Robert Haasнаписал(а): > > On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov > wrote: >> Yes, the major question is cost. But I think we should validate our thoughts >> by experiments assuming there are more possible synchronization protocols. >> Ildus posted implemention of double buffering approach that showed quite low >> cost. > > I'm not sure exactly which email you are referring to, but I don't > believe that anyone has done experiments that are anywhere near > comprehensive enough to convince ourselves that this won't be a > problem. If a particular benchmark doesn't show an issue, that can > just mean that the benchmark isn't hitting the case where there is a > problem. For example, EDB has had customers who have severe > contention apparently on the buffer content lwlocks, resulting in big > slowdowns. You don't see that in, say, a pgbench run. But for people > who have certain kinds of queries, it's really bad. Those sort of > loads, where the lwlock system really gets stressed, are cases where > adding overhead seems likely to pinch. Alexander and Ildus gave us two patches for REL9_4_STABLE - one with sampling every N milliseconds and one with measuring timings. We have tested both of them on two kinds of our workload besides pgbench runs. One workload is OLTP when all the data fits in shared buffers, synchronous_commit if off and the bottleneck is ProcArrayLock (all backtraces look something like the following): [Thread debugging using libthread_db enabled] 0x7f10e01d4f27 in semop () from /lib64/libc.so.6 #0 0x7f10e01d4f27 in semop () from /lib64/libc.so.6 #1 0x0061fe27 in PGSemaphoreLock (sema=0x7f11f2d4f430, interruptOK=0 '\000') at pg_sema.c:421 #2 0x006769ba in LWLockAcquireCommon (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:626 #3 LWLockAcquire (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:467 #4 0x00667862 in ProcArrayEndTransaction (proc=0x7f11f2d4f420, latestXid=182562881) at procarray.c:404 #5 0x004b579b in CommitTransaction () at xact.c:1957 #6 0x004b6ae5 in CommitTransactionCommand () at xact.c:2727 #7 0x006819d9 in finish_xact_command () at postgres.c:2437 #8 0x00684f05 in PostgresMain (argc=, argv=, dbname=0x21e1a70 "xivadb", username=) at postgres.c:4270 #9 0x00632d7d in BackendRun (argc=, argv=) at postmaster.c:4155 #10 BackendStartup (argc=, argv=) at postmaster.c:3829 #11 ServerLoop (argc=, argv=) at postmaster.c:1597 #12 PostmasterMain (argc=, argv=) at postmaster.c:1244 #13 0x005cadb8 in main (argc=3, argv=0x21e0aa0) at main.c:228 Another is when all the data fits in RAM but does not fit in shared buffers and the bottleneck is mostly BufFreelistLock with sensible contention around buffer partitions locking and buffers locking. Taking several backtraces with GDB of several backends and doing some bash magic gives the following: root@pgload01g ~ # grep '^#4 ' /tmp/bt | awk '{print $2, $4, $NF}' | sort | uniq -c | sort -rn 126 0x0065db61 BufferAlloc bufmgr.c:591 67 0x0065e03a BufferAlloc bufmgr.c:760 43 0x005c8c3b pq_getbyte pqcomm.c:899 39 0x0065dd93 BufferAlloc bufmgr.c:765 6 0x004b52bb RecordTransactionCommit xact.c:1194 4 0x0065da0e ReadBuffer_common bufmgr.c:476 1 ReadBuffer_common relpersistence=112 bufmgr.c:340 1 exec_eval_expr expr=0x166e908, pl_exec.c:4796 1 0x7f78b8cb217b ?? /usr/pgsql-9.4/lib/pg_stat_statements.so 1 0x005d4cbb _copyList copyfuncs.c:3849 root@pgload01g ~ # For both scenarios on linux we got approximately the same results - version with timings was faster then version with sampling (sampling was done every 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version with timings gave ~14500 tps while version with sampling gave ~13800 tps. In all cases processor was 100% utilized. Comparing vanilla PostgreSQL and version with timings on constant workload (12000 tps) gave the following results in latencies for queries: q'thvanilla timing 99% 79.097.0(+18.0) 98% 64.076.0(+12.0) 95% 38.047.0(+9.0) 90% 16.021.0(+5.0) 85% 7.0 11.0(+4.0) 80% 5.0 7.0 (+2.0) 75% 4.0 5.0 (+1.0) 50% 2.0 3.0 (+1.0) And it that test version with timings consumed about 7% more of CPU. Does it seem to be the results on workload where lwlock system is stressed? And when the data does not fit in RAM you really don’t see much difference between all three version because your contention is moved from lwlock system to I/O, even with newest NVMe SSDs, or at least is divided between lwlocks and other waits events. > >> Yes, but some competing products also provides
Re: [HACKERS] row_security GUC, BYPASSRLS
On 09/18/2015 09:25 AM, Adam Brightwell wrote: >>> 1. remove row_security=force >>> 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to >>> policies >>> 3. add DDL-controlled, per-table policy forcing >>> >>> They ought to land in that order. PostgreSQL 9.5 would need at least (1) >>> and >>> (2); would RLS experts find it beneficial for me to take care of those? >> >> That would be awesome, but I would say that if we do #1 & 2 for 9.5, we >> also need #3. > > Agreed. Please let me know if there is anything I can do to help. Yes, same here. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM
Reply to both emails in one Fujii: > So autoanalyze still doesn't call IndexFreeSpaceMapVacuum(). Fixed, see patch > ginvacuumcleanup calls RecordFreeIndexPage() twice for the same block. fixed too Tom: It seems quite bizarre for auto-analyze to do that. auto-vacuum, sure, but I do not think this should get plugged into ANALYZE. I agree, but we want to have pending list shorter as possible. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index 76bebad..7be9362 100644 --- a/src/backend/access/gin/ginfast.c +++ b/src/backend/access/gin/ginfast.c @@ -434,7 +434,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) END_CRIT_SECTION(); if (needCleanup) - ginInsertCleanup(ginstate, false, NULL); + ginInsertCleanup(ginstate, false, true, NULL); } /* @@ -505,7 +505,7 @@ ginHeapTupleFastCollect(GinState *ginstate, */ static bool shiftList(Relation index, Buffer metabuffer, BlockNumber newHead, - IndexBulkDeleteResult *stats) + bool fill_fsm, IndexBulkDeleteResult *stats) { Pagemetapage; GinMetaPageData *metadata; @@ -732,13 +732,19 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka, * action of removing a page from the pending list really needs exclusive * lock. * - * vac_delay indicates that ginInsertCleanup is called from vacuum process, - * so call vacuum_delay_point() periodically. + * vac_delay indicates that ginInsertCleanup should so call + * vacuum_delay_point() periodically. + * + * fill_fsm indicates that ginInsertCleanup should add deleted pages + * to FSM otherwise caller is responsible to put deleted pages into + * FSM. + * * If stats isn't null, we count deleted pending pages into the counts. */ void ginInsertCleanup(GinState *ginstate, -bool vac_delay, IndexBulkDeleteResult *stats) +bool vac_delay, bool fill_fsm, +IndexBulkDeleteResult *stats) { Relationindex = ginstate->index; Buffer metabuffer, @@ -899,7 +905,7 @@ ginInsertCleanup(GinState *ginstate, * remove read pages from pending list, at this point all * content of read pages is in regular structure */ - if (shiftList(index, metabuffer, blkno, stats)) + if (shiftList(index, metabuffer, blkno, fill_fsm, stats)) { /* another cleanup process is running concurrently */ LockBuffer(metabuffer, GIN_UNLOCK); @@ -948,7 +954,7 @@ ginInsertCleanup(GinState *ginstate, * desirable to recycle them immediately to the FreeSpace Map when * ordinary backends clean the list. */ - if (fsm_vac && !vac_delay) + if (fsm_vac && fill_fsm) IndexFreeSpaceMapVacuum(index); diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c index 1315762..323cfb0 100644 --- a/src/backend/access/gin/ginvacuum.c +++ b/src/backend/access/gin/ginvacuum.c @@ -544,7 +544,7 @@ ginbulkdelete(PG_FUNCTION_ARGS) /* Yes, so initialize stats to zeroes */ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); /* and cleanup any pending inserts */ - ginInsertCleanup(, true, stats); + ginInsertCleanup(, true, false, stats); } /* we'll re-count the tuples each time */ @@ -659,7 +659,7 @@ ginvacuumcleanup(PG_FUNCTION_ARGS) if (IsAutoVacuumWorkerProcess()) { initGinState(, index); - ginInsertCleanup(, true, stats); + ginInsertCleanup(, true, true, stats); } PG_RETURN_POINTER(stats); } @@ -672,7 +672,7 @@ ginvacuumcleanup(PG_FUNCTION_ARGS) { stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); initGinState(, index); - ginInsertCleanup(, true, stats); + ginInsertCleanup(, true, false, stats); } memset(, 0, sizeof(idxStat)); diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h index acbe36a..5021887 100644 --- a/src/include/access/gin_private.h +++ b/src/include/access/gin_private.h @@ -936,7 +936,7 @@ extern void ginHeapTupleFastCollect(GinState *ginstate, OffsetNumber attnum, Datum value, bool isNull, ItemPointer ht_ctid); extern void
Re: [HACKERS] tsvector work with citext
Fixable? Fixed (9acb9007de30b3daaa9efc16763c3bc6e3e0a92d), but didn't backpatch because it isn't a critical bug. Thank you for the report! -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_security GUC, BYPASSRLS
>> 1. remove row_security=force >> 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies >> 3. add DDL-controlled, per-table policy forcing >> >> They ought to land in that order. PostgreSQL 9.5 would need at least (1) and >> (2); would RLS experts find it beneficial for me to take care of those? > > That would be awesome, but I would say that if we do #1 & 2 for 9.5, we > also need #3. Agreed. Please let me know if there is anything I can do to help. -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.
Jan Wieckwrites: > Attached is a complete rework of the fix from scratch, based on Tom's > suggestion. > The code now maintains a double linked list as suggested, but only uses > it to mark all currently valid entries as invalid when hashvalue == 0. > If a specific entry is invalidated it performs a hash lookup for maximum > efficiency in that case. That does not work; you're ignoring the possibility of hashvalue collisions, and for that matter not considering that the hash value is not equal to the hash key. I don't think your code is invalidating the right entry at all during a foreign key constraint update, and it certainly cannot do the right thing if there's a hash value collision. Attached is something closer to what I was envisioning; can you do performance testing on it? regards, tom lane diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index 61edde9..db1787a 100644 *** a/src/backend/utils/adt/ri_triggers.c --- b/src/backend/utils/adt/ri_triggers.c *** *** 40,45 --- 40,46 #include "commands/trigger.h" #include "executor/executor.h" #include "executor/spi.h" + #include "lib/ilist.h" #include "parser/parse_coerce.h" #include "parser/parse_relation.h" #include "miscadmin.h" *** typedef struct RI_ConstraintInfo *** 125,130 --- 126,132 * PK) */ Oid ff_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (FK = * FK) */ + dlist_node valid_link; /* Link in list of valid entries */ } RI_ConstraintInfo; *** typedef struct RI_CompareHashEntry *** 185,190 --- 187,194 static HTAB *ri_constraint_cache = NULL; static HTAB *ri_query_cache = NULL; static HTAB *ri_compare_cache = NULL; + static dlist_head ri_constraint_cache_valid_list; + static int ri_constraint_cache_valid_count = 0; /* -- *** ri_LoadConstraintInfo(Oid constraintOid) *** 2924,2929 --- 2928,2940 ReleaseSysCache(tup); + /* + * For efficient processing of invalidation messages below, we keep a + * doubly-linked list, and a count, of all currently valid entries. + */ + dlist_push_tail(_constraint_cache_valid_list, >valid_link); + ri_constraint_cache_valid_count++; + riinfo->valid = true; return riinfo; *** ri_LoadConstraintInfo(Oid constraintOid) *** 2936,2956 * gets enough update traffic that it's probably worth being smarter. * Invalidate any ri_constraint_cache entry associated with the syscache * entry with the specified hash value, or all entries if hashvalue == 0. */ static void InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue) { ! HASH_SEQ_STATUS status; ! RI_ConstraintInfo *hentry; Assert(ri_constraint_cache != NULL); ! hash_seq_init(, ri_constraint_cache); ! while ((hentry = (RI_ConstraintInfo *) hash_seq_search()) != NULL) { ! if (hentry->valid && ! (hashvalue == 0 || hentry->oidHashValue == hashvalue)) ! hentry->valid = false; } } --- 2947,2987 * gets enough update traffic that it's probably worth being smarter. * Invalidate any ri_constraint_cache entry associated with the syscache * entry with the specified hash value, or all entries if hashvalue == 0. + * + * Note: at the time a cache invalidation message is processed there may be + * active references to the cache. Because of this we never remove entries + * from the cache, but only mark them invalid, which is harmless to active + * uses. (Any query using an entry should hold a lock sufficient to keep that + * data from changing under it --- but we may get cache flushes anyway.) */ static void InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue) { ! dlist_mutable_iter iter; Assert(ri_constraint_cache != NULL); ! /* ! * If the list of currently valid entries gets excessively large, we mark ! * them all invalid so we can empty the list. This arrangement avoids ! * O(N^2) behavior in situations where a session touches many foreign keys ! * and also does many ALTER TABLEs, such as a restore from pg_dump. ! */ ! if (ri_constraint_cache_valid_count > 1000) ! hashvalue = 0; /* pretend it's a cache reset */ ! ! dlist_foreach_modify(iter, _constraint_cache_valid_list) { ! RI_ConstraintInfo *riinfo = dlist_container(RI_ConstraintInfo, ! valid_link, iter.cur); ! ! if (hashvalue == 0 || riinfo->oidHashValue == hashvalue) ! { ! riinfo->valid = false; ! /* Remove invalidated entries from the list, too */ ! dlist_delete(iter.cur); ! ri_constraint_cache_valid_count--; ! } } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_security GUC, BYPASSRLS
> 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies I believe this one has already been addressed by Stephen (20150910192313.gt3...@tamriel.snowman.net)? Are there further considerations for his proposed changes? -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_security GUC, BYPASSRLS
On 09/18/2015 01:07 AM, Noah Misch wrote: > Great. Robert, does that work for you, too? If so, this sub-thread is > looking at three patches: > > 1. remove row_security=force > 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies > 3. add DDL-controlled, per-table policy forcing > > They ought to land in that order. PostgreSQL 9.5 would need at least (1) and > (2); would RLS experts find it beneficial for me to take care of those? That would be awesome, but I would say that if we do #1 & 2 for 9.5, we also need #3. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM
-- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM
[ moving thread to -hackers ] Fujii Masaowrites: > So autoanalyze still doesn't call IndexFreeSpaceMapVacuum(). > That is, only backend can clean the list in INSERT-only workload. > I don't think that this is desirable. Because the backend will > periodically take a big performance penalty. > So I'm thinking that even autoanalyze should call IndexFreeSpaceMapVacuum() > to clean the list in a background, in order to avoid such spikes in > INSERT response time. Thought? It seems quite bizarre for auto-analyze to do that. auto-vacuum, sure, but I do not think this should get plugged into ANALYZE. Maybe we need to rethink the rules for what is autovacuum vs. autoanalyze, or come up with some third thing that the autovac daemon does for indexes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_security GUC, BYPASSRLS
On Fri, Sep 18, 2015 at 2:07 AM, Noah Mischwrote: > On Tue, Sep 15, 2015 at 03:18:21PM -0400, Adam Brightwell wrote: >> On Tue, Sep 15, 2015 at 2:26 PM, Joe Conway wrote: >> >> Joe Conway writes: >> >>> There are use cases where row_security=force will be set in production >> >>> environments, not only in testing. > >> > Noah's suggestion of using a per table attribute >> > would work -- in fact I like the idea of that better than using the >> > current GUC. >> >> FWIW, I also concur with a per table attribute for this purpose. In >> fact, I think I really like the per-table flexibility over an >> 'all-or-nothing' approach better too. > > Great. Robert, does that work for you, too? Yes, that seems like a fine design from my point of view. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use pg_rewind when target timeline was switched
On Wed, Sep 16, 2015 at 7:47 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier < > michael.paqu...@gmail.com> wrote: > >> On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote: > > > A also added additional check in findCommonAncestorTimeline(). Two >> standbys >> > could be independently promoted and get the same new timeline ID. Now, >> it's >> > checked that timelines, that we assume to be the same, have equal >> begins. >> > Begins could still match by coincidence. But the same risk exists in >> > original pg_rewind, though. >> >> Really? pg_rewind blocks attempts to rewind two nodes that are already >> on the same timeline before checking if their timeline history map at >> some point or not: >> /* >> * If both clusters are already on the same timeline, there's >> nothing to >> * do. >> */ >> if (ControlFile_target.checkPointCopy.ThisTimeLineID == >> ControlFile_source.checkPointCopy.ThisTimeLineID) >> pg_fatal("source and target cluster are on the same >> timeline\n"); >> And this seems really justified to me. Imagine that you have one >> master, with two standbys linked to it. If both standbys are promoted >> to the same timeline, you could easily replug them to the master, but >> I fail to see the point to be able to replug one promoted standby with >> the other in this case: those nodes have segment and history files >> that map with each other, an operator could easily mess up things in >> such a configuration. >> > > Imagine following configuration of server. > 1 > / \ > 2 3 > | > 4 > > Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4 > is cascaded standby for node 3. > Then node 2 and node 3 are promoted. They both got timeline number 2. Then > node 3 is promoted and gets timeline number 3. > Then we could try to rewind node 4 with node 2 as source. How pg_rewind > could know that timeline number 2 for those nodes is not the same? > We can only check that those timelines are forked from timeline 1 at the > same place. But coincidence is still possible. > BTW, it would be an option to generate system_identifier to each new timeline, by analogy of initdb do for the whole WAL. Having such system_identifiers we can distinguish different timeline which have assigned same ids. Any thoughts? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Parallel Seq Scan
On Thu, Sep 17, 2015 at 11:44 PM, Amit Kapilawrote: > Okay, but I think the same can be achieved with this as well. Basic idea > is that each worker will work on one planned statement at a time and in > above case there will be two different planned statements and they will > store partial seq scan related information in two different loctions in > toc, although the key (PARALLEL_KEY_SCAN) would be same and I think this > will quite similar to what we are already doing for response queues. > The worker will work on one of those keys based on planned statement > which it chooses to execute. I have explained this in somewhat more details > in one of my previous mails [1]. shm_toc keys are supposed to be unique. If you added more than one with the same key, there would be no look up the second one. That was intentional, and I don't want to revise it. I don't want to have multiple PlannedStmt objects in any case. That doesn't seem like the right approach. I think passing down an Append tree with multiple Partial Seq Scan children to be run in order is simple and clear, and I don't see why we would do it any other way. The master should be able to generate a plan and then copy the part of it below the Funnel and send it to the worker. But there's clearly never more than one PlannedStmt in the master, so where would the other ones come from in the worker? There's no reason to introduce that complexity. >> Each partial sequential scan needs to have a *separate* key, which >> will need to be stored in either the Plan or the PlanState or both >> (not sure exactly). Each partial seq scan needs to get assigned a >> unique key there in the master, probably starting from 0 or 100 or >> something and counting up, and then this code needs to extract that >> value and use it to look up the correct data for that scan. > > In that case also, multiple workers can worker on same key, assuming > in your above example, multiple workers will be required to execute > each partial seq scan. In this case we might need to see how to map > instrumentation information for a particular execution. That was discussed on the nearby thread about numbering plan nodes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extend pgbench expressions with functions
-1. double is an inexact type, whereas integer is an exact type. Sure. I already argue on that very line. The typical way to handle this sort of thing is to define a struct whose first member is a type field and whose second field is a union of all the types you need to care about. Yep. Then that gets passed around everywhere. This patch should be designed in such a way that if we eventually end up with functions that have 10 different return types instead of 2 different return types, we don't need to add 8 more parameters to any functions. Instead, those still return PgBench_Value (or whatever we call it) which is the aforementioned struct, but there are more options for what that can contain. I just put the double type as a proof of concept, but for pgbench only integers really matters. What you suggest would work, but it would also result in ugly and lengthy code, as I argued in another mail, because you have to decide for overloaded operators and functions which actual typed operator must be called, and then perform the necessary type conversions depending on the actual type of the operands. The implicit descendent typing used in the patch hides this, and is more than enough for pgbench, IMO. If this is a blocker, I would rather remove the support for doubles than write verbose and inelegant code. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector work with citext
"David E. Wheeler"writes: > On Sep 18, 2015, at 7:29 AM, Teodor Sigaev wrote: >> Fixed (9acb9007de30b3daaa9efc16763c3bc6e3e0a92d), but didn't backpatch >> because it isn't a critical bug. > Great, thank you! > For those on older versions, whatâs the simplest workaround? FWIW, I thought this would be a reasonable thing to back-patch. It's not as though contrib/citext hasn't been around for awhile. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] vacuumdb sentence
On Fri, Sep 18, 2015 at 12:18 AM, Euler Taveirawrote: > Yeah, I know. [Too sleepy to be writing emails...] What I want to say is: > when we want to refer to an option, we usually add "option" after the quoted > name (in this case, it won't make sense). I propose to remove the quotation > because the way the sentence is written it seems we are referring to the > task instead of the option. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Sep 18, 2015 at 4:03 AM, Haribabu Kommiwrote: > On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila wrote: >> >> Attached, find the rebased version of patch. > > Here are the performance test results: Thanks, this is really interesting. I'm very surprised by how much kernel overhead this shows. I wonder where that's coming from. The writes to and reads from the shm_mq shouldn't need to touch the kernel at all except for page faults; that's why I chose this form of IPC. It could be that the signals which are sent for flow control are chewing up a lot of cycles, but if that's the problem, it's not very clear from here. copy_user_generic_string doesn't sound like something related to signals. And why all the kernel time in _spin_lock? Maybe perf -g would help us tease out where this kernel time is coming from. Some of this may be due to rapid context switching. Suppose the master process is the bottleneck. Then each worker will fill up the queue and go to sleep. When the master reads a tuple, the worker has to wake up and write a tuple, and then it goes back to sleep. This might be an indication that we need a bigger shm_mq size. I think that would be experimenting with: if we double or quadruple or increase by 10x the queue size, what happens to performance? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect patch, for showing tuple data
В письме от 11 сентября 2015 15:12:04 Вы написали: > > Ok.Let's come to the final decision with tuple_data_parse, and i will add > > this switch there and to pure sql heap_page_item_attrs > > Fine for me. So I've modified the code, now we have: heap_page_items - have a column with raw tuple data tuple_data_split - takes oid, raw tuple data, infomask, infomask2 and bits parsed as string and returns bytea[] with attribute raw values. It also have two optional arguments do_detoast that forces function to detoast attribute, and warning_mode that allows to set this function to warning mode, and do not stop working if some inconsistency were found. there is also pure sql function heap_page_item_attrs that takes page data, and table oid, and returns same data as heap_page_items but bytea[] of attributes instead of one whole piece of raw data. It also has optional argument do_detoast that allows to get bytea[] of detoasted attribute data. I've decided that there is no real need in warning_mode in heap_page_item_attrs so there is no such argument there. So now it is still RFC. Final patch with documentation will come soon -- Nikolay Shaplov Postgres Professional: http://www.postgrespro.com Russian Postgres Companydiff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index aec5258..e4bc1af 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -5,7 +5,7 @@ OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \ brinfuncs.o ginfuncs.o $(WIN32RES) EXTENSION = pageinspect -DATA = pageinspect--1.3.sql pageinspect--1.2--1.3.sql \ +DATA = pageinspect--1.4.sql pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \ pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \ pageinspect--unpackaged--1.0.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c index 8d1666c..d42ca48 100644 --- a/contrib/pageinspect/heapfuncs.c +++ b/contrib/pageinspect/heapfuncs.c @@ -29,7 +29,11 @@ #include "funcapi.h" #include "utils/builtins.h" #include "miscadmin.h" +#include "utils/array.h" +#include "utils/rel.h" +#include "catalog/pg_type.h" +Datum split_tuple_data(char *tuple_data, uint16 tuple_data_len, TupleDesc tuple_desc, uint16 t_infomask, uint16 t_infomask2, bits8 *t_bits, bool do_detoast, int error_level); /* * bits_to_text @@ -122,8 +126,8 @@ heap_page_items(PG_FUNCTION_ARGS) HeapTuple resultTuple; Datum result; ItemId id; - Datum values[13]; - bool nulls[13]; + Datum values[14]; + bool nulls[14]; uint16 lp_offset; uint16 lp_flags; uint16 lp_len; @@ -155,6 +159,8 @@ heap_page_items(PG_FUNCTION_ARGS) { HeapTupleHeader tuphdr; int bits_len; + bytea *tuple_data_bytea; + int tuple_data_len; /* Extract information from the tuple header */ @@ -168,6 +174,13 @@ heap_page_items(PG_FUNCTION_ARGS) values[9] = UInt32GetDatum(tuphdr->t_infomask); values[10] = UInt8GetDatum(tuphdr->t_hoff); + /* Copy raw tuple data into bytea attribute */ + tuple_data_len = lp_len - tuphdr->t_hoff; + tuple_data_bytea = (bytea *) palloc(tuple_data_len + VARHDRSZ); + SET_VARSIZE(tuple_data_bytea, tuple_data_len + VARHDRSZ); + memcpy(VARDATA(tuple_data_bytea), (char *) tuphdr + tuphdr->t_hoff, tuple_data_len); + values[13] = PointerGetDatum(tuple_data_bytea); + nulls[13] = false; /* * We already checked that the item is completely within the raw * page passed to us, with the length given in the line pointer. @@ -208,7 +221,7 @@ heap_page_items(PG_FUNCTION_ARGS) */ int i; - for (i = 4; i <= 12; i++) + for (i = 4; i <= 13; i++) nulls[i] = true; } @@ -223,3 +236,210 @@ heap_page_items(PG_FUNCTION_ARGS) else SRF_RETURN_DONE(fctx); } + +PG_FUNCTION_INFO_V1(tuple_data_split); +Datum +tuple_data_split(PG_FUNCTION_ARGS) +{ + Oidrel_oid; + bytea *raw_data; + uint16 t_infomask; + uint16 t_infomask2; + text *t_bits_str; + RelationData *rel; + TupleDesc tuple_desc; + bool do_detoast = false; + interror_level = ERROR; + + bits8 *t_bits = NULL; + Datum res; + + rel_oid = PG_GETARG_OID(0); + raw_data = PG_GETARG_BYTEA_P(1); + t_infomask = PG_GETARG_INT16(2); + t_infomask2 = PG_GETARG_INT16(3); + t_bits_str = PG_ARGISNULL(4) ? NULL : PG_GETARG_TEXT_P(4); + if (PG_NARGS()>=6) do_detoast = PG_GETARG_BOOL(5); + if (PG_NARGS()>=7) error_level = PG_GETARG_BOOL(6)?WARNING:ERROR; + + if (!superuser()) + ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + (errmsg("must be superuser to use raw page functions"; + + /* + * Here we converting t_bits string back to the bits8 array, + * as it really is in the tuple header + */ + if (t_infomask & HEAP_HASNULL) + { + int bits_str_len; + int bits_len; + char * p; + int off; + char byte = 0; + bits_len = (t_infomask2 & HEAP_NATTS_MASK) / 8 + 1; + if (! t_bits_str) + { +
Re: [HACKERS] tsvector work with citext
On Sep 18, 2015, at 7:29 AM, Teodor Sigaevwrote: >> Fixable? > > Fixed (9acb9007de30b3daaa9efc16763c3bc6e3e0a92d), but didn't backpatch > because it isn't a critical bug. Great, thank you! For those on older versions, what’s the simplest workaround? Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On-demand running query plans using auto_explain and signals
On Fri, Sep 18, 2015 at 6:59 AM, Pavel Stehulewrote: > I am afraid so it has not simple and nice solution - when data sender will > wait for to moment when data are received, then we have same complexity like > we use shm_mq. > > Isn't better to introduce new background worker with responsibility to clean > orphaned DSM? That won't work, or at least not easily. On Windows, the DSM is cleaned up by the operating system as soon as nobody has it mapped. Frankly, I think you guys are making this out to be way more complicated than it really is. Basically, I think the process being queried should publish a DSM via a slot it owns. The recipient is responsible for reading it and then notifying the sender. If a second process requests data before the first process reads its data, the second process can either (1) become an additional reader of the already-published data or (2) wait for the first process to finish, and then send its own inquiry. There are some problems to solve here, but they hardly seem impossible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] numbering plan nodes
On Fri, Sep 18, 2015 at 5:24 AM, Kouhei Kaigaiwrote: >> On Thu, Sep 17, 2015 at 9:01 PM, Kouhei Kaigai wrote: >> > I entirely agree with the idea of plan-node identifier, however, >> > uncertain whether the node-id shall represent physical location on >> > the dynamic shared memory segment, because >> > (1) Relatively smaller number of node type needs shared state, >> > thus most of array items are empty. >> > (2) Extension that tries to modify plan-tree using planner_hook >> > may need to adjust node-id also. >> > >> > Even though shm_toc_lookup() has to walk on the toc entries to find >> > out the node-id, it happens at once on beginning of the executor at >> > background worker side. I don't think it makes a significant problem. >> >> Yes, I was thinking that what would make sense is to have each >> parallel-aware node call shm_toc_insert() using its ID as the key. >> Then, we also need Instrumentation nodes. For those, I thought we >> could use some fixed, high-numbered key, and Tom's idea. >> > Hmm, indeed, run-time statistics are needed for every node. > If an array indexed by node-id would be a hash slot, we can treat > non-contiguous node-id with no troubles. Yeah, we could do that, but it seems more complex than we really need. >> Are there extensions that use planner_hook to do surgery on the plan >> tree? What do they do, exactly? >> > (Even though it will not work under Funnel,) PG-Strom often inject > a preprocessor node under Agg-node to produce partial aggregation > to reduce number of rows to be processed by CPU. OK. So, if you wanted to make it work under a Funnel, you'd need to stick a node ID on it that was higher than any assigned to any plan node thus far. Doesn't seem like a big deal. > Also, I have seen a paper published by Fujitsu folks. Their module > modifies plan-tree to replace built-in scan node with their own > columnar storage scan node. > http://db-event.jpn.org/deim2015/paper/195.pdf > This paper is written in Japanese, however, figure-3 in page.4 shows > what I explain above. If you replaced a node, you'd just copy the ID from the existing node. That seems simple enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extend pgbench expressions with functions
On Thu, Sep 17, 2015 at 10:58 PM, Kyotaro HORIGUCHIwrote: > By the way, the complexity comes from separating integer and > double. If there is no serios reason to separate them, handling > all values as double makes things far simpler. -1. double is an inexact type, whereas integer is an exact type. The typical way to handle this sort of thing is to define a struct whose first member is a type field and whose second field is a union of all the types you need to care about. Then that gets passed around everywhere. This patch should be designed in such a way that if we eventually end up with functions that have 10 different return types instead of 2 different return types, we don't need to add 8 more parameters to any functions. Instead, those still return PgBench_Value (or whatever we call it) which is the aforementioned struct, but there are more options for what that can contain. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Fri, Sep 18, 2015 at 4:08 AM, Vladimir Borodinwrote: > For both scenarios on linux we got approximately the same results - version > with timings was faster then version with sampling (sampling was done every > 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version > with timings gave ~14500 tps while version with sampling gave ~13800 tps. In > all cases processor was 100% utilized. Comparing vanilla PostgreSQL and > version with timings on constant workload (12000 tps) gave the following > results in latencies for queries: If the timing is speeding things up, that's most likely a sign that the spinlock contention on that workload is so severe that you are spending a lot of time in s_lock. Adding more things for the system to do that don't require that lock will speed the system up by reducing the contention. Instead of inserting gettimeofday() calls, you could insert a for loop that counts to some large number without doing any useful work, and that would likely have a similar effect. In any case, I think your experiment clearly proves that the presence or absence of this instrumentation *is* performance-relevant and that we *do* need to worry about what it costs. If the system gets 20% faster when you call gettimeofday() a lot, does that mean we should insert gettimeofday() calls all over the system in random places to speed it up? I do agree that if we're going to include support for timings, having them be controlled by a GUC is a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Sep 18, 2015 at 6:55 AM, Amit Kapilawrote: >> +currentRelation = ExecOpenScanRelation(estate, >> + ((SeqScan *) >> node->ss.ps.plan)->scanrelid, >> + eflags); >> >> I can't see how this can possibly be remotely correct. The funnel >> node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist). >> > > This is mainly used for generating tuple descriptor and that tuple > descriptor will be used for forming scanslot, funnel node itself won't > do any scan. However, we can completely eliminate this InitFunnel() > function and use ExecAssignProjectionInfo() instead of > ExecAssignScanProjectionInfo() to form the projection info. That sounds like a promising approach. >> +buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * >> sizeof(BufferUsage))); >> >> Cast it to a BufferUsage * first. Then you can use [i] to find >> the i'th element. > > Do you mean to say that the way code is written won't work? > Values of structure BufferUsage for each worker is copied into string > buffer_usage which I believe could be fetched in above way. I'm just complaining about the style. If bar is a char*, then these are all equivalent: foo = (Quux *) (bar + (i * sizeof(Quux)); foo = ((Quux *) bar) + i; foo = &((Quux *) bar)[i]; baz = (Quux *) bar; foo = [i]; >> +/* >> + * Re-initialize the parallel context and workers to perform >> + * rescan of relation. We want to gracefully shutdown all the >> + * workers so that they should be able to propagate any error >> + * or other information to master backend before dying. >> + */ >> +FinishParallelSetupAndAccumStats(node); >> >> Somehow, this makes me feel like that function is badly named. > > I think here comment seems to be slightly misleading, shall we > change the comment as below: > > Destroy the parallel context to gracefully shutdown all the > workers so that they should be able to propagate any error > or other information to master backend before dying. Well, why does a function that destroys the parallel context have a name that starts with FinishParallelSetup? It sounds like it is tearing things down, not finishing setup. >> +#parallel_setup_cost = 0.0 # same scale as above >> +#define DEFAULT_PARALLEL_SETUP_COST 0.0 >> >> This value is probably a bit on the low side. > > How about keeping it as 10.0? Really? I would have guessed that the correct value was in the tens of thousands. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use pg_rewind when target timeline was switched
On Fri, Sep 18, 2015 at 9:00 AM, Alexander Korotkov wrote: > BTW, it would be an option to generate system_identifier to each new > timeline, by analogy of initdb do for the whole WAL. > Having such system_identifiers we can distinguish different timeline which > have assigned same ids. > Any thoughts? If you mean a new field incorporated in XLogLongPageHeader and ControlFile to ensure that a new timeline generated is unique across the same installation identified with system_identifier, then I'm not against it for something up to 2 bytes (even 1 byte would be fine), though it seems that there is a very narrow need for it. Do you have cases in mind that could use it? Even in the case of pg_rewind, it seems to me we would not need this extra timeline-based ID. Per your case above, it is possible to rewind node 4 on timeline 3 using node 2 on timeline 2 when both timelines have forked exactly at the same point from timeline 1, by having node 4 beginning to replay from timeline 1 (last checkpoint record before WAL forked), and if I am reading your patch correctly that's what you do. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers