Re: [HACKERS] Add generate_series(numeric, numeric)
2014-12-18 19:35 GMT+07:00 Fujii Masao masao.fu...@gmail.com: On Mon, Dec 15, 2014 at 2:38 PM, Andrew Gierth and...@tao11.riddles.org.uk wrote: I was thinking something like this, added just after that para: warning para While the actual arguments to the function remain unchanged between calls, if you detoast the argument values (which is normally done transparently by the functionPG_GETARG_replaceablexxx/replaceable/function macro) in the transient context then the detoasted copies will be freed on each cycle. Accordingly, if you keep references to such values in your structfielduser_fctx/, you must either copy them into the structfieldmulti_call_memory_ctx/ after detoasting, or ensure that you detoast the values only in that context. /para /warning I'm OK with this. Wrapping the doc changes in a patch. Will add to next commitfest so it won't be lost. -- Ali Akbar *** a/doc/src/sgml/xfunc.sgml --- b/doc/src/sgml/xfunc.sgml *** *** 2986,2991 SRF_RETURN_DONE(funcctx) --- 2986,3005 structfieldmulti_call_memory_ctx/ while doing the first-call setup. /para + warning + para + While the actual arguments to the function remain unchanged between + calls, if you detoast the argument values (which is normally done + transparently by the + functionPG_GETARG_replaceablexxx/replaceable/function macro) + in the transient context then the detoasted copies will be freed on + each cycle. Accordingly, if you keep references to such values in + your structfielduser_fctx/, you must either copy them into the + structfieldmulti_call_memory_ctx/ after detoasting, or ensure + that you detoast the values only in that context. + /para + /warning + para A complete pseudo-code example looks like the following: programlisting -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Thu, Jan 8, 2015 at 1:00 PM, Kevin Grittner kgri...@ymail.com wrote: Clearly, all the nodes other than the local one need to use 2PC. I am unconvinced that the local node must write a 2PC state file only to turn around and remove it again almost immediately thereafter. The key point is that the distributed transaction data must be flagged as needing to commit rather than roll back between the prepare phase and the final commit. If you try to avoid the PREPARE, flagging, COMMIT PREPARED sequence by building the flagging of the distributed transaction metadata into the COMMIT process, you still have the problem of what to do on crash recovery. You really need to use 2PC to keep that clean, I think. I don't really follow this. You need to write a list of the transactions that you're going to prepare to stable storage before preparing any of them. And then you need to write something to stable storage when you've definitively determined that you're going to commit. But we have no current mechanism for the first thing (so reusing 2PC doesn't help) and we already have the second thing (it's the commit record itself). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Minor configure tweak to simplify adjusting gcc warnings
Would anyone object to modifying configure.in like this: if test $GCC = yes -a $ICC = no; then - CFLAGS=$CFLAGS -Wall -Wmissing-prototypes -Wpointer-arith + CFLAGS=-Wall $CFLAGS -Wmissing-prototypes -Wpointer-arith # These work in some but not all gcc versions PGAC_PROG_CC_CFLAGS_OPT([-Wdeclaration-after-statement]) The reason I got interested in this is that I attempted to pass in CFLAGS=-Wno-format to configure, to suppress format warnings on buildfarm member gaur (whose gcc is too old to recognize z modifiers). That doesn't work because -Wall turns the warnings right back on again. If the user-supplied CFLAGS were inserted after -Wall then it would work. A slightly more complicated change could be applied to make sure that *all* of the CFLAGS forcibly inserted by configure appear before any externally-sourced CFLAGS, allowing any of them to be overridden from the environment variable. I'm not sure if it's worth the trouble to do that, but if there's interest I could make it happen. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Tue, Jan 13, 2015 at 4:55 PM, John Gorman johngorm...@gmail.com wrote: On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net wrote: So, for my 2c, I've long expected us to parallelize at the relation-file level for these kinds of operations. This goes back to my other thoughts on how we should be thinking about parallelizing inbound data for bulk data loads but it seems appropriate to consider it here also. One of the issues there is that 1G still feels like an awful lot for a minimum work size for each worker and it would mean we don't parallelize for relations less than that size. Yes, I think that's a killer objection. One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly. Here we have to decide what should be the strategy and how much each worker should scan. As an example one of the the strategy could be if the table size is X MB and there are 8 workers, then divide the work as X/8 MB for each worker (which I have currently used in patch) and another could be each worker does scan 1 block at a time and then check some global structure to see which next block it needs to scan, according to me this could lead to random scan. I have read that some other databases also divide the work based on partitions or segments (size of segment is not very clear). We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items. This solves a lot of problems. - Small to medium jobs can be parallelized efficiently. - No need to split big jobs perfectly. - We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle. - Worker memory footprint is tiny so we can afford many of them. - Worker pool management is a well known problem. - Worker spawn time disappears as a cost factor. - The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable. Yeah, it is good idea to maintain shared worker pool, but it seems to me that for initial version even if the workers are not shared, then also it is meaningful to make parallel sequential scan work. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On Wed, Jan 14, 2015 at 9:12 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 13, 2015 at 4:55 PM, John Gorman johngorm...@gmail.com wrote: On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net wrote: So, for my 2c, I've long expected us to parallelize at the relation-file level for these kinds of operations. This goes back to my other thoughts on how we should be thinking about parallelizing inbound data for bulk data loads but it seems appropriate to consider it here also. One of the issues there is that 1G still feels like an awful lot for a minimum work size for each worker and it would mean we don't parallelize for relations less than that size. Yes, I think that's a killer objection. One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly. Here we have to decide what should be the strategy and how much each worker should scan. As an example one of the the strategy could be if the table size is X MB and there are 8 workers, then divide the work as X/8 MB for each worker (which I have currently used in patch) and another could be each worker does scan 1 block at a time and then check some global structure to see which next block it needs to scan, according to me this could lead to random scan. I have read that some other databases also divide the work based on partitions or segments (size of segment is not very clear). A block can contain useful tuples, i.e tuples which are visible and fulfil the quals + useless tuples i.e. tuples which are dead, invisible or that do not fulfil the quals. Depending upon the contents of these blocks, esp. the ratio of (useful tuples)/(unuseful tuples), even though we divide the relation into equal sized runs, each worker may take different time. So, instead of dividing the relation into number of run = number of workers, it might be better to divide them into fixed sized runs with size (total number of blocks/ number of workers), and let a worker pick up a run after it finishes with the previous one. The smaller the size of runs the better load balancing but higher cost of starting with the run. So, we have to strike a balance. We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items. This solves a lot of problems. - Small to medium jobs can be parallelized efficiently. - No need to split big jobs perfectly. - We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle. - Worker memory footprint is tiny so we can afford many of them. - Worker pool management is a well known problem. - Worker spawn time disappears as a cost factor. - The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable. Yeah, it is good idea to maintain shared worker pool, but it seems to me that for initial version even if the workers are not shared, then also it is meaningful to make parallel sequential scan work. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Sun, Jan 11, 2015 at 3:36 AM, Michael Paquier michael.paqu...@gmail.com wrote: My understanding is that once you get a successful PREPARE that should mean that it's basically impossible for the transaction to fail to commit. If that's not the case, I fail to see how you can get any decent level of sanity out of this... When giving the responsability of a group of COMMIT PREPARED to a set of nodes in a network, there could be a couple of problems showing up, of the type split-brain for example. I think this is just confusing the issue. When a machine reports that a transaction is successfully prepared, any future COMMIT PREPARED operation *must* succeed. If it doesn't, the machine has broken its promises, and that's not OK. Period. It doesn't matter whether that's due to split-brain or sunspots or Oscar Wilde having bad breath. If you say that it's prepared, then you're not allowed to change your mind later and say that it can't be committed. If you do, then you have a broken 2PC implementation and, as Jim says, all bets are off. Now of course nothing is certain in life except death and taxes. If you PREPARE a transaction, and then go into the data directory and corrupt the 2PC state file using dd, and then try to commit it, it might fail. But no system can survive that sort of thing, whether 2PC is involved or not; in such extraordinary situations, of course operator intervention will be required. But in a more normal situation where you just have a failover, if the failover causes your prepared transaction to come unprepared, that means your failover mechanism is broken. If you're using synchronous replication, this shouldn't happen. There could be as well failures at hardware-level, so you would need a mechanism ensuring that WAL is consistent among all the nodes, with for example the addition of a common restore point on all the nodes once PREPARE is successfully done with for example XLOG_RESTORE_POINT. That's a reason why I think that the local Coordinator should use 2PC as well, to ensure a consistency point once all the remote nodes have successfully PREPAREd, and a reason why things can get complicated for either the DBA or the upper application in charge of ensuring the DB consistency even in case of critical failures. It's up to the DBA to decide whether they care about surviving complete loss of a node while having 2PC still work. If they do, they should use sync rep, and they should be fine -- the machine on which the transaction is prepared shouldn't acknowledge the PREPARE as having succeeded until the WAL is safely on disk on the standby. Most probably don't, though; that's a big performance penalty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
On 01/13/2015 07:24 PM, Tom Lane wrote: It is, but FDWs are not at risk here: they merely reference ExplainStates that were allocated by core backend code. So as long as we add the new field at the end it's not a problem for them. Problematic usage would be like what auto_explain does: ExplainState es; ExplainInitState(es); ... In hindsight, that's a bad API and we should change it to something like ExplainState *es = NewExplainState(); so that the sizeof the struct isn't embedded in extension code. But we definitely can't do that in back branches. Actually, it would make sense to do exactly that, to break any extensions that are doing the unsafe thing in an obvious way. The downside would be that an extension using the new API would then not work on an old server. We could repurpose one of the existing fields in ExplainState to point to another struct that contains more fields. Something like this: *** a/src/include/commands/explain.h --- b/src/include/commands/explain.h *** *** 40,48 List *rtable; /* range table */ List *rtable_names; /* alias names for RTEs */ int indent; /* current indentation level */ ! List *grouping_stack; /* format-specific grouping state */ } ExplainState; /* Hook for plugins to get control in ExplainOneQuery() */ typedef void (*ExplainOneQuery_hook_type) (Query *query, IntoClause *into, --- 40,55 List *rtable; /* range table */ List *rtable_names; /* alias names for RTEs */ int indent; /* current indentation level */ ! ExplainStateExtraFields *extra; /* more fields */ } ExplainState; + typedef struct ExplainStateExtraFields + { + List *grouping_stack; /* format-specific grouping state */ + ... + } + That's pretty ugly, but it would work even if there are ExplainState structs embeded in extensions. As long as they don't try to look at the grouping_stack field; I think that's fairly safe assumption. But do we really need to backpatch any of this? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] explain sortorder
On 01/13/2015 06:04 PM, Timmer, Marius wrote: -malloc() (StringInfo is used as suggested now). There really shouldn't be any snprintf() calls in the patch, when StringInfo is used correctly... @@ -1187,6 +1187,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id; Sort Output: matest0.id, matest0.name, ((1 - matest0.id)) Sort Key: ((1 - matest0.id)) + Sort Order: ASC NULLS LAST - Result Output: matest0.id, matest0.name, (1 - matest0.id) - Append This patch isn't going to be committed with this output format. Please change per my suggestion earlier: I don't like this output. If there are a lot of sort keys, it gets difficult to match the right ASC/DESC element to the sort key it applies to. (Also, there seems to be double-spaces in the list) I would suggest just adding the information to the Sort Key line. As long as you don't print the modifiers when they are defaults (ASC and NULLS LAST), we could print the information even in non-VERBOSE mode. So it would look something like: Sort Key: sortordertest.n1 NULLS FIRST, sortordertest.n2 DESC Or if you don't agree with that, explain why. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Check that streaming replica received all data after master shutdown
On 01/13/2015 12:11 PM, Vladimir Borodin wrote: 05 янв. 2015 г., в 18:15, Vladimir Borodin r...@simply.name написал(а): Hi all. I have a simple script for planned switchover of PostgreSQL (9.3 and 9.4) master to one of its replicas. This script checks a lot of things before doing it and one of them is that all data from master has been received by replica that is going to be promoted. Right now the check is done like below: On the master: postgres@pgtest03d ~ $ psql -t -A -c 'select pg_current_xlog_location();' 0/3390 postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_ctl stop -m fast waiting for server to shut down done server stopped postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_controldata | head pg_control version number:937 Catalog version number: 201306121 Database system identifier: 6061800518091528182 Database cluster state: shut down pg_control last modified: Mon 05 Jan 2015 06:47:57 PM MSK Latest checkpoint location: 0/3428 Prior checkpoint location:0/3328 Latest checkpoint's REDO location:0/3428 Latest checkpoint's REDO WAL file:001B0034 Latest checkpoint's TimeLineID: 27 postgres@pgtest03d ~ $ On the replica (after shutdown of master): postgres@pgtest03g ~ $ psql -t -A -c select pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/3428'); 104 postgres@pgtest03g ~ $ These 104 bytes seems to be the size of shutdown checkpoint record (as I can understand from pg_xlogdump output). postgres@pgtest03g ~/9.3/data/pg_xlog $ /usr/pgsql-9.3/bin/pg_xlogdump -s 0/3390 -t 27 rmgr: XLOGlen (rec/tot): 0/32, tx: 0, lsn: 0/3390, prev 0/3328, bkp: , desc: xlog switch rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/3428, prev 0/3390, bkp: , desc: checkpoint: redo 0/3428; tli 27; prev tli 27; fpw true; xid 0/6010; oid 54128; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown pg_xlogdump: FATAL: error in WAL record at 0/3428: record with zero length at 0/3490 postgres@pgtest03g ~/9.3/data/pg_xlog $ I’m not sure that these 104 bytes will always be 104 bytes to have a strict equality while checking. Could it change in the future? Or is there a better way to understand that streaming replica received all data after master shutdown? The check that pg_xlog_location_diff returns 104 bytes seems a bit strange. Don't rely on it being 104 bytes. It can vary across versions, and across different architectures. You could simply check that the standby's pg_last_xlog_replay_location() master's Latest checkpoint location, and not care about the exact difference. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] explain sortorder
Hi, we removed -malloc() (StringInfo is used as suggested now). -leftover commented out code -the splitting of existing declaration and initialization in show_group_keys(). Missing tests and documentation are WIP and will follow with the next patch version. Best regards Marius --- Marius Timmer Zentrum für Informationsverarbeitung Westfälische Wilhelms-Universität Münster Einsteinstraße 60 mtimm...@uni-muenster.de explain_sortorder-v7.patch Description: explain_sortorder-v7.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: A difficulty with either your patch or my idea is that they require adding another field to ExplainState, which is an ABI break for any third-party code that might be declaring variables of that struct type. That's fine for HEAD but would be risky to back-patch. Any thoughts about whether we can get away with that (ie, anybody have an idea if there are third-party extensions that call explain.c)? codesearch.debian.net shows a couple of hits for ExplainState in multicorn (an extension for FDW from Python data sources); I didn't look but it seems that the FDW API is using that struct. It is, but FDWs are not at risk here: they merely reference ExplainStates that were allocated by core backend code. So as long as we add the new field at the end it's not a problem for them. Problematic usage would be like what auto_explain does: ExplainState es; ExplainInitState(es); ... In hindsight, that's a bad API and we should change it to something like ExplainState *es = NewExplainState(); so that the sizeof the struct isn't embedded in extension code. But we definitely can't do that in back branches. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
Tom Lane wrote: A difficulty with either your patch or my idea is that they require adding another field to ExplainState, which is an ABI break for any third-party code that might be declaring variables of that struct type. That's fine for HEAD but would be risky to back-patch. Any thoughts about whether we can get away with that (ie, anybody have an idea if there are third-party extensions that call explain.c)? codesearch.debian.net shows a couple of hits for ExplainState in multicorn (an extension for FDW from Python data sources); I didn't look but it seems that the FDW API is using that struct. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] EXEC_BACKEND + logging_collector=on is broken
Hi, Currently the combination from $subject fails for me with could not read from backend variables file The origin for that problem seems to be b94ce6e80 which moved RemovePgTempFiles() to after SysLogger_Start(). Unless the syslogger starts up very quickly RemovePgTempFiles() will have deleted the server variables file. I think moving the RemovePgTempFiles() to just above SysLogger_Start() should preserve the faster reporting for pg_ctl intended by b94ce6e80 and fix the startup issue? I'm rather surprised that this hasn't caused more problems. Apparently windows users don't use the logging_collector? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
Heikki Linnakangas hlinnakan...@vmware.com writes: On 01/13/2015 07:24 PM, Tom Lane wrote: In hindsight, that's a bad API and we should change it to something like ExplainState *es = NewExplainState(); so that the sizeof the struct isn't embedded in extension code. But we definitely can't do that in back branches. Actually, it would make sense to do exactly that, to break any extensions that are doing the unsafe thing in an obvious way. The downside would be that an extension using the new API would then not work on an old server. I guess that's a possibility ... We could repurpose one of the existing fields in ExplainState to point to another struct that contains more fields. Something like this: ... That's pretty ugly, but it would work even if there are ExplainState structs embeded in extensions. As long as they don't try to look at the grouping_stack field; I think that's fairly safe assumption. Yeah, I was thinking the same thing, but it's *mighty* ugly and would also create a back-patch hazard, since presumably we'd not do that in HEAD. But do we really need to backpatch any of this? Alexey's example consumes only a couple hundred MB in 9.2, vs about 7GB peak in 9.3 and up. That seems like a pretty nasty regression. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITH CHECK and Column-Level Privileges
On Mon, Jan 12, 2015 at 05:16:40PM -0500, Stephen Frost wrote: Alright, here's an updated patch which doesn't return any detail if no values are visible or if only a partial key is visible. I browsed this patch. There's been no mention of foreign key constraints, but ri_ReportViolation() deserves similar hardening. If a user has only DELETE privilege on a PK table, FK violation messages should not leak the PK values. Modifications to the foreign side are less concerning, since the user will often know the attempted value; still, I would lock down both sides. Please add a comment explaining the safety of each row-emitting message you haven't changed. For example, the one in refresh_by_match_merge() is safe because getting there requires MV ownership. --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -388,18 +388,30 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, { Datum values[INDEX_MAX_KEYS]; bool isnull[INDEX_MAX_KEYS]; + char *key_desc; index_deform_tuple(itup, RelationGetDescr(rel), values, isnull); - ereport(ERROR, - (errcode(ERRCODE_UNIQUE_VIOLATION), - errmsg(duplicate key value violates unique constraint \%s\, - RelationGetRelationName(rel)), - errdetail(Key %s already exists., - BuildIndexValueDescription(rel, - values, isnull)), - errtableconstraint(heapRel, - RelationGetRelationName(rel; + + key_desc = BuildIndexValueDescription(rel, values, + isnull); + + if (!strlen(key_desc)) + ereport(ERROR, + (errcode(ERRCODE_UNIQUE_VIOLATION), + errmsg(duplicate key value violates unique constraint \%s\, + RelationGetRelationName(rel)), + errtableconstraint(heapRel, + RelationGetRelationName(rel; + else + ereport(ERROR, + (errcode(ERRCODE_UNIQUE_VIOLATION), + errmsg(duplicate key value violates unique constraint \%s\, + RelationGetRelationName(rel)), + errdetail(Key %s already exists., + key_desc), + errtableconstraint(heapRel, + RelationGetRelationName(rel; Instead of duplicating an entire ereport() to change whether you include an errdetail, use condition ? errdetail(...) : 0. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Typo fix in alter_table.sgml
Hi all, I noticed that SET STATISTICS was not in a literal block in alter_table.sgml: para - SET STATISTICS acquires a literalSHARE UPDATE EXCLUSIVE/literal lock. + literalSET STATISTICS/literal acquires a + literalSHARE UPDATE EXCLUSIVE/literal lock. /para That's a small detail, still.. Patch is attached. Regards, -- Michael diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index b5ef09e..b3a4970 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -190,7 +190,8 @@ ALTER TABLE ALL IN TABLESPACE replaceable class=PARAMETERname/replaceable xref linkend=planner-stats. /para para - SET STATISTICS acquires a literalSHARE UPDATE EXCLUSIVE/literal lock. + literalSET STATISTICS/literal acquires a + literalSHARE UPDATE EXCLUSIVE/literal lock. /para /listitem /varlistentry -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken
On Tue, Jan 13, 2015 at 10:23 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, Currently the combination from $subject fails for me with could not read from backend variables file The origin for that problem seems to be b94ce6e80 which moved RemovePgTempFiles() to after SysLogger_Start(). Unless the syslogger starts up very quickly RemovePgTempFiles() will have deleted the server variables file. I think moving the RemovePgTempFiles() to just above SysLogger_Start() should preserve the faster reporting for pg_ctl intended by b94ce6e80 and fix the startup issue? I'm rather surprised that this hasn't caused more problems. Apparently windows users don't use the logging_collector? I haven't looked at the actual code. But AFAIK, logging_collector=on and storing it in pg_log is the default log behaviour for pg on Windows installed by the edb installers afaik. Always has been. Surely we didn't break logging on Windows completely back in 9.2?! Any chance it's working on Windows, just not EXEC_BACKEND on Unix? (Dave or someone else from the EDB team can probably confirm that this is still the default in the installers?) EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm guessing that's it. It does an anonymous memory mapping (see top of internal_forkexec) out of the swapfile and passes a handle down. And only closes it's own handle once it's been inherited into the subprocess. I'm pretty sure that means it's not actually broken on Windows. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken
Andres Freund and...@2ndquadrant.com writes: On 2015-01-13 11:10:06 -0800, Magnus Hagander wrote: EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm guessing that's it. Ah! Then this really is fairly harmless. Will fix and backpatch anyway, but the number of affected people should be pretty much zero. Agreed. We want this case to work for our own testing purposes, but it's unlikely to be hurting anybody in the field. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken
On 2015-01-13 11:10:06 -0800, Magnus Hagander wrote: EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm guessing that's it. Ah! Then this really is fairly harmless. Will fix and backpatch anyway, but the number of affected people should be pretty much zero. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 3:54 PM, Andres Freund and...@2ndquadrant.com wrote: I don't remember seeing _bt_moveright() or _bt_compare() figuring so prominently, where _bt_binsrch() is nowhere to be seen. I can't see a reference to _bt_binsrch() in either profile. Well, we do a _bt_moveright pretty early on, so that actually might be cache misses we're primarily seing. Still, I see zero references to _bt_binsrch(). Not one. Even if there was only one non-meta page (i.e. pre-first-root-split), and regardless of whether this was a read or a write to the B-Tree, there'd still be some call there as the B-Tree was initially scanned (before _bt_next() calls, of which I also see none). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Some more information what's happening: This is a ghetto logical replication engine that migrates data from sql sever to postgres, consolidating a sharded database into a single set of tables (of which there are only two). There is only one index on the destination table, and it's composite int,int in both cases. Does the logical replication engine perform dynamic DDL at all? Does it dynamically add columns to the table that everything is being consolidated to? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 4:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: I'm not convinced that Peter is barking up the right tree. I'm noticing that the profiles seem rather skewed towards parser/planner work; so I suspect the contention is probably on access to system catalogs. No idea exactly why though. I see no int4cmp() calls at all, but plenty of _bt_compare(), and some FunctionCall2Coll(). And yet, no _bt_binsrch(). I can see btoidcmp(), so if I'm right then I guess it's a system catalog index. I too would like to see a stack trace. Trivia: pg_attribute_relid_attnam_index has deleted B-Tree pages after a fresh initdb. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
I wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: But do we really need to backpatch any of this? Alexey's example consumes only a couple hundred MB in 9.2, vs about 7GB peak in 9.3 and up. That seems like a pretty nasty regression. I did a bit more measurement of the time and backend memory consumption for Alexey's example EXPLAIN: 9.2: 0.9 sec, circa 200 MB HEAD: 56 sec, circa 7300 MB with patch below: 3.7 sec, circa 300 MB So while this doesn't get us all the way back down to where we were before we started trying to guarantee unique table/column identifiers in EXPLAIN printouts, it's at least a lot closer. Not sure whether to just commit this to HEAD and call it a day, or to risk back-patching. It occurred to me that we could use your idea of a secondary data structure and minimize the code impact with a macro layer, ie #define grouping_stack pointer_field-groupingstack But I'm not sure if it's worth the trouble. 9.3 has been like this right along, and this is the first complaint. One compromise idea would be to back-patch only as far as 9.4. If there are extensions out there that have this ABI issue, expecting them to rebuild for 9.4.1 might be OK. regards, tom lane diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 8a0be5d..39ceaf2 100644 *** a/src/backend/commands/explain.c --- b/src/backend/commands/explain.c *** ExplainPrintPlan(ExplainState *es, Query *** 563,568 --- 563,570 es-rtable = queryDesc-plannedstmt-rtable; ExplainPreScanNode(queryDesc-planstate, rels_used); es-rtable_names = select_rtable_names_for_explain(es-rtable, rels_used); + es-deparse_cxt = deparse_context_for_plan_rtable(es-rtable, + es-rtable_names); ExplainNode(queryDesc-planstate, NIL, NULL, NULL, es); } *** show_plan_tlist(PlanState *planstate, Li *** 1678,1687 return; /* Set up deparsing context */ ! context = deparse_context_for_planstate((Node *) planstate, ! ancestors, ! es-rtable, ! es-rtable_names); useprefix = list_length(es-rtable) 1; /* Deparse each result column (we now include resjunk ones) */ --- 1680,1688 return; /* Set up deparsing context */ ! context = set_deparse_context_planstate(es-deparse_cxt, ! (Node *) planstate, ! ancestors); useprefix = list_length(es-rtable) 1; /* Deparse each result column (we now include resjunk ones) */ *** show_expression(Node *node, const char * *** 1710,1719 char *exprstr; /* Set up deparsing context */ ! context = deparse_context_for_planstate((Node *) planstate, ! ancestors, ! es-rtable, ! es-rtable_names); /* Deparse the expression */ exprstr = deparse_expression(node, context, useprefix, false); --- 1711,1719 char *exprstr; /* Set up deparsing context */ ! context = set_deparse_context_planstate(es-deparse_cxt, ! (Node *) planstate, ! ancestors); /* Deparse the expression */ exprstr = deparse_expression(node, context, useprefix, false); *** show_sort_group_keys(PlanState *planstat *** 1855,1864 return; /* Set up deparsing context */ ! context = deparse_context_for_planstate((Node *) planstate, ! ancestors, ! es-rtable, ! es-rtable_names); useprefix = (list_length(es-rtable) 1 || es-verbose); for (keyno = 0; keyno nkeys; keyno++) --- 1855,1863 return; /* Set up deparsing context */ ! context = set_deparse_context_planstate(es-deparse_cxt, ! (Node *) planstate, ! ancestors); useprefix = (list_length(es-rtable) 1 || es-verbose); for (keyno = 0; keyno nkeys; keyno++) diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index dd748ac..cacd72c 100644 *** a/src/backend/utils/adt/ruleutils.c --- b/src/backend/utils/adt/ruleutils.c *** deparse_context_for(const char *aliasnam *** 2520,2526 } /* ! * deparse_context_for_planstate - Build deparse context for a plan * * When deparsing an expression in a Plan tree, we might have to resolve * OUTER_VAR, INNER_VAR, or INDEX_VAR references. To do this, the caller must --- 2520,2562 } /* ! * deparse_context_for_plan_rtable - Build deparse context for a plan's rtable ! * ! * When deparsing an expression in a Plan tree, we use the plan's rangetable ! * to resolve names of simple Vars. The initialization of column names for ! * this is rather expensive if the rangetable is large, and it'll be the same ! * for every expression in the Plan tree; so we do it just once and re-use ! * the result of this function for each expression. (Note that the result ! * is not usable until set_deparse_context_planstate() is applied to it.) ! * ! * In addition to the plan's
Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken
On 2015-01-13 20:21:55 +0100, Andres Freund wrote: On 2015-01-13 11:10:06 -0800, Magnus Hagander wrote: EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm guessing that's it. Ah! Then this really is fairly harmless. Will fix and backpatch anyway, but the number of affected people should be pretty much zero. Done that now. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 13/01/15 13:24, Tomas Vondra wrote: On 12.1.2015 22:33, Petr Jelinek wrote: On 15/12/14 11:36, Petr Jelinek wrote: On 10/12/14 03:33, Petr Jelinek wrote: On 24/11/14 12:16, Heikki Linnakangas wrote: About the rough edges: - The AlterSequence is not prettiest code around as we now have to create new relation when sequence AM is changed and I don't know how to do that nicely - I am not sure if I did the locking, command order and dependency handling in AlterSequence correcly This version does AlterSequence differently and better. Didn't attach the gapless sequence again as that one is unchanged. And another version, separated into patch-set of 3 patches. First patch contains the seqam patch itself, not many changes there, mainly docs/comments related. What I wrote in the previous email for version 3 still applies. I did a review of the first part today - mostly by reading through the diff. I plan to do a bit more thorough testing in a day or two. I'll also look at the two (much smaller) patches. Thanks! comments/questions/general nitpicking: (1) Why treating empty string as equal to 'local'? Isn't enforcing the actual value a better approach? Álvaro mentioned on IM also, I already changed it to saner normal GUC with 'local' as default value in my working copy (2) NITPICK: Maybe we could use access_method in the docs (instead of sequence_access_method), as the 'sequence' part is clear from the context? Yes. (3) Why index_reloptions / sequence_reloptions when both do exactly the same thing (call to common_am_reloptions)? I'd understand this if the patch(es) then change the sequence_reloptions() but that's not happening. Maybe that's expected to happen? That's leftover from the original design where AM was supposed to call this, since this is not exposed to AM anymore I think it can be single function now. (4) DOCS: Each sequence can only use one access method at a time. Does that mean a sequence can change the access method during it's lifetime? My understanding is that the AM is fixed after creating the sequence? Oh, I forgot to add ALTER SEQUENCE USING into docs, you can change AM even though you probably don't want to do it often, but for migrations it's useful. (8) check_default_seqam without a transaction * If we aren't inside a transaction, we cannot do database access so * cannot verify the name. Must accept the value on faith. In which situation this happens? Wouldn't it be better to simply enforce the transaction and fail otherwise? Reading postgresql.conf during startup, I don't think we want to fail in that scenario ;) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POLA violation with \c service=
On Sat, Jan 10, 2015 at 04:41:16PM -0800, David Fetter wrote: On Sat, Jan 10, 2015 at 09:30:57AM +0100, Erik Rijkers wrote: On Fri, January 9, 2015 20:15, David Fetter wrote: [psql_fix_uri_service_003.patch] Applies on master; the feature (switching services) works well but a \c without any parameters produces a segfault: (centos 6.6, 4.9.2, 64-bit) $ echo -en $PGSERVICEFILE\n$PGSERVICE\n$PGPORT\n /home/aardvark/.pg_service service_pola 6968 $ psql Timing is on. psql (9.5devel_service_pola_20150109_2340_ac7009abd228) Type help for help. testdb=# \c service=HEAD You are now connected to database testdb as user aardvark via socket in /tmp at port 6545. testdb=# \c service=service_pola You are now connected to database testdb as user aardvark via socket in /tmp at port 6968. testdb=# \c Segmentation fault (core dumped) Fixed by running that function only if the argument exists. More C cleanups, too. Added to the upcoming commitfest. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] authentication_timeout ineffective for replication connections
Hi, I just noticed that authentication_timeout is ineffective for replication=true type connections. That's because walsender doesn't register a SIGINT handler and authentication_timeout relies on having one. There's no problem with reading the initial startup packet (ProcessStartupPacket/BackendInitialize) because we use a separate handler there. But once that's done, before finishing authentication, WalSndSignals() will have set SIGINT's handler to SIG_IGN. Demo python program attached. You'll only see the problem if the authentication method requires a password/addititional packets. I think we could fix this by simply mapping SIGINT to die() instead SIG_IGN. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services import socket import time import struct import sys import os if len(sys.argv) != 6: print(wrong number of params: host port database user replication) sys.exit(1) print connecting conn = socket.create_connection((sys.argv[1], sys.argv[2])) data = bytes(); # send protocol version data = struct.pack('I', 0x0003) # database data += database data += struct.pack('B', 0) data += sys.argv[3] data += struct.pack('B', 0) # user data += user data += struct.pack('B', 0) data += sys.argv[4] data += struct.pack('B', 0) # replication data += replication data += struct.pack('B', 0) data += sys.argv[5] data += struct.pack('B', 0) # start packet terminator data += struct.pack('B', 0) print sending startup packet # length of the length + length of the startup packet conn.send(struct.pack('i', 4 + len(data))) # version conn.send(data) print and going idle # and now sleep time.sleep(3600) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITH CHECK and Column-Level Privileges
On 13 January 2015 at 13:50, Stephen Frost sfr...@snowman.net wrote: * Dean Rasheed (dean.a.rash...@gmail.com) wrote: On 12 January 2015 at 22:16, Stephen Frost sfr...@snowman.net wrote: Alright, here's an updated patch which doesn't return any detail if no values are visible or if only a partial key is visible. Please take a look. I'm not thrilled with simply returning an empty string and then checking that for BuildIndexValueDescription and ExecBuildSlotValueDescription, but I figured we might have other users of BuildIndexValueDescription and I wasn't seeing a particularly better solution. Suggestions welcome, of course. Actually I'm starting to wonder if it's even worth bothering about the index case. To leak information, you'd need to have a composite key for which you only had access to a subset of the key columns (which in itself seems like a pretty rare case), and then you'd need to specify new values to make the entire key conflict with an existing value, at which point the fact that an exception is thrown tells you that the values in the index must be the same as your new values. I'm struggling to imagine a realistic scenario where this would be a problem. I'm not sure that I follow.. From re-reading the above a couple times, I take it you're making an argument that people would be silly to set their database up that way, but that's not an argument we can stand on when it comes to privileges. Additionally, as the regression tests hopefully show, if you have update on one column of a composite key, you could find out the entire key today by issuing an update against that column to set it to the same value throughout. You don't need to know what the rest of the key is but only that two records somewhere have the same values except for the one column you have update rights on. Hmm, yes I guess that's right. One improvement we could trivially make is to only do this for multi-column indexes. If there is only one column there is no danger of information leakage, right? Also, if we change BuildIndexValueDescription() in this way, it's going to make it more or less useless for updatable views, since in the most common case the user won't have permissions on the underlying table. That's certainly something to consider. I looked at trying to get which columns the user had provided down to BuildIndexValueDescription, but I couldn't find a straight-forward way to do that as it involves the index AMs which we can't change (and I'm not really sure we'd want to anyway). Yeah I couldn't see any easy way of doing it. 2 possibilities sprung to mind -- (1) wrap the index update in a PG_TRY() and add the detail in the catch block, or (2) track the currently active EState and make GetModifiedColumns() into an exported function taking a single EState argument (the EState has the currently active ResultRelInfo on it). Neither of those alternatives seems particularly attractive to me though. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
On 13.01.2015 16:47, Heikki Linnakangas wrote: Hmm, something like the attached? Seems reasonable... - Heikki yes, i have tested something like this, it stopped eating memory Just one small notice to the patch you attached: maybe it would be more safe to switch to oldcxt before calling ExplainPropertyText/ExplainPropertyList? Otherwise are you pretty sure ExplainPropertyText and ExplainPropertyList are not going perform any pallocs without immediate pfrees in current memory context even in future? Regards, Alexey Bashtanov -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Check that streaming replica received all data after master shutdown
05 янв. 2015 г., в 18:15, Vladimir Borodin r...@simply.name написал(а): Hi all. I have a simple script for planned switchover of PostgreSQL (9.3 and 9.4) master to one of its replicas. This script checks a lot of things before doing it and one of them is that all data from master has been received by replica that is going to be promoted. Right now the check is done like below: On the master: postgres@pgtest03d ~ $ psql -t -A -c 'select pg_current_xlog_location();' 0/3390 postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_ctl stop -m fast waiting for server to shut down done server stopped postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_controldata | head pg_control version number:937 Catalog version number: 201306121 Database system identifier: 6061800518091528182 Database cluster state: shut down pg_control last modified: Mon 05 Jan 2015 06:47:57 PM MSK Latest checkpoint location: 0/3428 Prior checkpoint location:0/3328 Latest checkpoint's REDO location:0/3428 Latest checkpoint's REDO WAL file:001B0034 Latest checkpoint's TimeLineID: 27 postgres@pgtest03d ~ $ On the replica (after shutdown of master): postgres@pgtest03g ~ $ psql -t -A -c select pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/3428'); 104 postgres@pgtest03g ~ $ These 104 bytes seems to be the size of shutdown checkpoint record (as I can understand from pg_xlogdump output). postgres@pgtest03g ~/9.3/data/pg_xlog $ /usr/pgsql-9.3/bin/pg_xlogdump -s 0/3390 -t 27 rmgr: XLOGlen (rec/tot): 0/32, tx: 0, lsn: 0/3390, prev 0/3328, bkp: , desc: xlog switch rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/3428, prev 0/3390, bkp: , desc: checkpoint: redo 0/3428; tli 27; prev tli 27; fpw true; xid 0/6010; oid 54128; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown pg_xlogdump: FATAL: error in WAL record at 0/3428: record with zero length at 0/3490 postgres@pgtest03g ~/9.3/data/pg_xlog $ I’m not sure that these 104 bytes will always be 104 bytes to have a strict equality while checking. Could it change in the future? Or is there a better way to understand that streaming replica received all data after master shutdown? The check that pg_xlog_location_diff returns 104 bytes seems a bit strange. +hackers Could anyone help? Thanks. Thanks. -- May the force be with you... http://simply.name -- May the force be with you... http://simply.name
Re: [HACKERS] Memory leak in receivelog.c when receiving stream
On Tue, Jan 13, 2015 at 5:45 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: This looks like a false positive to me. PQgetCopyData() will only return a buffer if its return value is 0 Right. Sorry for the noise. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing INNER JOINs
On 12 January 2015 at 15:57, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 8, 2015 at 6:31 AM, David Rowley dgrowle...@gmail.com wrote: I'd be keen to know what people's thoughts are about the nodeAlternativePlan only surviving until the plan is initialised. I find it scary, although sometimes I am easily frightened. Ok remember I'm not actually modifying the plan like I was in the earlier version of the patch. The Alternative Plan node just simply initialises the correct plan and instead of returning it's own initialised state, it returns the initialised state of the selected plan's root node. I have to admit, it didn't really become clear in my head if the frenzy of discussion above gave any sort of indication that this Alternative plan node would remain and be shown in the EXPLAIN output, or the appropriate plan would be selected during plan initialisation and the new root node would become that of the selected plan. When I was implement what was discussed, I decided that it would be better to choose the correct plan during initialisation rather than transitioning through the Alternative plan node for each row. As proved already on this thread, transitioning through needless nodes adds needless executor time overhead. Also if we kept the alternative plan node, then I think the explain would look rather weird and frighteningly complex, as it would effectively be 2 plans in 1. I'm actually quite happy with how simple the executor changes became. It's far more simple and clean than the node stripping code that I had in an earlier version. The parts of the patch that I'm concerned might raise a few eyebrows are the changes to query_planner() as it now returns a List of RelOptInfo instead of a RelOptInfo. Regards David Rowley
Re: [HACKERS] [RFC] LSN Map
Il 08/01/15 20:18, Jim Nasby ha scritto: On 1/7/15, 3:50 AM, Marco Nenciarini wrote: The current implementation tracks only heap LSN. It currently does not track any kind of indexes, but this can be easily added later. Would it make sense to do this at a buffer level, instead of at the heap level? That means it would handle both heap and indexes. I don't know if LSN is visible that far down though. Where exactly you are thinking to handle it? Also, this pattern is repeated several times; it would be good to put it in it's own function: + lsnmap_pin(reln, blkno, lmbuffer); + lsnmap_set(reln, blkno, lmbuffer, lsn); + ReleaseBuffer(lmbuffer); Right. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it signature.asc Description: OpenPGP digital signature
Re: [HACKERS] Parallel Seq Scan
On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net wrote: So, for my 2c, I've long expected us to parallelize at the relation-file level for these kinds of operations. This goes back to my other thoughts on how we should be thinking about parallelizing inbound data for bulk data loads but it seems appropriate to consider it here also. One of the issues there is that 1G still feels like an awful lot for a minimum work size for each worker and it would mean we don't parallelize for relations less than that size. Yes, I think that's a killer objection. One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly. We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items. This solves a lot of problems. - Small to medium jobs can be parallelized efficiently. - No need to split big jobs perfectly. - We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle. - Worker memory footprint is tiny so we can afford many of them. - Worker pool management is a well known problem. - Worker spawn time disappears as a cost factor. - The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable.
Re: [HACKERS] Unused variables in hstore_to_jsonb
On Tue, Jan 13, 2015 at 5:36 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: You'll need to use (void) pushJsonbValue(...), otherwise you'll just get a different warning. See commit c8315930. Oh, I see. So this portion in contrib/ has been visibly missing. Attached is a new patch. -- Michael *** a/contrib/hstore/hstore_io.c --- b/contrib/hstore/hstore_io.c *** *** 1338,1344 hstore_to_jsonb(PG_FUNCTION_ARGS) JsonbParseState *state = NULL; JsonbValue *res; ! res = pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL); for (i = 0; i count; i++) { --- 1338,1344 JsonbParseState *state = NULL; JsonbValue *res; ! (void) pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL); for (i = 0; i count; i++) { *** *** 1349,1355 hstore_to_jsonb(PG_FUNCTION_ARGS) key.val.string.len = HS_KEYLEN(entries, i); key.val.string.val = HS_KEY(entries, base, i); ! res = pushJsonbValue(state, WJB_KEY, key); if (HS_VALISNULL(entries, i)) { --- 1349,1355 key.val.string.len = HS_KEYLEN(entries, i); key.val.string.val = HS_KEY(entries, base, i); ! (void) pushJsonbValue(state, WJB_KEY, key); if (HS_VALISNULL(entries, i)) { *** *** 1361,1367 hstore_to_jsonb(PG_FUNCTION_ARGS) val.val.string.len = HS_VALLEN(entries, i); val.val.string.val = HS_VAL(entries, base, i); } ! res = pushJsonbValue(state, WJB_VALUE, val); } res = pushJsonbValue(state, WJB_END_OBJECT, NULL); --- 1361,1367 val.val.string.len = HS_VALLEN(entries, i); val.val.string.val = HS_VAL(entries, base, i); } ! (void) pushJsonbValue(state, WJB_VALUE, val); } res = pushJsonbValue(state, WJB_END_OBJECT, NULL); *** *** 1385,1391 hstore_to_jsonb_loose(PG_FUNCTION_ARGS) initStringInfo(tmp); ! res = pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL); for (i = 0; i count; i++) { --- 1385,1391 initStringInfo(tmp); ! (void) pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL); for (i = 0; i count; i++) { *** *** 1396,1402 hstore_to_jsonb_loose(PG_FUNCTION_ARGS) key.val.string.len = HS_KEYLEN(entries, i); key.val.string.val = HS_KEY(entries, base, i); ! res = pushJsonbValue(state, WJB_KEY, key); if (HS_VALISNULL(entries, i)) { --- 1396,1402 key.val.string.len = HS_KEYLEN(entries, i); key.val.string.val = HS_KEY(entries, base, i); ! (void) pushJsonbValue(state, WJB_KEY, key); if (HS_VALISNULL(entries, i)) { *** *** 1471,1477 hstore_to_jsonb_loose(PG_FUNCTION_ARGS) val.val.string.val = HS_VAL(entries, base, i); } } ! res = pushJsonbValue(state, WJB_VALUE, val); } res = pushJsonbValue(state, WJB_END_OBJECT, NULL); --- 1471,1477 val.val.string.val = HS_VAL(entries, base, i); } } ! (void) pushJsonbValue(state, WJB_VALUE, val); } res = pushJsonbValue(state, WJB_END_OBJECT, NULL); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Escaping from blocked send() reprised.
On 2015-01-12 00:40:50 +0100, Andres Freund wrote: Fixed in what I've since pushed (as Heikki basically was ok with the patch sent a couple months back, modulo some fixes)... I'd not actually pushed that patch... I had pushed some patches (barriers, atomics), but had decided to hold off on this. I've now done so. I've mentioned the portability concerns over select() bugs in the commit message a comment. ATM I'm not inclined to add a relatively elaborate test for the bug on pretty fringe platforms. Thanks for looking at this! I plan to continue with committing 1) Commonalize process startup code 2) Add a default local latch for use in signal handlers 3) Use a nonblocking socket for FE/BE communication and block using latches pretty soon. As we already seem to assume that WaitLatch() is signal safe/reentrant (c.f. walsender.c), I'm fine with committing 3) in isolation, without 4). I need a test that properly exercises catchup interrupts before committing that. Once I have that test I plan to commit 4) Introduce and use infrastructure for interrupt processing during client reads. I'd like some input from others what they think about the problem that 5) Process 'die' interrupts while reading/writing from a socket. can reduce the likelihood of clients getting the error message. I personally think that's more than outweighed by not having backends stuck (save quickdie) for a long time when the client is gone/stuck. I think the middleground in the patch to only process die events when actually blocked in writes reduces the likelihood of this sufficiently. I have hacks ontop this to get rid of ImmediateInterrupt alltogether, although I'm not sure how well this will work for some parts of auth/crypt.c. Everything else, including the deadlock checker, seems quite doable. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unused variables in hstore_to_jsonb
On 01/13/2015 01:09 PM, Michael Paquier wrote: On Tue, Jan 13, 2015 at 5:36 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: You'll need to use (void) pushJsonbValue(...), otherwise you'll just get a different warning. See commit c8315930. Oh, I see. So this portion in contrib/ has been visibly missing. Attached is a new patch. Thanks, committed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
On 01/13/2015 02:08 PM, Alexey Bashtanov wrote: I found that EXPLAIN command takes very much memory to execute when huge unions are used. For example the following sql -- begin sql create table t (a000 int, a001 int, ... a099 int); explain select * from ( select a001 a from t union all select a001 a from t union all ... (1000 times) ... union all select a001 a from t ) _ where a = 1; -- end sql took more than 1GB of memory to execute. Namely converting of the plan to a human-readable form causes excessive memory usage, not planning itself. By varying the parameters and reading source code I determined that memory usage linearly depends on (plan nodes count)*(overall columns count), thus it quadratically depends on number of tables unionized. To remove this excessive memory usage I propose to run deparse_context_for_planstate+deparse_expression in a separate memory context and free it after a plan node is generated. Hmm, something like the attached? Seems reasonable... - Heikki diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 8a0be5d..4509aab 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -27,6 +27,7 @@ #include utils/builtins.h #include utils/json.h #include utils/lsyscache.h +#include utils/memutils.h #include utils/rel.h #include utils/ruleutils.h #include utils/snapmgr.h @@ -263,6 +264,15 @@ ExplainInitState(ExplainState *es) es-costs = true; /* Prepare output buffer. */ es-str = makeStringInfo(); + + /* + * Create a temporary memory context, used for deparsing expressions. + */ + es-tmpCxt = AllocSetContextCreate(CurrentMemoryContext, + Temporary EXPLAIN context, + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); } /* @@ -1664,6 +1674,7 @@ show_plan_tlist(PlanState *planstate, List *ancestors, ExplainState *es) List *result = NIL; bool useprefix; ListCell *lc; + MemoryContext oldcxt; /* No work if empty tlist (this occurs eg in bitmap indexscans) */ if (plan-targetlist == NIL) @@ -1677,6 +1688,9 @@ show_plan_tlist(PlanState *planstate, List *ancestors, ExplainState *es) if (IsA(plan, RecursiveUnion)) return; + /* Switch to a temporary memory context to limit memory usage */ + oldcxt = MemoryContextSwitchTo(es-tmpCxt); + /* Set up deparsing context */ context = deparse_context_for_planstate((Node *) planstate, ancestors, @@ -1696,6 +1710,10 @@ show_plan_tlist(PlanState *planstate, List *ancestors, ExplainState *es) /* Print results */ ExplainPropertyList(Output, result, es); + + /* Clean up */ + MemoryContextSwitchTo(oldcxt); + MemoryContextReset(es-tmpCxt); } /* @@ -1708,6 +1726,10 @@ show_expression(Node *node, const char *qlabel, { List *context; char *exprstr; + MemoryContext oldcxt; + + /* Switch to a temporary memory context to limit memory usage */ + oldcxt = MemoryContextSwitchTo(es-tmpCxt); /* Set up deparsing context */ context = deparse_context_for_planstate((Node *) planstate, @@ -1720,6 +1742,10 @@ show_expression(Node *node, const char *qlabel, /* And add to es-str */ ExplainPropertyText(qlabel, exprstr, es); + + /* Clean up */ + MemoryContextSwitchTo(oldcxt); + MemoryContextReset(es-tmpCxt); } /* @@ -1850,10 +1876,14 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel, bool useprefix; int keyno; char *exprstr; + MemoryContext oldcxt; if (nkeys = 0) return; + /* Switch to a temporary memory context to limit memory usage */ + oldcxt = MemoryContextSwitchTo(es-tmpCxt); + /* Set up deparsing context */ context = deparse_context_for_planstate((Node *) planstate, ancestors, @@ -1877,6 +1907,10 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel, } ExplainPropertyList(qlabel, result, es); + + /* Clean up */ + MemoryContextSwitchTo(oldcxt); + MemoryContextReset(es-tmpCxt); } /* diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h index 6e26950..0a819e9 100644 --- a/src/include/commands/explain.h +++ b/src/include/commands/explain.h @@ -41,6 +41,8 @@ typedef struct ExplainState List *rtable_names; /* alias names for RTEs */ int indent; /* current indentation level */ List *grouping_stack; /* format-specific grouping state */ + + MemoryContext tmpCxt; /* working context for deparsing expressions */ } ExplainState; /* Hook for plugins to get control in ExplainOneQuery() */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC
Hi Marco, could you please send an updated version the patch against the current HEAD in order to facilitate reviewers? Thanks, Gabriele -- Gabriele Bartolini - 2ndQuadrant Italia - Managing Director PostgreSQL Training, Services and Support gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it 2015-01-07 11:00 GMT+01:00 Marco Nenciarini marco.nenciar...@2ndquadrant.it : Il 06/01/15 14:26, Robert Haas ha scritto: I suggest leaving this out altogether for the first version. I can think of three possible ways that we can determine which blocks need to be backed up. One, just read every block in the database and look at the LSN of each one. Two, maintain a cache of LSN information on a per-segment (or smaller) basis, as you suggest here. Three, scan the WAL generated since the incremental backup and summarize it into a list of blocks that need to be backed up. This last idea could either be done when the backup is requested, or it could be done as the WAL is generated and used to populate the LSN cache. In the long run, I think some variant of approach #3 is likely best, but in the short run, approach #1 (scan everything) is certainly easiest. While it doesn't optimize I/O, it still gives you the benefit of reducing the amount of data that needs to be transferred and stored, and that's not nothing. If we get that much working, we can improve things more later. Hi, The patch now uses the approach #1, but I've just sent a patch that uses the #2 approach. 54ad016e.9020...@2ndquadrant.it Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
[HACKERS] OOM on EXPLAIN with lots of nodes
Hello! I found that EXPLAIN command takes very much memory to execute when huge unions are used. For example the following sql -- begin sql create table t (a000 int, a001 int, ... a099 int); explain select * from ( select a001 a from t union all select a001 a from t union all ... (1000 times) ... union all select a001 a from t ) _ where a = 1; -- end sql took more than 1GB of memory to execute. Namely converting of the plan to a human-readable form causes excessive memory usage, not planning itself. By varying the parameters and reading source code I determined that memory usage linearly depends on (plan nodes count)*(overall columns count), thus it quadratically depends on number of tables unionized. To remove this excessive memory usage I propose to run deparse_context_for_planstate+deparse_expression in a separate memory context and free it after a plan node is generated. Any reasons to treat this idea as bad? BTW in this case explain execution is also quite long (I got tens of seconds). But I have no immediate ideas how to improve it. Regards, Alexey Bashtanov -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Tue, Jan 13, 2015 at 7:25 AM, John Gorman johngorm...@gmail.com wrote: On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net wrote: So, for my 2c, I've long expected us to parallelize at the relation-file level for these kinds of operations. This goes back to my other thoughts on how we should be thinking about parallelizing inbound data for bulk data loads but it seems appropriate to consider it here also. One of the issues there is that 1G still feels like an awful lot for a minimum work size for each worker and it would mean we don't parallelize for relations less than that size. Yes, I think that's a killer objection. One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly. We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items. This solves a lot of problems. - Small to medium jobs can be parallelized efficiently. - No need to split big jobs perfectly. - We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle. - Worker memory footprint is tiny so we can afford many of them. - Worker pool management is a well known problem. - Worker spawn time disappears as a cost factor. - The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable. I forgot to mention that a running task queue can provide metrics such as current utilization, current average throughput, current queue length and estimated queue wait time. These can become dynamic cost factors in deciding whether to parallelize.
Re: [HACKERS] Sequence Access Method WIP
On 12.1.2015 22:33, Petr Jelinek wrote: On 15/12/14 11:36, Petr Jelinek wrote: On 10/12/14 03:33, Petr Jelinek wrote: On 24/11/14 12:16, Heikki Linnakangas wrote: About the rough edges: - The AlterSequence is not prettiest code around as we now have to create new relation when sequence AM is changed and I don't know how to do that nicely - I am not sure if I did the locking, command order and dependency handling in AlterSequence correcly This version does AlterSequence differently and better. Didn't attach the gapless sequence again as that one is unchanged. And another version, separated into patch-set of 3 patches. First patch contains the seqam patch itself, not many changes there, mainly docs/comments related. What I wrote in the previous email for version 3 still applies. I did a review of the first part today - mostly by reading through the diff. I plan to do a bit more thorough testing in a day or two. I'll also look at the two (much smaller) patches. comments/questions/general nitpicking: (1) Why treating empty string as equal to 'local'? Isn't enforcing the actual value a better approach? (2) NITPICK: Maybe we could use access_method in the docs (instead of sequence_access_method), as the 'sequence' part is clear from the context? (3) Why index_reloptions / sequence_reloptions when both do exactly the same thing (call to common_am_reloptions)? I'd understand this if the patch(es) then change the sequence_reloptions() but that's not happening. Maybe that's expected to happen? (4) DOCS: Each sequence can only use one access method at a time. Does that mean a sequence can change the access method during it's lifetime? My understanding is that the AM is fixed after creating the sequence? (5) DOCS/NITPICK: SeqAM / SeqAm ... (probably should be the same?) (6) cache lookup failed for relation %u I believe this should reference 'sequence access method' instead of a relation. (7) seqam_init I believe this is a bug (not setting nulls[4] but twice nulls[3]): + fcinfo.argnull[0] = seqparams == NULL; + fcinfo.argnull[1] = reloptions_parsed == NULL; + fcinfo.argnull[2] = false; + fcinfo.argnull[3] = false; + fcinfo.argnull[3] = false; (8) check_default_seqam without a transaction * If we aren't inside a transaction, we cannot do database access so * cannot verify the name. Must accept the value on faith. In which situation this happens? Wouldn't it be better to simply enforce the transaction and fail otherwise? regards -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Async execution of postgres_fdw.
Hello. This is a version 3 patch. - PgFdwConnState is removed - PgFdwConn is isolated as a separate module. - State transition was simplicated, I think. - Comment about multiple scans on a connection is added. - The issue of PREPARE is not addressed yet. - It is to show how the new style looks, so it is lacking for comments for every PgFdwConn functions. - Rebased to current master. === I added a comment describing the and logic and meaning of the statesjust above the enum declaration. This needs to be clarified further. But that can wait till we finalise the approach and the patch. Esp. comment below is confusing 1487 + * PGFDW_CONN_SYNC_RUNNING is rather an internal state in 1488 + * fetch_more_data(). It indicates that the function shouldn't send the next 1489 + * fetch requst after getting the result. I couldn't get the meaning of the second sentence, esp. it's connection with synchronous-ness. In this version, I removed PgFdwConnState. Now what indicates that async fetch is running or not is the existence of async_fetch. I think the complicated state transition is dissapeard. The if condition if (entry-conn == NULL) in GetConnection(), used to track whether there is a PGConn active for the given entry, now it tracks whether it has PgFdwConn for the same. After some soncideration, I decided to make PgFdwConn to be handled more similarly to PGconn. This patch has shrunk as a result and bacame looks clear. I think it's still prone to segfaults considering two pointer indirections. PGconn itself already makes two-level indirection, and PgFdwConn has hidden its details mainly using macros. I may misunderstood you, but if you're worried that PgFdwConn.pgconn can be set from anywhere, we would should separate PgFdwConn into another C-module and hide all the details as PGconn does. It is shown as the separte patch. But I feel it a bit overdone because it is not an end-user interface. I wrote the outline of the logic in the comment for enum PgFdwConnState in postgres_fdw.h. Is it make sense? The point about two different ForeignScan nodes using the same connection needs some clarification, I guess. It's not very clear, why would there be more queries run on the same connection. I know why this happens, but it's important to mention it somewhere. If it's already mentioned somewhere in the file, sorry for not paying attention to that. Yeah. It is just what I stumbled on. I changed the comment in fetch_more_data() like below. Does it make sense? | /* | * On the current postgres_fdw implement, multiple PgFdwScanState | * on the same foreign server and mapped user share the same | * connection to the remote server (see GetConnection() in | * connection.c) and inidividual scans on it are separated using | * cursors. Since one connection cannot accept two or more | * asynchronous queries simultaneously, we should stop the async | * fetching if the another scan comes. | */ | | if (PFCgetNscans(conn) 1) | PFCsetAsyncScan(conn, NULL); I think there is more chance that there will more than one ForeignScan nodes trying interact with a foreign server, even after the push-down work. The current solution doesn't address that. We actually need parallel querying in two cases 1. Querying multiple servers in parallel 2. Querying same server (by two querists) in parallel within the same query e.g. an un-pushable join. We need a solution which is work in both the cases. The first point is realized by this patch with some limitations. The second point is that my old patch did, it made a dedicated connection for individual scans up to some fixed number aside the main connection, then the overflowed scans go to the main connection and they are done in the manner the unpatched postgres_fdw does. I was thinking that the 'some fiexed number' could be set by a parameter of a foreign server but I got a comment that it could fill up the remote server unless reasonable load or/and bandwidth managemant. So I abandoned the multiple-connection solution and decided to do everything on the first connection. It's how the current patch came. Is it possible to use the parallel query infrastructure being built by Robert or to do something like parallel seq scan? That will work, not just for Postgres FDW but all the FDWs. I haven't seen closer to the patch but if my understanding by a glance is correct, the parallel scan devides the target table into multple parts then runs subscans on every part in parallel. It might allocate dedicated processes for every child scan on a partitioned table. But, I think, from the performance view, every scan of multiple foreign scans don't need correnponding local process. But if the parallel scan infrastructure makes the mechanism simpler, using it is a very promising choice. In case of Prepared statements, ExecInit is called at the end of planning, without subsequent execution like the case of
Re: [HACKERS] WITH CHECK and Column-Level Privileges
On 12 January 2015 at 22:16, Stephen Frost sfr...@snowman.net wrote: Alright, here's an updated patch which doesn't return any detail if no values are visible or if only a partial key is visible. Please take a look. I'm not thrilled with simply returning an empty string and then checking that for BuildIndexValueDescription and ExecBuildSlotValueDescription, but I figured we might have other users of BuildIndexValueDescription and I wasn't seeing a particularly better solution. Suggestions welcome, of course. Actually I'm starting to wonder if it's even worth bothering about the index case. To leak information, you'd need to have a composite key for which you only had access to a subset of the key columns (which in itself seems like a pretty rare case), and then you'd need to specify new values to make the entire key conflict with an existing value, at which point the fact that an exception is thrown tells you that the values in the index must be the same as your new values. I'm struggling to imagine a realistic scenario where this would be a problem. Also, if we change BuildIndexValueDescription() in this way, it's going to make it more or less useless for updatable views, since in the most common case the user won't have permissions on the underlying table. For CHECK constraints/options, the change looks more reasonable, and I guess that approach could be extended to RLS by only including the modified columns, not the ones with select permissions, if RLS is enabled on the table. One minor comment on the code -- you could save a few cycles by only calling GetModifiedColumns() in the exceptional case. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITH CHECK and Column-Level Privileges
* Dean Rasheed (dean.a.rash...@gmail.com) wrote: On 12 January 2015 at 22:16, Stephen Frost sfr...@snowman.net wrote: Alright, here's an updated patch which doesn't return any detail if no values are visible or if only a partial key is visible. Please take a look. I'm not thrilled with simply returning an empty string and then checking that for BuildIndexValueDescription and ExecBuildSlotValueDescription, but I figured we might have other users of BuildIndexValueDescription and I wasn't seeing a particularly better solution. Suggestions welcome, of course. Actually I'm starting to wonder if it's even worth bothering about the index case. To leak information, you'd need to have a composite key for which you only had access to a subset of the key columns (which in itself seems like a pretty rare case), and then you'd need to specify new values to make the entire key conflict with an existing value, at which point the fact that an exception is thrown tells you that the values in the index must be the same as your new values. I'm struggling to imagine a realistic scenario where this would be a problem. I'm not sure that I follow.. From re-reading the above a couple times, I take it you're making an argument that people would be silly to set their database up that way, but that's not an argument we can stand on when it comes to privileges. Additionally, as the regression tests hopefully show, if you have update on one column of a composite key, you could find out the entire key today by issuing an update against that column to set it to the same value throughout. You don't need to know what the rest of the key is but only that two records somewhere have the same values except for the one column you have update rights on. Also, if we change BuildIndexValueDescription() in this way, it's going to make it more or less useless for updatable views, since in the most common case the user won't have permissions on the underlying table. That's certainly something to consider. I looked at trying to get which columns the user had provided down to BuildIndexValueDescription, but I couldn't find a straight-forward way to do that as it involves the index AMs which we can't change (and I'm not really sure we'd want to anyway). For CHECK constraints/options, the change looks more reasonable, and I guess that approach could be extended to RLS by only including the modified columns, not the ones with select permissions, if RLS is enabled on the table. One minor comment on the code -- you could save a few cycles by only calling GetModifiedColumns() in the exceptional case. Agreed, that sounds like a good approach for how to address the RLS concern. Also, good point about GetModifiedColumns. There's a few other minor changes that I want to make on re-reading it also. I should be able to post a new version later today. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
Heikki Linnakangas hlinnakan...@vmware.com writes: On 01/13/2015 02:08 PM, Alexey Bashtanov wrote: By varying the parameters and reading source code I determined that memory usage linearly depends on (plan nodes count)*(overall columns count), thus it quadratically depends on number of tables unionized. To remove this excessive memory usage I propose to run deparse_context_for_planstate+deparse_expression in a separate memory context and free it after a plan node is generated. Hmm, something like the attached? Seems reasonable... This looks pretty unsafe to me: it assumes, without much justification, that there is no memory allocated during show_expression() that will be needed later. I suspect the O(N^2) consumption comes directly from some workspace allocated for variable deparsing in ruleutils.c. A safer fix would be to do retail pfrees on that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
I wrote: Michael Paquier michael.paqu...@gmail.com writes: Attached is a patch adding the following set of functions for frontend and backends returning NULL instead of reporting ERROR when allocation fails: - palloc_safe - palloc0_safe - repalloc_safe Unimpressed with this naming convention. _unsafe would be nearer the mark ;-) Less snarkily: _noerror would probably fit better with existing precedents in our code. However, there is a larger practical problem with this whole concept, which is that experience should teach us to be very wary of the assumption that asking for memory the system can't give us will just lead to nice neat malloc-returns-NULL behavior. Any small perusal of the mailing list archives will remind you that very often the end result will be SIGSEGV, OOM kills, unrecoverable trap-on-write when the kernel realizes it can't honor a copy-on-write promise, yadda yadda. Agreed that it's arguable that these only occur in misconfigured systems ... but misconfiguration appears to be the default in a depressingly large fraction of systems. (This is another reason for _safe not being the mot juste :-() In that light, I'm not really convinced that there's a safe use-case for a behavior like this. I certainly wouldn't risk asking for a couple of gigabytes on the theory that I could just ask for less if it fails. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #12070: hstore extension: hstore_to_json_loose produces invalid JSON
On 11/30/14 11:45 AM, Tom Lane wrote: The API break isn't a big issue imo. The net effect would be that eg hstore 9.3.6 wouldn't work against a 9.3.5 server. We do that sort of thing *all the time* --- at least twice in the past year, according to a quick scan of the commit logs. If you were changing or removing a function that third-party code might depend on, it'd be problematic, but an addition has no such risk. This sort of things is actually a bit of an annoyance, because it means that for minor-version upgrades, you need to stop the server before unpacking the new version, otherwise the old running server will try to load the new hstore module and fail with a symbol lookup. This can increase the downtime significantly. Yes, we've done this before, and people have gotten bitten by it before. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC
Il 13/01/15 12:53, Gabriele Bartolini ha scritto: Hi Marco, could you please send an updated version the patch against the current HEAD in order to facilitate reviewers? Here is the updated patch for incremental file based backup. It is based on the current HEAD. I'm now working to the client tool to rebuild a full backup starting from a file based incremental backup. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it From 50ff0872d3901a30b6742900170052eabe0e06dd Mon Sep 17 00:00:00 2001 From: Marco Nenciarini marco.nenciar...@2ndquadrant.it Date: Tue, 14 Oct 2014 14:31:28 +0100 Subject: [PATCH] File-based incremental backup v4 Add backup profile to pg_basebackup INCREMENTAL option implementaion --- src/backend/access/transam/xlog.c | 7 +- src/backend/access/transam/xlogfuncs.c | 2 +- src/backend/replication/basebackup.c | 335 +++-- src/backend/replication/repl_gram.y| 6 + src/backend/replication/repl_scanner.l | 1 + src/bin/pg_basebackup/pg_basebackup.c | 83 ++-- src/include/access/xlog.h | 3 +- src/include/replication/basebackup.h | 4 + 8 files changed, 409 insertions(+), 32 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 839ea7c..625a5df 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** XLogFileNameP(TimeLineID tli, XLogSegNo *** 9249,9255 * permissions of the calling user! */ XLogRecPtr ! do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, char **labelfile) { boolexclusive = (labelfile == NULL); --- 9249,9256 * permissions of the calling user! */ XLogRecPtr ! do_pg_start_backup(const char *backupidstr, bool fast, ! XLogRecPtr incremental_startpoint, TimeLineID *starttli_p, char **labelfile) { boolexclusive = (labelfile == NULL); *** do_pg_start_backup(const char *backupids *** 9468,9473 --- 9469,9478 (uint32) (startpoint 32), (uint32) startpoint, xlogfilename); appendStringInfo(labelfbuf, CHECKPOINT LOCATION: %X/%X\n, (uint32) (checkpointloc 32), (uint32) checkpointloc); + if (incremental_startpoint 0) + appendStringInfo(labelfbuf, INCREMENTAL FROM LOCATION: %X/%X\n, +(uint32) (incremental_startpoint 32), +(uint32) incremental_startpoint); appendStringInfo(labelfbuf, BACKUP METHOD: %s\n, exclusive ? pg_start_backup : streamed); appendStringInfo(labelfbuf, BACKUP FROM: %s\n, diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index 2179bf7..ace84d8 100644 *** a/src/backend/access/transam/xlogfuncs.c --- b/src/backend/access/transam/xlogfuncs.c *** pg_start_backup(PG_FUNCTION_ARGS) *** 59,65 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg(must be superuser or replication role to run a backup))); ! startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL); PG_RETURN_LSN(startpoint); } --- 59,65 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg(must be superuser or replication role to run a backup))); ! startpoint = do_pg_start_backup(backupidstr, fast, 0, NULL, NULL); PG_RETURN_LSN(startpoint); } diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 07030a2..05b19c5 100644 *** a/src/backend/replication/basebackup.c --- b/src/backend/replication/basebackup.c *** *** 30,40 --- 30,42 #include replication/basebackup.h #include replication/walsender.h #include replication/walsender_private.h + #include storage/bufpage.h #include storage/fd.h #include storage/ipc.h #include utils/builtins.h #include utils/elog.h #include utils/ps_status.h + #include utils/pg_lsn.h #include utils/timestamp.h *** typedef struct *** 46,56 boolnowait; boolincludewal; uint32 maxrate; } basebackup_options; ! static int64 sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces); ! static int64 sendTablespace(char *path, bool sizeonly); static bool sendFile(char *readfilename, char *tarfilename, struct stat * statbuf, bool missing_ok); static void sendFileWithContent(const char
Re: [HACKERS] WITH CHECK and Column-Level Privileges
* Dean Rasheed (dean.a.rash...@gmail.com) wrote: One improvement we could trivially make is to only do this for multi-column indexes. If there is only one column there is no danger of information leakage, right? That's an interesting thought. If there's only one column then to have a conflict you must be changing it and providing a new value with either a constant, through a column on which you must have select rights, or with a function you have execute rights on. So, no, I can't think of a way that would leak information. I'm still on the fence about it though as it might be confusing to have single-column indexes behave differently and I'm a bit worried that, even if there isn't a way now to exploit this, there might be in the future. Yeah I couldn't see any easy way of doing it. 2 possibilities sprung to mind -- (1) wrap the index update in a PG_TRY() and add the detail in the catch block, or (2) track the currently active EState and make GetModifiedColumns() into an exported function taking a single EState argument (the EState has the currently active ResultRelInfo on it). Neither of those alternatives seems particularly attractive to me though. The EState is available when dealing with exclusion constraints but it's not available to _bt_check_unique easily, which is the bigger issue. GetModifiedColumns() could (and probably should, really) be moved into a .h somewhere as it's also in trigger.c (actually, that's where I pulled it from :). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] OOM on EXPLAIN with lots of nodes
I wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: Hmm, something like the attached? Seems reasonable... This looks pretty unsafe to me: it assumes, without much justification, that there is no memory allocated during show_expression() that will be needed later. I suspect the O(N^2) consumption comes directly from some workspace allocated for variable deparsing in ruleutils.c. A safer fix would be to do retail pfrees on that. I looked at this a bit more, and realized that we've got worse problems than just O(N^2) space consumption: there's also O(N^2) time consumption in ruleutils' set_simple_column_names(). The constant factor is not too bad in this example but would be horrible with 1000 regular relations in the rtable :-(. Safe or not, the extra-context solution doesn't improve the code's time consumption. Really the issue here is that explain.c assumes that deparse_context_for_planstate() is negligibly cheap, which was once true but is so no longer. What's more, most of the work is dependent only on the rtable and so there is no good reason at all to be doing it more than once per plan tree. I think we can fix this by refactoring so that we construct a deparse context once in ExplainPrintPlan(), and then just repoint it at different planstate nodes as we work through the plan tree. A difficulty with either your patch or my idea is that they require adding another field to ExplainState, which is an ABI break for any third-party code that might be declaring variables of that struct type. That's fine for HEAD but would be risky to back-patch. Any thoughts about whether we can get away with that (ie, anybody have an idea if there are third-party extensions that call explain.c)? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] hung backends stuck in spinlock heavy endless loop
On my workstation today (running vanilla 9.4.0) I was testing some new code that does aggressive parallel loading to a couple of tables. It ran ok several dozen times and froze up with no external trigger. There were at most 8 active backends that were stuck (the loader is threaded to a cap) -- each query typically resolves in a few seconds but they were hung for 30 minutes+. Had to do restart immediate as backends were not responding to cancel...but I snapped a 'perf top' before I did so. The results were interesting so I'm posting them here. So far I have not been able to reproduce...FYI 61.03% postgres [.] s_lock 13.56% postgres [.] LWLockRelease 10.11% postgres [.] LWLockAcquire 4.02% perf [.] 0x000526d3 1.65% postgres [.] _bt_compare 1.60% libc-2.17.so [.] 0x00081069 0.66% [kernel] [k] kallsyms_expand_symbol.constprop.1 0.60% [kernel] [k] format_decode 0.57% [kernel] [k] number.isra.1 0.47% [kernel] [k] memcpy 0.44% postgres [.] ReleaseAndReadBuffer 0.44% postgres [.] FunctionCall2Coll 0.41% [kernel] [k] vsnprintf 0.41% [kernel] [k] module_get_kallsym 0.32% postgres [.] _bt_relandgetbuf 0.31% [kernel] [k] string.isra.5 0.31% [kernel] [k] strnlen 0.31% postgres [.] _bt_moveright 0.28% libc-2.17.so [.] getdelim 0.22% postgres [.] LockBuffer 0.16% [kernel] [k] seq_read 0.16% libc-2.17.so [.] __libc_calloc 0.13% postgres [.] _bt_checkpage 0.09% [kernel] [k] pointer.isra.15 0.09% [kernel] [k] update_iter 0.08% plugin_host [.] PyObject_GetAttr 0.06% [kernel] [k] strlcpy 0.06% [kernel] [k] seq_vprintf 0.06% [kernel] [k] copy_user_enhanced_fast_string 0.06% libc-2.17.so [.] _IO_feof 0.06% postgres [.] btoidcmp 0.06% [kernel] [k] page_fault 0.06% libc-2.17.so [.] free 0.06% libc-2.17.so [.] memchr 0.06% libpthread-2.17.so [.] __pthread_mutex_unlock_usercnt merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
Hi, On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote: On my workstation today (running vanilla 9.4.0) I was testing some new code that does aggressive parallel loading to a couple of tables. It ran ok several dozen times and froze up with no external trigger. There were at most 8 active backends that were stuck (the loader is threaded to a cap) -- each query typically resolves in a few seconds but they were hung for 30 minutes+. Interesting. Had to do restart immediate as backends were not responding to cancel...but I snapped a 'perf top' before I did so. The results were interesting so I'm posting them here. So far I have not been able to reproduce...FYI Can you compile postgres with -fno-omit-frame-pointer? Then, when this happens the next time, you can take a perf record -g, which will tell us which lock the contention is at. 61.03% postgres [.] s_lock 13.56% postgres [.] LWLockRelease 10.11% postgres [.] LWLockAcquire That profile looks like it might end up being helped by the lwlock and/or freelist changes in 9.5. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 4:33 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote: On my workstation today (running vanilla 9.4.0) I was testing some new code that does aggressive parallel loading to a couple of tables. It ran ok several dozen times and froze up with no external trigger. There were at most 8 active backends that were stuck (the loader is threaded to a cap) -- each query typically resolves in a few seconds but they were hung for 30 minutes+. Interesting. Had to do restart immediate as backends were not responding to cancel...but I snapped a 'perf top' before I did so. The results were interesting so I'm posting them here. So far I have not been able to reproduce...FYI Can you compile postgres with -fno-omit-frame-pointer? Then, when this happens the next time, you can take a perf record -g, which will tell us which lock the contention is at. will do, and I'll loop it for a while and see if I can get it to re-occur. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup fails with long tablespace paths
On 1/7/15 3:19 PM, Robert Haas wrote: On Tue, Jan 6, 2015 at 4:33 PM, Peter Eisentraut pete...@gmx.net wrote: Currently, when you unpack a tarred basebackup with tablespaces, the symlinks will tell you whether you have unpacked the tablespace tars at the right place. Otherwise, how do you know? Secondly, you also have the option of putting the tablespaces somewhere else by changing the symlinks. That's a good argument for making the tablespace-map file human-readable and human-editable, but I don't think it's an argument for duplicating its contents inaccurately in the filesystem. One way to address this would be to do away with the symlinks altogether and have pg_tblspc/12345 be a text file that contains the tablespace location. Kind of symlinks implemented in user space. Well, that's just spreading the tablespace-map file out into several files, and maybe keeping it around after we've restored from backup. I think the key point I'm approaching is that the information should only ever be in one place, all the time. This is not dissimilar from why we took the tablespace location out of the system catalogs. Users might have all kinds of workflows for how they back up, restore, and move their tablespaces. This works pretty well right now, because the authoritative configuration information is always in plain view. The proposal is essentially that we add another location for this information, because the existing location is incompatible with some operating system tools. And, when considered by a user, that second location might or might not collide with or overwrite the first location at some mysterious times. So I think the preferable fix is not to add a second location, but to make the first location compatible with said operating system tools, possibly in the way I propose above. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind in contrib
There are two ways in which access control for replication connections is separate: - replication pseudo-database in pg_hba.conf - replication role attribute If someone has a restrictive setup for replication and pg_basebackup, and then pg_rewind enters, they will have to - add a line to pg_hba.conf to allow non-replication access - connect as superuser The first is an inconvenience, although it might be useful for this and other reasons to have a variant of all includes replication. The second, however, is a problem, because you have to change from a restricted, quasi-ready-only user to full superuser access. One way to address that might be if we added backend functions that are variants of pg_ls_dir() and pg_stat_file() that are accessible only with the replication attribute. Or we allow calling pg_ls_dir() and pg_stat_file() with relative paths with the replication attribute. (While we're at it, add a recursive variant of pg_ls_dir().) Or some variant of that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 2:29 PM, Merlin Moncure mmonc...@gmail.com wrote: On my workstation today (running vanilla 9.4.0) I was testing some new code that does aggressive parallel loading to a couple of tables. Could you give more details, please? For example, I'd like to see representative data, or at least the table definitions involved. Do you have any idea what index is involved with the _bt_compare() calls above? I'm curious as to what its definition is. You might also consider using my B-Tree verification tool on a preserved data directory: http://www.postgresql.org/message-id/cam3swzrk2yxkegj94_2wks_vbemjehc1ye23ecss01wrakz...@mail.gmail.com I'm inclined to think that this is a livelock, and so the problem isn't evident from the structure of the B-Tree, but it can't hurt to check. The tool is bleeding edge, so don't use it in production. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote: I'm inclined to think that this is a livelock, and so the problem isn't evident from the structure of the B-Tree, but it can't hurt to check. My guess is rather that it's contention on the freelist lock via StrategyGetBuffer's. I've seen profiles like this due to exactly that before - and it fits to parallel loading quite well. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote: I'm inclined to think that this is a livelock, and so the problem isn't evident from the structure of the B-Tree, but it can't hurt to check. My guess is rather that it's contention on the freelist lock via StrategyGetBuffer's. I've seen profiles like this due to exactly that before - and it fits to parallel loading quite well. I think I've got it to pop again. s_lock is only showing 35% (increasing very slowly if at all) but performance is mostly halted. Frame pointer is compiled out. perf report attached. merlin 35.82% postgres[.] s_lock 23.71% postgres[.] tas 14.01% postgres[.] tas 6.82% postgres[.] spin_delay 5.93% postgres[.] LWLockRelease 4.36% postgres[.] LWLockAcquireCommon 2.34% postgres[.] FunctionCall2Coll 1.79% postgres[.] _bt_compare 0.80% postgres[.] LockBuffer 0.52% postgres[.] btoidcmp 0.49% postgres[.] ReleaseAndReadBuffer 0.47% postgres[.] _bt_moveright 0.36% postgres[.] _bt_checkpage 0.36% postgres[.] _bt_relandgetbuf 0.20% perf[.] 0x0004312a 0.19% postgres[.] LWLockAcquire 0.13% libv8.so[.] 0x001bbbe0 0.11% libc-2.17.so[.] 0x00151134 0.09% libwebkit.so[.] 0x0106ccb7 0.05% libgdk_pixbuf-2.0.so.0.2800.1 [.] 0x000139c7 0.04% Xorg[.] 0x000efb00 0.03% libglib-2.0.so.0.3800.1 [.] 0x000876a2 0.03% [kernel][k] __ticket_spin_lock 0.02% perf-12966.map [.] 0x0625db944621 0.02% libskia.so [.] 0x0021d15f 0.02% libcairo.so.2.11200.16 [.] 0x0002440b 0.02% libpthread-2.17.so [.] __pthread_mutex_unlock_usercnt 0.02% [kernel][k] copy_user_enhanced_fast_string 0.02% radeon_drv.so [.] 0x00045002 0.02% libpthread-2.17.so [.] pthread_mutex_lock 0.02% [kernel][k] fget_light 0.01% [kernel][k] __schedule 0.01% libexa.so [.] 0x6e1d 0.01% libgdk-x11-2.0.so.0.2400.20 [.] 0x0002f0b0 0.01% libvte.so.9.2800.2 [.] 0x00039028 0.01% [radeon][k] r100_mm_rreg 0.01% [kernel][k] xhci_irq 0.01% [kernel][k] native_write_msr_safe 0.01% [kernel][k] update_cfs_rq_blocked_load 0.01% libglib-2.0.so.0.3800.1 [.] g_hash_table_lookup 0.01% libgobject-2.0.so.0.3800.1 [.] g_type_check_instance_is_a Press '?' for help on key bindings perf.report.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
Tom Lane writes: [blah] (This is another reason for _safe not being the mot juste :-() My wording was definitely incorrect but I sure you got it: I should have said safe on error. noerror or error_safe would are definitely more correct. In that light, I'm not really convinced that there's a safe use-case for a behavior like this. I certainly wouldn't risk asking for a couple of gigabytes on the theory that I could just ask for less if it fails. That's as well a matter of documentation. We could add a couple of lines in for example xfunc.sgml to describe the limitations of such APIs. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote: On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote: I'm inclined to think that this is a livelock, and so the problem isn't evident from the structure of the B-Tree, but it can't hurt to check. My guess is rather that it's contention on the freelist lock via StrategyGetBuffer's. I've seen profiles like this due to exactly that before - and it fits to parallel loading quite well. I think I've got it to pop again. s_lock is only showing 35% (increasing very slowly if at all) but performance is mostly halted. Frame pointer is compiled out. perf report attached. 35.82% postgres[.] s_lock 23.71% postgres[.] tas 14.01% postgres[.] tas 6.82% postgres[.] spin_delay 5.93% postgres[.] LWLockRelease 4.36% postgres[.] LWLockAcquireCommon Interesting. This profile looks quite different? What kind of hardware is this on? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote: My guess is rather that it's contention on the freelist lock via StrategyGetBuffer's. I've seen profiles like this due to exactly that before - and it fits to parallel loading quite well. I'm not saying you're wrong, but the breakdown of _bt_moveright() relative to _bt_relandgetbuf() calls seems a bit fishy to me. I don't remember seeing _bt_moveright() or _bt_compare() figuring so prominently, where _bt_binsrch() is nowhere to be seen. I can't see a reference to _bt_binsrch() in either profile. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 5:49 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote: My guess is rather that it's contention on the freelist lock via StrategyGetBuffer's. I've seen profiles like this due to exactly that before - and it fits to parallel loading quite well. I'm not saying you're wrong, but the breakdown of _bt_moveright() relative to _bt_relandgetbuf() calls seems a bit fishy to me. I don't remember seeing _bt_moveright() or _bt_compare() figuring so prominently, where _bt_binsrch() is nowhere to be seen. I can't see a reference to _bt_binsrch() in either profile. hm, this is hand compiled now, I bet the symbols are wrong. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On 2015-01-13 15:49:33 -0800, Peter Geoghegan wrote: On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote: My guess is rather that it's contention on the freelist lock via StrategyGetBuffer's. I've seen profiles like this due to exactly that before - and it fits to parallel loading quite well. I'm not saying you're wrong, but the breakdown of _bt_moveright() relative to _bt_relandgetbuf() calls seems a bit fishy to me. The hierarchical profile definitely doesn't look like my guess was right. I don't remember seeing _bt_moveright() or _bt_compare() figuring so prominently, where _bt_binsrch() is nowhere to be seen. I can't see a reference to _bt_binsrch() in either profile. Well, we do a _bt_moveright pretty early on, so that actually might be cache misses we're primarily seing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote: I don't remember seeing _bt_moveright() or _bt_compare() figuring so prominently, where _bt_binsrch() is nowhere to be seen. I can't see a reference to _bt_binsrch() in either profile. hm, this is hand compiled now, I bet the symbols are wrong. In case it isn't clear, I think that the proximate cause here may well be either one (or both) of commits efada2b8e920adfdf7418862e939925d2acd1b89 and/or 40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think that the profile is roughly consistent with that, although I may well be wrong. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote: I don't remember seeing _bt_moveright() or _bt_compare() figuring so prominently, where _bt_binsrch() is nowhere to be seen. I can't see a reference to _bt_binsrch() in either profile. hm, this is hand compiled now, I bet the symbols are wrong. In case it isn't clear, I think that the proximate cause here may well be either one (or both) of commits efada2b8e920adfdf7418862e939925d2acd1b89 and/or 40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think that the profile is roughly consistent with that, although I may well be wrong. I'm out of time for the day, but I'm fairly confident I can reproduce. I'll see if I can reverse those commits tomorrow and retest (I'm on development box). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote: On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote: I'm inclined to think that this is a livelock, and so the problem isn't evident from the structure of the B-Tree, but it can't hurt to check. My guess is rather that it's contention on the freelist lock via StrategyGetBuffer's. I've seen profiles like this due to exactly that before - and it fits to parallel loading quite well. I think I've got it to pop again. s_lock is only showing 35% (increasing very slowly if at all) but performance is mostly halted. Frame pointer is compiled out. perf report attached. 35.82% postgres[.] s_lock 23.71% postgres[.] tas 14.01% postgres[.] tas 6.82% postgres[.] spin_delay 5.93% postgres[.] LWLockRelease 4.36% postgres[.] LWLockAcquireCommon Interesting. This profile looks quite different? yep, it's very stable, and the database is approximately frozen. What kind of hardware is this on? linux on crap workstation box, except I have good ssd (intel 3500). model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz top - 17:44:00 up 11 days, 6:25, 15 users, load average: 6.94, 6.97, 5.73 Tasks: 259 total, 8 running, 250 sleeping, 0 stopped, 1 zombie %Cpu0 : 95.7 us, 0.0 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 93.5 us, 2.2 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 95.7 us, 0.0 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 93.6 us, 2.1 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 8131028 total, 7977608 used, 153420 free,21424 buffers KiB Swap: 8340476 total, 1884900 used, 6455576 free, 5201188 cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 32492 mmoncure 20 0 175m 12m 9040 R 64.0 0.2 9:28.42 postgres 32367 mmoncure 20 0 175m 12m 9128 R 57.3 0.2 9:35.87 postgres 31792 mmoncure 20 0 175m 92m 88m R 52.9 1.2 9:57.50 postgres 32725 mmoncure 20 0 175m 13m 9852 R 52.9 0.2 9:08.50 postgres 31899 mmoncure 20 0 179m 104m 97m R 50.7 1.3 9:49.51 postgres 32368 mmoncure 20 0 175m 12m 9200 R 50.7 0.2 9:33.77 postgres 31282 mmoncure 20 0 184m 152m 140m R 46.3 1.9 10:26.29 postgres 666 mmoncure 20 0 20640 1684 1116 R 2.2 0.0 0:00.01 top Some more information what's happening: This is a ghetto logical replication engine that migrates data from sql sever to postgres, consolidating a sharded database into a single set of tables (of which there are only two). There is only one index on the destination table, and it's composite int,int in both cases. A single source table is replicated in a single transaction, and it's staged to a set of two temp tables, one for deletes, one for inserts. I log each step in the database. Here are the times: cds2=# select cdsruntableid, finished - started, inserted from cdsruntable where cdsrunid = 62 order by started; cdsruntableid │?column? │ inserted ───┼─┼── 833 │ 00:00:33.24044 │11860 834 │ 00:00:35.44981 │19099 835 │ 00:02:01.725186 │ 530923 836 │ 00:01:29.578811 │ 211737 838 │ 00:01:17.393461 │64258 837 │ 00:00:56.849106 │ 227844 839 │ 00:02:12.016504 │ 630309 840 │ 00:00:00.00111 │ 841 │ 00:01:09.058806 │ 155309 842 │ 00:01:54.723747 │ 704422 843 │ 00:01:14.965304 │ 844 │ 00:00:45.410953 │59934 845 │ 00:00:34.302632 │14848 846 │ 00:00:46.913832 │89893 848 │ 00:05:22.155189 │ 2410622 847 │ 00:01:25.199141 │ 268157 849 │ 00:01:16.168352 │ 117511 850 │ 00:00:34.809862 │ 1175 851 │ 00:01:12.565397 │67214 852 │ 00:01:03.734776 │20129 853 │ 00:00:41.038456 │62765 854 │ 00:01:03.478917 │14967 855 │ 00:00:33.88803 │ 6901 856 │ 00:00:36.381886 │ 6670 857 │ 00:00:36.626623 │ 8268 858 │ 00:01:14.391584 │ 363448 859 │ 00:01:44.619005 │ 533395 860 │ 00:01:02.042255 │ 212202 861 │ 00:00:00.001065 │ 863 │ 00:00:58.265506 │ 215234 862 │ 00:02:45.572635 │ 827941 865 │ 00:01:28.049165 │ 241020 864 │ 00:01:51.465643 │ 531012 866 │ 00:01:20.273391 │ 419357 868 │ 00:01:25.479443 │ 294262 869 │ 00:00:46.400825 │46136 870 │ 00:00:42.337286 │25934
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
Merlin Moncure mmonc...@gmail.com writes: On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote: In case it isn't clear, I think that the proximate cause here may well be either one (or both) of commits efada2b8e920adfdf7418862e939925d2acd1b89 and/or 40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think that the profile is roughly consistent with that, although I may well be wrong. I'm out of time for the day, but I'm fairly confident I can reproduce. I'll see if I can reverse those commits tomorrow and retest (I'm on development box). I'm not convinced that Peter is barking up the right tree. I'm noticing that the profiles seem rather skewed towards parser/planner work; so I suspect the contention is probably on access to system catalogs. No idea exactly why though. It would be good to collect a few stack traces from the hung backends, rather than relying on perf statistics. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On 2015-01-13 19:05:10 -0500, Tom Lane wrote: Merlin Moncure mmonc...@gmail.com writes: On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote: In case it isn't clear, I think that the proximate cause here may well be either one (or both) of commits efada2b8e920adfdf7418862e939925d2acd1b89 and/or 40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think that the profile is roughly consistent with that, although I may well be wrong. I'm out of time for the day, but I'm fairly confident I can reproduce. I'll see if I can reverse those commits tomorrow and retest (I'm on development box). I'm not convinced that Peter is barking up the right tree. I'm noticing that the profiles seem rather skewed towards parser/planner work; so I suspect the contention is probably on access to system catalogs. No idea exactly why though. The plan contains plpgsql and exec_stmt_dynexecute(). So it might just be executing crazy amounts of dynamic statements. I'm still wondering if this isn'ta different issue to the first one, the plans do look different. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers