Re: [HACKERS] Timing events WIP v1
2012/12/12 Greg Smith g...@2ndquadrant.com: On 11/20/12 8:02 PM, Craig Ringer wrote: On 11/20/2012 04:48 PM, Pavel Stehule wrote: I don't like to see current hstore in core - It doesn't support nesting, it doesn't support different datatypes, it is not well supported by plpgsql ... or by many client libraries, though converting hstore values to json notation for output usually takes care of that. The more I look at this, the less comfortable I am moving forward with hand waving this issue away. The obvious but seemingly all troublesome options: -Store things internally using hstore. This seems to require pulling hstore fully into core, which has this list of issues. Outputting through a JSON filter seems like a lot of overhead. And even if the output concerns are addressed, if there's a problem processing the raw data with PL/pgSQL that would be bad. I didn't think about that at all until Pavel brought it up. -Go back to a simpler composite type idea for storing this data. It feels wrong to create something that looks a lot like hstore data, but isn't. There's probably subtle display/client/processing issues in there I haven't thought of too. -Try to store the data in a JSON friendly way in the first place. That last one seems to lead right to... I can't help but suspect that hstore will be subsumed by native storage and indexing of json objects, since these are already widely supported in KV or document stores. Of course, that requires that we come to some agreement on how to represent literal scalars for interaction with json, which hasn't seen much progress. If this is the Right Way to solve this problem, that's puts a serious snag in doing something useful with Timing Events in he near future. I know better than to try and fight against following the correct path once it's identified. I can't claim any good expertise on client encoding/transfer issues though; that's not a strong area for me. I didn't watch this discussion all time, but I don't understand why you need a complex datatypes. Maybe you can do normalization and complex data types can hold only in memory. Regards Pavel -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP PATCH] for Performance Improvement in Buffer Management
On Wednesday, December 12, 2012 5:23 AM Greg Smith wrote: On 11/23/12 5:57 AM, Amit kapila wrote: Let us try to see by example: Total RAM - 22G Database size - 16G ... Case -2 (Shared Buffers - 10G) a. Load all the files in OS buffers. In best case OS buffers can contain10-12G data as OS has 12G of memory available. b. Try to load all in Shared buffers. Last 10G will be there in shared buffers. c. Now as there is no direct correlation of data between Shared Buffers and OS buffers, so whenever PG has to access any data which is not there in Shared Buffers, good chances are there that it can lead to IO. I don't think either of these examples are very representative of real-world behavior. The idea of load all the files in OS buffers assumes someone has used a utility like pg_prewarm or pgfincore. It's not something that happens in normal use. Being able to re-populate all of RAM using those utilities isn't realistic anyway. Anyone who tries to load more than (memory - shared_buffers) that way is likely to be disappointed by the result. True, I also think nobody will directly try to do it in this way, but such similar situations can arise after long run. Something like if we assume most used pages fall under the range of RAM. Similarly, the problem you're describing here has been described as the double buffering one for a while now. The old suggestion that shared_buffers not be set about 25% of RAM comes from this sort of concern. If triggering a problem requires doing that, essentially misconfiguring the server, it's hard to get too excited about it. Anyway, none of that impacts on me mixing testing for this into what I'm working on. The way most pgbench tests happen, it's hard to *not* have the important data in cache. Once you run the init step, you have to either reboot or drop the OS cache to get those pages out of RAM. That means the sort of cached setup you're using pg_prewarm to simulate--things are in the OS cache, but not the PostgreSQL one--is one that anyone running an init/test pair will often create. You don't need pg_prewarm to do it. The way, I have ran the tests is to just try to simulate scenario's where invalidating buffers by bgwriter/checkpoint can have advantage. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding exported base snapshot
On 2012-12-11 21:05:51 -0500, Joachim Wieland wrote: On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund and...@2ndquadrant.com wrote: One problem I see is that while exporting a snapshot solves the visibility issues of the table's contents it does not protect against schema changes. I am not sure whether thats a problem. If somebody runs a CLUSTER or something like that, the table's contents will be preserved including MVCC semantics. That's fine. The more problematic cases I see are TRUNCATE, DROP and ALTER TABLE. This is why the pg_dump master process executes a lock table table in access share mode for every table, so your commands would all block. In fact it's even more complicated because the pg_dump worker processes also need to lock the table. They try to get a similar lock in NOWAIT mode right before dumping the table. If they don't get the lock that means that somebody else is waiting for an exclusive lock (this is the case you describe) and the backup will fail. [Tom explains why this is problematic way better than I do, see his email] A trivial example - you can play nastier games than that though: S1: CREATE TABLE test(id int); S1: INSERT INTO test VALUES(1), (2); S1: BEGIN; S1: ALTER TABLE test RENAME COLUMN id TO id2; S2: pg_dump S2: waits S1: COMMIT; pg_dump: [archiver (db)] query failed: ERROR: column id of relation test does not exist pg_dump: [archiver (db)] query was: COPY public.test (id) TO stdout; Problem 2: To be able to do a SET TRANSACTION SNAPSHOT the source transaction needs to be alive. That's currently solved by exporting the snapshot in the walsender connection that did the INIT_LOGICAL_REPLICATION. The question is how long should we preserve that snapshot? You lost me on this one after the first sentence... But in general the snapshot isn't so much a magical thing: As long the exporting transaction is open, it guarantees that there is a transaction out there that is holding off vacuum from removing data and it's also guaranteeing that the snapshot as is has existed at some time in the past. Once it is applied to another transaction you now have two transactions that will hold off vacuum because they share the same xmin,xmax values. You could also just end the exporting transaction at that point. I should probably have given a bit more context here... All this is in the context of decoding WAL back into something useful for the purpose of replication. This is done in the already existing walsender process, using the commands I explained in the original post. What we do there is walk the WAL from some point until we could assemble a snapshot for exactly that LSN. Several preconditions need to be met there (like a low enough xmin horizon, so the old catalog entries are still arround). Using that snapshot we can now decode the entries stored in the WAL. If you setup a new replica though, getting all changes from one point where you're consistent isn't all that interesting if you cannot also get a backup from exactly that point in time because otherwise its very hard to start with something sensible on the standby without locking everything down. So the idea is that once we found that consistent snapshot we also export it. Only that we are in a walsender connection which doesn't expose transactional semantics, so the snapshot cannot be deallocated by COMMIT; ing. So the question is basically about how to design the user interface of that deallocation. Makes slightly more sense now? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] bulk_multi_insert infinite loops with large rows and small fill factors
COPY IN loops in heap_multi_insert() extending the table until it fills the disk when trying to insert a wide row into a table with a low fill-factor. Internally fill-factor is implemented by reserving some space space on a page. For large enough rows and small enough fill-factor bulk_multi_insert() can't fit the row even on a new empty page, so it keeps allocating new pages but is never able to place the row. It should always put at least one row on an empty page. In the excerpt below saveFreeSpace is the reserved space for the fill-factor. while (ndone ntuples) { ... /* * Find buffer where at least the next tuple will fit. If the page is * all-visible, this will also pin the requisite visibility map page. */ buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]-t_len, ... /* Put as many tuples as fit on this page */ for (nthispage = 0; ndone + nthispage ntuples; nthispage++) { HeapTuple heaptup = heaptuples[ndone + nthispage]; if (PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) break; RelationPutHeapTuple(relation, buffer, heaptup); } ...Do a bunch of dirtying and logging etc ... } This was introduced in 9.2 as part of the bulk insert speedup. One more point, in the case where we don't insert any rows, we still do all the dirtying and logging work even though we did not modify the page. I have tried skip all this if no rows are added (nthispage == 0), but my access method foo is sadly out of date, so someone should take a skeptical look at that. A test case and patch against 9.2.2 is attached. It fixes the problem and passes make check. Most of the diff is just indentation changes. Whoever tries this will want to test this on a small partition by itself. -dg -- David Gould 510 282 0869 da...@sonic.net If simplicity worked, the world would be overrun with insects. \set ECHO all \timing on drop table if exists big_sparse; CREATE TABLE big_sparse ( u_id integer NOT NULL, c_id integer DEFAULT 0 NOT NULL, a_id integer DEFAULT 0 NOT NULL, cr_id integer DEFAULT 0 NOT NULL, b integer, c_t_u text, l_date abstime DEFAULT now(), c_b double precision DEFAULT 0.0, d_b double precision, c_s text, c_s_u abstime, c_e_u abstime, co_b_r_ts abstime, e_a_s integer, c_c_i_l text, c_c_e_l text, ag_c_i_l text, ag_c_e_l text, c_c_t_l text, c_r_t_l text, c_m_t_l text, c_ci_t_l text, ag_c_t_l text, ag_r_t_l text, ag_m_t_l text, ag_ci_t_l text, c_l_t_l text, ag_l_t_l text, r_s_id smallint DEFAULT 1 NOT NULL, r_s text NOT NULL, i_u_l text, e_u_l text, f_l text, f smallint, e_s_l_e text, r_h_f_n text, cr_h smallint, cr_w smallint, n_b integer, cr_a_l text, e_id bigint NOT NULL, a_ex text, ca_b_a real[], su smallint DEFAULT 0 NOT NULL, h_b_f real[], p_id integer, ca_b_aj real[], ca_b_aj_i integer[] ) WITH (fillfactor=10); COPY big_sparse FROM stdin; 1688 700032834 700580073 704810483 25 foobar.grotxbarb/xdexyzzyen.xyzzyv=xxixsrc=201002RAA9 2011-11-12 05:00:01+00 0 500 i 2010-10-31 04:00:00+00 \N 2011-11-12 05:00:01+00 113 \N \N \N \N US \N \N \N \N \N \N \N en \N 5 Rxyzzyn \N foobar./|foobar.xyzzyatxyzzy.xyzzyk/|foobar.grotxyzzyroupbarb/|grotxyzzxyzzybarb|foobar.grotxyxyzzybarb/|foobar.xyzzyxyzzyily.xyzzy/|grotxyzzyancebarb|grotxyzzydog|foobar.grotxyzzyxyzzysbarbdog/|foobar.grotxyzzyaobarb/|grotxyzzxyzzywsbarb|foobar.xyzzyun.xyzzy/|foobar.grotogejbarb/|foobar.grotxyzzxyzzybarb/|foobar.grotogcdbarbdog/|grotxyzzyetbarb|foobar.grotxyzzyipobarb/|foobar.grotxyzzyaobarb/|grotxyzzyxyzzyiabarb|grotxyzzybarbdog xyzzyign 2 1:1291|0|0 ABC DEF 36|0|0 OR 1290|0|0 ABC DEF 36|0|0 OR 13|0|0 ABC DEF 36|0|0 OR 84|2592000|0 ABC DEF 36|0|0 OR 83|2592000|0 ABC DEF 36|0|0 OR 82|2592000|0 ABC DEF 36|0|0 OR 12|0|0 ABC DEF 36|0|0 \N 0 0 25 \N 0 \N \N 0 {1,1,1,1,1,0.20003,0.40006,0.60024,0.80012,1,1,1,1,1,1,1! ,1,1,1,1,1,1,1,1} \N \N \N \. *** postgresql-9.2.2/src/backend/access/heap/heapam.c 2012-12-03 12:16:10.0 -0800 --- postgresql-9.2.2dg/src/backend/access/heap/heapam.c 2012-12-12 01:55:58.174653706 -0800 *** *** 2158,2163 --- 2158,2164 Buffer buffer; Buffer vmbuffer = InvalidBuffer; bool all_visible_cleared = false; + bool page_is_empty; int nthispage; /* *** *** 2173,2299 START_CRIT_SECTION(); /* Put as many tuples as fit on this page */ for (nthispage = 0; ndone + nthispage ntuples; nthispage++) { HeapTuple heaptup = heaptuples[ndone + nthispage]; ! if (PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) break; ! RelationPutHeapTuple(relation, buffer, heaptup); }
Re: [HACKERS] Logical decoding exported base snapshot
On 2012-12-11 22:20:18 -0500, Tom Lane wrote: Joachim Wieland j...@mcknight.de writes: On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund and...@2ndquadrant.com wrote: One problem I see is that while exporting a snapshot solves the visibility issues of the table's contents it does not protect against schema changes. I am not sure whether thats a problem. If somebody runs a CLUSTER or something like that, the table's contents will be preserved including MVCC semantics. That's fine. The more problematic cases I see are TRUNCATE, DROP and ALTER TABLE. This is why the pg_dump master process executes a lock table table in access share mode for every table, so your commands would all block. A lock doesn't protect against schema changes made before the lock was taken. The reason that the described scenario is problematic is that pg_dump is going to be expected to work against a snapshot made before it gets a chance to take those table locks. Thus, there's a window where DDL is dangerous, and will invalidate the dump --- perhaps without any warning. Exactly. Now, we have this problem today, in that pg_dump has to read pg_class before it can take table locks so some window exists already. Yea. And if somebody else already has conflicting locks its pretty damn easy to hit as show in my answer to Joachim... I am pretty sure there are lots of nasty silent variants. What's bothering me about what Andres describes is that the window for trouble seems to be getting much bigger. Me too. If it only were clearly visible errors I wouldn't be bothered too much, but ... This morning I wondered whether we couldn't protect against that by acquiring share locks on the catalog rows pg_dump reads, that would result in could not serialize access due to concurrent update type of errors which would be easy enough discernible/translateable. While pretty damn ugly that should take care of most of those issues, shouldn't it? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding exported base snapshot
On 2012-12-11 22:39:14 -0500, Steve Singer wrote: On 12-12-11 06:52 PM, Andres Freund wrote: Hi, Problem 1: One problem I see is that while exporting a snapshot solves the visibility issues of the table's contents it does not protect against schema changes. I am not sure whether thats a problem. If somebody runs a CLUSTER or something like that, the table's contents will be preserved including MVCC semantics. That's fine. The more problematic cases I see are TRUNCATE, DROP and ALTER TABLE. Imagine the following: S1: INIT_LOGICAL_REPLICATION S1: get snapshot: 0333-1 S2: ALTER TABLE foo ALTER COLUMN blub text USING (blub::text); S3: pg_dump --snapshot 0333-1 S1: START_LOGICAL_REPLICATION In that case the pg_dump would dump foo using the schema *after* the ALTER TABLE but showing only rows visible to our snapshot. After START_LOGICAL_REPLICATION all changes after the xlog position from INIT_LOGICAL_REPLICATION will be returned though - including all the tuples from the ALTER TABLE and potentially - if some form of schema capturing was in place - the ALTER TABLE itself. The copied schema would have the new format already though. Does anybody see that as aproblem or is it just a case of PEBKAC? One argument for the latter is that thats already a problematic case for normal pg_dump's. Its just that the window is a bit larger here. Is there anyway to detect this situation as part of the pg_dump? If I could detect this, abort my pg_dump then go and get a new snapshot then I don't see this as a big deal. I can live with telling users, don't do DDL like things while subscribing a new node, if you do the subscription will restart. I am less keen on telling users don't do DDL like things while subscribing a new node or the results will be unpredictable I am trying to think of unintrusive way to detect this I'm trying to convince myself if I will be able to take a pg_dump from an exported snapshot plus the changes made after in between the snapshot id to some later time and turn the results into a consistent database. What if something like this comes along ... I'm worried that it will be difficult to pragmatically stitch together the inconsistent snapshot from the pg_dump plus the logical records generated in between the snapshot and the dump (along with any record of the DDL if it exists). I think trying to solve that in the replication solution is a bad idea. There are too many possible scenarios and some of them very subtle and hard to detect. So I think its either: 1) find something that tests for this and abort if so 2) acquire locks earlier preventing DDL alltogether till pg_dump starts 3) don't care. The problem exists today and not many people have complained. Problem 2: Control Flow. To be able to do a SET TRANSACTION SNAPSHOT the source transaction needs to be alive. That's currently solved by exporting the snapshot in the walsender connection that did the INIT_LOGICAL_REPLICATION. The question is how long should we preserve that snapshot? There is no requirement - and there *cannot* be one in the general case, the slot needs to be usable after a restart - that START_LOGICAL_REPLICATION has to be run in the same replication connection as INIT. Possible solutions: 1) INIT_LOGICAL_REPLICATION waits for an answer from the client that confirms that logical replication initialization is finished. Before that the walsender connection cannot be used for anything else. 2) we remove the snapshot as soon as any other commend is received, this way the replication connection stays usable, e.g. to issue a START_LOGICAL_REPLICATION in parallel to the initial data dump. In that case the snapshot would have to be imported *before* the next command was received as SET TRANSACTION SNAPSHOT requires the source transaction to be still open. Option 2 sounds more flexible. Is it more difficult to implement? No, I don't think so. It's a bit more intrusive in that it requires knowledge about logical replication in more parts of walsender, but it should be ok. Note btw, that my description of 1) was easy to misunderstand. The that in Before that the walsender connection cannot be used for anything else. is the answer from the client, not the usage of the exported snapshot. Once the snapshot has been iimported into other session(s) the source doesn't need to be alive anymore. Does that explanation change anything? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
On 2012-12-12 03:04:19 -0800, David Gould wrote: COPY IN loops in heap_multi_insert() extending the table until it fills the disk when trying to insert a wide row into a table with a low fill-factor. Internally fill-factor is implemented by reserving some space space on a page. For large enough rows and small enough fill-factor bulk_multi_insert() can't fit the row even on a new empty page, so it keeps allocating new pages but is never able to place the row. It should always put at least one row on an empty page. Heh. Nice one. Did you hit that in practice? One more point, in the case where we don't insert any rows, we still do all the dirtying and logging work even though we did not modify the page. I have tried skip all this if no rows are added (nthispage == 0), but my access method foo is sadly out of date, so someone should take a skeptical look at that. A test case and patch against 9.2.2 is attached. It fixes the problem and passes make check. Most of the diff is just indentation changes. Whoever tries this will want to test this on a small partition by itself. ISTM this would be fixed with a smaller footprint by just making if (PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) if (!PageIsEmpty(page) PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) I think that should work? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
On 12.12.2012 13:27, Andres Freund wrote: On 2012-12-12 03:04:19 -0800, David Gould wrote: One more point, in the case where we don't insert any rows, we still do all the dirtying and logging work even though we did not modify the page. I have tried skip all this if no rows are added (nthispage == 0), but my access method foo is sadly out of date, so someone should take a skeptical look at that. A test case and patch against 9.2.2 is attached. It fixes the problem and passes make check. Most of the diff is just indentation changes. Whoever tries this will want to test this on a small partition by itself. ISTM this would be fixed with a smaller footprint by just making if (PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) if (!PageIsEmpty(page) PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) I think that should work? Yeah, seems that it should, although PageIsEmpty() is no guarantee that the tuple fits, because even though PageIsEmpty() returns true, there might be dead line pointers consuming so much space that the tuple at hand doesn't fit. However, RelationGetBufferForTuple() won't return such a page, it guarantees that the first tuple does indeed fit on the page it returns. For the same reason, the later check that at least one tuple was actually placed on the page is not necessary. I committed a slightly different version, which unconditionally puts the first tuple on the page, and only applies the freespace check to the subsequent tuples. Since RelationGetBufferForTuple() guarantees that the first tuple fits, we can trust that, like heap_insert does. --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2172,8 +2172,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); - /* Put as many tuples as fit on this page */ - for (nthispage = 0; ndone + nthispage ntuples; nthispage++) + /* +* RelationGetBufferForTuple has ensured that the first tuple fits. +* Put that on the page, and then as many other tuples as fit. +*/ + RelationPutHeapTuple(relation, buffer, heaptuples[ndone]); + for (nthispage = 1; ndone + nthispage ntuples; nthispage++) { HeapTuple heaptup = heaptuples[ndone + nthispage]; Thanks for the report! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
On Wed, 12 Dec 2012 12:27:11 +0100 Andres Freund and...@2ndquadrant.com wrote: On 2012-12-12 03:04:19 -0800, David Gould wrote: COPY IN loops in heap_multi_insert() extending the table until it fills the Heh. Nice one. Did you hit that in practice? Yeah, with a bunch of hosts that run postgres on a ramdisk, and that copy happens late in the initial setup script for new hosts. The first batch of new hosts to be setup with 9.2 filled the ramdisk, oomed and fell over within a minute. Since the script setups up a lot of stuff we had no idea at first who oomed. ISTM this would be fixed with a smaller footprint by just making if (PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) if (!PageIsEmpty(page) PageGetHeapFreeSpace(page) MAXALIGN(heaptup-t_len) + saveFreeSpace) I think that should work? I like PageIsEmpty() better (and would have used if I I knew), but I'm not so crazy about the negation. -dg -- David Gould 510 282 0869 da...@sonic.net If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
On 12.12.2012 14:17, David Gould wrote: On Wed, 12 Dec 2012 12:27:11 +0100 Andres Freundand...@2ndquadrant.com wrote: On 2012-12-12 03:04:19 -0800, David Gould wrote: COPY IN loops in heap_multi_insert() extending the table until it fills the Heh. Nice one. Did you hit that in practice? Yeah, with a bunch of hosts that run postgres on a ramdisk, and that copy happens late in the initial setup script for new hosts. The first batch of new hosts to be setup with 9.2 filled the ramdisk, oomed and fell over within a minute. Since the script setups up a lot of stuff we had no idea at first who oomed. The bug's been fixed now, but note that huge tuples like this will always cause the table to be extended. Even if there are completely empty pages in the table, after a vacuum. Even a completely empty existing page is not considered spacious enough in this case, because it's still too small when you take fillfactor into account, so the insertion will always extend the table. If you regularly run into this situation, you might want to raise your fillfactor.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
On Wed, 12 Dec 2012 13:56:08 +0200 Heikki Linnakangas hlinnakan...@vmware.com wrote: However, RelationGetBufferForTuple() won't return such a page, it guarantees that the first tuple does indeed fit on the page it returns. For the same reason, the later check that at least one tuple was actually placed on the page is not necessary. I committed a slightly different version, which unconditionally puts the first tuple on the page, and only applies the freespace check to the subsequent tuples. Since RelationGetBufferForTuple() guarantees that the first tuple fits, we can trust that, like heap_insert does. --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2172,8 +2172,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); - /* Put as many tuples as fit on this page */ - for (nthispage = 0; ndone + nthispage ntuples; nthispage++) + /* + * () has ensured that the first tuple fits. + * Put that on the page, and then as many other tuples as fit. + */ + RelationPutHeapTuple(relation, buffer, heaptuples[ndone]); + for (nthispage = 1; ndone + nthispage ntuples; nthispage++) { HeapTuple heaptup = heaptuples[ndone + nthispage]; I don't know if this is the same thing. At least in the comments I was reading trying to figure this out there was some concern that someone else could change the space on the page. Does RelationGetBufferForTuple() guarantee against this too? -dg -- David Gould 510 282 0869 da...@sonic.net If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
On 12.12.2012 14:24, David Gould wrote: I don't know if this is the same thing. At least in the comments I was reading trying to figure this out there was some concern that someone else could change the space on the page. Does RelationGetBufferForTuple() guarantee against this too? Yeah, RelationGetBufferForTuple grabs a lock on the page before returning it. For comparison, plain heap_insert does simply this: buffer = RelationGetBufferForTuple(relation, heaptup-t_len, InvalidBuffer, options, bistate, vmbuffer, NULL); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); RelationPutHeapTuple(relation, buffer, heaptup); - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
On Wed, 12 Dec 2012 14:23:12 +0200 Heikki Linnakangas hlinnakan...@vmware.com wrote: The bug's been fixed now, but note that huge tuples like this will always cause the table to be extended. Even if there are completely empty pages in the table, after a vacuum. Even a completely empty existing page is not considered spacious enough in this case, because it's still too small when you take fillfactor into account, so the insertion will always extend the table. If you regularly run into this situation, you might want to raise your fillfactor.. Actually, we'd like it lower. Ideally, one row per page. We lose noticable performance when we raise fill-factor above 10. Even 20 is slower. During busy times these hosts sometimes fall into a stable state with very high cpu use mostly in s_lock() and LWLockAcquire() and I think PinBuffer plus very high system cpu in the scheduler (I don't have the perf trace in front of me so take this with a grain of salt). In this mode they fall from the normal 7000 queries per second to below 3000. Once in this state they tend to stay that way. If we turn down the number of incoming requests they go back to normal. Our conjecture is that most requests are for only a few keys and so we have multiple sessions contending for few pages and convoying in the buffer manager. The table is under 20k rows, but the hot items are probably only a couple hundred different rows. The busy processes are doing reads only, but there is some update activity on this table too. Ah, found an email with the significant part of the perf output: ... set number of client threads = number of postgres backends = 70. That way all my threads have constant access to a backend and they just spin in a tight loop running the same query over and over (with different values). ... this seems to have tapped into 9.2's resonant frequency, right now we're spending almost all our time spin locking. ... 762377.00 71.0% s_lock /usr/local/bin/postgres 22279.00 2.1% LWLockAcquire /usr/local/bin/postgres 18916.00 1.8% LWLockRelease /usr/local/bin/postgres I was trying to resurrect the pthread s_lock() patch to see if that helps, but it did not apply at all and I have not had time to persue it. We have tried lots of number of processes and get the best result with about ten less active postgresql backends than HT cores. System is 128GB with: processor : 79 vendor_id : GenuineIntel cpu family : 6 model : 47 model name : Intel(R) Xeon(R) CPU E7-L8867 @ 2.13GHz stepping: 2 cpu MHz : 2128.478 cache size : 30720 KB -dg -- David Gould 510 282 0869 da...@sonic.net If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use of systable_beginscan_ordered in event trigger patch
Tom Lane t...@sss.pgh.pa.us writes: Yeah, I had forgotten about that loose end, but we definitely need something of the sort. Committed with additional documentation. (I put something in the CREATE EVENT TRIGGER ref page, but do we need anything about it in chapter 37?) Thanks! I guess we could add a note at the end of the Overview of Event Trigger Behavior section. Then maybe we should explain in that section that you can't set an event trigger to fire on event trigger commands. BTW, is the example in the CREATE EVENT TRIGGER ref page meant as an IQ test for unwary readers? You do know there are people who will copy-and-paste just about any example that's put in front of them. Perhaps we just want to make sure they internalize the advice about how to get out of a broken-event-trigger situation. For those following at home, the example is: CREATE OR REPLACE FUNCTION abort_any_command() RETURNS event_trigger LANGUAGE plpgsql AS $$ BEGIN RAISE EXCEPTION 'command % is disabled', tg_tag; END; $$; CREATE EVENT TRIGGER abort_ddl ON ddl_command_start EXECUTE PROCEDURE abort_any_command(); Maybe we could just append to it how to remove such an event trigger, which is easy to do because you can't target an event trigger related command with event triggers, so the following is not disabled: DROP EVENT TRIGGER abort_ddl; Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.3] writable foreign tables
On Wed, December 12, 2012 14:45, Kohei KaiGai wrote: pgsql-v9.3-writable-fdw-poc.v8.part-2.patch 151 k pgsql-v9.3-writable-fdw-poc.v8.part-1.patch 70 k I wanted to have a look at this, and tried to apply part 1, en then part 2 on top of that (that's the idea, right?) part 1 applies and then part 2 does not. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.3] writable foreign tables
Erik Rijkers wrote: pgsql-v9.3-writable-fdw-poc.v8.part-2.patch 151 k pgsql-v9.3-writable-fdw-poc.v8.part-1.patch 70 k I wanted to have a look at this, and tried to apply part 1, en then part 2 on top of that (that's the idea, right?) part 1 applies and then part 2 does not. Part 2 needs this prerequisite: http://archives.postgresql.org/message-id/CAEZqfEcQtxn1JSjhC5usqhL4_n+Zck3mqo=rzedfpz+dawf...@mail.gmail.com Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use of systable_beginscan_ordered in event trigger patch
Dimitri Fontaine dimi...@2ndquadrant.fr writes: Tom Lane t...@sss.pgh.pa.us writes: BTW, is the example in the CREATE EVENT TRIGGER ref page meant as an IQ test for unwary readers? Maybe we could just append to it how to remove such an event trigger, which is easy to do because you can't target an event trigger related command with event triggers, so the following is not disabled: DROP EVENT TRIGGER abort_ddl; Oh really? The explanation of the example certainly doesn't say that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multiple --table options for other commands
On 12/11/2012 10:25:43 PM, Josh Kupershmidt wrote: On Tue, Dec 11, 2012 at 11:46 AM, Karl O. Pinc k...@meme.com wrote: Yes, the current pg_restore silently ignores multiple --table arguments, and seems to use the last one. You are introducing a backwards incompatible change here. Agreed with Robert that this change should be reasonable in a major version (i.e. 9.3). Good by me. Seemed worth a mention. I believe you need ellipses behind --table in the syntax summaries of the command reference docs. Hrm, I was following pg_dump's lead here for the .sgml docs, and didn't see anywhere that pg_dump makes the multiple --table syntax explicit other than in the explanatory text underneath the option. Yes. I see. I didn't look at all the command's reference pages but did happen to look at clusterdb, which does have --table in the syntax summary. I just checked and you need to fix clusterdb, reindexdb, and vacuumdb. I also note that the pg_dump --help output says table(s) so you probably want to have pg_restore say the same now that it takes multiple tables. Good catch, will fix, and ditto reindexdb's --index help output. (It is possible that the --help output for pg_dump was worded to say table(s) because one can use a pattern --table specification with pg_dump, though IMO it's helpful to mention table(s) in the --help output for the rest of these programs as well, as a little reminder to the user.) Agreed. Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG?] lag of minRecoveryPont in archive recovery
On Wed, Dec 12, 2012 at 2:03 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11.12.2012 17:04, Fujii Masao wrote: The patch looks good. I confirmed that the PANIC error didn't happen, with the patch. Ok, committed, and also moved the CheckRecoveryConsistency call. Please take a look to double-check that I didn't miss anything. Looks good to me. Thanks a lot! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: bulk_multi_insert infinite loops with large rows and small fill factors
Heikki Linnakangas hlinnakan...@vmware.com writes: The bug's been fixed now, but note that huge tuples like this will always cause the table to be extended. Even if there are completely empty pages in the table, after a vacuum. Even a completely empty existing page is not considered spacious enough in this case, because it's still too small when you take fillfactor into account, so the insertion will always extend the table. Seems like that's a bug in itself: there's no reason to reject an empty existing page. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PageIsAllVisible()'s trustworthiness in Hot Standby
On Wed, Dec 12, 2012 at 1:35 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Dec 4, 2012 at 12:10 PM, Pavan Deolasee pavan.deola...@gmail.com wrote: Hmm. Yeah, I do not have guts to prove that either. I'll probably write up a comment for your consideration to explain why we don't trust PD_ALL_VISIBLE in Hot standby for seq scans, but still trust VM for index-only scans. Sure. Here is a small patch that adds comments to heap scan code explaining why we don't trust the all-visible flag in the page, still continue to support index-only scans on hot standby. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee hot-standby-pd-all-visible-heapscan.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] PL/Python: Add spidata to all spiexceptions
Though not the original patch autor, I did modify and submit it to the CF app, so I'll reply here :) On 10/12/12 19:20, Karl O. Pinc wrote: On 12/09/2012 10:33:59 PM, Karl O. Pinc wrote: I've gone ahead and signed up to review this patch. Thanks! There were 2 outstanding issue raised: Is this useful enough/proceeding in the right direction to commit now? I believe the process would be to mark it as Ready for Committer if you feel like it's ready and a then a committer would make the final call. Should some of the logic be moved out of a subroutine and into the calling code? I think I structured the PLy_get_spi_sqlerrcode to accept the same kind of arguments as PLy_get_spi_error_data, which means a SPIException object and a pointer to whatever values it can fill in. That said, I haven't really thought about that too much, so I'm perfectly fine with moving that bit of logic to the caller. Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PRIVATE columns
Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. Such private data is seldom the target of searches, or if it is, it is mostly evenly distributed anyway. It would be good if we could collect the overall stats * NULL fraction * average width * ndistinct yet without storing either the MFVs or histogram. Doing that would avoid inadvertent leaking of potentially private information. SET STATISTICS 0 simply skips collection of statistics altogether To implement this, one way would be to allow ALTER TABLE foo ALTER COLUMN foo1 SET STATISTICS PRIVATE; Or we could use another magic value like -2 to request this case. (Yes, I am aware we could use a custom datatype with a custom typanalyze for this, but that breaks other things) Thoughts? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PRIVATE columns
On 12/12/2012 1:12 PM, Simon Riggs wrote: Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. Such private data is seldom the target of searches, or if it is, it is mostly evenly distributed anyway. Would protecting it the same way, we protect the passwords in pg_authid, be sufficient? Jan It would be good if we could collect the overall stats * NULL fraction * average width * ndistinct yet without storing either the MFVs or histogram. Doing that would avoid inadvertent leaking of potentially private information. SET STATISTICS 0 simply skips collection of statistics altogether To implement this, one way would be to allow ALTER TABLE foo ALTER COLUMN foo1 SET STATISTICS PRIVATE; Or we could use another magic value like -2 to request this case. (Yes, I am aware we could use a custom datatype with a custom typanalyze for this, but that breaks other things) Thoughts? -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] PL/Python: Add spidata to all spiexceptions
Hi Jan and Oskari, On 12/12/2012 11:36:54 AM, Jan Urbański wrote: On 10/12/12 19:20, Karl O. Pinc wrote: On 12/09/2012 10:33:59 PM, Karl O. Pinc wrote: There were 2 outstanding issue raised: Is this useful enough/proceeding in the right direction to commit now? I believe the process would be to mark it as Ready for Committer if you feel like it's ready and a then a committer would make the final call. It looks acceptable to me. My concern is that there's nothing introduced here that precludes attaching additional attributes to the exception object to carry message, detail, and hint. I don't see a problem, and can't see how there could be a problem but also feel like a novice. I was looking for some assurance from you that there's no concern here, but am confident enough regardless to pass this aspect through to a committer for final review. Should some of the logic be moved out of a subroutine and into the calling code? I think I structured the PLy_get_spi_sqlerrcode to accept the same kind of arguments as PLy_get_spi_error_data, which means a SPIException object and a pointer to whatever values it can fill in. That said, I haven't really thought about that too much, so I'm perfectly fine with moving that bit of logic to the caller. I can see arguments to be made for both sides. I'm asking that you think it through to the extent you deem necessary and make changes or not. At that point it should be ready to send to a committer. (I'll re-test first, if you make any changes.) Regards, Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PRIVATE columns
On 12 December 2012 19:13, Jan Wieck janwi...@yahoo.com wrote: On 12/12/2012 1:12 PM, Simon Riggs wrote: Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. Such private data is seldom the target of searches, or if it is, it is mostly evenly distributed anyway. Would protecting it the same way, we protect the passwords in pg_authid, be sufficient? The user backend does need to be able to access the stats data during optimization. It's hard to have data accessible and yet impose limits on the uses to which that can be put. If we have row security on the table but no equivalent capability on the stats, then we'll have leakage. e.g. set statistics 1, ANALYZE, then leak 1 credit card numbers. Selectivity functions are not marked leakproof, nor do people think they can easily be made so. Which means the data might be leaked by various means through error messages, plan selection, skullduggery etc.. If it ain't in the bucket, the bucket can't leak it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PRIVATE columns
On 12/12/2012 12:12 PM, Simon Riggs wrote: Would protecting it the same way, we protect the passwords in pg_authid, be sufficient? The user backend does need to be able to access the stats data during optimization. It's hard to have data accessible and yet impose limits on the uses to which that can be put. If we have row security on the table but no equivalent capability on the stats, then we'll have leakage. e.g. set statistics 1, ANALYZE, then leak 1 credit card numbers. Selectivity functions are not marked leakproof, nor do people think they can easily be made so. Which means the data might be leaked by various means through error messages, plan selection, skullduggery etc.. If it ain't in the bucket, the bucket can't leak it. I accidentally responded to Simon off-list to this. I understand the need and think it would be a good thing to have. However, the real opportunity here is to make statistics non-user visible. I can't think of any reason that they need to be visible to the standard user? Even if when we set the statistics private, it makes just that column non-visible. Sincerely, Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use of systable_beginscan_ordered in event trigger patch
Tom Lane t...@sss.pgh.pa.us writes: Maybe we could just append to it how to remove such an event trigger, which is easy to do because you can't target an event trigger related command with event triggers, so the following is not disabled: DROP EVENT TRIGGER abort_ddl; Oh really? The explanation of the example certainly doesn't say that. I remember than in my proposals somewhere we had support for a very limited form of command triggers for any command in the system. Of course as that included transaction control commands we made that better. I'm quite tired so maybe my memory is playing tricks to me, but I kind of remember that we also had quite a discussion about just that example and its phrasing in the docs and did came out with something satisfactory. Robert, does that ring a bell to you? I'm going to crawl the archives tomorrow if not… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PRIVATE columns
Simon Riggs si...@2ndquadrant.com writes: Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. Only if you have appropriate privileges. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. SET STATISTICS 0 seems like a sufficient solution for people who don't trust the have_column_privilege() protection in the pg_stats view. In practice I think this is a waste of time, though. Anyone who can bypass the view restriction can probably just read the original table. (I suppose we could consider marking pg_stats as a security_barrier view to make this even safer. Not sure it's worth the trouble though; the interesting columns are anyarray so it's hard to do much with them mechanically.) It would be good if we could collect the overall stats * NULL fraction * average width * ndistinct yet without storing either the MFVs or histogram. Do you have any evidence whatsoever that that's worth the trouble? I'd bet against it. And if we're being paranoid, who's to say that those numbers couldn't reveal useful data in themselves? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Use gcc built-in atomic inc/dec in lock.c
Hi, I noticed a XXX: It might be worth considering using an atomic fetch-and-add instruction here, on architectures where that is supported. in lock.c Here is my first try at using it. The patch adds a configure check for gcc 4.7 __atomic_add_fetch as well as the older __sync_add_and_fetch built-ins. If either is available they are used instead of the mutex. Changes do the following: - declare atomic_inc and atomic_dec inline functions (implemented either using GCC built-in functions or the old mutex code) in new atomics.h - change lock.c to use atomic_* functions instead of explicit mutex - moved all other assignments inside the mutex to occur before the atomic operation so that the barrier of the atomic operation can guarantee the stores are visible when the function ends - removed one assert that could not easily be implemented with atomic_dec in RemoveLocalLock Using method AbortStrongLockAcquire as an example. When compiling with Fedora GCC 4.7.2 the following assembly code is generated Original code before the patch: 136 bytes Mutex code with the patch: 124 bytes Code with gcc-built-ins: 56 bytes I think moving the extra assignments outside the mutex has allowed the compiler to optimize the code more even when the mutex is used. Questions: 1) is it safe to move the assignments to locallock-holdsStrongLockCount and StrongLockInProgress outside the FastPathStrongRelationLocks? 2) With built-ins the FastPathStrongRelationLockData becomes uint32[1024], should we add back some padding to reduce the cacheline collisions? For a modern cpu using 64 byte cache lines there can be only max 16 concurrent updates at the and the propability of collision is quite big 3) What kind of pgbench test would utilise the code path the most? TODO: 1) add check in configure.in to ensure that the built-ins are not converted to external calls by gcc (on architectures that use the gcc generic version) 2) other architectures / compilers -- Original code before the patch: 0e10 AbortStrongLockAcquire: e10: 48 89 5c 24 f0 mov%rbx,-0x10(%rsp) e15: 48 89 6c 24 f8 mov%rbp,-0x8(%rsp) e1a: 48 83 ec 18 sub$0x18,%rsp e1e: 48 8b 1d 00 00 00 00mov0x0(%rip),%rbx# e25 AbortStrongLockAcquire+0x15 e25: 48 85 dbtest %rbx,%rbx e28: 74 41 je e6b AbortStrongLockAcquire+0x5b e2a: 8b 6b 28mov0x28(%rbx),%ebp e2d: b8 01 00 00 00 mov$0x1,%eax e32: 48 8b 15 00 00 00 00mov0x0(%rip),%rdx# e39 AbortStrongLockAcquire+0x29 e39: 81 e5 ff 03 00 00 and$0x3ff,%ebp e3f: f0 86 02lock xchg %al,(%rdx) e42: 84 c0 test %al,%al e44: 75 3a jnee80 AbortStrongLockAcquire+0x70 e46: 48 8b 05 00 00 00 00mov0x0(%rip),%rax# e4d AbortStrongLockAcquire+0x3d e4d: 48 c7 05 00 00 00 00movq $0x0,0x0(%rip)# e58 AbortStrongLockAcquire+0x48 e54: 00 00 00 00 e58: 83 6c a8 04 01 subl $0x1,0x4(%rax,%rbp,4) e5d: c6 43 40 00 movb $0x0,0x40(%rbx) e61: 48 8b 05 00 00 00 00mov0x0(%rip),%rax# e68 AbortStrongLockAcquire+0x58 e68: c6 00 00movb $0x0,(%rax) e6b: 48 8b 5c 24 08 mov0x8(%rsp),%rbx e70: 48 8b 6c 24 10 mov0x10(%rsp),%rbp e75: 48 83 c4 18 add$0x18,%rsp e79: c3 retq e7a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) e80: 48 8b 3d 00 00 00 00mov0x0(%rip),%rdi# e87 AbortStrongLockAcquire+0x77 e87: ba a8 05 00 00 mov$0x5a8,%edx e8c: be 00 00 00 00 mov$0x0,%esi e91: e8 00 00 00 00 callq e96 AbortStrongLockAcquire+0x86 e96: eb ae jmpe46 AbortStrongLockAcquire+0x36 e98: 0f 1f 84 00 00 00 00nopl 0x0(%rax,%rax,1) e9f: 00 Mutex code with the patch: 0e00 AbortStrongLockAcquire: e00: 48 8b 05 00 00 00 00mov0x0(%rip),%rax# e07 AbortStrongLockAcquire+0x7 e07: 48 85 c0test %rax,%rax e0a: 74 55 je e61 AbortStrongLockAcquire+0x61 e0c: 48 89 5c 24 f0 mov%rbx,-0x10(%rsp) e11: 48 89 6c 24 f8 mov%rbp,-0x8(%rsp) e16: 48 83 ec 18 sub$0x18,%rsp e1a: 8b 68 28mov0x28(%rax),%ebp e1d: c6 40 40 00 movb $0x0,0x40(%rax) e21: b8 01 00 00 00 mov$0x1,%eax e26: 48 c7 05 00 00 00 00movq $0x0,0x0(%rip)
Re: [HACKERS] Use gcc built-in atomic inc/dec in lock.c
On 12 December 2012 22:11, Mikko Tiihonen mikko.tiiho...@nitorcreations.com wrote: noticed a XXX: It might be worth considering using an atomic fetch-and-add instruction here, on architectures where that is supported. in lock.c Here is my first try at using it. That's interesting, but I have to wonder if there is any evidence that this *is* actually helpful to performance. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12/5/12 6:49 PM, Simon Riggs wrote: * Zeroing pages, making pages all 1s * Transposing pages * Moving chunks of data sideways in a block * Flipping bits randomly * Flipping data endianness * Destroying particular catalog tables or structures I can take this on, as part of the QA around checksums working as expected. The result would be a Python program; I don't have quite enough time to write this in C or re-learn Perl to do it right now. But this won't be a lot of code. If it's tossed one day as simply a prototype for something more permanent, I think it's still worth doing now. The UI I'm thinking of for what I'm going to call pg_corrupt is a CLI that asks for: -A relation name -Corruption type (an entry from this list) -How many blocks to touch I'll just loop based on the count, randomly selecting a block each time and messing with it in that way. The randomness seed should be printed as part of the output, so that it's possible re-create the damage exactly later. If the server doesn't handle it correctly, we'll want to be able to replicate the condition it choked on exactly later, just based on the tool's log output. Any other requests? -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding exported base snapshot
On 2012-12-12 12:13:44 +0100, Andres Freund wrote: On 2012-12-11 22:20:18 -0500, Tom Lane wrote: Joachim Wieland j...@mcknight.de writes: On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund and...@2ndquadrant.com wrote: One problem I see is that while exporting a snapshot solves the visibility issues of the table's contents it does not protect against schema changes. I am not sure whether thats a problem. If somebody runs a CLUSTER or something like that, the table's contents will be preserved including MVCC semantics. That's fine. The more problematic cases I see are TRUNCATE, DROP and ALTER TABLE. This is why the pg_dump master process executes a lock table table in access share mode for every table, so your commands would all block. A lock doesn't protect against schema changes made before the lock was taken. The reason that the described scenario is problematic is that pg_dump is going to be expected to work against a snapshot made before it gets a chance to take those table locks. Thus, there's a window where DDL is dangerous, and will invalidate the dump --- perhaps without any warning. Now, we have this problem today, in that pg_dump has to read pg_class before it can take table locks so some window exists already. What's bothering me about what Andres describes is that the window for trouble seems to be getting much bigger. This morning I wondered whether we couldn't protect against that by acquiring share locks on the catalog rows pg_dump reads, that would result in could not serialize access due to concurrent update type of errors which would be easy enough discernible/translateable. While pretty damn ugly that should take care of most of those issues, shouldn't it? After a quick look it doesn't look too hard to add this, does anybody have an opinion whether its something worthwile? And possibly a suggest option name? I can't come up with something better than --recheck-locks or something, but thats awful. I don't think there's too much point in locking anything but pg_class, pg_attribute, pg_type nearly everything else is read using a mvcc snapshot anyway or isn't all that critical. Other candidates? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding exported base snapshot
Andres Freund and...@2ndquadrant.com writes: On 2012-12-12 12:13:44 +0100, Andres Freund wrote: This morning I wondered whether we couldn't protect against that by acquiring share locks on the catalog rows pg_dump reads, that would result in could not serialize access due to concurrent update type of errors which would be easy enough discernible/translateable. While pretty damn ugly that should take care of most of those issues, shouldn't it? How would it fix anything? The problem is with DDL that's committed and gone before pg_dump ever gets to the table's pg_class row. Once it does, and takes AccessShareLock on the relation, it's safe. Adding a SELECT FOR SHARE step just adds more time before we can get that lock. Also, locking the pg_class row doesn't provide protection against DDL that doesn't modify the relation's pg_class row, of which there is plenty. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding exported base snapshot
On 2012-12-12 18:52:33 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2012-12-12 12:13:44 +0100, Andres Freund wrote: This morning I wondered whether we couldn't protect against that by acquiring share locks on the catalog rows pg_dump reads, that would result in could not serialize access due to concurrent update type of errors which would be easy enough discernible/translateable. While pretty damn ugly that should take care of most of those issues, shouldn't it? How would it fix anything? The problem is with DDL that's committed and gone before pg_dump ever gets to the table's pg_class row. Once it does, and takes AccessShareLock on the relation, it's safe. Adding a SELECT FOR SHARE step just adds more time before we can get that lock. Getting a FOR SHARE lock ought to error out with a serialization failure if the row was updated since our snapshot started as pg_dump uses repeatable read/serializable. Now that obviously doesn't fix the situation, but making it detectable in a safe way seems to be good enough for me. Also, locking the pg_class row doesn't provide protection against DDL that doesn't modify the relation's pg_class row, of which there is plenty. Well, thats why I thought of pg_class, pg_attribute, pg_type. Maybe that list needs to get extended a bit, but I think just those 3 should detect most dangerous situations. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v3 - git repository
On 9 December 2012 19:14, Andres Freund and...@2ndquadrant.com wrote: I pushed a new version which - is rebased ontop of master - is based ontop of the new xlogreader (biggest part) - is base ontop of the new binaryheap.h - some fixes - some more comments I decided to take another look at this, following my earlier reviews of a substantial subset of earlier versions of this patch, including my earlier reviews of WAL decoding [1] and focused review of snapshot building [2] (itself a subset of WAL decoding). I think it was the right time to consolidate your multiple earlier patches, because some of the earlier BDR patches were committed (including Rearrange storage of data in xl_running_xacts [3], Basic binary heap implementation [4], Embedded list interface [5], and, though it isn't touched on here and is technically entirely distinct functionality, Background worker processes [6]). Furthermore, now that we've gotten past some early rounds of reviewing, it makes sense to build a *perfectly* (rather than just approximately) atomic unit, as we work towards something that is actually committable. So what's the footprint of this big, newly rebased feature branch? Well, though some of these changes are uncommitted stuff from Heikki (i.e. XLogReader, which you've modified), and some of this is README documentation, the footprint is very large. I merged master with your dev branch (last commit of yours, 743f3af081209f784a30270bdf49301e9e242b78, made on Mon 10th Dec 15:35), and the stats are: 91 files changed, 9736 insertions(+), 956 deletions(-) Note that there is a relatively large number of files affected in part because the tqual interface was bashed around a bit for the benefit of logical decoding - a lot of the changes to each of those 91 files are completely trivial. I'm very glad that you followed my earlier recommendation of splitting your demo logical changeset consumer into a contrib module, in the spirit of contrib/spi, etc. This module, test_decoding, represents a logical entry point, if you will, for the entire patch. As unwieldy as it may appear to be, the patch is (or at least *should be*) ultimately reducible to some infrastructural changes to core to facilitate this example logical change-set consumer. test_decoding contrib module contrib/Makefile |1 + contrib/test_decoding/Makefile | 16 + contrib/test_decoding/test_decoding.c | 192 Once again, because test_decoding is a kind of entry point, it gives me a nice point to continually refer back to when talking about this patch. (Incidentally, maybe test_decoding should be called pg_decoding?). The regression tests pass, though this isn't all that surprising, since frankly the test coverage of this patch appears to be quite low. I know that you're working with Abhijit on improvements to the isolation tester to verify the correctness of the patch as it relates to supporting actual, practical logical replication systems. I would very much welcome any such test coverage (even if it wasn't in a committable state), since in effect you're asking me to take a leap of faith in respect of how well this infrastructure will support such systems – previously, I obliged you and didn't focus on concurrency and serializability concerns (it was sufficient to print out values/do some decoding in a toy function), but it's time to take a closer look at those now, I feel. test_decoding is a client of the logical change-set producing infrastructure, and there appears to be broad agreement that that infrastructure needs to treat such consumers in a way that is maximally abstract. My question is, just how abstract does this interface have to be, really? How well are you going to support the use-case of a real logical replication system? Now, maybe it's just that I haven't being paying attention (in particular, to the discussion surrounding [3] – though that commit doesn't appear to have been justified in terms of commit ordering in BDR at all), but I would like you to be more demonstrative of certain things, like: 1. Just what does a logical change-set consumer look like? What things are always true of one, and never true of one? 2. Please describe in as much detail as possible the concurrency issues with respect to logical replication systems. Please make verifiable, testable claims as to how well these issues are considered here, perhaps with reference to the previous remarks of subject-matter experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9] following my earlier review. I'm not all that impressed with where test_decoding is at right now. There is still essentially no documentation. I think it's notable that you don't really touch the ReorderBufferTXN passed by the core system in the test_decoding plugin. test_decoding and pg_receivellog I surmised that the way that the test_decoding module is intended to be used is as a client of
Re: [HACKERS] [PERFORM] encouraging index-only scans
On Wed, Dec 12, 2012 at 05:27:39PM -0500, Andrew Dunstan wrote: On 12/12/2012 05:12 PM, Andrew Dunstan wrote: On 12/12/2012 04:32 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: A client is testing a migration from 9.1 to 9.2, and has found that a large number of queries run much faster if they use index-only scans. However, the only way he has found to get such a plan is by increasing the seq_page_cost to insanely high levels (3.5). Is there any approved way to encourage such scans that's a but less violent than this? Is the pg_class.relallvisible estimate for the table realistic? They might need a few more VACUUM and ANALYZE cycles to get it into the neighborhood of reality, if not. That was the problem - I didn't know this hadn't been done. Actually, the table had been analysed but not vacuumed, so this kinda begs the question what will happen to this value on pg_upgrade? Will people's queries suddenly get slower until autovacuum kicks in on the table? [ moved to hackers list.] Yes, this does seem like a problem for upgrades from 9.2 to 9.3? We can have pg_dump --binary-upgrade set these, or have ANALYZE set it. I would prefer the later. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] encouraging index-only scans
Bruce Momjian br...@momjian.us writes: On Wed, Dec 12, 2012 at 05:27:39PM -0500, Andrew Dunstan wrote: Actually, the table had been analysed but not vacuumed, so this kinda begs the question what will happen to this value on pg_upgrade? Will people's queries suddenly get slower until autovacuum kicks in on the table? [ moved to hackers list.] Yes, this does seem like a problem for upgrades from 9.2 to 9.3? We can have pg_dump --binary-upgrade set these, or have ANALYZE set it. I would prefer the later. ANALYZE does not set that value, and is not going to start doing so, because it doesn't scan enough of the table to derive a trustworthy value. It's been clear for some time that pg_upgrade ought to do something about transferring the statistics columns in pg_class to the new cluster. This is just another example of why. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PRIVATE columns
On 12/12/2012 3:12 PM, Simon Riggs wrote: On 12 December 2012 19:13, Jan Wieck janwi...@yahoo.com wrote: On 12/12/2012 1:12 PM, Simon Riggs wrote: Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. Such private data is seldom the target of searches, or if it is, it is mostly evenly distributed anyway. Would protecting it the same way, we protect the passwords in pg_authid, be sufficient? The user backend does need to be able to access the stats data during optimization. It's hard to have data accessible and yet impose limits on the uses to which that can be put. If we have row security on the table but no equivalent capability on the stats, then we'll have leakage. e.g. set statistics 1, ANALYZE, then leak 1 credit card numbers. Like for the encrypted password column of pg_authid, I don't see any reason why the values in the stats columns need to be readable for anyone but a superuser at all. Do you? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] encouraging index-only scans
On Thu, Dec 13, 2012 at 9:21 AM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: On Wed, Dec 12, 2012 at 05:27:39PM -0500, Andrew Dunstan wrote: Actually, the table had been analysed but not vacuumed, so this kinda begs the question what will happen to this value on pg_upgrade? Will people's queries suddenly get slower until autovacuum kicks in on the table? [ moved to hackers list.] Yes, this does seem like a problem for upgrades from 9.2 to 9.3? We can have pg_dump --binary-upgrade set these, or have ANALYZE set it. I would prefer the later. ANALYZE does not set that value, and is not going to start doing so, because it doesn't scan enough of the table to derive a trustworthy value. Should we do that though ? i.e. scan the entire map and count the number of bits at the end of ANALYZE, like we do at the end of VACUUM ? I recently tried to optimize that code path by not recounting at the end of the vacuum and instead track the number of all-visible bits while scanning them in the earlier phases on vacuum. But it turned out that its so fast to count even a million bits that its probably not worth doing so. It's been clear for some time that pg_upgrade ought to do something about transferring the statistics columns in pg_class to the new cluster. This is just another example of why. +1. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multiple --table options for other commands
On Wed, Dec 12, 2012 at 8:14 AM, Karl O. Pinc k...@meme.com wrote: On 12/11/2012 10:25:43 PM, Josh Kupershmidt wrote: On Tue, Dec 11, 2012 at 11:46 AM, Karl O. Pinc k...@meme.com wrote: I believe you need ellipses behind --table in the syntax summaries of the command reference docs. Hrm, I was following pg_dump's lead here for the .sgml docs, and didn't see anywhere that pg_dump makes the multiple --table syntax explicit other than in the explanatory text underneath the option. Yes. I see. I didn't look at all the command's reference pages but did happen to look at clusterdb, which does have --table in the syntax summary. I just checked and you need to fix clusterdb, reindexdb, and vacuumdb. OK, I made some changes to the command synopses for these three commands. For some reason, reindexdb.sgml was slightly different from the other two, in that the --table and --index bits were underneath a group node instead of arg. I've changed that one to be like the other two, and made them all have: arg choice=opt rep=repeat to include the ellipses after the table -- I hope that matches what you had in mind. Josh multiple_tables.v3.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multiple --table options for other commands
Hi Josh, The good news is that there's only this little bit of documentation formatting to fix On 12/12/2012 11:04:53 PM, Josh Kupershmidt wrote: On Wed, Dec 12, 2012 at 8:14 AM, Karl O. Pinc k...@meme.com wrote: On 12/11/2012 10:25:43 PM, Josh Kupershmidt wrote: On Tue, Dec 11, 2012 at 11:46 AM, Karl O. Pinc k...@meme.com wrote: I believe you need ellipses behind --table in the syntax summaries of the command reference docs. Yes. I see. I didn't look at all the command's reference pages but did happen to look at clusterdb, which does have --table in the syntax summary. I just checked and you need to fix clusterdb, reindexdb, and vacuumdb. OK, I made some changes to the command synopses for these three commands. For some reason, reindexdb.sgml was slightly different from the other two, in that the --table and --index bits were underneath a group node instead of arg. I've changed that one to be like the other two, and made them all have: arg choice=opt rep=repeat to include the ellipses after the table -- I hope that matches what you had in mind. Sadly, the ... is in the wrong place in all 3. Yours looks like (for 2 examples): vacuumdb [connection-option...] [option...] [ --table | -t table [( column [,...] )] ...] [dbname] reindexdb [connection-option...] [ --table | -t table ...] [ --index | -i index ...] [dbname] I want the ... outside the square braces, because the --table (or -t) must repeat for each table specified. Like: vacuumdb [connection-option...] [option...] [ --table | -t table [( column [,...] )] ]... [dbname] reindexdb [connection-option...] [ --table | -t table ]... [ --index | -i index ]... [dbname] Sorry, I don't see any examples in the existing docs of how this is done. It seems like what you have should work, although perhaps somehow that's the difference between the arg and group and you need to switch them back to how reindex used to be. Otherwise, there's the docs at docbook.org, and the docbook mailing lists. Or maybe the stylesheets are broken. (Always blame the tool! It's never _our_ fault! ;-) Have you had trouble getting this to work? Maybe someone who knows what they're doing will step in and give us the answer. Regards, Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers