Re: [HACKERS] Tab completion for view triggers in psql
On Wed, Nov 24, 2010 at 12:21, David Fetter da...@fetter.org wrote: Please find attached a patch changing both this and updateable to updatable, also per the very handy git grep I just learned about :) I think the patch has two issues to be fixed. It expands all tables (and views) in tab-completion after INSERT, UPDATE, and DELETE FROM. Is it an intended change? IMHO, I don't want to expand any schema because my console is filled with system tables and duplicated tables with or without schema names :-( . (original) =# INSERT INTO [tab] information_schema. pg_temp_1. pg_toast_temp_1. tbl pg_catalog. pg_toast.public. (patched) =# INSERT INTO [tab] Display all 113 possibilities? (y or n) Also, event names are not completed after INSTEAD OF: =# CREATE TRIGGER trg BEFORE [tab] DELETEINSERTTRUNCATE UPDATE =# CREATE TRIGGER trg INSTEAD OF [tab] (no candidates) -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tab completion for view triggers in psql
On Tue, Nov 30, 2010 at 05:48:04PM +0900, Itagaki Takahiro wrote: On Wed, Nov 24, 2010 at 12:21, David Fetter da...@fetter.org wrote: Please find attached a patch changing both this and updateable to updatable, also per the very handy git grep I just learned about :) I think the patch has two issues to be fixed. It expands all tables (and views) in tab-completion after INSERT, UPDATE, and DELETE FROM. Is it an intended change? IMHO, I don't want to expand any schema because my console is filled with system tables and duplicated tables with or without schema names :-( . (original) =# INSERT INTO [tab] information_schema. pg_temp_1. pg_toast_temp_1. tbl pg_catalog. pg_toast.public. (patched) =# INSERT INTO [tab] Display all 113 possibilities? (y or n) Yes. I believe that the question of having INSERT INTO [tab] check for permissions is a separate issue from this patch. This patch does bring the question of whether it *should* check such permission into more focus, though. Whether it should is probably a matter for a separate thread, though. I could create arguments in both directions. Also, event names are not completed after INSTEAD OF: =# CREATE TRIGGER trg BEFORE [tab] DELETEINSERTTRUNCATE UPDATE =# CREATE TRIGGER trg INSTEAD OF [tab] (no candidates) This one's fixed in the attached patch, although subsequent events, as in =# CREATE TRIGGER trg BEFORE INSERT OR [tab] are not. Thanks for your review :) Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 4c468a8..0aba07c 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -303,6 +303,57 @@ static const SchemaQuery Query_for_list_of_tables = { NULL }; +/* The bit masks for the following three functions come from + * src/include/catalog/pg_trigger.h. + */ +static const SchemaQuery Query_for_list_of_insertables = { + /* catname */ + pg_catalog.pg_class c, + /* selcondition */ + c.relkind = 'r' OR (c.relkind = 'v' AND c.relhastriggers AND EXISTS + (SELECT 1 FROM pg_catalog.pg_trigger t WHERE t.tgrelid = c.oid AND t.tgtype | (1 2) = t.tgtype)), + /* viscondition */ + pg_catalog.pg_table_is_visible(c.oid), + /* namespace */ + c.relnamespace, + /* result */ + pg_catalog.quote_ident(c.relname), + /* qualresult */ + NULL +}; + +static const SchemaQuery Query_for_list_of_deletables = { + /* catname */ + pg_catalog.pg_class c, + /* selcondition */ + c.relkind = 'r' OR (c.relkind = 'v' AND c.relhastriggers AND EXISTS + (SELECT 1 FROM pg_catalog.pg_trigger t WHERE t.tgrelid = c.oid AND t.tgtype | (1 3) = t.tgtype)), + /* viscondition */ + pg_catalog.pg_table_is_visible(c.oid), + /* namespace */ + c.relnamespace, + /* result */ + pg_catalog.quote_ident(c.relname), + /* qualresult */ + NULL +}; + +static const SchemaQuery Query_for_list_of_updatables = { + /* catname */ + pg_catalog.pg_class c, + /* selcondition */ + c.relkind = 'r' OR (c.relkind = 'v' AND c.relhastriggers AND EXISTS + (SELECT 1 FROM pg_catalog.pg_trigger t WHERE t.tgrelid = c.oid AND t.tgtype | (1 4) = t.tgtype)), + /* viscondition */ + pg_catalog.pg_table_is_visible(c.oid), + /* namespace */ + c.relnamespace, + /* result */ + pg_catalog.quote_ident(c.relname), + /* qualresult */ + NULL +}; + static const SchemaQuery Query_for_list_of_tisv = { /* catname */ pg_catalog.pg_class c, @@ -333,6 +384,21 @@ static const SchemaQuery Query_for_list_of_tsv = { NULL }; +static const SchemaQuery Query_for_list_of_tv = { + /* catname */ + pg_catalog.pg_class c, + /* selcondition */ + c.relkind IN ('r', 'v'), + /* viscondition */ + pg_catalog.pg_table_is_visible(c.oid), + /* namespace */ + c.relnamespace, + /* result */ + pg_catalog.quote_ident(c.relname), + /* qualresult */ + NULL +}; + static const SchemaQuery Query_for_list_of_views = { /* catname */ pg_catalog.pg_class c, @@ -630,7 +696,8 @@ psql_completion(char *text, int start, int end) *prev2_wd, *prev3_wd, *prev4_wd, - *prev5_wd; + *prev5_wd, + *prev6_wd; static const char *const sql_commands[] = { ABORT, ALTER, ANALYZE, BEGIN, CHECKPOINT, CLOSE,
Re: [HACKERS] pg_execute_from_file review
Itagaki Takahiro itagaki.takah...@gmail.com writes: I think there are two topics here: 1. Do we need to restrict locations in which sql files are executable? 2. Which is better, pg_execute_sql_file() or EXECUTE pg_read_file() ? There are no discussion yet for 1, but I think we need some restrictions Well, as a first level of restrictions, the function is superuser only. I understand and share your concerns, but as the main use for this function is in the extension's patch which is superuser only too, I feel like this discussion could (should) be taken to after commit. We would issue the next alpha with current coding, and the one after that with more security thoughts in. Is it possible to attack it this way? (The fear being that it might be hard to get to a common decision here and we have other patches to care about depending on this one, that will continue working fine --- that's the goal --- once the security policy land in. This one is the mechanism patch) For 2, I'd like to monofunctionalize pg_execute_sql_file() into separated functions something like: - FUNCTION pg_execute_sql(sql text) - FUNCTION replace(str text, from1 text, to1 text, VARIADIC text) - FUNCTION pg_read_binary_file(path text, offset bigint, size bigint) (size == -1 means whole-file) pg_read_binary_file() is the most important functions probably. pg_execute_sql_file() can be rewritten as follows. We can also use existing convert_from() to support encodings. SELECT pg_execute_sql( replace( convert_from( pg_read_binary_file('path', 0, -1), 'encoding'), 'key1', 'value1', 'key2', 'value2') ); I think the current pg_execute_sql_file() is a better user interface, but it could be extended to support queries in a text argument too, of course. I proposed it and Robert said he's not thinking that should happen in this very patch, if at all, if I understood correctly. Again, I'd like to see pg_read_binary_file() and it's easy to expose the other derivatives you're proposing here: the support code is already in my patch and is organised this way internally. Now, is there an agreement that all those new SQL functions this should be in the pg_execute_from_file patch? If so, I'll prepare v7 with that. Overall, I'd like it if it'd be possible to separate the concerns directly relevant to the extension patch to other legitimate ones, so that the main patch gets its share of review in this commitfest. But maybe I'm over-worried here. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_execute_from_file review
Itagaki Takahiro itagaki.takah...@gmail.com writes: client_encoding won't work at all because read_sql_queries_from_file() uses pg_verifymbstr(), that is verify the input with *server_encoding*. Even if we replace it with pg_verify_mbstr(client_encoding, ...) and pg_do_encoding_conversion(from client_encoding to server_encoding), it still won't work well when error messages are raised. The client expects the original client encoding, but messages are sent in the file encoding. It would be a security hole. I'll confess I'm at a loss here wrt how to solve your concerns. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST insert algorithm rewrite
On 27.11.2010 21:31, Bruce Momjian wrote: Heikki Linnakangas wrote: There's no on-disk format changes, except for the additional flag in the page headers, so this does not affect pg_upgrade. However, if there's any invalid keys in the old index because of an incomplete insertion, the new code will not understand that. So you should run vacuum to ensure that there's no such invalid keys in the index before upgrading. Vacuum will print a message in the log if it finds any, and you will have to reindex. But that's what it suggests you to do anyway. OK, pg_upgrade has code to report invalid gin and hash indexes because of changes between PG 8.3 and 8.4. Is this something we would do for 9.0 to 9.1? 9.1. The problem that started this whole thing is there in older versions, but given the lack of real-life reports and the scale of the changes required it doesn't seem wise to backport. You are saying it would have to be run before the upgrade. Can it not be run after? After would work too. I can output a script to VACUUM all such indexes, and tell users to manually REINDEX any index that generates a warning messasge. I don't have any way to automate an optional REINDEX step. That seems good enough. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST insert algorithm rewrite
On 30.11.2010 11:55, Heikki Linnakangas wrote: On 27.11.2010 21:31, Bruce Momjian wrote: Heikki Linnakangas wrote: There's no on-disk format changes, except for the additional flag in the page headers, so this does not affect pg_upgrade. However, if there's any invalid keys in the old index because of an incomplete insertion, the new code will not understand that. So you should run vacuum to ensure that there's no such invalid keys in the index before upgrading. Vacuum will print a message in the log if it finds any, and you will have to reindex. But that's what it suggests you to do anyway. OK, pg_upgrade has code to report invalid gin and hash indexes because of changes between PG 8.3 and 8.4. Is this something we would do for 9.0 to 9.1? 9.1. The problem that started this whole thing is there in older versions, but given the lack of real-life reports and the scale of the changes required it doesn't seem wise to backport. Oh sorry, I read your question as 9.0 *or* 9.1. Only GiST indexes that have any invalid tuples in them n, as a result of a crash, need to be reindexed. That's very rare in practice, so we shouldn't invalidate all GiST indexes. I don't think there's any simple way to check whether reindex is required, so I think we have to just document this. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tab completion for view triggers in psql
On Tue, Nov 30, 2010 at 18:18, David Fetter da...@fetter.org wrote: It expands all tables (and views) in tab-completion after INSERT, UPDATE, and DELETE FROM. Is it an intended change? I found it was a simple bug; we need ( ) around selcondition. In addition, I modified your patch a bit: * I added a separated CREATE TRIGGER INSTEAD OF if-branch because it doesn't accept tables actually; it only accepts views. OTOH, BEFORE or AFTER only accepts tables. * I think t.tgtype (1 N) 0 is more natural bit operation than t.tgtype | (1 N) = t.tgtype. Patch attached. If you think my changes are ok, please change the patch status to Ready for Committer. -- Itagaki Takahiro psql_view_trigger_tab_completion_6.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Hi all, The workaround recommended some time ago by Tom is: DELETE FROM residents_of_athens WHERE ctid = any(array(SELECT ctid FROM residents_of_athens ORDER BY ostracism_votes DESC LIMIT 1)); It is about as efficient as the requested feature would be, just uglier to write down. I use it all the time when batch-deleting something large (to avoid long running transactions and to not crash slony). It also helps to vacuum frequently if you do that on large amount of data... Cheers, Csaba. On Tue, 2010-11-30 at 00:05 -0500, Robert Haas wrote: On Mon, Nov 29, 2010 at 11:25 PM, Andrew Dunstan and...@dunslane.net wrote: On 11/29/2010 10:19 PM, Robert Haas wrote: For example, suppose we're trying to govern an ancient Greek democracy: http://en.wikipedia.org/wiki/Ostracism DELETE FROM residents_of_athens ORDER BY ostracism_votes DESC LIMIT 1; I'm not sure this is a very good example. Assuming there isn't a tie, I'd do it like this: DELETE FROM residents_of_athens WHERE ostracism_votes = 6000 and ostracism_votes = (SELECT max(ostracism_votes) FROM residents_of_athens); That might be a lot less efficient, though, and sometimes it's not OK to delete more than one record. Imagine, for example, wanting to dequeue the work item with the highest priority. Sure, you can use SELECT ... LIMIT to identify one and then DELETE it by some other key, but DELETE .. ORDER BY .. RETURNING .. LIMIT would be cool, and would let you do it with just one scan. I can't say I'd be excited by this feature. In quite a few years of writing SQL I don't recall ever wanting such a gadget. It's something I've wanted periodically, though not badly enough to do the work to make it happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Mon, Nov 29, 2010 at 9:57 PM, Robert Haas robertmh...@gmail.com wrote: 1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits are set, take the exclusive content lock for long enough to clear them. I wonder what the performance hit will be to workloads with contention and if this feature should be optional. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Mon, Nov 29, 2010 at 10:50 PM, Marti Raudsepp ma...@juffo.org wrote: On Tue, Nov 30, 2010 at 05:09, Jaime Casanova ja...@2ndquadrant.com wrote: at least IMHO the only sensible way that LIMIT is usefull is with an ORDER BY clause with make the results very well defined... DELETE with LIMIT is also useful for deleting things in batches, so you can do large deletes on a live system without starving other users from I/O. In this case deletion order doesn't matter (it's more efficient to delete rows in physical table order) -- ORDER BY isn't necessary. Regards, Marti ++ I have a lot of DELETE with LIMIT in my (mysql) environment for this reason. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] column-level update privs + lock table
On Mon, Nov 29, 2010 at 10:06 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 29, 2010 at 9:37 PM, Josh Kupershmidt schmi...@gmail.com wrote: I actually hadn't thought of that, for some reason. We used to similarly recommend that people handle TRUNCATE privileges with a security definer function. That doesn't mean GRANT TRUNCATE wasn't a sweet addition to 8.4. Hmm, glad you like it (I wrote that). I'm just asking how far we should go before we decide we catering to use cases that are too narrow to warrant an extension of the permissions system. I am slightly opposed to adding GRANTs for LOCK TABLE, ANALYZE, VACUUM, etc. The GRANT help page is long enough already, and I doubt many users would use them, even though I might use GRANT LOCK TABLE myself. Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tab completion for view triggers in psql
On Tue, Nov 30, 2010 at 08:13:37PM +0900, Itagaki Takahiro wrote: On Tue, Nov 30, 2010 at 18:18, David Fetter da...@fetter.org wrote: It expands all tables (and views) in tab-completion after INSERT, UPDATE, and DELETE FROM. Is it an intended change? I found it was a simple bug; we need ( ) around selcondition. Oh. Heh. Thanks for tracking this down. In addition, I modified your patch a bit: * I added a separated CREATE TRIGGER INSTEAD OF if-branch because it doesn't accept tables actually; it only accepts views. OTOH, BEFORE or AFTER only accepts tables. OK * I think t.tgtype (1 N) 0 is more natural bit operation than t.tgtype | (1 N) = t.tgtype. OK Patch attached. If you think my changes are ok, please change the patch status to Ready for Committer. Done :) Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tue, Nov 30, 2010 at 4:25 AM, Csaba Nagy ncsli...@googlemail.com wrote: The workaround recommended some time ago by Tom is: DELETE FROM residents_of_athens WHERE ctid = any(array(SELECT ctid FROM residents_of_athens ORDER BY ostracism_votes DESC LIMIT 1)); It is about as efficient as the requested feature would be, just uglier to write down. I use it all the time when batch-deleting something large (to avoid long running transactions and to not crash slony). It also helps to vacuum frequently if you do that on large amount of data... That's a very elegant hack, but not exactly obvious to a novice user or, say, me. So I think it'd be nicer to have the obvious syntax work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST insert algorithm rewrite
On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 30.11.2010 11:55, Heikki Linnakangas wrote: On 27.11.2010 21:31, Bruce Momjian wrote: Heikki Linnakangas wrote: There's no on-disk format changes, except for the additional flag in the page headers, so this does not affect pg_upgrade. However, if there's any invalid keys in the old index because of an incomplete insertion, the new code will not understand that. So you should run vacuum to ensure that there's no such invalid keys in the index before upgrading. Vacuum will print a message in the log if it finds any, and you will have to reindex. But that's what it suggests you to do anyway. OK, pg_upgrade has code to report invalid gin and hash indexes because of changes between PG 8.3 and 8.4. Is this something we would do for 9.0 to 9.1? 9.1. The problem that started this whole thing is there in older versions, but given the lack of real-life reports and the scale of the changes required it doesn't seem wise to backport. Oh sorry, I read your question as 9.0 *or* 9.1. Only GiST indexes that have any invalid tuples in them n, as a result of a crash, need to be reindexed. That's very rare in practice, so we shouldn't invalidate all GiST indexes. I don't think there's any simple way to check whether reindex is required, so I think we have to just document this. It seems odd to say, the indexes are corrupted, but they're probably not, so let's not worry about it. I assume there's no way to make the new code cope with any pre-existing corruption? Does the current code cope with the corruption? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Robert Haas robertmh...@gmail.com writes: On Mon, Nov 29, 2010 at 11:25 PM, Andrew Dunstan and...@dunslane.net wrote: I can't say I'd be excited by this feature. In quite a few years of writing SQL I don't recall ever wanting such a gadget. It's something I've wanted periodically, though not badly enough to do the work to make it happen. It would certainly look like nothing but a crude hack if the feature is only available for DELETE and not UPDATE. Unfortunately, the UPDATE case would be an order of magnitude harder (think inheritance trees where the children aren't all alike). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On 11/30/2010 09:57 AM, Csaba Nagy wrote: So it is really an ideological thing and not lack of demand or implementation attempts... I for myself can't write working C code anyway, so I got my peace with the workaround - I wish you good luck arguing Tom :-) We need a convincing use case for it. So far the only one that's seemed at all convincing to me is the one about deleting in batches. But that might be enough. As for it being illogical, I don't think it's any more so than DELETE FROM foo WHERE random() 0.1; and you can do that today. cheers andrew
Re: [HACKERS] GiST insert algorithm rewrite
On 30.11.2010 16:23, Robert Haas wrote: On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 30.11.2010 11:55, Heikki Linnakangas wrote: On 27.11.2010 21:31, Bruce Momjian wrote: Heikki Linnakangas wrote: There's no on-disk format changes, except for the additional flag in the page headers, so this does not affect pg_upgrade. However, if there's any invalid keys in the old index because of an incomplete insertion, the new code will not understand that. So you should run vacuum to ensure that there's no such invalid keys in the index before upgrading. Vacuum will print a message in the log if it finds any, and you will have to reindex. But that's what it suggests you to do anyway. OK, pg_upgrade has code to report invalid gin and hash indexes because of changes between PG 8.3 and 8.4. Is this something we would do for 9.0 to 9.1? 9.1. The problem that started this whole thing is there in older versions, but given the lack of real-life reports and the scale of the changes required it doesn't seem wise to backport. Oh sorry, I read your question as 9.0 *or* 9.1. Only GiST indexes that have any invalid tuples in them n, as a result of a crash, need to be reindexed. That's very rare in practice, so we shouldn't invalidate all GiST indexes. I don't think there's any simple way to check whether reindex is required, so I think we have to just document this. It seems odd to say, the indexes are corrupted, but they're probably not, so let's not worry about it. I assume there's no way to make the new code cope with any pre-existing corruption? Does the current code cope with the corruption? It's not corruption, but intended degradation. Yes, the current code copes with it, that's how GiST survives a crash. However, even with the current code, VACUUM will nag if it finds any invalid tuples with this message: ereport(NOTICE, (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash recovery, That's harmless, in the sense that all scans and inserts work fine, but scans might need to do more work than if the invalid tuple wasn't there. I don't think we need to go out of our way to support such degraded indexes in 9.1. If you see such notices in your logs, you should REINDEX anyway, before of after pg_upgrade. Let's just make sure that you get a reasonable error message in 9.1 if a scan or insert encounters such a tuple. There is a section on this in the docs, BTW: http://www.postgresql.org/docs/9.0/static/gist-recovery.html -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Some care is needed with checkpoints. Setting visibility map bits in step 2 is safe because crash recovery will replay the intent XLOG record and clear any incorrectly set bits. But if a checkpoint has happened after the intent XLOG record was written, that's not true. This can be avoided by checking RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed since the last intent XLOG record was written. It seems like you'll need to hold some kind of lock between the time you examine RedoRecPtr and the time you actually examine the bit. WALInsertLock in shared mode, maybe? There's a small race condition in the way a visibility map bit is currently cleared. When a heap page is updated, it is locked, the update is WAL-logged, and the lock is released. The visibility map page is updated only after that. If the final vacuum XLOG record is written just after updating the heap page, but before the visibility map bit is cleared, replaying the final XLOG record will set a bit that should not have been set. Well, if that final XLOG record isn't necessary for correctness anyway, the obvious thing to do seems to be - don't write it. Crashes are not so common that loss of even a full hour's visibility map bits in the event that we have one seems worth killing ourselves over. And not everybody sets checkpoint_timeout to an hour, and not all checkpoints are triggered by checkpoint_timeout, and not all crashes happen just before it expires. Seems like we might be better off writing that much less WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 30.11.2010 06:57, Robert Haas wrote: I can't say I'm totally in love with any of these designs. Anyone else have any ideas, or any opinions about which one is best? Well, the design I've been pondering goes like this: Wouldn't it be easier and more robust to just consider VM bit changes to be part of the WAL-logged actions? That would include updating LSNs on VM pages and flushing VM pages to disk during checkpoint based on their LSN values. All of these other schemes seem too complicated and not provably correct. Of course, that'd mean doing the bit changes inside the critical sections for the related actions, so it's not a trivial change code-wise, but neither are these other ideas. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Nov 29, 2010 at 11:25 PM, Andrew Dunstan and...@dunslane.net wrote: I can't say I'd be excited by this feature. In quite a few years of writing SQL I don't recall ever wanting such a gadget. It's something I've wanted periodically, though not badly enough to do the work to make it happen. It would certainly look like nothing but a crude hack if the feature is only available for DELETE and not UPDATE. I'm not sure this is true, given Andrew's comment that the bulk deletion argument is the only one he finds compelling, but I'd surely be in favor of supporting both. Unfortunately, the UPDATE case would be an order of magnitude harder (think inheritance trees where the children aren't all alike). I don't understand why there's anything more to this than sticking a Limit node either immediately above or immediately below the ModifyTable node. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On 30.11.2010 17:32, Robert Haas wrote: On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Some care is needed with checkpoints. Setting visibility map bits in step 2 is safe because crash recovery will replay the intent XLOG record and clear any incorrectly set bits. But if a checkpoint has happened after the intent XLOG record was written, that's not true. This can be avoided by checking RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed since the last intent XLOG record was written. It seems like you'll need to hold some kind of lock between the time you examine RedoRecPtr and the time you actually examine the bit. WALInsertLock in shared mode, maybe? It's enough to hold an exclusive lock on the visibility map page. You have to set the bit first, and then check RedoRecPtr, and if it changed, write the XLOG record before releasing the lock. If RedoRecPtr changes any time before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it changes after that, we're safe because the checkpoint will flush the updated heap page and visibility map page. There's a small race condition in the way a visibility map bit is currently cleared. When a heap page is updated, it is locked, the update is WAL-logged, and the lock is released. The visibility map page is updated only after that. If the final vacuum XLOG record is written just after updating the heap page, but before the visibility map bit is cleared, replaying the final XLOG record will set a bit that should not have been set. Well, if that final XLOG record isn't necessary for correctness anyway, the obvious thing to do seems to be - don't write it. Crashes are not so common that loss of even a full hour's visibility map bits in the event that we have one seems worth killing ourselves over. And not everybody sets checkpoint_timeout to an hour, and not all checkpoints are triggered by checkpoint_timeout, and not all crashes happen just before it expires. Seems like we might be better off writing that much less WAL. Yeah, possibly. It also means that the set bits will not propagate to standby servers, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 30.11.2010 06:57, Robert Haas wrote: I can't say I'm totally in love with any of these designs. Anyone else have any ideas, or any opinions about which one is best? Well, the design I've been pondering goes like this: Wouldn't it be easier and more robust to just consider VM bit changes to be part of the WAL-logged actions? That would include updating LSNs on VM pages and flushing VM pages to disk during checkpoint based on their LSN values. All of these other schemes seem too complicated and not provably correct. What WAL-logged actions? The problem case is where a page has no tuples or line pointers that need to be removed, and all we need to do is mark it all-visible. We don't current WAL-log anything in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On 30.11.2010 17:38, Tom Lane wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com writes: On 30.11.2010 06:57, Robert Haas wrote: I can't say I'm totally in love with any of these designs. Anyone else have any ideas, or any opinions about which one is best? Well, the design I've been pondering goes like this: Wouldn't it be easier and more robust to just consider VM bit changes to be part of the WAL-logged actions? That would include updating LSNs on VM pages and flushing VM pages to disk during checkpoint based on their LSN values. All of these other schemes seem too complicated and not provably correct. The vm bit can be set once all the tuples on the page become visible to everyone. There is no WAL-logged action at that point we could piggyback on. Clearing the bit is already handled like that - replay of heap insert/update/delete records clear the visibility map bit. Of course, that'd mean doing the bit changes inside the critical sections for the related actions, so it's not a trivial change code-wise, but neither are these other ideas. Yeah, I'm not terribly excited about any of these schemes. The intent record seems like the simplest one, but even that is quite different from the traditional WAL-logging we do that it makes me slightly nervous. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 10:43 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It seems like you'll need to hold some kind of lock between the time you examine RedoRecPtr and the time you actually examine the bit. WALInsertLock in shared mode, maybe? It's enough to hold an exclusive lock on the visibility map page. You have to set the bit first, and then check RedoRecPtr, and if it changed, write the XLOG record before releasing the lock. If RedoRecPtr changes any time before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it changes after that, we're safe because the checkpoint will flush the updated heap page and visibility map page. Brilliant. I assume that we need to call GetRedoRecPtr() after taking the exclusive lock on the page, though? Yeah, possibly. It also means that the set bits will not propagate to standby servers, though. That's definitely sucky, but in some ways it would be more complicated if they did, because I don't think all-visible on the master implies all-visible on the standby. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Here's one more idea: The trivial solution to this is to WAL-log setting the visibility map bit, like we WAL-log any other operation. Lock the heap page, lock the visibility map page, write WAL-record, and release locks. That works, but the problem is that it creates quite a lot of new WAL traffic. We could reduce the WAL traffic by simply updating multiple pages at a time. Lock N pages, lock the visibility map page, write one WAL record, and release locks. If N=10, for example, we only need to WAL-log a couple of bytes per page, so the WAL volume should be acceptable. The downside is that you need to keep more pages locked at the same time, but maybe that's not too bad. This wouldn't require anything special, which means fewer hard-to-debug visibility recovery bugs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Robert Haas robertmh...@gmail.com writes: On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Unfortunately, the UPDATE case would be an order of magnitude harder (think inheritance trees where the children aren't all alike). I don't understand why there's anything more to this than sticking a Limit node either immediately above or immediately below the ModifyTable node. 1. You need to support ORDER BY too, otherwise I *will* be on the warpath against this as a foot-gun with no redeeming social value. 2. So what you need is Sort underneath Limit underneath ModifyTable. Putting them above it would be quite the wrong semantics. 3. This doesn't work tremendously well for inheritance trees, where ModifyTable acts as sort of an implicit Append node. You can't just funnel all the tuples through one Sort or Limit node because they aren't all the same rowtype. (Limit might perhaps not care, but Sort will.) But you can't have a separate Sort/Limit for each table either, because that would give the wrong behavior. Another problem with funneling all the rows through one Sort/Limit is that ModifyTable did need to know which table each row came from, so it can apply the modify to the right table. I don't offhand see a solution other than integrating the responsibility for limit-counting and sorting into ModifyTable itself, making it into an unholy union of ModifyTable+Limit+MergeAppend (with the individual inputs required to deliver sorted outputs separately). That's sufficiently ugly, and probably bad for performance in the normal case, that I don't think it's going to be acceptable for such a marginal feature. Or I guess you could try to persuade us that DELETE/UPDATE with ORDER BY or LIMIT doesn't need to support inherited target tables. I wouldn't bet on that proposal flying either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Hi Robert, On Tue, 2010-11-30 at 09:19 -0500, Robert Haas wrote: That's a very elegant hack, but not exactly obvious to a novice user or, say, me. So I think it'd be nicer to have the obvious syntax work. I fully agree - but you first have to convince core hackers that this is not just a foot-gun. This was discussed many times in the past, patches were also offered (perhaps not complete one, but proving that there is an itch getting scratched): http://archives.postgresql.org/pgsql-patches/2002-09/msg00255.php The reaction: http://archives.postgresql.org/pgsql-patches/2002-09/msg00256.php There are other discussions too, if I remember correctly Tom once admitted that the core of implementing the feature would likely consist in letting it work, as the infrastructure is there to do it but it is actively disabled. I can't find the mail now though. So it is really an ideological thing and not lack of demand or implementation attempts... I for myself can't write working C code anyway, so I got my peace with the workaround - I wish you good luck arguing Tom :-) Cheers, Csaba. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Another proposal for table synonyms
Hello, Here is the proposal to add synonyms to PostgreSQL. Initial goal is to add synonyms for relations (tables, views, sequences) and an infrastructure to allow synonyms for other database objects in the future. A thread with discussion of an old proposal by Jonah Harris is here: http://archives.postgresql.org/pgsql-hackers/2006-03/msg00519.php - Synonyms are database objects, which act as an alias for other objects. In this proposal the synonyms for tables, views and sequences will be considered. The new command, CREATE SYNONYM, defines a synonym. The syntax is: CREATE SYNONYM synonym_name FOR object_type object_name where synonym_name: fully-qualified name (FQN) of the synonym object_type: {TABLE | VIEW | SEQUENCE}. In the future, new object_types, such as functions, can be added. object_name: FQN of a database table, view or sequence. Another new command, DROP SYNONYM, is used for deleting an already existing synonym without removing the object the synonym references. The syntax is: DROP SYNONYM synonym_name where synonym_name is a FQN of a synonym. Comments will be supported on synonyms with the following command: COMMENT ON SYNONYM synonym_name IS comment_text where synonym_name is a FQN of a synonym. To support addition of new database objects types that can be referenced by synonyms a new system catalog, pg_synonym, is to be added, with an oid to support comments on synonym, and the following schema: synname name name of the synonym synnamespace oid OID of the namespace that contains the synonym synclassid oid OID of the system catalog that contains the referenced object synobjid oid OID of the referenced object When resolving the synonym name, the usual search_path lookup rules apply, i.e. first, the object of the appropriate type is looked into the schema, then the synonym, afterwards the process iterates with the next schema from the search_path. Note that the table synonym with the same FQN as an existing table will be masked by that table. To speedup the synonym name resolution a new syscache, SYNNAMENSPCLASS {synname, synnamespace, synclassid} will be introduced. This cache will be accessed if the query to the RELNAMENSP syscache will return no result, with the DB object's catalog OID set to pg_class OID. For table and view synonyms, INSERT/UPDATE/DELETE/SELECT will be supported. For sequences SELECT will be supported. The commands will translate synonyms to the referenced database objects on the parser stage. All types of synonyms will be supported as table arguments/return value types, as well as actual values (i.e. currval/nextval will accept a sequence synonym). The following DDL will work transparently with table synonyms (sequences and views if the corresponding command applies to them): COPY, LOCK, TRUNCATE, EXPLAIN, EXPLAIN ANALYZE. The following DDL commands will cause an error when called for tables (sequences, views) synonyms: ALTER {TABLE|VIEW|SEQUENCE}, ANALYZE, CLUSTER, COMMENT ON {TABLE | VIEW | SEQUENCE} .. IS, DROP {TABLE | VIEW | SEQUENCE}, GRANT, REVOKE, VACUUM. For these commands additional checks for synonyms will be introduced on a per-command basis. A dependency of the referenced object on a synonym will be added when adding a new synonym to forbid removing a referenced object without removing the synonym first (without using CASCADE). On DROP SYNONYM the related dependency will be removed. -- Alexey Klyukin http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 30.11.2010 17:38, Tom Lane wrote: Wouldn't it be easier and more robust to just consider VM bit changes to be part of the WAL-logged actions? That would include updating LSNs on VM pages and flushing VM pages to disk during checkpoint based on their LSN values. All of these other schemes seem too complicated and not provably correct. The vm bit can be set once all the tuples on the page become visible to everyone. There is no WAL-logged action at that point we could piggyback on. So you start emitting a WAL entry for the act of setting the VM bit (and I guess the page header hint bit would be included in that too). Yeah, I'm not terribly excited about any of these schemes. The intent record seems like the simplest one, but even that is quite different from the traditional WAL-logging we do that it makes me slightly nervous. I'm not convinced it works at all. Consider write intent record, checkpoint, set bit, crash before completing vacuum. There will be no second intent record at which you could clean up if things are inconsistent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tue, Nov 30, 2010 at 11:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Unfortunately, the UPDATE case would be an order of magnitude harder (think inheritance trees where the children aren't all alike). I don't understand why there's anything more to this than sticking a Limit node either immediately above or immediately below the ModifyTable node. 1. You need to support ORDER BY too, otherwise I *will* be on the warpath against this as a foot-gun with no redeeming social value. Will you be wielding a Tom-ahawk? 2. So what you need is Sort underneath Limit underneath ModifyTable. Putting them above it would be quite the wrong semantics. OK. 3. This doesn't work tremendously well for inheritance trees, where ModifyTable acts as sort of an implicit Append node. You can't just funnel all the tuples through one Sort or Limit node because they aren't all the same rowtype. (Limit might perhaps not care, but Sort will.) But you can't have a separate Sort/Limit for each table either, because that would give the wrong behavior. Another problem with funneling all the rows through one Sort/Limit is that ModifyTable did need to know which table each row came from, so it can apply the modify to the right table. Could you possibly have ModifyTable - Limit - MergeAppend? Or I guess you could try to persuade us that DELETE/UPDATE with ORDER BY or LIMIT doesn't need to support inherited target tables. I wouldn't bet on that proposal flying either. I've spent enough time worrying about the fact that tables with inheritance children don't behave as nicely as those that don't to have any interest in going in the other direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: The trivial solution to this is to WAL-log setting the visibility map bit, like we WAL-log any other operation. Lock the heap page, lock the visibility map page, write WAL-record, and release locks. That works, but the problem is that it creates quite a lot of new WAL traffic. How much is quite a lot? Do we have any real reason to think that this solution is unacceptable performance-wise? I'd also suggest that if you want to prevent torn-page syndrome on VM pages (and if you want to rely on their LSN values, you do) then you probably don't have any choice anyway. VM pages will have to adhere to the same write-full-page-on-first-mod-after-checkpoint rule as any other page. I'd guess that this will swamp any savings from cutesy schemes for reducing the number of WAL records. We could reduce the WAL traffic by simply updating multiple pages at a time. Lock N pages, lock the visibility map page, write one WAL record, and release locks. I don't think that will work, because you have to hold the lock on a page from the time you check that it's all-visible to the time you apply the update. The loss of concurrency against updates would be pretty bad, and I think you'd be creating significant risk of deadlocks from holding multiple buffer locks at once. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote: How much is quite a lot? Do we have any real reason to think that this solution is unacceptable performance-wise? Well, let's imagine a 1GB insert-only table. It has 128K pages. If you XLOG setting the bit on each page, you'll need to write 128K WAL records, each containing a 12-byte relfilenode and a 4-byte block offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL. But you did just dirty a gigabyte of data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] column-level update privs + lock table
On Tue, Nov 30, 2010 at 9:07 AM, Josh Kupershmidt schmi...@gmail.com wrote: On Mon, Nov 29, 2010 at 10:06 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 29, 2010 at 9:37 PM, Josh Kupershmidt schmi...@gmail.com wrote: I actually hadn't thought of that, for some reason. We used to similarly recommend that people handle TRUNCATE privileges with a security definer function. That doesn't mean GRANT TRUNCATE wasn't a sweet addition to 8.4. Hmm, glad you like it (I wrote that). I'm just asking how far we should go before we decide we catering to use cases that are too narrow to warrant an extension of the permissions system. I am slightly opposed to adding GRANTs for LOCK TABLE, ANALYZE, VACUUM, etc. The GRANT help page is long enough already, and I doubt many users would use them, even though I might use GRANT LOCK TABLE myself. You'd really probably want GRANT LOCK TABLE (SHARE), GRANT LOCK TABLE (EXCLUSIVE), ... It'd be sort of cool, but it doesn't seem worth the complexity. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 11:22 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote: How much is quite a lot? Do we have any real reason to think that this solution is unacceptable performance-wise? Well, let's imagine a 1GB insert-only table. It has 128K pages. If you XLOG setting the bit on each page, you'll need to write 128K WAL records, each containing a 12-byte relfilenode and a 4-byte block offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL. But you did just dirty a gigabyte of data. Oh, but it's worse than that. When you XLOG a WAL record for each of those pages, you're going to trigger full-page writes for all of them. So now you've turned 1GB of data to write into 2+ GB of data to write. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On 30.11.2010 18:22, Robert Haas wrote: On Tue, Nov 30, 2010 at 11:16 AM, Tom Lanet...@sss.pgh.pa.us wrote: How much is quite a lot? Do we have any real reason to think that this solution is unacceptable performance-wise? Well, let's imagine a 1GB insert-only table. It has 128K pages. If you XLOG setting the bit on each page, you'll need to write 128K WAL records, each containing a 12-byte relfilenode and a 4-byte block offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL. Plus WAL headers, I think it's something like 32 or 40 bytes of WAL per page. But you did just dirty a gigabyte of data. Good point. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Another proposal for table synonyms
Alexey Klyukin al...@commandprompt.com writes: To support addition of new database objects types that can be referenced by synonyms a new system catalog, pg_synonym, is to be added, with an oid to support comments on synonym, and the following schema: This is not going to work, at least not without making every type of lookup consult pg_synonym too, which I think can be considered DOA because of its performance impact on people who aren't even using the feature. It's also quite unclear how you prevent duplicate names if the synonyms are in their own catalog. (And no, the part of your proposal that says you're not preventing that isn't acceptable from a usability standpoint.) You could reasonably support synonyms for tables/views by storing them in pg_class with a new relkind. This doesn't cover synonyms for other object types, but since the total world demand for such a feature is approximately zero, that's not really a problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On 30.11.2010 18:10, Tom Lane wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com writes: Yeah, I'm not terribly excited about any of these schemes. The intent record seems like the simplest one, but even that is quite different from the traditional WAL-logging we do that it makes me slightly nervous. I'm not convinced it works at all. Consider write intent record, checkpoint, set bit, crash before completing vacuum. There will be no second intent record at which you could clean up if things are inconsistent. That's why you need to check the RedoRecPtr when you set the bit. If it has changed, ie. a checkpoint has happened, the set bit step will write a new intent record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Robert Haas robertmh...@gmail.com writes: Oh, but it's worse than that. When you XLOG a WAL record for each of those pages, you're going to trigger full-page writes for all of them. So now you've turned 1GB of data to write into 2+ GB of data to write. No, because only the first mod of each VM page would trigger a full page write, at least assuming a reasonable ordering of the operations. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 30.11.2010 18:10, Tom Lane wrote: I'm not convinced it works at all. Consider write intent record, checkpoint, set bit, crash before completing vacuum. There will be no second intent record at which you could clean up if things are inconsistent. That's why you need to check the RedoRecPtr when you set the bit. If it has changed, ie. a checkpoint has happened, the set bit step will write a new intent record. Oh, you explained the proposal poorly then. I thought you meant recheck and write another intent record just once, immediately before sending the final xlog record. It still seems rickety and not clearly correct, especially when you start thinking about all the other constraints we have on xlog behavior (eg, does this work while taking a base backup). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Jeff Janes wrote: For the individual file sync times emitted under debug1, it would be very handy if the file being synced was identified, for example relation base/16384/16523. Rather than being numbered sequentially within a given checkpoint. I was numbering them sequentially so that it's straightforward to graph the sync times in an external analysis tool, but the relation data is helpful too. New patch reflecting all upthread suggestions is attached. The output looks like this now at DEBUG1: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/16424/11645 time=11589.549000 msec DEBUG: checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec DEBUG: checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000 msec DEBUG: checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec DEBUG: checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec DEBUG: checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000 msec DEBUG: checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000 msec LOG: checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s I kept the units for the DEBUG level ones in msec because that's a better scale for the common really short syncs during good behavior. But the summary info in seconds now appears at the end of the existing checkpoint complete message, so only one line to parse for those looking to analyze the gross checkpoint data. That looks to work well enough for finding situations like the big ext3 spikes. You can easily see one in this example by the fact that longest=11.589 s is almost the entirety of sync=11.715 s. That's the really key thing there's currently no visibility into, that's made obvious with this patch. This might be ready for some proper review now. I know there's at least one blatant bug still in here I haven't found yet, related to how the averages are computed. I saw this once: LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s After an immediate checkpoint, so at least one path not quite right yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ede6ceb..6f6eb3b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7063,10 +7063,15 @@ LogCheckpointEnd(bool restartpoint) { long write_secs, sync_secs, -total_secs; +total_secs, +longest_secs, +average_secs; int write_usecs, sync_usecs, -total_usecs; +total_usecs, +longest_usecs, +average_usecs; + double average_sync_time; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); @@ -7082,18 +7087,35 @@ LogCheckpointEnd(bool restartpoint) CheckpointStats.ckpt_sync_end_t, sync_secs, sync_usecs); + /* + * Timing values returned from CheckpointStats are in milliseconds. + * Convert to the second plus microsecond form that TimestampDifference + * returns for homogeneous printing. + */ + longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000); + longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - longest_secs * 1000); + + average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels; + average_secs = (long) (average_sync_time / 1000); + average_usecs = 1000 * (average_sync_time - average_secs * 1000); + if (restartpoint) elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); - write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s, + write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; + sync files=%d, longest=%ld.%03d s, average=%ld.%03d s, CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, - total_secs, total_usecs / 1000); + total_secs, total_usecs / 1000, +
Re: [HACKERS] crash-safe visibility map, take three
Robert Haas robertmh...@gmail.com writes: That's definitely sucky, but in some ways it would be more complicated if they did, because I don't think all-visible on the master implies all-visible on the standby. Ouch. That seems like it could shoot down all these proposals. There definitely isn't any way to make VM crash-safe if there is no WAL-driven mechanism for setting the bits. I guess what we need is a way to delay the application of such a WAL record on the slave until it's safe, which means the record also has to carry some indication of the youngest XMIN on the page. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 11:33 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Oh, but it's worse than that. When you XLOG a WAL record for each of those pages, you're going to trigger full-page writes for all of them. So now you've turned 1GB of data to write into 2+ GB of data to write. No, because only the first mod of each VM page would trigger a full page write, at least assuming a reasonable ordering of the operations. I'm not worried about the full-page writes from updating the visibility map - I'm worried about the full-page writes from updating the heap. It doesn't matter a whit if we fail to set a bit in the visibility map. What matters is if we DO set the bit in the visibility map but FAIL TO set the bit in the heap, because then a subsequent update to the heap page won't check the visibility map and clear the bit. The *heap* updates are the ones that have to be guaranteed to make it to disk. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On 30.11.2010 18:40, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: That's definitely sucky, but in some ways it would be more complicated if they did, because I don't think all-visible on the master implies all-visible on the standby. Ouch. That seems like it could shoot down all these proposals. There definitely isn't any way to make VM crash-safe if there is no WAL-driven mechanism for setting the bits. Note that this is only a problem for *hot* standby. After failover, all the tuples that were visible to everyone in the master are also visible to all new transactions in the standby. We dealt with this in 9.0 already, with the killed flag in index tuples and the PD_ALL_VISIBLE flag in heap scans. We simply don't believe them in hot standby mode, and check visibility even if the flag is set. I guess what we need is a way to delay the application of such a WAL record on the slave until it's safe, which means the record also has to carry some indication of the youngest XMIN on the page. Something like that would certainly be nice. With index-only scans, it can be a big disappointment if you can't do an index-only scan in hot standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On 30.11.2010 18:33, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Oh, but it's worse than that. When you XLOG a WAL record for each of those pages, you're going to trigger full-page writes for all of them. So now you've turned 1GB of data to write into 2+ GB of data to write. No, because only the first mod of each VM page would trigger a full page write, at least assuming a reasonable ordering of the operations. If you change the LSN on the heap pages, you have to write full page images of those as well. Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it usually isn't), and then set the bit in the VM while keeping the heap page locked. Can we get away with not setting the LSN on the heap page, even though we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be flushed to disk before the WAL record, but I think that's fine because it's OK to have the flag set in the heap page even if the VM bit is not set. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: That's definitely sucky, but in some ways it would be more complicated if they did, because I don't think all-visible on the master implies all-visible on the standby. Ouch. That seems like it could shoot down all these proposals. There definitely isn't any way to make VM crash-safe if there is no WAL-driven mechanism for setting the bits. Heikki's intent method works fine, because the WAL record only clears the visibility map bits on redo; it never sets them. I guess what we need is a way to delay the application of such a WAL record on the slave until it's safe, which means the record also has to carry some indication of the youngest XMIN on the page. I'm unexcited about inventing more ways to delay XLOG application on the standby. We have enough of those already. We could actually allow the slave to set the visibility map bits based on its own xmin horizon. The only problem is that you wouldn't be able to write the intent XLOG records. I suppose you could have a separate file which is just used to store the intent records, and designate a range of very-high numbered LSNs to mean blocks of the intent file rather than a position in the regular WAL stream. VACUUM is so much fun on the master, let's have it on the standby too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Can we get away with not setting the LSN on the heap page, even though we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be flushed to disk before the WAL record, but I think that's fine because it's OK to have the flag set in the heap page even if the VM bit is not set. Why is that fine? It's certainly not fine from the standpoint of someone wondering why his index-only scan performs so badly. I think all this hair-splitting about cases where it's okay to have one bit set and not the other is misguided. To me, crash-safety of the VM means that its copy of the page-header bit is right. Period. Yes, it will cost something to ensure that; so what? If we don't get more than enough compensating performance gain from index-only scans, the whole patch is going to end up reverted. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Another proposal for table synonyms
On Nov 30, 2010, at 6:28 PM, Tom Lane wrote: Alexey Klyukin al...@commandprompt.com writes: To support addition of new database objects types that can be referenced by synonyms a new system catalog, pg_synonym, is to be added, with an oid to support comments on synonym, and the following schema: This is not going to work, at least not without making every type of lookup consult pg_synonym too, which I think can be considered DOA because of its performance impact on people who aren't even using the feature. For those not using synonyms it would result in an extra syscache lookup for each schema from the search_path that doesn't contain the table with the specified name. If the table is specified with A FQN or contained in the first schema from the search_path no extra lookup would occur. Is it considered a big impact? The number of such lookups can be reduced if we traverse the search_path for the tables first, and then look for the synonyms, although that would change the lookup rules stated in this proposal It's also quite unclear how you prevent duplicate names if the synonyms are in their own catalog. (And no, the part of your proposal that says you're not preventing that isn't acceptable from a usability standpoint.) What's wrong with the usability of that feature? The fact that the table with the same FQN as a synonym masks the latter can be clearly stated in the documentation. Are you expecting lots of people to name the synonym exactly the same as one of the database tables and wonder why is the table and not the synonym gets accessed? As an alternative, a warning during table/synonym creation/renaming can be emitted if the name clash occurs. You could reasonably support synonyms for tables/views by storing them in pg_class with a new relkind. This doesn't cover synonyms for other object types, but since the total world demand for such a feature is approximately zero, that's not really a problem. I think that would almost guarantee that synonyms for other kinds of objects (i.e. databases, such kind of synonyms were requested in the past) would never be added. -- Alexey Klyukin http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Robert Haas robertmh...@gmail.com writes: On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Ouch. That seems like it could shoot down all these proposals. There definitely isn't any way to make VM crash-safe if there is no WAL-driven mechanism for setting the bits. Heikki's intent method works fine, because the WAL record only clears the visibility map bits on redo; it never sets them. Uh, no, because he also had that final WAL record that would set the bits. We could actually allow the slave to set the visibility map bits based on its own xmin horizon. Not in a crash-safe way, which is exactly the problem here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 11:49 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 30.11.2010 18:33, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Oh, but it's worse than that. When you XLOG a WAL record for each of those pages, you're going to trigger full-page writes for all of them. So now you've turned 1GB of data to write into 2+ GB of data to write. No, because only the first mod of each VM page would trigger a full page write, at least assuming a reasonable ordering of the operations. If you change the LSN on the heap pages, you have to write full page images of those as well. Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it usually isn't), and then set the bit in the VM while keeping the heap page locked. Can we get away with not setting the LSN on the heap page, even though we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be flushed to disk before the WAL record, but I think that's fine because it's OK to have the flag set in the heap page even if the VM bit is not set. I don't immediately see why that wouldn't work. As long as you bump the LSN on the visibility map page, and also bump the LSN of the visibility map page every time you clear a bit, I think you should be OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 11:55 AM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Can we get away with not setting the LSN on the heap page, even though we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be flushed to disk before the WAL record, but I think that's fine because it's OK to have the flag set in the heap page even if the VM bit is not set. Why is that fine? It's certainly not fine from the standpoint of someone wondering why his index-only scan performs so badly. I think all this hair-splitting about cases where it's okay to have one bit set and not the other is misguided. To me, crash-safety of the VM means that its copy of the page-header bit is right. Period. Yes, it will cost something to ensure that; so what? If we don't get more than enough compensating performance gain from index-only scans, the whole patch is going to end up reverted. We're not going to double the cost of VACUUM to get index-only scans. And that's exactly what will happen if you do full-page writes of every heap page to set a single bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 11:59 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Ouch. That seems like it could shoot down all these proposals. There definitely isn't any way to make VM crash-safe if there is no WAL-driven mechanism for setting the bits. Heikki's intent method works fine, because the WAL record only clears the visibility map bits on redo; it never sets them. Uh, no, because he also had that final WAL record that would set the bits. Well, as already discussed upthread, that WAL record causes some other problems, so make it Heikki's intent method, without the final WAL record that breaks things. We could actually allow the slave to set the visibility map bits based on its own xmin horizon. Not in a crash-safe way, which is exactly the problem here. Brilliant selective quoting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Another proposal for table synonyms
Alexey Klyukin al...@commandprompt.com writes: On Nov 30, 2010, at 6:28 PM, Tom Lane wrote: This is not going to work, at least not without making every type of lookup consult pg_synonym too, which I think can be considered DOA because of its performance impact on people who aren't even using the feature. For those not using synonyms it would result in an extra syscache lookup for each schema from the search_path that doesn't contain the table with the specified name. If the table is specified with A FQN or contained in the first schema from the search_path no extra lookup would occur. Is it considered a big impact? Yes. It'll be slow and it will render code that's already unreasonably complicated into an unreadable morass. We are not going there. (Just to be clear, it's not the table search case I'm worried about; it's operator/function lookup that I think this would be completely unacceptable for. And if you're only going to support table/view synonyms then you might as well put them in pg_class.) I think that would almost guarantee that synonyms for other kinds of objects (i.e. databases, such kind of synonyms were requested in the past) would never be added. That's fine with me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Robert Haas robertmh...@gmail.com writes: We're not going to double the cost of VACUUM to get index-only scans. And that's exactly what will happen if you do full-page writes of every heap page to set a single bit. It's ridiculous to claim that that doubles the cost of VACUUM. In the worst case, it will add 25% to the cost of setting an all-visible bit on a page where there is no other work to do. (You already are writing out the heap page and the VM page, plus a WAL image of the heap page, so a WAL image of the VM page adds 25%. But only if you did not set any other bits on the same VM page, which is probably not a real common case.) Given that VACUUM has a lot of other cleanup besides visibility bit setting, I'm not convinced that this would even be noticeable. I think the burden is on people who are proposing complicated mechanisms to show that there's actually a strong need for them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: We're not going to double the cost of VACUUM to get index-only scans. And that's exactly what will happen if you do full-page writes of every heap page to set a single bit. It's ridiculous to claim that that doubles the cost of VACUUM. In the worst case, it will add 25% to the cost of setting an all-visible bit on a page where there is no other work to do. (You already are writing out the heap page and the VM page, True. plus a WAL image of the heap page, so a False. That is exactly what we are NOT doing now and what we must find a way to avoid doing. WAL image of the VM page adds 25%. But only if you did not set any other bits on the same VM page, which is probably not a real common case.) Given that VACUUM has a lot of other cleanup besides visibility bit setting, I'm not convinced that this would even be noticeable. I think the burden is on people who are proposing complicated mechanisms to show that there's actually a strong need for them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Robert Haas robertmh...@gmail.com writes: On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: It's ridiculous to claim that that doubles the cost of VACUUM. In the worst case, it will add 25% to the cost of setting an all-visible bit on a page where there is no other work to do. (You already are writing out the heap page and the VM page, True. plus a WAL image of the heap page, so a False. That is exactly what we are NOT doing now and what we must find a way to avoid doing. I do not accept that argument. You can't make an omelette without breaking eggs, and the cost of index-only scans is going to be that it costs more to get the visibility bits set in the first place. But having said that, I wonder whether we need a full-page image for a WAL-logged action that is known to involve only setting a single bit and updating LSN. Would omitting the FPI be any more risky than what happens now (ie, the page does get written back to disk at some point, without any image from which it can be rewritten if the write fails...) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On 30.11.2010 19:22, Tom Lane wrote: But having said that, I wonder whether we need a full-page image for a WAL-logged action that is known to involve only setting a single bit and updating LSN. Would omitting the FPI be any more risky than what happens now (ie, the page does get written back to disk at some point, without any image from which it can be rewritten if the write fails...) You have to write a full-page image if you update the LSN, because otherwise the next update that comes along will not write a full page image. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: But having said that, I wonder whether we need a full-page image for a WAL-logged action that is known to involve only setting a single bit and updating LSN. Would omitting the FPI be any more risky than what happens now (ie, the page does get written back to disk at some point, without any image from which it can be rewritten if the write fails...) That's pretty much exactly what Heikki proposed 35 minutes ago, and you objected 6 minutes later. I still think it might work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
On Tue, Nov 30, 2010 at 12:25 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: But having said that, I wonder whether we need a full-page image for a WAL-logged action that is known to involve only setting a single bit and updating LSN. Would omitting the FPI be any more risky than what happens now (ie, the page does get written back to disk at some point, without any image from which it can be rewritten if the write fails...) That's pretty much exactly what Heikki proposed 35 minutes ago, and you objected 6 minutes later. I still think it might work. Oh, I see the difference now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 30.11.2010 19:22, Tom Lane wrote: But having said that, I wonder whether we need a full-page image for a WAL-logged action that is known to involve only setting a single bit and updating LSN. You have to write a full-page image if you update the LSN, because otherwise the next update that comes along will not write a full page image. Um. Drat. I was thinking about the replay side, where I think it would actually work --- but you're right, it would break the logic on the generation side. Unless you want to put in some kind of flag saying this was only a visibility bit update, any bigger update still needs to write an FPI. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] column-level update privs + lock table
On Mon, 2010-11-29 at 21:37 -0500, Josh Kupershmidt wrote: I still see little reason to make LOCK TABLE permissions different for column-level vs. table-level UPDATE privileges Agreed. This is the crux of the debate. Why should this inconsistency be allowed to continue? Are there covert channel issues here, KaiGai? -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS]
Re: [HACKERS] DELETE with LIMIT (or my first hack)
3. This doesn't work tremendously well for inheritance trees, where ModifyTable acts as sort of an implicit Append node. You can't just funnel all the tuples through one Sort or Limit node because they aren't all the same rowtype. (Limit might perhaps not care, but Sort will.) But you can't have a separate Sort/Limit for each table either, because that would give the wrong behavior. Another problem with funneling all the rows through one Sort/Limit is that ModifyTable did need to know which table each row came from, so it can apply the modify to the right table. So I guess that I have choose the wrong hack to start. Just for curiosity, why the result of WHERE filter (in SELECT/DELETE/UPDATE) is not put in memory, i.e. an array of ctid, like an buffer and then executed by SELECT/DELETE/UPDATE at once ? Greets, -- Daniel Loureiro
Re: [HACKERS] [GENERAL] column-level update privs + lock table
On Tue, Nov 30, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: On Mon, 2010-11-29 at 21:37 -0500, Josh Kupershmidt wrote: I still see little reason to make LOCK TABLE permissions different for column-level vs. table-level UPDATE privileges Agreed. This is the crux of the debate. Why should this inconsistency be allowed to continue? Well, a user with full-table UPDATE privileges can trash the whole thing, so, from a security perspective, letting them lock is only allowing them to deny access to data they could have just as easily trashed. A user with only single-column UPDATE privileges cannot trash the whole table, though, so you are allowing them to deny read access to data they may not themselves have permission either to read or to update. Admittedly, this seems a bit more rickety in light of your point that they can still lock all the rows... but then that only stops writes, not reads. I'm less convinced that I'm right about this than I was 3 days ago. But I'm still not convinced that the above argument is wrong, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
to me the key its security - its a anti-DBA-with-lack-of-attention feature. If i forget the WHERE statement, I will delete some valid tuples and messed up the bd, but its less-than-worst that exclude all the table. A DBA who never forgot an WHERE in an DELETE is not an DBA. Just kidding, but this happens often enough. is there another option to implement this ? Its possible to be done by plugins/extension (in a Firefox browser style) ? Sds, -- Daniel Loureiro -- 2010/11/30 Andrew Dunstan and...@dunslane.net On 11/30/2010 09:57 AM, Csaba Nagy wrote: So it is really an ideological thing and not lack of demand or implementation attempts... I for myself can't write working C code anyway, so I got my peace with the workaround - I wish you good luck arguing Tom :-) We need a convincing use case for it. So far the only one that's seemed at all convincing to me is the one about deleting in batches. But that might be enough. As for it being illogical, I don't think it's any more so than DELETE FROM foo WHERE random() 0.1; and you can do that today. cheers andrew
Re: [HACKERS] Instrument checkpoint sync calls
2010/11/30 Greg Smith g...@2ndquadrant.com: Jeff Janes wrote: For the individual file sync times emitted under debug1, it would be very handy if the file being synced was identified, for example relation base/16384/16523. Rather than being numbered sequentially within a given checkpoint. I was numbering them sequentially so that it's straightforward to graph the sync times in an external analysis tool, but the relation data is helpful too. New patch reflecting all upthread suggestions is attached. The output looks like this now at DEBUG1: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/16424/11645 time=11589.549000 msec DEBUG: checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec DEBUG: checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000 msec DEBUG: checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec DEBUG: checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec DEBUG: checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000 msec DEBUG: checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000 msec LOG: checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s I kept the units for the DEBUG level ones in msec because that's a better scale for the common really short syncs during good behavior. But the summary info in seconds now appears at the end of the existing checkpoint complete message, so only one line to parse for those looking to analyze the gross checkpoint data. That looks to work well enough for finding situations like the big ext3 spikes. You can easily see one in this example by the fact that longest=11.589 s is almost the entirety of sync=11.715 s. That's the really key thing there's currently no visibility into, that's made obvious with this patch. wonderfull. This might be ready for some proper review now. I know there's at least one blatant bug still in here I haven't found yet, related to how the averages are computed. I saw this once: LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s After an immediate checkpoint, so at least one path not quite right yet. -- Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] column-level update privs + lock table
Robert Haas robertmh...@gmail.com writes: Well, a user with full-table UPDATE privileges can trash the whole thing, so, from a security perspective, letting them lock is only allowing them to deny access to data they could have just as easily trashed. A user with only single-column UPDATE privileges cannot trash the whole table, though, so you are allowing them to deny read access to data they may not themselves have permission either to read or to update. Admittedly, this seems a bit more rickety in light of your point that they can still lock all the rows... but then that only stops writes, not reads. I'm less convinced that I'm right about this than I was 3 days ago. But I'm still not convinced that the above argument is wrong, either. I think what your complaint really boils down to is that LOCK TABLE is granting excessive permissions to someone who only has table-level UPDATE privilege. If we were designing that from scratch today, I am sure we'd have made it tighter, say that UPDATE alone wouldn't give you more than RowExclusive lock. However, given the lack of complaints about this from the field, I can't get very excited about a non-backward-compatible change to tighten LOCK's privilege checking. I find the argument that column-level update should give weaker locking privileges than full-table update to be pretty thin. That isn't the case at the row level; why should it be true at the table level? However, the other side of the lack of complaints argument is that few people seem to care about whether LOCK TABLE responds to column level privileges, either. AFAICS this patch is not in response to any user request but just because Josh thought things were inconsistent. Which they are, but at a deeper level than this. If we apply this patch, then we'll be expanding the set of cases where LOCK is granting privilege too freely, and thus creating more not less backwards-compatibility problem if we are ever to make it saner. On the whole I agree with Robert --- let's just adjust the documentation, and not enlarge privileges in a way we might regret later. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Daniel Loureiro dan...@termasa.com.br wrote: to me the key its security - its a anti-DBA-with-lack-of-attention feature. Well, it seems pretty weak to me for that purpose. You still trash data, and you don't have any immediate clue as to what. If you wanted protection from that you'd want more of an assert limit that would fail if the affected row count was above what you specified. For me the best solution is to develop good habits. I first type my statement as SELECT * FROM ... and after reviewing the results arrow up and replace SELECT * with DELETE. If there's enough volatility or complexity to make that insufficient insurance, I begin a transaction. That way I can not only review row counts but run queries against the modified data to confirm correct modification before issuing a COMMIT (or ROLLBACK). The batching of updates so that vacuums can make space available for re-use is more compelling to me, but still pretty iffy, since the work-arounds aren't that hard to find. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On 11/30/2010 02:12 PM, Kevin Grittner wrote: Daniel Loureirodan...@termasa.com.br wrote: to me the key its security - its a anti-DBA-with-lack-of-attention feature. Well, it seems pretty weak to me for that purpose. You still trash data, and you don't have any immediate clue as to what. I agree, that argument is completely misconceived. If the DBA is paying enough attention to use LIMIT, s/he should be paying enough attention not to do damage in the first place. If that were the only argument in its favor I'd be completely against the feature. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On 11/30/2010 02:12 PM, Kevin Grittner wrote: Daniel Loureirodan...@termasa.com.br wrote: to me the key its security - its a anti-DBA-with-lack-of-attention feature. Well, it seems pretty weak to me for that purpose. You still trash data, and you don't have any immediate clue as to what. I agree, that argument is completely misconceived. If the DBA is paying enough attention to use LIMIT, s/he should be paying enough attention not to do damage in the first place. If that were the only argument in its favor I'd be completely against the feature. I don't buy the argument either; why would you put a LIMIT there and delete one row by accident when you could put a BEGIN; in front and not do any damage at all? Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] profiling connection overhead
On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote: Rolling in calloc in place of malloc/memset made no particular difference either, which says that Fedora 13's glibc does not have any optimization for that case as I'd hoped. glibc's calloc is either mmap of /dev/zero or malloc followed by memset. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote: 3. This doesn't work tremendously well for inheritance trees, where ModifyTable acts as sort of an implicit Append node. You can't just funnel all the tuples through one Sort or Limit node because they aren't all the same rowtype. (Limit might perhaps not care, but Sort will.) But you can't have a separate Sort/Limit for each table either, because that would give the wrong behavior. Another problem with funneling all the rows through one Sort/Limit is that ModifyTable did need to know which table each row came from, so it can apply the modify to the right table. Could you possibly have ModifyTable - Limit - MergeAppend? Before MergeAppend knows which tuple to produce, it needs to see the tuples (at least the first one from each of its children), meaning that it needs to pull them through ModifyTable; and at that point it's already too late. Also, assuming LIMIT K, MergeAppend will have N children, meaning N limits, meaning an effective limit of K*N rather than K. Can you be a little more specific about what you mean? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
While reading this thread, I thought of two things I think we could do if this feature was implemented: 1. Sort large UPDATE/DELETEs so it is done in heap order This is actually a TODO item. I imagine it would be possible to do something like: DELETE FROM foo USING (...) ORDER BY ctid; with this patch to help this case. 2. Reducing deadlocks in big UPDATE/DELETEs One problem that sometimes occurs when doing multiple multi-row UPDATEs or DELETEs concurrently is that the transactions end up working on the same rows, but in a different order. One could use an ORDER BY clause to make sure the transactions don't deadlock. Thoughts? Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] KNNGIST next step: adjusting indexAM API
In the current KNNGIST patch, the indexable ORDER BY clauses are transmitted to the executor by cramming them in with the index qual conditions (the IndexScan plan node's indexqual list), from whence they become part of the ScanKey array passed to the index AM. Robert complained that this was an ingenious way to minimize the number of lines touched by the patch but utterly ugly from any other standpoint, and I quite agree. An ORDER BY clause is a completely different thing from a WHERE qual, so mixing them together doesn't seem like a good idea. However, if we hold to that principle then we need to modify the indexAM API to pass the ordering operators separately. This is no big deal as far as the built-in AMs are concerned, particularly because 3 of the 4 need only assert that the additional list is empty. The only reason it would be a problem is if there were third-party index AMs that would be affected to a larger degree; but I don't know of any. Does anyone have an objection to that? (Another thing that might be worth changing, as long as we have to touch the beginscan and rescan APIs anyway, is to refactor the handling of the initial set of scan keys. It never made any sense to me for RelationGetIndexScan to call index_rescan: that seems to accomplish little except making it difficult for AM beginscan routines to do things in a sane order. I'm inclined to take that out and let the AM call rescan internally if it wants to.) Lastly, I'm pretty un-thrilled with the way that the KNNGIST patch implements the interface to the opclass-specific hook functions. Seems like it would be cleaner to leave the Consistent function alone and invent a new, separate hook function for processing ORDER BY. Is there a strong reason for having both things done in one call, or was that just done as a byproduct of trying to cram all the data into one ScanKey array? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix for seg picksplit function
On 2010-11-16 09:57, Alexander Korotkov wrote: On Tue, Nov 16, 2010 at 3:07 AM, Robert Haas robertmh...@gmail.com mailto:robertmh...@gmail.com wrote: The loop that begins here: for (i = 0; i maxoff; i++) { /* First half of segs goes to the left datum. */ if (i seed_2) ...looks like it should perhaps be broken into two separate loops. That might also help tweak the logic in a way that eliminates this: seg.c: In function ‘gseg_picksplit’: seg.c:327: warning: ‘datum_r’ may be used uninitialized in this function seg.c:326: warning: ‘datum_l’ may be used uninitialized in this function I restored original version of that loop. But on a broader note, I'm not very certain the sorting algorithm is sensible. For example, suppose you have 10 segments that are exactly '0' and 20 segments that are exactly '1'. Maybe I'm misunderstanding, but it seems like this will result in a 15/15 split when we almost certainly want a 10/20 split. I think there will be problems in more complex cases as well. The documentation says about the less-than and greater-than operators that These operators do not make a lot of sense for any practical purpose but sorting. I think almost any split algorithm has corner cases when it's results don't look very good. I think the way to understand significance of these corner cases for real life is to perform sufficient testing on datasets which is close to real life. I'm not feeling power to propose enough of test datasets and estimate their significance for real life cases, and I need help in this field. I think it is time to mark this patch ready for committer: The unintuitive result thus far is that sorting outperforms the R-tree bounding boxes style index, as Alexander has demonstrated with several different distributions on 20-11 (uniform, natural (is that a bell curve?), many distinct values) My personal opinion is that I like the single loop for walking over the sort array (aka gbt_num_picksplit) more than the two different ones, but I'm in the minority here. Two remarks on this patch also apply to other picksplit functions currently around: 1) the *first = *last = FirstOffsetNumber assignment, as noted by Alvaro, is not necessary for anymore, and Oleg confirmed this is true since PostgreSQL 7.x. 2) loops over something other than the entryvector better not use FirstOffsetNumber, OffsetNumberNext, as indicated by Tom. If this patch is committed, it might be an idea to change the other occurences as well. regards, Yeb Havinga
Re: [HACKERS] Another proposal for table synonyms
Alexey, Here is the proposal to add synonyms to PostgreSQL. Initial goal is to add synonyms for relations (tables, views, sequences) and an infrastructure to allow synonyms for other database objects in the future. Can you explain, for our benefit, the use case for this? Specifically, what can be done with synonyms which can't be done with search_path and VIEWs? I ask partly because I've migrated some Oracle databases to PostgreSQL, and did not find replacing the functionality of synonyms to be at all difficult. Presumably you've run into a case which was difficult? BTW, I have a specific use case for *column* synonyms which isn't currently covered by our existing tools. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Where are we on Standby Promotion?
Fujii, Simon, Greg, etc.: Has anyone submitted or committed a patch to make Standby Promotion* easier, at this point? We discussed it earlier in the dev cycle, but I can't find anything which has actually been submitted. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com * that is, promotion to be a new master of other existing standbys -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith g...@2ndquadrant.com wrote: Hi Greg, Thanks for the update. This might be ready for some proper review now. I know there's at least one blatant bug still in here I haven't found yet, related to how the averages are computed. Yes, the blatant bug: average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels; That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync. I saw this once: LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s After an immediate checkpoint, so at least one path not quite right yet. Not clear what the right thing to do here is. I guess we should special case the div by zero to yield zero for the average? The patch is in unified diff rather than context diff. Doesn't bother me, but the wiki on doing a review says... Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote: On 11/30/2010 02:12 PM, Kevin Grittner wrote: Daniel Loureirodan...@termasa.com.br wrote: to me the key its security - its a anti-DBA-with-lack-of-attention feature. Well, it seems pretty weak to me for that purpose. You still trash data, and you don't have any immediate clue as to what. I agree, that argument is completely misconceived. If the DBA is paying enough attention to use LIMIT, s/he should be paying enough attention not to do damage in the first place. If that were the only argument in its favor I'd be completely against the feature. I don't buy the argument either; why would you put a LIMIT there and delete one row by accident when you could put a BEGIN; in front and not do any damage at all? Because the delete of the whole table may take awfully long? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Ron Mayer wrote: Might smoother checkpoints be better solved by talking to the OS vendors virtual-memory-tunning-knob-authors to work with them on exposing the ideal knobs; rather than saying that our only tool is a hammer(fsync) so the problem must be handled as a nail. Maybe, but it's hard to argue that the current implementation--just doing all of the sync calls as fast as possible, one after the other--is going to produce worst-case behavior in a lot of situations. Given that it's not a huge amount of code to do better, I'd rather do some work in that direction, instead of presuming the kernel authors will ever make this go away. Spreading the writes out as part of the checkpoint rework in 8.3 worked better than any kernel changes I've tested since then, and I'm not real optimisic about this getting resolved at the system level. So long as the database changes aren't antagonistic toward kernel improvements, I'd prefer to have some options here that become effective as soon as the database code is done. I've attached an updated version of the initial sync spreading patch here, one that applies cleanly on top of HEAD and over top of the sync instrumentation patch too. The conflict that made that hard before is gone now. Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the time now has made me reconsider how important one potential bit of refactoring here would be. I managed to catch one of the situations where really popular relations were being heavily updated in a way that was competing with the checkpoint on my test system (which I can happily share the logs of), with the instrumentation patch applied but not the spread sync one: LOG: checkpoint starting: xlog DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 7747 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 42688 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 9723 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 58117 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 165128 of relation base/16424/16437 [330 of these total, all referring to the same two relations] DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm time=10132.83 msec DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 msec LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s Note here how the checkpoint was hung on trying to get 16448_fsm written out, but the backends were issuing constant competing fsync calls to these other relations. This is very similar to the production case this patch was written to address, which I hadn't been able to share a good example of yet. That's essentially what it looks like, except with the contention going on for minutes instead of seconds. One of the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 620b197..501cab8 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -143,8 +143,8 @@ typedef struct static BgWriterShmemStruct *BgWriterShmem; -/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */ -#define WRITES_PER_ABSORB 1000 +/* Fraction of fsync absorb queue that needs to be filled before acting */
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tue, Nov 30, 2010 at 9:24 PM, Marko Tiikkaja marko.tiikk...@cs.helsinki.fi wrote: On 11/30/2010 02:12 PM, Kevin Grittner wrote: Daniel Loureirodan...@termasa.com.br wrote: to me the key its security - its a anti-DBA-with-lack-of-attention feature. Well, it seems pretty weak to me for that purpose. You still trash data, and you don't have any immediate clue as to what. I agree, that argument is completely misconceived. If the DBA is paying enough attention to use LIMIT, s/he should be paying enough attention not to do damage in the first place. If that were the only argument in its favor I'd be completely against the feature. I don't buy the argument either; why would you put a LIMIT there and delete one row by accident when you could put a BEGIN; in front and not do any damage at all? It is valuable as a DBA carelessness/typo catcher only if it is imposed by default (in line with Kevin's point), and only if it rolls back rather than reduces the number of affected rows (as per Marko). We have implemented a damage limitation solution similar to this with triggers on an MSSQL database, and it has worked for the specific environment it's in. The safety net is basically that the DBA has to set an environment variable before a very large delete or update operation. If the operation is recognised as being beyond the threshold size the enviroment variable is checked - if it is set the transaction passes and the variable is reset, if not the transaction is rolled back. It should be possible to implement something along these lines in triggers, all that would be needed is a structure for defining the (optional) limits on potentially destructive operations. More flexible options or options based on the number of rows in a table will rapidly increase the performance impact of the triggers - but may make them more useful. I'm not sure if there is a way to persist data (like a row count) between per row triggers so that the operation could be aborted at the limit rather than only once all the rows had been updated (potentially a big peformance gain). Alastair Bell Turner Technical Lead ^F5 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On 11/30/2010 03:16 PM, Andres Freund wrote: On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote: On 11/30/2010 02:12 PM, Kevin Grittner wrote: Daniel Loureirodan...@termasa.com.brwrote: to me the key its security - its a anti-DBA-with-lack-of-attention feature. Well, it seems pretty weak to me for that purpose. You still trash data, and you don't have any immediate clue as to what. I agree, that argument is completely misconceived. If the DBA is paying enough attention to use LIMIT, s/he should be paying enough attention not to do damage in the first place. If that were the only argument in its favor I'd be completely against the feature. I don't buy the argument either; why would you put a LIMIT there and delete one row by accident when you could put a BEGIN; in front and not do any damage at all? Because the delete of the whole table may take awfully long? I don't see that that has anything to do with restricting damage. LIMIT might be useful for the reason you give, but not as any sort of protection against DBA carelessness. That's what the discussion above is about. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] profiling connection overhead
Peter Eisentraut pete...@gmx.net writes: On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote: Rolling in calloc in place of malloc/memset made no particular difference either, which says that Fedora 13's glibc does not have any optimization for that case as I'd hoped. glibc's calloc is either mmap of /dev/zero or malloc followed by memset. Hmm. I would have expected to see a difference then. Do you know what conditions are needed to cause the mmap to be used? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tue, Nov 30, 2010 at 2:45 PM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote: 3. This doesn't work tremendously well for inheritance trees, where ModifyTable acts as sort of an implicit Append node. You can't just funnel all the tuples through one Sort or Limit node because they aren't all the same rowtype. (Limit might perhaps not care, but Sort will.) But you can't have a separate Sort/Limit for each table either, because that would give the wrong behavior. Another problem with funneling all the rows through one Sort/Limit is that ModifyTable did need to know which table each row came from, so it can apply the modify to the right table. Could you possibly have ModifyTable - Limit - MergeAppend? Before MergeAppend knows which tuple to produce, it needs to see the tuples (at least the first one from each of its children), meaning that it needs to pull them through ModifyTable; and at that point it's already too late. Also, assuming LIMIT K, MergeAppend will have N children, meaning N limits, meaning an effective limit of K*N rather than K. Can you be a little more specific about what you mean? You seem to be imagining the MergeAppend node on top, but I had it in the other order in my mind. The ModifyTable node would be the outermost plan node, pulling from the Limit, which would deliver the first n table rows from the MergeAppend, which would be reponsible for getting it from the various child tables. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Maybe, but it's hard to argue that the current implementation--just doing all of the sync calls as fast as possible, one after the other--is going to produce worst-case behavior in a lot of situations. Given that it's not a huge amount of code to do better, I'd rather do some work in that direction, instead of presuming the kernel authors will ever make this go away. Spreading the writes out as part of the checkpoint rework in 8.3 worked better than any kernel changes I've tested since then, and I'm not real optimisic about this getting resolved at the system level. So long as the database changes aren't antagonistic toward kernel improvements, I'd prefer to have some options here that become effective as soon as the database code is done. Besides, even if kernel/FS authors did improve things, the improvements would not be available on production platforms for years. And, for that matter, while Linux and BSD are pretty responsive to our feedback, Apple, Microsoft and Oracle are most definitely not. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] KNNGIST next step: adjusting indexAM API
On Tue, Nov 30, 2010 at 2:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: In the current KNNGIST patch, the indexable ORDER BY clauses are transmitted to the executor by cramming them in with the index qual conditions (the IndexScan plan node's indexqual list), from whence they become part of the ScanKey array passed to the index AM. Robert complained that this was an ingenious way to minimize the number of lines touched by the patch but utterly ugly from any other standpoint, and I quite agree. An ORDER BY clause is a completely different thing from a WHERE qual, so mixing them together doesn't seem like a good idea. However, if we hold to that principle then we need to modify the indexAM API to pass the ordering operators separately. This is no big deal as far as the built-in AMs are concerned, particularly because 3 of the 4 need only assert that the additional list is empty. The only reason it would be a problem is if there were third-party index AMs that would be affected to a larger degree; but I don't know of any. Does anyone have an objection to that? None here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Marko Tiikkaja marko.tiikk...@cs.helsinki.fi writes: While reading this thread, I thought of two things I think we could do if this feature was implemented: 1. Sort large UPDATE/DELETEs so it is done in heap order This is actually a TODO item. I imagine it would be possible to do something like: DELETE FROM foo USING (...) ORDER BY ctid; with this patch to help this case. Well, that's strictly an implementation detail; it is not a reason to expose ORDER BY to the user, and even less of a reason to invent LIMIT. It also hasn't got any of the problems we were discussing with inheritance situations, since it'd be perfectly OK (in fact probably desirable) to sort each table's rows separately. 2. Reducing deadlocks in big UPDATE/DELETEs One problem that sometimes occurs when doing multiple multi-row UPDATEs or DELETEs concurrently is that the transactions end up working on the same rows, but in a different order. One could use an ORDER BY clause to make sure the transactions don't deadlock. That, on the other hand, seems like potentially a valid use-case. Note that the user-given order would have to override any internal attempt to order by ctid for this to be usable. I had thought of a slightly different application, which could be summarized with this example: UPDATE sometab SET somecol = nextval('seq') ORDER BY id; with the expectation that somecol's values would then fall in the same order as the id column. Unfortunately, that won't actually *work* reliably, the reason being that ORDER BY is applied after targetlist computation. I think enough people would get burnt this way that we'd have popular demand to make ORDER BY work differently in UPDATE than it does in SELECT, which seems rather ugly not only from the definitional side but the implementation side. (DELETE escapes this issue because it has no user-definable elements in its targetlist, which is another way that DELETE is simpler here.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Andres Freund and...@anarazel.de writes: On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote: I don't buy the argument either; why would you put a LIMIT there and delete one row by accident when you could put a BEGIN; in front and not do any damage at all? Because the delete of the whole table may take awfully long? Then you just C-c and that's your ROLLBACK. Been there, seen that (a developer came to me sweating over maybe-lost data — his chance was that forgetting the WHERE clause, it did take long enough for him to C-c by reflex, the oops moment). But more to the point, I don't see that we're this much on the policy side of things rather than on the mechanism side. This feature has real appealing usages (cheap work queues, anti-deadlock, huge data purges with no production locking — you do that in little steps in a loop). To summarize, people that are arguing against are saying they will not themselves put time on the feature more than anything else, I think. I don't see us refusing a good implementation on the grounds that misuse is possible. After all, advisory locks are session based, to name another great foot gun. If you don't think it's big enough, think about web environments and pgbouncer in transaction pooling mode. Loads of fun. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Robert Haas robertmh...@gmail.com writes: You seem to be imagining the MergeAppend node on top, but I had it in the other order in my mind. The ModifyTable node would be the outermost plan node, pulling from the Limit, which would deliver the first n table rows from the MergeAppend, which would be reponsible for getting it from the various child tables. That's just a variation of the Sort/Limit/ModifyTable approach. It doesn't fix the problem of how ModifyTable knows which table each row came from, and it doesn't fix the problem of the rows not being all the same rowtype. (In fact it makes the latter worse, since now MergeAppend has to be included in whatever kluge you invent to work around it.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Excerpts from Daniel Loureiro's message of mar nov 30 15:04:17 -0300 2010: So I guess that I have choose the wrong hack to start. So it seems :-D -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
On Tue, 2010-11-30 at 15:52 -0500, Robert Haas wrote: On Tue, Nov 30, 2010 at 2:45 PM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote: Could you possibly have ModifyTable - Limit - MergeAppend? Before MergeAppend knows which tuple to produce, it needs to see the tuples (at least the first one from each of its children), meaning that it needs to pull them through ModifyTable; and at that point it's already too late. You seem to be imagining the MergeAppend node on top Yes, I assumed that the tuples flowed in the direction of the arrows ;) Now that I think about it, your representation makes some sense given our EXPLAIN output. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Nov 14, 2010 at 3:48 PM, Greg Smith g...@2ndquadrant.com wrote: ... One change that turned out be necessary rather than optional--to get good performance from the system under tuning--was to make regular background writer activity, including fsync absorb checks, happen during these sync pauses. The existing code ran the checkpoint sync work in a pretty tight loop, which as I alluded to in an earlier patch today can lead to the backends competing with the background writer to get their sync calls executed. This squashes that problem if the background writer is setup properly. Have you tested out this absorb during syncing phase code without the sleep between the syncs? I.e. so that it still a tight loop, but the loop alternates between sync and absorb, with no intentional pause? I wonder if all the improvement you see might not be due entirely to the absorb between syncs, and none or very little from the sleep itself. I ask because I don't have a mental model of how the pause can help. Given that this dirty data has been hanging around for many minutes already, what is a 3 second pause going to heal? The healing power of clearing out the absorb queue seems much more obvious. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take three
Heikki Linnakangas wrote: On 30.11.2010 18:33, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Oh, but it's worse than that. When you XLOG a WAL record for each of those pages, you're going to trigger full-page writes for all of them. So now you've turned 1GB of data to write into 2+ GB of data to write. No, because only the first mod of each VM page would trigger a full page write, at least assuming a reasonable ordering of the operations. If you change the LSN on the heap pages, you have to write full page images of those as well. Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it usually isn't), and then set the bit in the VM while keeping the heap page locked. What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to happen so the heap page is guaranteed to be on disk, then on next read, if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the VM bit. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DELETE with LIMIT (or my first hack)
Daniel Loureiro wrote: 3. This doesn't work tremendously well for inheritance trees, where ModifyTable acts as sort of an implicit Append node. You can't just funnel all the tuples through one Sort or Limit node because they aren't all the same rowtype. (Limit might perhaps not care, but Sort will.) But you can't have a separate Sort/Limit for each table either, because that would give the wrong behavior. Another problem with funneling all the rows through one Sort/Limit is that ModifyTable did need to know which table each row came from, so it can apply the modify to the right table. So I guess that I have choose the wrong hack to start. Just for curiosity, why the result of WHERE filter (in SELECT/DELETE/UPDATE) is not put in memory, i.e. an array of ctid, like an buffer and then executed by SELECT/DELETE/UPDATE at once ? Informix dbaccess would prompt a user for confirmation if it saw a DELETE with no WHERE. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Where are we on Standby Promotion?
Josh Berkus wrote: Fujii, Simon, Greg, etc.: Has anyone submitted or committed a patch to make Standby Promotion* easier, at this point? We discussed it earlier in the dev cycle, but I can't find anything which has actually been submitted. * that is, promotion to be a new master of other existing standbys Nope, I haven't seen anything except a request that someone do testing. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST insert algorithm rewrite
Heikki Linnakangas wrote: Does the current code cope with the corruption? It's not corruption, but intended degradation. Yes, the current code copes with it, that's how GiST survives a crash. However, even with the current code, VACUUM will nag if it finds any invalid tuples with this message: ereport(NOTICE, (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash recovery, That's harmless, in the sense that all scans and inserts work fine, but scans might need to do more work than if the invalid tuple wasn't there. I don't think we need to go out of our way to support such degraded indexes in 9.1. If you see such notices in your logs, you should REINDEX anyway, before of after pg_upgrade. Let's just make sure that you get a reasonable error message in 9.1 if a scan or insert encounters such a tuple. There is a section on this in the docs, BTW: http://www.postgresql.org/docs/9.0/static/gist-recovery.html OK, administrators will be prompted during normal operation --- seems there is nothing extra for pg_upgrade to do here. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST insert algorithm rewrite
On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Does the current code cope with the corruption? It's not corruption, but intended degradation. Yes, the current code copes with it, that's how GiST survives a crash. However, even with the current code, VACUUM will nag if it finds any invalid tuples with this message: ereport(NOTICE, (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash recovery, That's harmless, in the sense that all scans and inserts work fine, but scans might need to do more work than if the invalid tuple wasn't there. I don't think we need to go out of our way to support such degraded indexes in 9.1. If you see such notices in your logs, you should REINDEX anyway, before of after pg_upgrade. Let's just make sure that you get a reasonable error message in 9.1 if a scan or insert encounters such a tuple. I just don't want to take a risk of giving people unexpected wrong answers. It's not clear to me whether that's a risk here or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Hackers, Some of you might already be aware that this combination produces a fatal startup crash in PostgreSQL: 1. Create an Ext3 or Ext4 partition and mount it with data=journal on a server with linux kernel 2.6.30 or later. 2. Initdb a PGDATA on that partition 3. Start PostgreSQL with the default config from that PGDATA This was reported a ways back: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=567113 To explain: calling O_DIRECT on an ext3 or ext4 partition with data=journalled causes a crash. However, recent Linux kernels now report support for O_DIRECT when we compile PostgreSQL, so we use it by default. This results in a crash by default situation with new Linuxes if anyone sets data=journal. We just encountered this again with another user. With RHEL6 out now, this seems likely to become a fairly common crash report. Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Where are we on Standby Promotion?
On Tue, Nov 30, 2010 at 9:03 PM, Bruce Momjian br...@momjian.us wrote: Josh Berkus wrote: Fujii, Simon, Greg, etc.: Has anyone submitted or committed a patch to make Standby Promotion* easier, at this point? We discussed it earlier in the dev cycle, but I can't find anything which has actually been submitted. * that is, promotion to be a new master of other existing standbys Nope, I haven't seen anything except a request that someone do testing. I thought we agreed Josh was in charge of writing the code for this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Josh Berkus j...@agliodbs.com writes: Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? We should wait for the outcome of the discussion about whether to change the default wal_sync_method before worrying about this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 11/30/10 7:09 PM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? We should wait for the outcome of the discussion about whether to change the default wal_sync_method before worrying about this. Are we considering backporting that change? If so, this would be another argument in favor of changing the default. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers