Re: [HACKERS] Tab completion for view triggers in psql

2010-11-30 Thread Itagaki Takahiro
On Wed, Nov 24, 2010 at 12:21, David Fetter da...@fetter.org wrote:
 Please find attached a patch changing both this and updateable to
 updatable, also per the very handy git grep I just learned about :)

I think the patch has two issues to be fixed.

It expands all tables (and views) in tab-completion after INSERT,
UPDATE, and DELETE FROM. Is it an intended change?  IMHO, I don't
want to expand any schema because my console is filled with system
tables and duplicated tables with or without schema names :-( .

(original)
=# INSERT INTO [tab]
information_schema.  pg_temp_1.   pg_toast_temp_1. tbl
pg_catalog.  pg_toast.public.

(patched)
=# INSERT INTO [tab]
Display all 113 possibilities? (y or n)

Also, event names are not completed after INSTEAD OF:

=# CREATE TRIGGER trg BEFORE [tab]
DELETEINSERTTRUNCATE  UPDATE
=# CREATE TRIGGER trg INSTEAD OF [tab]
(no candidates)

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tab completion for view triggers in psql

2010-11-30 Thread David Fetter
On Tue, Nov 30, 2010 at 05:48:04PM +0900, Itagaki Takahiro wrote:
 On Wed, Nov 24, 2010 at 12:21, David Fetter da...@fetter.org wrote:
  Please find attached a patch changing both this and updateable to
  updatable, also per the very handy git grep I just learned about :)
 
 I think the patch has two issues to be fixed.
 
 It expands all tables (and views) in tab-completion after INSERT,
 UPDATE, and DELETE FROM.  Is it an intended change?  IMHO, I don't
 want to expand any schema because my console is filled with system
 tables and duplicated tables with or without schema names :-( .
 
 (original)
 =# INSERT INTO [tab]
 information_schema.  pg_temp_1.   pg_toast_temp_1. tbl
 pg_catalog.  pg_toast.public.
 
 (patched)
 =# INSERT INTO [tab]
 Display all 113 possibilities? (y or n)

Yes.  I believe that the question of having INSERT INTO [tab] check
for permissions is a separate issue from this patch.  This patch does
bring the question of whether it *should* check such permission into
more focus, though.  Whether it should is probably a matter for a
separate thread, though.  I could create arguments in both directions.

 Also, event names are not completed after INSTEAD OF:
 
 =# CREATE TRIGGER trg BEFORE [tab]
 DELETEINSERTTRUNCATE  UPDATE
 =# CREATE TRIGGER trg INSTEAD OF [tab]
 (no candidates)

This one's fixed in the attached patch, although subsequent events, as
in 

=# CREATE TRIGGER trg BEFORE INSERT OR [tab]

are not.

Thanks for your review :)

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 4c468a8..0aba07c 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -303,6 +303,57 @@ static const SchemaQuery Query_for_list_of_tables = {
NULL
 };
 
+/* The bit masks for the following three functions come from
+ * src/include/catalog/pg_trigger.h.
+ */
+static const SchemaQuery Query_for_list_of_insertables = {
+   /* catname */
+   pg_catalog.pg_class c,
+   /* selcondition */
+   c.relkind = 'r' OR (c.relkind = 'v' AND c.relhastriggers AND EXISTS 
+   (SELECT 1 FROM pg_catalog.pg_trigger t WHERE t.tgrelid = c.oid AND 
t.tgtype | (1  2) = t.tgtype)),
+   /* viscondition */
+   pg_catalog.pg_table_is_visible(c.oid),
+   /* namespace */
+   c.relnamespace,
+   /* result */
+   pg_catalog.quote_ident(c.relname),
+   /* qualresult */
+   NULL
+};
+
+static const SchemaQuery Query_for_list_of_deletables = {
+   /* catname */
+   pg_catalog.pg_class c,
+   /* selcondition */
+   c.relkind = 'r' OR (c.relkind = 'v' AND c.relhastriggers AND EXISTS 
+   (SELECT 1 FROM pg_catalog.pg_trigger t WHERE t.tgrelid = c.oid AND 
t.tgtype | (1  3) = t.tgtype)),
+   /* viscondition */
+   pg_catalog.pg_table_is_visible(c.oid),
+   /* namespace */
+   c.relnamespace,
+   /* result */
+   pg_catalog.quote_ident(c.relname),
+   /* qualresult */
+   NULL
+};
+
+static const SchemaQuery Query_for_list_of_updatables = {
+   /* catname */
+   pg_catalog.pg_class c,
+   /* selcondition */
+   c.relkind = 'r' OR (c.relkind = 'v' AND c.relhastriggers AND EXISTS 
+   (SELECT 1 FROM pg_catalog.pg_trigger t WHERE t.tgrelid = c.oid AND 
t.tgtype | (1  4) = t.tgtype)),
+   /* viscondition */
+   pg_catalog.pg_table_is_visible(c.oid),
+   /* namespace */
+   c.relnamespace,
+   /* result */
+   pg_catalog.quote_ident(c.relname),
+   /* qualresult */
+   NULL
+};
+
 static const SchemaQuery Query_for_list_of_tisv = {
/* catname */
pg_catalog.pg_class c,
@@ -333,6 +384,21 @@ static const SchemaQuery Query_for_list_of_tsv = {
NULL
 };
 
+static const SchemaQuery Query_for_list_of_tv = {
+   /* catname */
+   pg_catalog.pg_class c,
+   /* selcondition */
+   c.relkind IN ('r', 'v'),
+   /* viscondition */
+   pg_catalog.pg_table_is_visible(c.oid),
+   /* namespace */
+   c.relnamespace,
+   /* result */
+   pg_catalog.quote_ident(c.relname),
+   /* qualresult */
+   NULL
+};
+
 static const SchemaQuery Query_for_list_of_views = {
/* catname */
pg_catalog.pg_class c,
@@ -630,7 +696,8 @@ psql_completion(char *text, int start, int end)
   *prev2_wd,
   *prev3_wd,
   *prev4_wd,
-  *prev5_wd;
+  *prev5_wd,
+  *prev6_wd;
 
static const char *const sql_commands[] = {
ABORT, ALTER, ANALYZE, BEGIN, CHECKPOINT, CLOSE, 

Re: [HACKERS] pg_execute_from_file review

2010-11-30 Thread Dimitri Fontaine
Itagaki Takahiro itagaki.takah...@gmail.com writes:
 I think there are two topics here:
   1. Do we need to restrict locations in which sql files are executable?
   2. Which is better, pg_execute_sql_file() or EXECUTE pg_read_file() ?

 There are no discussion yet for 1, but I think we need some restrictions

Well, as a first level of restrictions, the function is superuser
only. I understand and share your concerns, but as the main use for this
function is in the extension's patch which is superuser only too, I feel
like this discussion could (should) be taken to after commit. We would
issue the next alpha with current coding, and the one after that with
more security thoughts in. Is it possible to attack it this way?

(The fear being that it might be hard to get to a common decision here
 and we have other patches to care about depending on this one, that
 will continue working fine --- that's the goal --- once the security
 policy land in. This one is the mechanism patch)

 For 2, I'd like to monofunctionalize pg_execute_sql_file() into
 separated functions something like:
 - FUNCTION pg_execute_sql(sql text)
 - FUNCTION replace(str text, from1 text, to1 text, VARIADIC text)
 - FUNCTION pg_read_binary_file(path text, offset bigint, size bigint)
 (size == -1 means whole-file)

 pg_read_binary_file() is the most important functions probably.
 pg_execute_sql_file() can be rewritten as follows. We can also use
 existing convert_from() to support encodings.

 SELECT pg_execute_sql(
  replace(
convert_from(
  pg_read_binary_file('path', 0, -1),
  'encoding'),
  'key1', 'value1', 'key2', 'value2')
);

I think the current pg_execute_sql_file() is a better user interface,
but it could be extended to support queries in a text argument too, of
course. I proposed it and Robert said he's not thinking that should
happen in this very patch, if at all, if I understood correctly.

Again, I'd like to see pg_read_binary_file() and it's easy to expose the
other derivatives you're proposing here: the support code is already in
my patch and is organised this way internally. Now, is there an
agreement that all those new SQL functions this should be in the
pg_execute_from_file patch? If so, I'll prepare v7 with that.

Overall, I'd like it if it'd be possible to separate the concerns
directly relevant to the extension patch to other legitimate ones, so
that the main patch gets its share of review in this commitfest. But
maybe I'm over-worried here.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_execute_from_file review

2010-11-30 Thread Dimitri Fontaine
Itagaki Takahiro itagaki.takah...@gmail.com writes:
 client_encoding won't work at all because read_sql_queries_from_file()
 uses pg_verifymbstr(), that is verify the input with *server_encoding*.

 Even if we replace it with pg_verify_mbstr(client_encoding, ...) and
 pg_do_encoding_conversion(from client_encoding to server_encoding),
 it still won't work well when error messages are raised. The client
 expects the original client encoding, but messages are sent in the
 file encoding. It would be a security hole.

I'll confess I'm at a loss here wrt how to solve your concerns.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Heikki Linnakangas

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any invalid keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.


OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4.  Is this something we would do for
9.0 to 9.1?


9.1. The problem that started this whole thing is there in older 
versions, but given the lack of real-life reports and the scale of the 
changes required it doesn't seem wise to backport.



You are saying it would have to be run before the upgrade.  Can it not
be run after?


After would work too.


I can output a script to VACUUM all such indexes, and tell users to
manually REINDEX any index that generates a warning messasge.  I don't
have any way to automate an optional REINDEX step.


That seems good enough.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 11:55, Heikki Linnakangas wrote:

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any invalid keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.


OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?


9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.


Oh sorry, I read your question as 9.0 *or* 9.1.

Only GiST indexes that have any invalid tuples in them n, as a result 
of a crash, need to be reindexed. That's very rare in practice, so we 
shouldn't invalidate all GiST indexes. I don't think there's any simple 
way to check whether reindex is required, so I think we have to just 
document this.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tab completion for view triggers in psql

2010-11-30 Thread Itagaki Takahiro
On Tue, Nov 30, 2010 at 18:18, David Fetter da...@fetter.org wrote:
 It expands all tables (and views) in tab-completion after INSERT,
 UPDATE, and DELETE FROM.  Is it an intended change?

I found it was a simple bug; we need ( ) around selcondition.

In addition, I modified your patch a bit:

* I added a separated CREATE TRIGGER INSTEAD OF if-branch
  because it doesn't accept tables actually; it only accepts
  views. OTOH, BEFORE or AFTER only accepts tables.

* I think t.tgtype  (1  N)  0 is more natural
  bit operation than t.tgtype | (1  N) = t.tgtype.

Patch attached. If you think my changes are ok,
please change the patch status to Ready for Committer.

-- 
Itagaki Takahiro


psql_view_trigger_tab_completion_6.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Csaba Nagy
Hi all,

The workaround recommended some time ago by Tom is:

DELETE FROM residents_of_athens WHERE ctid = any(array(SELECT ctid FROM
residents_of_athens ORDER BY ostracism_votes DESC LIMIT 1));

It is about as efficient as the requested feature would be, just uglier
to write down. I use it all the time when batch-deleting something large
(to avoid long running transactions and to not crash slony). It also
helps to vacuum frequently if you do that on large amount of data...

Cheers,
Csaba.

On Tue, 2010-11-30 at 00:05 -0500, Robert Haas wrote:
 On Mon, Nov 29, 2010 at 11:25 PM, Andrew Dunstan and...@dunslane.net wrote:
 
 
  On 11/29/2010 10:19 PM, Robert Haas wrote:
 
  For example, suppose we're trying to govern an ancient Greek
  democracy:
 
  http://en.wikipedia.org/wiki/Ostracism
 
  DELETE FROM residents_of_athens ORDER BY ostracism_votes DESC LIMIT 1;
 
  I'm not sure this is a very good example. Assuming there isn't a tie, I'd do
  it like this:
 
  DELETE FROM residents_of_athens
  WHERE ostracism_votes = 6000
 and ostracism_votes =
  (SELECT max(ostracism_votes)
   FROM residents_of_athens);
 
 That might be a lot less efficient, though, and sometimes it's not OK
 to delete more than one record.  Imagine, for example, wanting to
 dequeue the work item with the highest priority.  Sure, you can use
 SELECT ... LIMIT to identify one and then DELETE it by some other key,
 but DELETE .. ORDER BY .. RETURNING .. LIMIT would be cool, and would
 let you do it with just one scan.
 
  I can't say I'd be excited by this feature. In quite a few years of writing
  SQL I don't recall ever wanting such a gadget.
 
 It's something I've wanted periodically, though not badly enough to do
 the work to make it happen.
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Rob Wultsch
On Mon, Nov 29, 2010 at 9:57 PM, Robert Haas robertmh...@gmail.com wrote:
 1. Pin each visibility map page.  If any VM_BECOMING_ALL_VISIBLE bits
 are set, take the exclusive content lock for long enough to clear
 them.

I wonder what the performance hit will be to workloads with contention
and if this feature should be optional.

-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Rob Wultsch
On Mon, Nov 29, 2010 at 10:50 PM, Marti Raudsepp ma...@juffo.org wrote:
 On Tue, Nov 30, 2010 at 05:09, Jaime Casanova ja...@2ndquadrant.com wrote:
 at least IMHO the only sensible way that LIMIT is usefull is with
 an ORDER BY clause with make the results very well defined...

 DELETE with LIMIT is also useful for deleting things in batches, so
 you can do large deletes on a live system without starving other users
 from I/O. In this case deletion order doesn't matter (it's more
 efficient to delete rows in physical table order) -- ORDER BY isn't
 necessary.

 Regards,
 Marti


++

I have a lot of DELETE with LIMIT in my (mysql) environment for this reason.


-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Josh Kupershmidt
On Mon, Nov 29, 2010 at 10:06 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 29, 2010 at 9:37 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 I actually hadn't thought of that, for some reason.

 We used to similarly recommend that people handle TRUNCATE privileges
 with a security definer function. That doesn't mean GRANT TRUNCATE
 wasn't a sweet addition to 8.4.

 Hmm, glad you like it (I wrote that).  I'm just asking how far we
 should go before we decide we catering to use cases that are too
 narrow to warrant an extension of the permissions system.

I am slightly opposed to adding GRANTs for LOCK TABLE, ANALYZE,
VACUUM, etc. The GRANT help page is long enough already, and I doubt
many users would use them, even though I might use GRANT LOCK TABLE
myself.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tab completion for view triggers in psql

2010-11-30 Thread David Fetter
On Tue, Nov 30, 2010 at 08:13:37PM +0900, Itagaki Takahiro wrote:
 On Tue, Nov 30, 2010 at 18:18, David Fetter da...@fetter.org wrote:
  It expands all tables (and views) in tab-completion after INSERT,
  UPDATE, and DELETE FROM.  Is it an intended change?
 
 I found it was a simple bug; we need ( ) around selcondition.

Oh.  Heh.  Thanks for tracking this down.

 In addition, I modified your patch a bit:
 
 * I added a separated CREATE TRIGGER INSTEAD OF if-branch
   because it doesn't accept tables actually; it only accepts
   views. OTOH, BEFORE or AFTER only accepts tables.

OK

 * I think t.tgtype  (1  N)  0 is more natural
   bit operation than t.tgtype | (1  N) = t.tgtype.

OK

 Patch attached. If you think my changes are ok,
 please change the patch status to Ready for Committer.

Done :)

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 4:25 AM, Csaba Nagy ncsli...@googlemail.com wrote:
 The workaround recommended some time ago by Tom is:

 DELETE FROM residents_of_athens WHERE ctid = any(array(SELECT ctid FROM
 residents_of_athens ORDER BY ostracism_votes DESC LIMIT 1));

 It is about as efficient as the requested feature would be, just uglier
 to write down. I use it all the time when batch-deleting something large
 (to avoid long running transactions and to not crash slony). It also
 helps to vacuum frequently if you do that on large amount of data...

That's a very elegant hack, but not exactly obvious to a novice user
or, say, me.  So I think it'd be nicer to have the obvious syntax
work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 30.11.2010 11:55, Heikki Linnakangas wrote:

 On 27.11.2010 21:31, Bruce Momjian wrote:

 Heikki Linnakangas wrote:

 There's no on-disk format changes, except for the additional flag in the
 page headers, so this does not affect pg_upgrade. However, if there's
 any invalid keys in the old index because of an incomplete insertion,
 the new code will not understand that. So you should run vacuum to
 ensure that there's no such invalid keys in the index before upgrading.
 Vacuum will print a message in the log if it finds any, and you will
 have to reindex. But that's what it suggests you to do anyway.

 OK, pg_upgrade has code to report invalid gin and hash indexes because
 of changes between PG 8.3 and 8.4. Is this something we would do for
 9.0 to 9.1?

 9.1. The problem that started this whole thing is there in older
 versions, but given the lack of real-life reports and the scale of the
 changes required it doesn't seem wise to backport.

 Oh sorry, I read your question as 9.0 *or* 9.1.

 Only GiST indexes that have any invalid tuples in them n, as a result of a
 crash, need to be reindexed. That's very rare in practice, so we shouldn't
 invalidate all GiST indexes. I don't think there's any simple way to check
 whether reindex is required, so I think we have to just document this.

It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Nov 29, 2010 at 11:25 PM, Andrew Dunstan and...@dunslane.net wrote:
 I can't say I'd be excited by this feature. In quite a few years of writing
 SQL I don't recall ever wanting such a gadget.

 It's something I've wanted periodically, though not badly enough to do
 the work to make it happen.

It would certainly look like nothing but a crude hack if the feature is
only available for DELETE and not UPDATE.  Unfortunately, the UPDATE
case would be an order of magnitude harder (think inheritance trees
where the children aren't all alike).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Andrew Dunstan



On 11/30/2010 09:57 AM, Csaba Nagy wrote:


So it is really an ideological thing and not lack of demand or
implementation attempts... I for myself can't write working C code
anyway, so I got my peace with the workaround - I wish you good luck
arguing Tom :-)




We need a convincing use case for it. So far the only one that's seemed 
at all convincing to me is the one about deleting in batches. But that 
might be enough.


As for it being illogical, I don't think it's any more so than

   DELETE FROM foo WHERE random()  0.1;

and you can do that today.

cheers

andrew



Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 16:23, Robert Haas wrote:

On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 30.11.2010 11:55, Heikki Linnakangas wrote:


On 27.11.2010 21:31, Bruce Momjian wrote:


Heikki Linnakangas wrote:


There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any invalid keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.


OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?


9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.


Oh sorry, I read your question as 9.0 *or* 9.1.

Only GiST indexes that have any invalid tuples in them n, as a result of a
crash, need to be reindexed. That's very rare in practice, so we shouldn't
invalidate all GiST indexes. I don't think there's any simple way to check
whether reindex is required, so I think we have to just document this.


It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?


It's not corruption, but intended degradation. Yes, the current code 
copes with it, that's how GiST survives a crash. However, even with the 
current code, VACUUM will nag if it finds any invalid tuples with this 
message:


ereport(NOTICE,
	(errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash 
recovery,


That's harmless, in the sense that all scans and inserts work fine, but 
scans might need to do more work than if the invalid tuple wasn't there.


I don't think we need to go out of our way to support such degraded 
indexes in 9.1. If you see such notices in your logs, you should REINDEX 
anyway, before of after pg_upgrade. Let's just make sure that you get a 
reasonable error message in 9.1 if a scan or insert encounters such a tuple.


There is a section on this in the docs, BTW: 
http://www.postgresql.org/docs/9.0/static/gist-recovery.html


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Some care is needed with checkpoints. Setting visibility map bits in step 2
 is safe because crash recovery will replay the intent XLOG record and clear
 any incorrectly set bits. But if a checkpoint has happened after the intent
 XLOG record was written, that's not true. This can be avoided by checking
 RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
 since the last intent XLOG record was written.

It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?

 There's a small race condition in the way a visibility map bit is currently
 cleared. When a heap page is updated, it is locked, the update is
 WAL-logged, and the lock is released. The visibility map page is updated
 only after that. If the final vacuum XLOG record is written just after
 updating the heap page, but before the visibility map bit is cleared,
 replaying the final XLOG record will set a bit that should not have been
 set.

Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it.  Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over.  And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires.  Seems like we might be better off
writing that much less WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 06:57, Robert Haas wrote:
 I can't say I'm totally in love with any of these designs.  Anyone
 else have any ideas, or any opinions about which one is best?

 Well, the design I've been pondering goes like this:

Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions?  That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values.  All of these other schemes seem too complicated and not
provably correct.

Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Nov 29, 2010 at 11:25 PM, Andrew Dunstan and...@dunslane.net wrote:
 I can't say I'd be excited by this feature. In quite a few years of writing
 SQL I don't recall ever wanting such a gadget.

 It's something I've wanted periodically, though not badly enough to do
 the work to make it happen.

 It would certainly look like nothing but a crude hack if the feature is
 only available for DELETE and not UPDATE.

I'm not sure this is true, given Andrew's comment that the bulk
deletion argument is the only one he finds compelling, but I'd surely
be in favor of supporting both.

 Unfortunately, the UPDATE
 case would be an order of magnitude harder (think inheritance trees
 where the children aren't all alike).

I don't understand why there's anything more to this than sticking a
Limit node either immediately above or immediately below the
ModifyTable node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 17:32, Robert Haas wrote:

On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Some care is needed with checkpoints. Setting visibility map bits in step 2
is safe because crash recovery will replay the intent XLOG record and clear
any incorrectly set bits. But if a checkpoint has happened after the intent
XLOG record was written, that's not true. This can be avoided by checking
RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
since the last intent XLOG record was written.


It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?


It's enough to hold an exclusive lock on the visibility map page. You 
have to set the bit first, and then check RedoRecPtr, and if it changed, 
write the XLOG record before releasing the lock. If RedoRecPtr changes 
any time before we check RedoRecPtr, we'll write the XLOG record so 
we're safe. If it changes after that, we're safe because the checkpoint 
will flush the updated heap page and visibility map page.



There's a small race condition in the way a visibility map bit is currently
cleared. When a heap page is updated, it is locked, the update is
WAL-logged, and the lock is released. The visibility map page is updated
only after that. If the final vacuum XLOG record is written just after
updating the heap page, but before the visibility map bit is cleared,
replaying the final XLOG record will set a bit that should not have been
set.


Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it.  Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over.  And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires.  Seems like we might be better off
writing that much less WAL.


Yeah, possibly. It also means that the set bits will not propagate to 
standby servers, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 06:57, Robert Haas wrote:
 I can't say I'm totally in love with any of these designs.  Anyone
 else have any ideas, or any opinions about which one is best?

 Well, the design I've been pondering goes like this:

 Wouldn't it be easier and more robust to just consider VM bit changes to
 be part of the WAL-logged actions?  That would include updating LSNs on
 VM pages and flushing VM pages to disk during checkpoint based on their
 LSN values.  All of these other schemes seem too complicated and not
 provably correct.

What WAL-logged actions?

The problem case is where a page has no tuples or line pointers that
need to be removed, and all we need to do is mark it all-visible.  We
don't current WAL-log anything in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 17:38, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

On 30.11.2010 06:57, Robert Haas wrote:

I can't say I'm totally in love with any of these designs.  Anyone
else have any ideas, or any opinions about which one is best?



Well, the design I've been pondering goes like this:


Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions?  That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values.  All of these other schemes seem too complicated and not
provably correct.


The vm bit can be set once all the tuples on the page become visible to 
everyone. There is no WAL-logged action at that point we could piggyback on.


Clearing the bit is already handled like that - replay of heap 
insert/update/delete records clear the visibility map bit.



Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.


Yeah, I'm not terribly excited about any of these schemes. The intent 
record seems like the simplest one, but even that is quite different 
from the traditional WAL-logging we do that it makes me slightly nervous.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 10:43 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 It seems like you'll need to hold some kind of lock between the time
 you examine RedoRecPtr and the time you actually examine the bit.
 WALInsertLock in shared mode, maybe?

 It's enough to hold an exclusive lock on the visibility map page. You have
 to set the bit first, and then check RedoRecPtr, and if it changed, write
 the XLOG record before releasing the lock. If RedoRecPtr changes any time
 before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it
 changes after that, we're safe because the checkpoint will flush the updated
 heap page and visibility map page.

Brilliant.  I assume that we need to call GetRedoRecPtr() after taking
the exclusive lock on the page, though?

 Yeah, possibly. It also means that the set bits will not propagate to
 standby servers, though.

That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

Here's one more idea:

The trivial solution to this is to WAL-log setting the visibility map 
bit, like we WAL-log any other operation. Lock the heap page, lock the 
visibility map page, write WAL-record, and release locks. That works, 
but the problem is that it creates quite a lot of new WAL traffic.


We could reduce the WAL traffic by simply updating multiple pages at a 
time. Lock N pages, lock the visibility map page, write one WAL record, 
and release locks. If N=10, for example, we only need to WAL-log a 
couple of bytes per page, so the WAL volume should be acceptable. The 
downside is that you need to keep more pages locked at the same time, 
but maybe that's not too bad.


This wouldn't require anything special, which means fewer hard-to-debug 
visibility  recovery bugs.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Unfortunately, the UPDATE
 case would be an order of magnitude harder (think inheritance trees
 where the children aren't all alike).

 I don't understand why there's anything more to this than sticking a
 Limit node either immediately above or immediately below the
 ModifyTable node.

1. You need to support ORDER BY too, otherwise I *will* be on the
warpath against this as a foot-gun with no redeeming social value.

2. So what you need is Sort underneath Limit underneath ModifyTable.
Putting them above it would be quite the wrong semantics.

3. This doesn't work tremendously well for inheritance trees, where
ModifyTable acts as sort of an implicit Append node.  You can't just
funnel all the tuples through one Sort or Limit node because they aren't
all the same rowtype.  (Limit might perhaps not care, but Sort will.)
But you can't have a separate Sort/Limit for each table either, because
that would give the wrong behavior.  Another problem with funneling all
the rows through one Sort/Limit is that ModifyTable did need to know
which table each row came from, so it can apply the modify to the right
table.

I don't offhand see a solution other than integrating the responsibility
for limit-counting and sorting into ModifyTable itself, making it into
an unholy union of ModifyTable+Limit+MergeAppend (with the individual
inputs required to deliver sorted outputs separately).  That's
sufficiently ugly, and probably bad for performance in the normal case,
that I don't think it's going to be acceptable for such a marginal
feature.

Or I guess you could try to persuade us that DELETE/UPDATE with ORDER BY
or LIMIT doesn't need to support inherited target tables.  I wouldn't
bet on that proposal flying either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Csaba Nagy
Hi Robert,

On Tue, 2010-11-30 at 09:19 -0500, Robert Haas wrote:
 That's a very elegant hack, but not exactly obvious to a novice user
 or, say, me.  So I think it'd be nicer to have the obvious syntax
 work.

I fully agree - but you first have to convince core hackers that this is
not just a foot-gun. This was discussed many times in the past, patches
were also offered (perhaps not complete one, but proving that there is
an itch getting scratched):

http://archives.postgresql.org/pgsql-patches/2002-09/msg00255.php

The reaction:

http://archives.postgresql.org/pgsql-patches/2002-09/msg00256.php

There are other discussions too, if I remember correctly Tom once
admitted that the core of implementing the feature would likely consist
in letting it work, as the infrastructure is there to do it but it is
actively disabled. I can't find the mail now though.

So it is really an ideological thing and not lack of demand or
implementation attempts... I for myself can't write working C code
anyway, so I got my peace with the workaround - I wish you good luck
arguing Tom :-)

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Another proposal for table synonyms

2010-11-30 Thread Alexey Klyukin
Hello,

Here is the proposal to add synonyms to PostgreSQL. Initial goal is to add 
synonyms
for relations (tables, views, sequences) and an infrastructure to allow synonyms
for other database objects in the future. 

A thread with discussion of an old proposal by Jonah Harris is here: 
http://archives.postgresql.org/pgsql-hackers/2006-03/msg00519.php

-
Synonyms are database objects, which act as an alias for other objects. In
this proposal the synonyms for tables, views and sequences will be considered.

The new command, CREATE SYNONYM, defines a synonym. The syntax is:

CREATE SYNONYM synonym_name FOR object_type object_name
where
synonym_name: fully-qualified name (FQN) of the synonym
object_type: {TABLE | VIEW | SEQUENCE}. In the future, new object_types, such as
functions, can be added.
object_name:  FQN of a database table, view or sequence.

Another new command, DROP SYNONYM, is used for deleting an already existing
synonym without removing the object the synonym references. The syntax is:

DROP SYNONYM synonym_name
where synonym_name is a FQN of a synonym.

Comments will be supported on synonyms with the following command:
COMMENT ON SYNONYM synonym_name IS comment_text
where synonym_name is a FQN of a synonym.

To support addition of new database objects types that can be referenced by
synonyms a new system catalog, pg_synonym, is to be added, with an oid to
support comments on synonym, and the following schema:

synname  name  name of the synonym
synnamespace  oid  OID of the namespace that contains the synonym
synclassid  oid  OID of the system catalog that contains the  referenced object
synobjid   oid  OID of the referenced object

When resolving the synonym name, the usual search_path lookup rules apply,
i.e. first, the object of the appropriate type is looked into the schema, then
the synonym, afterwards the process iterates with the next schema from the
search_path. Note that the table synonym with the same FQN as an existing
table will be masked by that table.

To speedup the synonym name resolution a new syscache, SYNNAMENSPCLASS
{synname, synnamespace, synclassid} will be introduced. This cache will be
accessed if the query to the RELNAMENSP syscache will return no result, with
the DB object's catalog OID set to pg_class OID.

For table and view synonyms, INSERT/UPDATE/DELETE/SELECT will be supported.
For sequences SELECT will be supported. The commands will translate synonyms
to the referenced database objects on the parser stage.

All types of synonyms will be supported as table arguments/return value types,
as well as actual values (i.e. currval/nextval will accept a sequence
synonym).

The following DDL will work transparently with table synonyms (sequences and
views if the corresponding command applies to them): 
COPY, LOCK, TRUNCATE, EXPLAIN, EXPLAIN ANALYZE.

The following DDL commands will cause an error when called for tables
(sequences, views) synonyms:
ALTER {TABLE|VIEW|SEQUENCE}, 
ANALYZE, 
CLUSTER, 
COMMENT ON {TABLE | VIEW | SEQUENCE} .. IS, 
DROP {TABLE | VIEW | SEQUENCE}, 
GRANT,
REVOKE,
VACUUM.
For these commands additional checks for synonyms will be introduced on a
per-command basis.

A dependency of the referenced object on a synonym will be added when adding a
new synonym to forbid removing a referenced object without removing the
synonym first (without using CASCADE). On DROP SYNONYM the related dependency
will be removed.

--
Alexey Klyukin  http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 17:38, Tom Lane wrote:
 Wouldn't it be easier and more robust to just consider VM bit changes to
 be part of the WAL-logged actions?  That would include updating LSNs on
 VM pages and flushing VM pages to disk during checkpoint based on their
 LSN values.  All of these other schemes seem too complicated and not
 provably correct.

 The vm bit can be set once all the tuples on the page become visible to 
 everyone. There is no WAL-logged action at that point we could piggyback on.

So you start emitting a WAL entry for the act of setting the VM bit
(and I guess the page header hint bit would be included in that too).

 Yeah, I'm not terribly excited about any of these schemes. The intent 
 record seems like the simplest one, but even that is quite different 
 from the traditional WAL-logging we do that it makes me slightly nervous.

I'm not convinced it works at all.  Consider write intent record,
checkpoint, set bit, crash before completing vacuum.  There will be
no second intent record at which you could clean up if things are
inconsistent.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:04 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Unfortunately, the UPDATE
 case would be an order of magnitude harder (think inheritance trees
 where the children aren't all alike).

 I don't understand why there's anything more to this than sticking a
 Limit node either immediately above or immediately below the
 ModifyTable node.

 1. You need to support ORDER BY too, otherwise I *will* be on the
 warpath against this as a foot-gun with no redeeming social value.

Will you be wielding a Tom-ahawk?

 2. So what you need is Sort underneath Limit underneath ModifyTable.
 Putting them above it would be quite the wrong semantics.

OK.

 3. This doesn't work tremendously well for inheritance trees, where
 ModifyTable acts as sort of an implicit Append node.  You can't just
 funnel all the tuples through one Sort or Limit node because they aren't
 all the same rowtype.  (Limit might perhaps not care, but Sort will.)
 But you can't have a separate Sort/Limit for each table either, because
 that would give the wrong behavior.  Another problem with funneling all
 the rows through one Sort/Limit is that ModifyTable did need to know
 which table each row came from, so it can apply the modify to the right
 table.

Could you possibly have ModifyTable - Limit - MergeAppend?

 Or I guess you could try to persuade us that DELETE/UPDATE with ORDER BY
 or LIMIT doesn't need to support inherited target tables.  I wouldn't
 bet on that proposal flying either.

I've spent enough time worrying about the fact that tables with
inheritance children don't behave as nicely as those that don't to
have any interest in going in the other direction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 The trivial solution to this is to WAL-log setting the visibility map 
 bit, like we WAL-log any other operation. Lock the heap page, lock the 
 visibility map page, write WAL-record, and release locks. That works, 
 but the problem is that it creates quite a lot of new WAL traffic.

How much is quite a lot?  Do we have any real reason to think that
this solution is unacceptable performance-wise?

I'd also suggest that if you want to prevent torn-page syndrome on VM
pages (and if you want to rely on their LSN values, you do) then you
probably don't have any choice anyway.  VM pages will have to adhere to
the same write-full-page-on-first-mod-after-checkpoint rule as any other
page.  I'd guess that this will swamp any savings from cutesy schemes
for reducing the number of WAL records.

 We could reduce the WAL traffic by simply updating multiple pages at a 
 time. Lock N pages, lock the visibility map page, write one WAL record, 
 and release locks.

I don't think that will work, because you have to hold the lock on a
page from the time you check that it's all-visible to the time you apply
the update.  The loss of concurrency against updates would be pretty
bad, and I think you'd be creating significant risk of deadlocks from
holding multiple buffer locks at once.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 How much is quite a lot?  Do we have any real reason to think that
 this solution is unacceptable performance-wise?

Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.

But you did just dirty a gigabyte of data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 9:07 AM, Josh Kupershmidt schmi...@gmail.com wrote:
 On Mon, Nov 29, 2010 at 10:06 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 29, 2010 at 9:37 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 I actually hadn't thought of that, for some reason.

 We used to similarly recommend that people handle TRUNCATE privileges
 with a security definer function. That doesn't mean GRANT TRUNCATE
 wasn't a sweet addition to 8.4.

 Hmm, glad you like it (I wrote that).  I'm just asking how far we
 should go before we decide we catering to use cases that are too
 narrow to warrant an extension of the permissions system.

 I am slightly opposed to adding GRANTs for LOCK TABLE, ANALYZE,
 VACUUM, etc. The GRANT help page is long enough already, and I doubt
 many users would use them, even though I might use GRANT LOCK TABLE
 myself.

You'd really probably want GRANT LOCK TABLE (SHARE), GRANT LOCK TABLE
(EXCLUSIVE), ...

It'd be sort of cool, but it doesn't seem worth the complexity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:22 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 How much is quite a lot?  Do we have any real reason to think that
 this solution is unacceptable performance-wise?

 Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
 you XLOG setting the bit on each page, you'll need to write 128K WAL
 records, each containing a 12-byte relfilenode and a 4-byte block
 offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.

 But you did just dirty a gigabyte of data.

Oh, but it's worse than that.  When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
 So now you've turned 1GB of data to write into 2+ GB of data to
write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:22, Robert Haas wrote:

On Tue, Nov 30, 2010 at 11:16 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

How much is quite a lot?  Do we have any real reason to think that
this solution is unacceptable performance-wise?


Well, let's imagine a 1GB insert-only table.  It has 128K pages.  If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.


Plus WAL headers, I think it's something like 32 or 40 bytes of WAL per 
page.



But you did just dirty a gigabyte of data.


Good point.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Tom Lane
Alexey Klyukin al...@commandprompt.com writes:
 To support addition of new database objects types that can be referenced by
 synonyms a new system catalog, pg_synonym, is to be added, with an oid to
 support comments on synonym, and the following schema:

This is not going to work, at least not without making every type of
lookup consult pg_synonym too, which I think can be considered DOA
because of its performance impact on people who aren't even using the
feature.  It's also quite unclear how you prevent duplicate names
if the synonyms are in their own catalog.  (And no, the part of your
proposal that says you're not preventing that isn't acceptable from
a usability standpoint.)

You could reasonably support synonyms for tables/views by storing them
in pg_class with a new relkind.  This doesn't cover synonyms for other
object types, but since the total world demand for such a feature is
approximately zero, that's not really a problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:10, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

Yeah, I'm not terribly excited about any of these schemes. The intent
record seems like the simplest one, but even that is quite different
from the traditional WAL-logging we do that it makes me slightly nervous.


I'm not convinced it works at all.  Consider write intent record,
checkpoint, set bit, crash before completing vacuum.  There will be
no second intent record at which you could clean up if things are
inconsistent.


That's why you need to check the RedoRecPtr when you set the bit. If it 
has changed, ie. a checkpoint has happened, the set bit step will write 
a new intent record.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Oh, but it's worse than that.  When you XLOG a WAL record for each of
 those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
 write.

No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 18:10, Tom Lane wrote:
 I'm not convinced it works at all.  Consider write intent record,
 checkpoint, set bit, crash before completing vacuum.  There will be
 no second intent record at which you could clean up if things are
 inconsistent.

 That's why you need to check the RedoRecPtr when you set the bit. If it 
 has changed, ie. a checkpoint has happened, the set bit step will write 
 a new intent record.

Oh, you explained the proposal poorly then.  I thought you meant recheck
and write another intent record just once, immediately before sending
the final xlog record.

It still seems rickety and not clearly correct, especially when you
start thinking about all the other constraints we have on xlog behavior
(eg, does this work while taking a base backup).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Greg Smith

Jeff Janes wrote:

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
relation base/16384/16523. Rather than being numbered sequentially
within a given checkpoint.
  


I was numbering them sequentially so that it's straightforward to graph 
the sync times in an external analysis tool, but the relation data is 
helpful too.  New patch reflecting all upthread suggestions is 
attached.  The output looks like this now at DEBUG1:


LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/16424/11645 
time=11589.549000 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm 
time=2.921000 msec

DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm 
time=1.654000 msec

DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec
DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm 
time=2.613000 msec

DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm 
time=2.638000 msec

DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm 
time=8.686000 msec
DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm 
time=5.984000 msec
LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log 
file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, 
total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s


I kept the units for the DEBUG level ones in msec because that's a 
better scale for the common really short syncs during good behavior.  
But the summary info in seconds now appears at the end of the existing 
checkpoint complete message, so only one line to parse for those 
looking to analyze the gross checkpoint data.  That looks to work well 
enough for finding situations like the big ext3 spikes.  You can easily 
see one in this example by the fact that longest=11.589 s is almost 
the entirety of sync=11.715 s.  That's the really key thing there's 
currently no visibility into, that's made obvious with this patch.


This might be ready for some proper review now.  I know there's at least 
one blatant bug still in here I haven't found yet, related to how the 
averages are computed.  I saw this once:


LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log 
file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, 
total=0.001 s; sync files=0, longest=0.000 s, 
average=-9223372036854775808.-2147483 s


After an immediate checkpoint, so at least one path not quite right yet.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede6ceb..6f6eb3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7063,10 +7063,15 @@ LogCheckpointEnd(bool restartpoint)
 {
 	long		write_secs,
 sync_secs,
-total_secs;
+total_secs,
+longest_secs,
+average_secs;
 	int			write_usecs,
 sync_usecs,
-total_usecs;
+total_usecs,
+longest_usecs,
+average_usecs;
+	double		average_sync_time;
 
 	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
 
@@ -7082,18 +7087,35 @@ LogCheckpointEnd(bool restartpoint)
 		CheckpointStats.ckpt_sync_end_t,
 		sync_secs, sync_usecs);
 
+	/*
+	 * Timing values returned from CheckpointStats are in milliseconds.
+	 * Convert to the second plus microsecond form that TimestampDifference
+	 * returns for homogeneous printing.
+	 */
+	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000);
+	longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - longest_secs * 1000);
+
+	average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels;
+	average_secs = (long) (average_sync_time / 1000);
+	average_usecs = 1000 * (average_sync_time - average_secs * 1000);
+
 	if (restartpoint)
 		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
-			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
+			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; 
+			 sync files=%d, longest=%ld.%03d s, average=%ld.%03d s,
 			 CheckpointStats.ckpt_bufs_written,
 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 			 write_secs, write_usecs / 1000,
 			 sync_secs, sync_usecs / 1000,
-			 total_secs, total_usecs / 1000);
+			 total_secs, total_usecs / 1000,
+			 

Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 That's definitely sucky, but in some ways it would be more complicated
 if they did, because I don't think all-visible on the master implies
 all-visible on the standby.

Ouch.  That seems like it could shoot down all these proposals.  There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.

I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Oh, but it's worse than that.  When you XLOG a WAL record for each of
 those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
 write.

 No, because only the first mod of each VM page would trigger a full page
 write, at least assuming a reasonable ordering of the operations.

I'm not worried about the full-page writes from updating the
visibility map - I'm worried about the full-page writes from updating
the heap.  It doesn't matter a whit if we fail to set a bit in the
visibility map. What matters is if we DO set the bit in the visibility
map but FAIL TO set the bit in the heap, because then a subsequent
update to the heap page won't check the visibility map and clear the
bit.  The *heap* updates are the ones that have to be guaranteed to
make it to disk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:40, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.


Ouch.  That seems like it could shoot down all these proposals.  There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.


Note that this is only a problem for *hot* standby. After failover, all 
the tuples that were visible to everyone in the master are also visible 
to all new transactions in the standby.


We dealt with this in 9.0 already, with the killed flag in index 
tuples and the PD_ALL_VISIBLE flag in heap scans. We simply don't 
believe them in hot standby mode, and check visibility even if the flag 
is set.



I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.


Something like that would certainly be nice. With index-only scans, it 
can be a big disappointment if you can't do an index-only scan in hot 
standby.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 18:33, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

Oh, but it's worse than that.  When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
write.


No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.


If you change the LSN on the heap pages, you have to write full page 
images of those as well.


Let's recap what happens when a VM bit is set: You set the 
PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it 
usually isn't), and then set the bit in the VM while keeping the heap 
page locked.


Can we get away with not setting the LSN on the heap page, even though 
we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page 
can be flushed to disk before the WAL record, but I think that's fine 
because it's OK to have the flag set in the heap page even if the VM bit 
is not set.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 That's definitely sucky, but in some ways it would be more complicated
 if they did, because I don't think all-visible on the master implies
 all-visible on the standby.

 Ouch.  That seems like it could shoot down all these proposals.  There
 definitely isn't any way to make VM crash-safe if there is no WAL-driven
 mechanism for setting the bits.

Heikki's intent method works fine, because the WAL record only clears
the visibility map bits on redo; it never sets them.

 I guess what we need is a way to delay the application of such a WAL
 record on the slave until it's safe, which means the record also has to
 carry some indication of the youngest XMIN on the page.

I'm unexcited about inventing more ways to delay XLOG application on
the standby.  We have enough of those already.

We could actually allow the slave to set the visibility map bits based
on its own xmin horizon.  The only problem is that you wouldn't be
able to write the intent XLOG records.  I suppose you could have a
separate file which is just used to store the intent records, and
designate a range of very-high numbered LSNs to mean blocks of the
intent file rather than a position in the regular WAL stream.  VACUUM
is so much fun on the master, let's have it on the standby too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Can we get away with not setting the LSN on the heap page, even though 
 we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page 
 can be flushed to disk before the WAL record, but I think that's fine 
 because it's OK to have the flag set in the heap page even if the VM bit 
 is not set.

Why is that fine?  It's certainly not fine from the standpoint of
someone wondering why his index-only scan performs so badly.

I think all this hair-splitting about cases where it's okay to have one
bit set and not the other is misguided.  To me, crash-safety of the VM
means that its copy of the page-header bit is right.  Period.  Yes, it
will cost something to ensure that; so what?  If we don't get more than
enough compensating performance gain from index-only scans, the whole
patch is going to end up reverted.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Alexey Klyukin

On Nov 30, 2010, at 6:28 PM, Tom Lane wrote:

 Alexey Klyukin al...@commandprompt.com writes:
 To support addition of new database objects types that can be referenced by
 synonyms a new system catalog, pg_synonym, is to be added, with an oid to
 support comments on synonym, and the following schema:
 
 This is not going to work, at least not without making every type of
 lookup consult pg_synonym too, which I think can be considered DOA
 because of its performance impact on people who aren't even using the
 feature.

For those not using synonyms it would result in an extra syscache lookup for
each schema from the search_path that doesn't contain the table with the
specified name. If the table is specified with A FQN or contained in the first
schema from the search_path no extra lookup would occur. Is it considered a
big impact? The number of such lookups can be reduced if we traverse the
search_path for the tables first, and then look for the synonyms, although
that would change the lookup rules stated in this proposal

  It's also quite unclear how you prevent duplicate names
 if the synonyms are in their own catalog.  (And no, the part of your
 proposal that says you're not preventing that isn't acceptable from
 a usability standpoint.)

What's wrong with the usability of that feature? The fact that the table with
the same FQN as a synonym masks the latter can be clearly stated in the
documentation. Are you expecting lots of people to name the synonym exactly
the same as one of the database tables and wonder why is the table and not the
synonym gets accessed? As an alternative, a warning during table/synonym
creation/renaming can be emitted if the name clash occurs.

 
 You could reasonably support synonyms for tables/views by storing them
 in pg_class with a new relkind.  This doesn't cover synonyms for other
 object types, but since the total world demand for such a feature is
 approximately zero, that's not really a problem.

I think that would almost guarantee that synonyms for other kinds of objects
(i.e. databases, such kind of synonyms were requested in the past) would never
be added.


--
Alexey Klyukin  http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ouch.  That seems like it could shoot down all these proposals.  There
 definitely isn't any way to make VM crash-safe if there is no WAL-driven
 mechanism for setting the bits.

 Heikki's intent method works fine, because the WAL record only clears
 the visibility map bits on redo; it never sets them.

Uh, no, because he also had that final WAL record that would set the
bits.

 We could actually allow the slave to set the visibility map bits based
 on its own xmin horizon.

Not in a crash-safe way, which is exactly the problem here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:49 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 30.11.2010 18:33, Tom Lane wrote:

 Robert Haasrobertmh...@gmail.com  writes:

 Oh, but it's worse than that.  When you XLOG a WAL record for each of
 those pages, you're going to trigger full-page writes for all of them.
  So now you've turned 1GB of data to write into 2+ GB of data to
 write.

 No, because only the first mod of each VM page would trigger a full page
 write, at least assuming a reasonable ordering of the operations.

 If you change the LSN on the heap pages, you have to write full page images
 of those as well.

 Let's recap what happens when a VM bit is set: You set the PD_ALL_VISIBLE
 flag on the heap page (assuming it's not set already, it usually isn't), and
 then set the bit in the VM while keeping the heap page locked.

 Can we get away with not setting the LSN on the heap page, even though we
 set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page can be
 flushed to disk before the WAL record, but I think that's fine because it's
 OK to have the flag set in the heap page even if the VM bit is not set.

I don't immediately see why that wouldn't work.  As long as you bump
the LSN on the visibility map page, and also bump the LSN of the
visibility map page every time you clear a bit, I think you should be
OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:55 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Can we get away with not setting the LSN on the heap page, even though
 we set the PD_ALL_VISIBLE flag? If we don't set the LSN, the heap page
 can be flushed to disk before the WAL record, but I think that's fine
 because it's OK to have the flag set in the heap page even if the VM bit
 is not set.

 Why is that fine?  It's certainly not fine from the standpoint of
 someone wondering why his index-only scan performs so badly.

 I think all this hair-splitting about cases where it's okay to have one
 bit set and not the other is misguided.  To me, crash-safety of the VM
 means that its copy of the page-header bit is right.  Period.  Yes, it
 will cost something to ensure that; so what?  If we don't get more than
 enough compensating performance gain from index-only scans, the whole
 patch is going to end up reverted.

We're not going to double the cost of VACUUM to get index-only scans.
And that's exactly what will happen if you do full-page writes of
every heap page to set a single bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 11:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ouch.  That seems like it could shoot down all these proposals.  There
 definitely isn't any way to make VM crash-safe if there is no WAL-driven
 mechanism for setting the bits.

 Heikki's intent method works fine, because the WAL record only clears
 the visibility map bits on redo; it never sets them.

 Uh, no, because he also had that final WAL record that would set the
 bits.

Well, as already discussed upthread, that WAL record causes some other
problems, so make it Heikki's intent method, without the final WAL
record that breaks things.

 We could actually allow the slave to set the visibility map bits based
 on its own xmin horizon.

 Not in a crash-safe way, which is exactly the problem here.

Brilliant selective quoting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Tom Lane
Alexey Klyukin al...@commandprompt.com writes:
 On Nov 30, 2010, at 6:28 PM, Tom Lane wrote:
 This is not going to work, at least not without making every type of
 lookup consult pg_synonym too, which I think can be considered DOA
 because of its performance impact on people who aren't even using the
 feature.

 For those not using synonyms it would result in an extra syscache lookup for
 each schema from the search_path that doesn't contain the table with the
 specified name. If the table is specified with A FQN or contained in the first
 schema from the search_path no extra lookup would occur. Is it considered a
 big impact?

Yes.  It'll be slow and it will render code that's already unreasonably
complicated into an unreadable morass.  We are not going there.

(Just to be clear, it's not the table search case I'm worried about;
it's operator/function lookup that I think this would be completely
unacceptable for.  And if you're only going to support table/view
synonyms then you might as well put them in pg_class.)

 I think that would almost guarantee that synonyms for other kinds of objects
 (i.e. databases, such kind of synonyms were requested in the past) would never
 be added.

That's fine with me.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 We're not going to double the cost of VACUUM to get index-only scans.
 And that's exactly what will happen if you do full-page writes of
 every heap page to set a single bit.

It's ridiculous to claim that that doubles the cost of VACUUM.  In the
worst case, it will add 25% to the cost of setting an all-visible bit on
a page where there is no other work to do.  (You already are writing out
the heap page and the VM page, plus a WAL image of the heap page, so a
WAL image of the VM page adds 25%.  But only if you did not set any
other bits on the same VM page, which is probably not a real common
case.)  Given that VACUUM has a lot of other cleanup besides visibility
bit setting, I'm not convinced that this would even be noticeable.

I think the burden is on people who are proposing complicated mechanisms
to show that there's actually a strong need for them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 We're not going to double the cost of VACUUM to get index-only scans.
 And that's exactly what will happen if you do full-page writes of
 every heap page to set a single bit.

 It's ridiculous to claim that that doubles the cost of VACUUM.  In the
 worst case, it will add 25% to the cost of setting an all-visible bit on
 a page where there is no other work to do.  (You already are writing out
 the heap page and the VM page,

True.

 plus a WAL image of the heap page, so a

False.  That is exactly what we are NOT doing now and what we must
find a way to avoid doing.

 WAL image of the VM page adds 25%.  But only if you did not set any
 other bits on the same VM page, which is probably not a real common
 case.)  Given that VACUUM has a lot of other cleanup besides visibility
 bit setting, I'm not convinced that this would even be noticeable.

 I think the burden is on people who are proposing complicated mechanisms
 to show that there's actually a strong need for them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 30, 2010 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It's ridiculous to claim that that doubles the cost of VACUUM.  In the
 worst case, it will add 25% to the cost of setting an all-visible bit on
 a page where there is no other work to do.  (You already are writing out
 the heap page and the VM page,

 True.

 plus a WAL image of the heap page, so a

 False.  That is exactly what we are NOT doing now and what we must
 find a way to avoid doing.

I do not accept that argument.  You can't make an omelette without
breaking eggs, and the cost of index-only scans is going to be that
it costs more to get the visibility bits set in the first place.

But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN.  Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 19:22, Tom Lane wrote:

But having said that, I wonder whether we need a full-page image for
a WAL-logged action that is known to involve only setting a single bit
and updating LSN.  Would omitting the FPI be any more risky than what
happens now (ie, the page does get written back to disk at some point,
without any image from which it can be rewritten if the write fails...)


You have to write a full-page image if you update the LSN, because 
otherwise the next update that comes along will not write a full page image.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But having said that, I wonder whether we need a full-page image for
 a WAL-logged action that is known to involve only setting a single bit
 and updating LSN.  Would omitting the FPI be any more risky than what
 happens now (ie, the page does get written back to disk at some point,
 without any image from which it can be rewritten if the write fails...)

That's pretty much exactly what Heikki proposed 35 minutes ago, and
you objected 6 minutes later.  I still think it might work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 12:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 30, 2010 at 12:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But having said that, I wonder whether we need a full-page image for
 a WAL-logged action that is known to involve only setting a single bit
 and updating LSN.  Would omitting the FPI be any more risky than what
 happens now (ie, the page does get written back to disk at some point,
 without any image from which it can be rewritten if the write fails...)

 That's pretty much exactly what Heikki proposed 35 minutes ago, and
 you objected 6 minutes later.  I still think it might work.

Oh, I see the difference now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 30.11.2010 19:22, Tom Lane wrote:
 But having said that, I wonder whether we need a full-page image for
 a WAL-logged action that is known to involve only setting a single bit
 and updating LSN.

 You have to write a full-page image if you update the LSN, because 
 otherwise the next update that comes along will not write a full page image.

Um.  Drat.  I was thinking about the replay side, where I think it would
actually work --- but you're right, it would break the logic on the
generation side.  Unless you want to put in some kind of flag saying
this was only a visibility bit update, any bigger update still needs
to write an FPI.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Simon Riggs
On Mon, 2010-11-29 at 21:37 -0500, Josh Kupershmidt wrote:

 I still see little reason to make LOCK TABLE permissions different for
 column-level vs. table-level UPDATE privileges 

Agreed.

This is the crux of the debate. Why should this inconsistency be allowed
to continue?

Are there covert channel issues here, KaiGai?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS]

2010-11-30 Thread rickytato rickytato



Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Daniel Loureiro
 3. This doesn't work tremendously well for inheritance trees, where
 ModifyTable acts as sort of an implicit Append node.  You can't just
 funnel all the tuples through one Sort or Limit node because they aren't
 all the same rowtype.  (Limit might perhaps not care, but Sort will.)
 But you can't have a separate Sort/Limit for each table either, because
 that would give the wrong behavior.  Another problem with funneling all
 the rows through one Sort/Limit is that ModifyTable did need to know
 which table each row came from, so it can apply the modify to the right
 table.

So I guess that I have choose the wrong hack to start.

Just for curiosity, why the result of WHERE filter (in
SELECT/DELETE/UPDATE) is not put in memory, i.e. an array of ctid, like an
buffer and then executed by SELECT/DELETE/UPDATE at once ?

Greets,
--
Daniel Loureiro


Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2010-11-29 at 21:37 -0500, Josh Kupershmidt wrote:

 I still see little reason to make LOCK TABLE permissions different for
 column-level vs. table-level UPDATE privileges

 Agreed.

 This is the crux of the debate. Why should this inconsistency be allowed
 to continue?

Well, a user with full-table UPDATE privileges can trash the whole
thing, so, from a security perspective, letting them lock is only
allowing them to deny access to data they could have just as easily
trashed.  A user with only single-column UPDATE privileges cannot
trash the whole table, though, so you are allowing them to deny read
access to data they may not themselves have permission either to read
or to update.

Admittedly, this seems a bit more rickety in light of your point that
they can still lock all the rows... but then that only stops writes,
not reads. I'm less convinced that I'm right about this than I was 3
days ago.  But I'm still not convinced that the above argument is
wrong, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Daniel Loureiro
to me the key its security - its a anti-DBA-with-lack-of-attention feature.
If i forget the WHERE statement, I will delete some valid tuples and
messed up the bd, but its less-than-worst that exclude all the table. A DBA
who never forgot an WHERE in an DELETE is not an DBA. Just kidding, but
this happens often enough.

is there another option to implement this ? Its possible to be done by
plugins/extension (in a Firefox browser style) ?

Sds,
--
Daniel Loureiro
--

2010/11/30 Andrew Dunstan and...@dunslane.net



 On 11/30/2010 09:57 AM, Csaba Nagy wrote:


 So it is really an ideological thing and not lack of demand or
 implementation attempts... I for myself can't write working C code
 anyway, so I got my peace with the workaround - I wish you good luck
 arguing Tom :-)




 We need a convincing use case for it. So far the only one that's seemed at
 all convincing to me is the one about deleting in batches. But that might be
 enough.

 As for it being illogical, I don't think it's any more so than

 DELETE FROM foo WHERE random()  0.1;

 and you can do that today.

 cheers

 andrew




Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Cédric Villemain
2010/11/30 Greg Smith g...@2ndquadrant.com:
 Jeff Janes wrote:

 For the individual file sync times emitted under debug1, it would be
 very handy if the file being synced was identified, for example
 relation base/16384/16523. Rather than being numbered sequentially
 within a given checkpoint.


 I was numbering them sequentially so that it's straightforward to graph the
 sync times in an external analysis tool, but the relation data is helpful
 too.  New patch reflecting all upthread suggestions is attached.  The output
 looks like this now at DEBUG1:

 LOG:  checkpoint starting: xlog
 DEBUG:  checkpoint sync: number=1 file=base/16424/11645 time=11589.549000
 msec
 DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
 DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec
 DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
 DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
 DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000
 msec
 DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
 DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000
 msec
 DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec
 DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000
 msec
 DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
 DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000
 msec
 DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
 DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
 DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000
 msec
 DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000
 msec
 LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log
 file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s,
 total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s

 I kept the units for the DEBUG level ones in msec because that's a better
 scale for the common really short syncs during good behavior.  But the
 summary info in seconds now appears at the end of the existing checkpoint
 complete message, so only one line to parse for those looking to analyze
 the gross checkpoint data.  That looks to work well enough for finding
 situations like the big ext3 spikes.  You can easily see one in this example
 by the fact that longest=11.589 s is almost the entirety of sync=11.715
 s.  That's the really key thing there's currently no visibility into,
 that's made obvious with this patch.

wonderfull.


 This might be ready for some proper review now.  I know there's at least one
 blatant bug still in here I haven't found yet, related to how the averages
 are computed.  I saw this once:

 LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
 added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
 sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s

 After an immediate checkpoint, so at least one path not quite right yet.

 --
 Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers





-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] column-level update privs + lock table

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Well, a user with full-table UPDATE privileges can trash the whole
 thing, so, from a security perspective, letting them lock is only
 allowing them to deny access to data they could have just as easily
 trashed.  A user with only single-column UPDATE privileges cannot
 trash the whole table, though, so you are allowing them to deny read
 access to data they may not themselves have permission either to read
 or to update.

 Admittedly, this seems a bit more rickety in light of your point that
 they can still lock all the rows... but then that only stops writes,
 not reads. I'm less convinced that I'm right about this than I was 3
 days ago.  But I'm still not convinced that the above argument is
 wrong, either.

I think what your complaint really boils down to is that LOCK TABLE
is granting excessive permissions to someone who only has table-level
UPDATE privilege.  If we were designing that from scratch today, I am
sure we'd have made it tighter, say that UPDATE alone wouldn't give you
more than RowExclusive lock.  However, given the lack of complaints
about this from the field, I can't get very excited about a
non-backward-compatible change to tighten LOCK's privilege checking.

I find the argument that column-level update should give weaker locking
privileges than full-table update to be pretty thin.  That isn't the
case at the row level; why should it be true at the table level?

However, the other side of the lack of complaints argument is that
few people seem to care about whether LOCK TABLE responds to column
level privileges, either.  AFAICS this patch is not in response to any
user request but just because Josh thought things were inconsistent.
Which they are, but at a deeper level than this.  If we apply this
patch, then we'll be expanding the set of cases where LOCK is granting
privilege too freely, and thus creating more not less
backwards-compatibility problem if we are ever to make it saner.

On the whole I agree with Robert --- let's just adjust the
documentation, and not enlarge privileges in a way we might regret
later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Kevin Grittner
Daniel Loureiro dan...@termasa.com.br wrote:
 
 to me the key its security - its a anti-DBA-with-lack-of-attention
 feature.
 
Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.  If you
wanted protection from that you'd want more of an assert limit
that would fail if the affected row count was above what you
specified.
 
For me the best solution is to develop good habits.  I first type my
statement as SELECT * FROM ... and after reviewing the results
arrow up and replace SELECT * with DELETE.  If there's enough
volatility or complexity to make that insufficient insurance, I
begin a transaction.  That way I can not only review row counts but
run queries against the modified data to confirm correct
modification before issuing a COMMIT (or ROLLBACK).
 
The batching of updates so that vacuums can make space available for
re-use is more compelling to me, but still pretty iffy, since the
work-arounds aren't that hard to find.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Andrew Dunstan



On 11/30/2010 02:12 PM, Kevin Grittner wrote:

Daniel Loureirodan...@termasa.com.br  wrote:


to me the key its security - its a anti-DBA-with-lack-of-attention
feature.


Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.


I agree, that argument is completely misconceived. If the DBA is paying 
enough attention to use LIMIT, s/he should be paying enough attention 
not to do damage in the first place. If that were the only argument in 
its favor I'd be completely against the feature.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Marko Tiikkaja

On 11/30/2010 02:12 PM, Kevin Grittner wrote:

Daniel Loureirodan...@termasa.com.br   wrote:


to me the key its security - its a anti-DBA-with-lack-of-attention
feature.


Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.


I agree, that argument is completely misconceived. If the DBA is paying
enough attention to use LIMIT, s/he should be paying enough attention
not to do damage in the first place. If that were the only argument in
its favor I'd be completely against the feature.


I don't buy the argument either; why would you put a LIMIT there and 
delete one row by accident when you could put a BEGIN; in front and not 
do any damage at all?



Regards,
Marko Tiikkaja

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] profiling connection overhead

2010-11-30 Thread Peter Eisentraut
On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
 Rolling in calloc in place of
 malloc/memset made no particular difference either, which says that
 Fedora 13's glibc does not have any optimization for that case as I'd
 hoped.

glibc's calloc is either mmap of /dev/zero or malloc followed by memset.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Jeff Davis
On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote:
  3. This doesn't work tremendously well for inheritance trees, where
  ModifyTable acts as sort of an implicit Append node.  You can't just
  funnel all the tuples through one Sort or Limit node because they aren't
  all the same rowtype.  (Limit might perhaps not care, but Sort will.)
  But you can't have a separate Sort/Limit for each table either, because
  that would give the wrong behavior.  Another problem with funneling all
  the rows through one Sort/Limit is that ModifyTable did need to know
  which table each row came from, so it can apply the modify to the right
  table.
 
 Could you possibly have ModifyTable - Limit - MergeAppend?

Before MergeAppend knows which tuple to produce, it needs to see the
tuples (at least the first one from each of its children), meaning that
it needs to pull them through ModifyTable; and at that point it's
already too late.

Also, assuming LIMIT K, MergeAppend will have N children, meaning N
limits, meaning an effective limit of K*N rather than K.

Can you be a little more specific about what you mean?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Marko Tiikkaja
While reading this thread, I thought of two things I think we could do 
if this feature was implemented:


 1. Sort large UPDATE/DELETEs so it is done in heap order

This is actually a TODO item.  I imagine it would be possible to do 
something like:


DELETE FROM foo USING (...) ORDER BY ctid;

with this patch to help this case.

 2. Reducing deadlocks in big UPDATE/DELETEs

One problem that sometimes occurs when doing multiple multi-row UPDATEs 
or DELETEs concurrently is that the transactions end up working on the 
same rows, but in a different order.  One could use an ORDER BY clause 
to make sure the transactions don't deadlock.


Thoughts?


Regards,
Marko Tiikkaja

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] KNNGIST next step: adjusting indexAM API

2010-11-30 Thread Tom Lane
In the current KNNGIST patch, the indexable ORDER BY clauses are
transmitted to the executor by cramming them in with the index qual
conditions (the IndexScan plan node's indexqual list), from whence
they become part of the ScanKey array passed to the index AM.
Robert complained that this was an ingenious way to minimize the
number of lines touched by the patch but utterly ugly from any other
standpoint, and I quite agree.  An ORDER BY clause is a completely
different thing from a WHERE qual, so mixing them together doesn't
seem like a good idea.

However, if we hold to that principle then we need to modify the indexAM
API to pass the ordering operators separately.  This is no big deal as
far as the built-in AMs are concerned, particularly because 3 of the 4
need only assert that the additional list is empty.  The only reason it
would be a problem is if there were third-party index AMs that would be
affected to a larger degree; but I don't know of any.  Does anyone have
an objection to that?

(Another thing that might be worth changing, as long as we have to touch
the beginscan and rescan APIs anyway, is to refactor the handling of
the initial set of scan keys.  It never made any sense to me for
RelationGetIndexScan to call index_rescan: that seems to accomplish
little except making it difficult for AM beginscan routines to do things
in a sane order.  I'm inclined to take that out and let the AM call
rescan internally if it wants to.)

Lastly, I'm pretty un-thrilled with the way that the KNNGIST patch
implements the interface to the opclass-specific hook functions.
Seems like it would be cleaner to leave the Consistent function alone
and invent a new, separate hook function for processing ORDER BY.
Is there a strong reason for having both things done in one call,
or was that just done as a byproduct of trying to cram all the data
into one ScanKey array?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix for seg picksplit function

2010-11-30 Thread Yeb Havinga

On 2010-11-16 09:57, Alexander Korotkov wrote:
On Tue, Nov 16, 2010 at 3:07 AM, Robert Haas robertmh...@gmail.com 
mailto:robertmh...@gmail.com wrote:


The loop that begins here:

   for (i = 0; i  maxoff; i++)
   {
   /* First half of segs goes to the left datum. */
   if (i  seed_2)

...looks like it should perhaps be broken into two separate loops.
That might also help tweak the logic in a way that eliminates this:

seg.c: In function ‘gseg_picksplit’:
seg.c:327: warning: ‘datum_r’ may be used uninitialized in this
function
seg.c:326: warning: ‘datum_l’ may be used uninitialized in this
function

I restored original version of that loop.

But on a broader note, I'm not very certain the sorting algorithm is
sensible.  For example, suppose you have 10 segments that are exactly
'0' and 20 segments that are exactly '1'.  Maybe I'm misunderstanding,
but it seems like this will result in a 15/15 split when we almost
certainly want a 10/20 split.  I think there will be problems in more
complex cases as well.  The documentation says about the less-than and
greater-than operators that These operators do not make a lot of
sense for any practical purpose but sorting.

I think almost any split algorithm has corner cases when it's results 
don't look very good. I think the way to understand significance of 
these corner cases for real life is to perform sufficient testing 
on datasets which is close to real life. I'm not feeling power to 
propose enough of test datasets and estimate their significance for 
real life cases, and I need help in this field.

I think it is time to mark this patch ready for committer:

The unintuitive result thus far is that sorting outperforms the R-tree 
bounding boxes style index, as Alexander has demonstrated with several 
different distributions on 20-11 (uniform, natural (is that a bell 
curve?), many distinct values)


My personal opinion is that I like the single loop for walking over the 
sort array (aka gbt_num_picksplit) more than the two different ones, but 
I'm in the minority here.


Two remarks on this patch also apply to other picksplit functions 
currently around:
1) the *first = *last = FirstOffsetNumber assignment, as noted by 
Alvaro, is not necessary for anymore, and Oleg confirmed this is true 
since PostgreSQL  7.x. 2) loops over something other than the 
entryvector better not use FirstOffsetNumber, OffsetNumberNext, as 
indicated by Tom.


If this patch is committed, it might be an idea to change the other 
occurences as well.


regards,
Yeb Havinga



Re: [HACKERS] Another proposal for table synonyms

2010-11-30 Thread Josh Berkus
Alexey,

 Here is the proposal to add synonyms to PostgreSQL. Initial goal is to add 
 synonyms
 for relations (tables, views, sequences) and an infrastructure to allow 
 synonyms
 for other database objects in the future. 

Can you explain, for our benefit, the use case for this?  Specifically,
what can be done with synonyms which can't be done with search_path and
VIEWs?

I ask partly because I've migrated some Oracle databases to PostgreSQL,
and did not find replacing the functionality of synonyms to be at all
difficult.  Presumably you've run into a case which was difficult?

BTW, I have a specific use case for *column* synonyms which isn't
currently covered by our existing tools.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Where are we on Standby Promotion?

2010-11-30 Thread Josh Berkus
Fujii, Simon, Greg, etc.:

Has anyone submitted or committed a patch to make Standby Promotion*
easier, at this point?  We discussed it earlier in the dev cycle, but I
can't find anything which has actually been submitted.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

* that is, promotion to be a new master of other existing standbys

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Jeff Janes
On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith g...@2ndquadrant.com wrote:


Hi Greg,

Thanks for the update.



 This might be ready for some proper review now.  I know there's at least one
 blatant bug still in here I haven't found yet, related to how the averages
 are computed.

Yes, the blatant bug:

average_sync_time = CheckpointStats.ckpt_longest_sync /
CheckpointStats.ckpt_sync_rels;

That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync.


 I saw this once:

 LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
 added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
 sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s

 After an immediate checkpoint, so at least one path not quite right yet.

Not clear what the right thing to do here is.  I guess we should
special case the div by zero to yield zero for the average?

The patch is in unified diff rather than context diff.  Doesn't bother
me, but the wiki on doing a review says...

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Andres Freund
On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote:
  On 11/30/2010 02:12 PM, Kevin Grittner wrote:
  Daniel Loureirodan...@termasa.com.br   wrote:
  to me the key its security - its a anti-DBA-with-lack-of-attention
  feature.
  
  Well, it seems pretty weak to me for that purpose.  You still trash
  data, and you don't have any immediate clue as to what.
  
  I agree, that argument is completely misconceived. If the DBA is paying
  enough attention to use LIMIT, s/he should be paying enough attention
  not to do damage in the first place. If that were the only argument in
  its favor I'd be completely against the feature.
 
 I don't buy the argument either; why would you put a LIMIT there and
 delete one row by accident when you could put a BEGIN; in front and not
 do any damage at all?
Because the delete of the whole table may take awfully long?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Greg Smith

Ron Mayer wrote:

Might smoother checkpoints be better solved by talking
to the OS vendors  virtual-memory-tunning-knob-authors
to work with them on exposing the ideal knobs; rather than
saying that our only tool is a hammer(fsync) so the problem
must be handled as a nail.
  


Maybe, but it's hard to argue that the current implementation--just 
doing all of the sync calls as fast as possible, one after the other--is 
going to produce worst-case behavior in a lot of situations.  Given that 
it's not a huge amount of code to do better, I'd rather do some work in 
that direction, instead of presuming the kernel authors will ever make 
this go away.  Spreading the writes out as part of the checkpoint rework 
in 8.3 worked better than any kernel changes I've tested since then, and 
I'm not real optimisic about this getting resolved at the system level.  
So long as the database changes aren't antagonistic toward kernel 
improvements, I'd prefer to have some options here that become effective 
as soon as the database code is done.


I've attached an updated version of the initial sync spreading patch 
here, one that applies cleanly on top of HEAD and over top of the sync 
instrumentation patch too.  The conflict that made that hard before is 
gone now.


Having the pg_stat_bgwriter.buffers_backend_fsync patch available all 
the time now has made me reconsider how important one potential bit of 
refactoring here would be.  I managed to catch one of the situations 
where really popular relations were being heavily updated in a way that 
was competing with the checkpoint on my test system (which I can happily 
share the logs of), with the instrumentation patch applied but not the 
spread sync one:


LOG:  checkpoint starting: xlog
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 7747 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 42688 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 9723 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 58117 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm 
time=10132.83 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 
msec
DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm 
time=0.001000 msec

DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm 
time=0.001000 msec
DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm 
time=0.001000 msec
LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, 
total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s


Note here how the checkpoint was hung on trying to get 16448_fsm written 
out, but the backends were issuing constant competing fsync calls to 
these other relations.  This is very similar to the production case this 
patch was written to address, which I hadn't been able to share a good 
example of yet.  That's essentially what it looks like, except with the 
contention going on for minutes instead of seconds.


One of the ideas Simon and I had been considering at one point was 
adding some better de-duplication logic to the fsync absorb code, which 
I'm reminded by the pattern here might be helpful independently of other 
improvements.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 620b197..501cab8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */

Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Alastair Turner
On Tue, Nov 30, 2010 at 9:24 PM, Marko Tiikkaja
marko.tiikk...@cs.helsinki.fi wrote:
 On 11/30/2010 02:12 PM, Kevin Grittner wrote:

 Daniel Loureirodan...@termasa.com.br   wrote:

 to me the key its security - its a anti-DBA-with-lack-of-attention
 feature.

 Well, it seems pretty weak to me for that purpose.  You still trash
 data, and you don't have any immediate clue as to what.

 I agree, that argument is completely misconceived. If the DBA is paying
 enough attention to use LIMIT, s/he should be paying enough attention
 not to do damage in the first place. If that were the only argument in
 its favor I'd be completely against the feature.

 I don't buy the argument either; why would you put a LIMIT there and delete
 one row by accident when you could put a BEGIN; in front and not do any
 damage at all?

It is valuable as a DBA carelessness/typo catcher only if it is
imposed by default (in line with Kevin's point), and only if it rolls
back rather than reduces the number of affected rows (as per Marko).

We have implemented a damage limitation solution similar to this with
triggers on an MSSQL database, and it has worked for the specific
environment it's in. The safety net is basically that the DBA has to
set an environment variable before a very large delete or update
operation. If the operation is recognised as being beyond the
threshold size the enviroment variable is checked - if it is set the
transaction passes and the variable is reset, if not the transaction
is rolled back.

It should be possible to implement something along these lines in
triggers, all that would be needed is a structure for defining the
(optional) limits on potentially destructive operations. More flexible
options or options based on the number of rows in a table will rapidly
increase the performance impact of the triggers - but may make them
more useful.

I'm not sure if there is a way to persist data (like a row count)
between per row triggers so that the operation could be aborted at the
limit rather than only once all the rows had been updated (potentially
a big peformance gain).

Alastair Bell Turner

Technical Lead
^F5

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Andrew Dunstan



On 11/30/2010 03:16 PM, Andres Freund wrote:

On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote:

On 11/30/2010 02:12 PM, Kevin Grittner wrote:

Daniel Loureirodan...@termasa.com.brwrote:

to me the key its security - its a anti-DBA-with-lack-of-attention
feature.

Well, it seems pretty weak to me for that purpose.  You still trash
data, and you don't have any immediate clue as to what.

I agree, that argument is completely misconceived. If the DBA is paying
enough attention to use LIMIT, s/he should be paying enough attention
not to do damage in the first place. If that were the only argument in
its favor I'd be completely against the feature.

I don't buy the argument either; why would you put a LIMIT there and
delete one row by accident when you could put a BEGIN; in front and not
do any damage at all?

Because the delete of the whole table may take awfully long?




I don't see that that has anything to do with restricting damage. LIMIT 
might be useful for the reason you give, but not as any sort of 
protection against DBA carelessness. That's what the discussion above is 
about.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] profiling connection overhead

2010-11-30 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
 Rolling in calloc in place of
 malloc/memset made no particular difference either, which says that
 Fedora 13's glibc does not have any optimization for that case as I'd
 hoped.

 glibc's calloc is either mmap of /dev/zero or malloc followed by memset.

Hmm.  I would have expected to see a difference then.  Do you know what
conditions are needed to cause the mmap to be used?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 2:45 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote:
  3. This doesn't work tremendously well for inheritance trees, where
  ModifyTable acts as sort of an implicit Append node.  You can't just
  funnel all the tuples through one Sort or Limit node because they aren't
  all the same rowtype.  (Limit might perhaps not care, but Sort will.)
  But you can't have a separate Sort/Limit for each table either, because
  that would give the wrong behavior.  Another problem with funneling all
  the rows through one Sort/Limit is that ModifyTable did need to know
  which table each row came from, so it can apply the modify to the right
  table.

 Could you possibly have ModifyTable - Limit - MergeAppend?

 Before MergeAppend knows which tuple to produce, it needs to see the
 tuples (at least the first one from each of its children), meaning that
 it needs to pull them through ModifyTable; and at that point it's
 already too late.

 Also, assuming LIMIT K, MergeAppend will have N children, meaning N
 limits, meaning an effective limit of K*N rather than K.

 Can you be a little more specific about what you mean?

You seem to be imagining the MergeAppend node on top, but I had it in
the other order in my mind.  The ModifyTable node would be the
outermost plan node, pulling from the Limit, which would deliver the
first n table rows from the MergeAppend, which would be reponsible for
getting it from the various child tables.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Josh Berkus

 Maybe, but it's hard to argue that the current implementation--just
 doing all of the sync calls as fast as possible, one after the other--is
 going to produce worst-case behavior in a lot of situations.  Given that
 it's not a huge amount of code to do better, I'd rather do some work in
 that direction, instead of presuming the kernel authors will ever make
 this go away.  Spreading the writes out as part of the checkpoint rework
 in 8.3 worked better than any kernel changes I've tested since then, and
 I'm not real optimisic about this getting resolved at the system level. 
 So long as the database changes aren't antagonistic toward kernel
 improvements, I'd prefer to have some options here that become effective
 as soon as the database code is done.

Besides, even if kernel/FS authors did improve things, the improvements
would not be available on production platforms for years.  And, for that
matter, while Linux and BSD are pretty responsive to our feedback,
Apple, Microsoft and Oracle are most definitely not.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] KNNGIST next step: adjusting indexAM API

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 2:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In the current KNNGIST patch, the indexable ORDER BY clauses are
 transmitted to the executor by cramming them in with the index qual
 conditions (the IndexScan plan node's indexqual list), from whence
 they become part of the ScanKey array passed to the index AM.
 Robert complained that this was an ingenious way to minimize the
 number of lines touched by the patch but utterly ugly from any other
 standpoint, and I quite agree.  An ORDER BY clause is a completely
 different thing from a WHERE qual, so mixing them together doesn't
 seem like a good idea.

 However, if we hold to that principle then we need to modify the indexAM
 API to pass the ordering operators separately.  This is no big deal as
 far as the built-in AMs are concerned, particularly because 3 of the 4
 need only assert that the additional list is empty.  The only reason it
 would be a problem is if there were third-party index AMs that would be
 affected to a larger degree; but I don't know of any.  Does anyone have
 an objection to that?

None here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Tom Lane
Marko Tiikkaja marko.tiikk...@cs.helsinki.fi writes:
 While reading this thread, I thought of two things I think we could do 
 if this feature was implemented:

   1. Sort large UPDATE/DELETEs so it is done in heap order

 This is actually a TODO item.  I imagine it would be possible to do 
 something like:
 DELETE FROM foo USING (...) ORDER BY ctid;
 with this patch to help this case.

Well, that's strictly an implementation detail; it is not a reason to
expose ORDER BY to the user, and even less of a reason to invent LIMIT.
It also hasn't got any of the problems we were discussing with
inheritance situations, since it'd be perfectly OK (in fact probably
desirable) to sort each table's rows separately.

   2. Reducing deadlocks in big UPDATE/DELETEs

 One problem that sometimes occurs when doing multiple multi-row UPDATEs 
 or DELETEs concurrently is that the transactions end up working on the 
 same rows, but in a different order.  One could use an ORDER BY clause 
 to make sure the transactions don't deadlock.

That, on the other hand, seems like potentially a valid use-case.  Note
that the user-given order would have to override any internal attempt to
order by ctid for this to be usable.

I had thought of a slightly different application, which could be
summarized with this example:

UPDATE sometab SET somecol = nextval('seq') ORDER BY id;

with the expectation that somecol's values would then fall in the same
order as the id column.  Unfortunately, that won't actually *work*
reliably, the reason being that ORDER BY is applied after targetlist
computation.  I think enough people would get burnt this way that we'd
have popular demand to make ORDER BY work differently in UPDATE than it
does in SELECT, which seems rather ugly not only from the definitional
side but the implementation side.

(DELETE escapes this issue because it has no user-definable elements in
its targetlist, which is another way that DELETE is simpler here.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Dimitri Fontaine
Andres Freund and...@anarazel.de writes:
 On Tuesday 30 November 2010 20:24:52 Marko Tiikkaja wrote:
 I don't buy the argument either; why would you put a LIMIT there and
 delete one row by accident when you could put a BEGIN; in front and not
 do any damage at all?
 Because the delete of the whole table may take awfully long?

Then you just C-c and that's your ROLLBACK. Been there, seen that (a
developer came to me sweating over maybe-lost data — his chance was that
forgetting the WHERE clause, it did take long enough for him to C-c by
reflex, the oops moment).

But more to the point, I don't see that we're this much on the policy
side of things rather than on the mechanism side. This feature has real
appealing usages (cheap work queues, anti-deadlock, huge data purges
with no production locking — you do that in little steps in a loop).

To summarize, people that are arguing against are saying they will not
themselves put time on the feature more than anything else, I think. I
don't see us refusing a good implementation on the grounds that misuse
is possible.

After all, advisory locks are session based, to name another great foot
gun. If you don't think it's big enough, think about web environments
and pgbouncer in transaction pooling mode. Loads of fun.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 You seem to be imagining the MergeAppend node on top, but I had it in
 the other order in my mind.  The ModifyTable node would be the
 outermost plan node, pulling from the Limit, which would deliver the
 first n table rows from the MergeAppend, which would be reponsible for
 getting it from the various child tables.

That's just a variation of the Sort/Limit/ModifyTable approach.  It
doesn't fix the problem of how ModifyTable knows which table each row
came from, and it doesn't fix the problem of the rows not being all the
same rowtype.  (In fact it makes the latter worse, since now MergeAppend
has to be included in whatever kluge you invent to work around it.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Alvaro Herrera
Excerpts from Daniel Loureiro's message of mar nov 30 15:04:17 -0300 2010:

 So I guess that I have choose the wrong hack to start.

So it seems :-D

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Jeff Davis
On Tue, 2010-11-30 at 15:52 -0500, Robert Haas wrote:
 On Tue, Nov 30, 2010 at 2:45 PM, Jeff Davis pg...@j-davis.com wrote:
  On Tue, 2010-11-30 at 11:12 -0500, Robert Haas wrote:
 
  Could you possibly have ModifyTable - Limit - MergeAppend?
 
  Before MergeAppend knows which tuple to produce, it needs to see the
  tuples (at least the first one from each of its children), meaning that
  it needs to pull them through ModifyTable; and at that point it's
  already too late.
 
 
 You seem to be imagining the MergeAppend node on top

Yes, I assumed that the tuples flowed in the direction of the arrows ;)

Now that I think about it, your representation makes some sense given
our EXPLAIN output.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Jeff Janes
On Sun, Nov 14, 2010 at 3:48 PM, Greg Smith g...@2ndquadrant.com wrote:

...

 One change that turned out be necessary rather than optional--to get good
 performance from the system under tuning--was to make regular background
 writer activity, including fsync absorb checks, happen during these sync
 pauses.  The existing code ran the checkpoint sync work in a pretty tight
 loop, which as I alluded to in an earlier patch today can lead to the
 backends competing with the background writer to get their sync calls
 executed.  This squashes that problem if the background writer is setup
 properly.

Have you tested out this absorb during syncing phase code without
the sleep between the syncs?
I.e. so that it still a tight loop, but the loop alternates between
sync and absorb, with no intentional pause?

I wonder if all the improvement you see might not be due entirely to
the absorb between syncs, and none or very little from
the sleep itself.

I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?

The healing power of clearing out the absorb queue seems much more obvious.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take three

2010-11-30 Thread Bruce Momjian
Heikki Linnakangas wrote:
 On 30.11.2010 18:33, Tom Lane wrote:
  Robert Haasrobertmh...@gmail.com  writes:
  Oh, but it's worse than that.  When you XLOG a WAL record for each of
  those pages, you're going to trigger full-page writes for all of them.
So now you've turned 1GB of data to write into 2+ GB of data to
  write.
 
  No, because only the first mod of each VM page would trigger a full page
  write, at least assuming a reasonable ordering of the operations.
 
 If you change the LSN on the heap pages, you have to write full page 
 images of those as well.
 
 Let's recap what happens when a VM bit is set: You set the 
 PD_ALL_VISIBLE flag on the heap page (assuming it's not set already, it 
 usually isn't), and then set the bit in the VM while keeping the heap 
 page locked.

What if we set PD_ALL_VISIBLE on the heap page, wait for a checkpoint to
happen so the heap page is guaranteed to be on disk, then on next read,
if PD_ALL_VISIBLE is set and the VM all-visible bit is not set, set the
VM bit.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DELETE with LIMIT (or my first hack)

2010-11-30 Thread Bruce Momjian
Daniel Loureiro wrote:
  3. This doesn't work tremendously well for inheritance trees, where
  ModifyTable acts as sort of an implicit Append node.  You can't just
  funnel all the tuples through one Sort or Limit node because they aren't
  all the same rowtype.  (Limit might perhaps not care, but Sort will.)
  But you can't have a separate Sort/Limit for each table either, because
  that would give the wrong behavior.  Another problem with funneling all
  the rows through one Sort/Limit is that ModifyTable did need to know
  which table each row came from, so it can apply the modify to the right
  table.
 
 So I guess that I have choose the wrong hack to start.
 
 Just for curiosity, why the result of WHERE filter (in
 SELECT/DELETE/UPDATE) is not put in memory, i.e. an array of ctid, like an
 buffer and then executed by SELECT/DELETE/UPDATE at once ?

Informix dbaccess would prompt a user for confirmation if it saw a
DELETE with no WHERE.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Where are we on Standby Promotion?

2010-11-30 Thread Bruce Momjian
Josh Berkus wrote:
 Fujii, Simon, Greg, etc.:
 
 Has anyone submitted or committed a patch to make Standby Promotion*
 easier, at this point?  We discussed it earlier in the dev cycle, but I
 can't find anything which has actually been submitted.
 
 * that is, promotion to be a new master of other existing standbys

Nope, I haven't seen anything except a request that someone do testing.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Bruce Momjian
Heikki Linnakangas wrote:
  Does the current code cope with the corruption?
 
 It's not corruption, but intended degradation. Yes, the current code 
 copes with it, that's how GiST survives a crash. However, even with the 
 current code, VACUUM will nag if it finds any invalid tuples with this 
 message:
 
 ereport(NOTICE,
   (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash 
 recovery,
 
 That's harmless, in the sense that all scans and inserts work fine, but 
 scans might need to do more work than if the invalid tuple wasn't there.
 
 I don't think we need to go out of our way to support such degraded 
 indexes in 9.1. If you see such notices in your logs, you should REINDEX 
 anyway, before of after pg_upgrade. Let's just make sure that you get a 
 reasonable error message in 9.1 if a scan or insert encounters such a tuple.
 
 There is a section on this in the docs, BTW: 
 http://www.postgresql.org/docs/9.0/static/gist-recovery.html

OK, administrators will be prompted during normal operation --- seems
there is nothing extra for pg_upgrade to do here.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Does the current code cope with the corruption?

 It's not corruption, but intended degradation. Yes, the current code copes
 with it, that's how GiST survives a crash. However, even with the current
 code, VACUUM will nag if it finds any invalid tuples with this message:

 ereport(NOTICE,
        (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash
 recovery,

 That's harmless, in the sense that all scans and inserts work fine, but
 scans might need to do more work than if the invalid tuple wasn't there.

 I don't think we need to go out of our way to support such degraded indexes
 in 9.1. If you see such notices in your logs, you should REINDEX anyway,
 before of after pg_upgrade. Let's just make sure that you get a reasonable
 error message in 9.1 if a scan or insert encounters such a tuple.

I just don't want to take a risk of giving people unexpected wrong
answers.  It's not clear to me whether that's a risk here or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Josh Berkus
Hackers,

Some of you might already be aware that this combination produces a
fatal startup crash in PostgreSQL:

1. Create an Ext3 or Ext4 partition and mount it with data=journal on a
server with linux kernel 2.6.30 or later.
2. Initdb a PGDATA on that partition
3. Start PostgreSQL with the default config from that PGDATA

This was reported a ways back:
https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=567113

To explain: calling O_DIRECT on an ext3 or ext4 partition with
data=journalled causes a crash.  However, recent Linux kernels now
report support for O_DIRECT when we compile PostgreSQL, so we use it by
default.  This results in a crash by default situation with new
Linuxes if anyone sets data=journal.

We just encountered this again with another user.  With RHEL6 out now,
this seems likely to become a fairly common crash report.

Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Where are we on Standby Promotion?

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 9:03 PM, Bruce Momjian br...@momjian.us wrote:
 Josh Berkus wrote:
 Fujii, Simon, Greg, etc.:

 Has anyone submitted or committed a patch to make Standby Promotion*
 easier, at this point?  We discussed it earlier in the dev cycle, but I
 can't find anything which has actually been submitted.

 * that is, promotion to be a new master of other existing standbys

 Nope, I haven't seen anything except a request that someone do testing.

I thought we agreed Josh was in charge of writing the code for this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-11-30 Thread Josh Berkus
On 11/30/10 7:09 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?
 
 We should wait for the outcome of the discussion about whether to change
 the default wal_sync_method before worrying about this.

Are we considering backporting that change?

If so, this would be another argument in favor of changing the default.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >