Re: [HACKERS] Much Ado About COUNT(*)
Centuries ago, Nostradamus foresaw when [EMAIL PROTECTED] (Manfred Koizar) would write: On Mon, 24 Jan 2005 08:28:09 -0700, Jonah H. Harris [EMAIL PROTECTED] wrote: UPDATE pg_user_table_counts SET rowcount = rowcount + 1 WHERE schemaname = this_schemaname AND tablename = TG_RELNAME; This might work for small single user applications. You'll have to keep an eye on dead tuples in pg_user_table_counts though. But as soon as there are several concurrent transactions doing both INSERTs and DELETEs, your solution will in the best case serialise access to test_tbl or it will break down because of deadlocks. At that point, what you need to do is to break the process in three: 1. Instead of the above, use... insert into pg_user_table_counts (rowcount, schemaname, tablename) values (1, this_schemaname, TG_RELNAME); The process for DELETEs involves using the value -1, of course... 2. A process needs to run once in a while that does... create temp table new_counts as select sum(rowcount), schemaname, tablename from pg_user_table_counts group by schemaname, tablename; delete from pg_user_table_counts; insert into pg_user_table_counts select * from new_counts; This process compresses the table so that it becomes cheaper to do the aggregate in 3. 3. Querying values is done differently... select sum(rowcount) from pg_user_table_counts where schemaname = 'this' and tablename = 'that'; -- let name=cbbrowne and tld=ntlug.org in String.concat @ [name;tld];; http://www.ntlug.org/~cbbrowne/nonrdbms.html Rules of the Evil Overlord #118. If I have equipment which performs an important function, it will not be activated by a lever that someone could trigger by accidentally falling on when fatally wounded. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Here's a possible solution... though I'm not sure about whether you find the pg_ prefix appropriate for this context. -- Create a Test Relation CREATE TABLE test_tbl ( test_id BIGINT NOT NULL, test_value VARCHAR(128) NOT NULL, PRIMARY KEY (test_id)); -- Create COUNT Collector Relation CREATE TABLE pg_user_table_counts ( schemaname VARCHAR(64) NOT NULL, tablenameVARCHAR(64) NOT NULL, rowcount BIGINT NOT NULL DEFAULT 0, PRIMARY KEY (schemaname, tablename)); -- Populate Collector Relation INSERT INTO pg_user_table_counts (schemaname, tablename) (SELECT schemaname, tablename FROM pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema' AND tablename != 'pg_user_table_counts' ) ; -- Create our Increment/Decrement Function CREATE OR REPLACE FUNCTION pg_user_table_count_func () RETURNS TRIGGER AS $pg_user_table_count_func$ DECLARE this_schemaname VARCHAR(64); BEGIN SELECT INTO this_schemaname nspname FROM pg_namespace WHERE oid = (SELECT relnamespace FROM pg_class WHERE oid = TG_RELID); -- Decrement Count IF (TG_OP = 'DELETE') THEN UPDATE pg_user_table_counts SET rowcount = rowcount - 1 WHERE schemaname = this_schemaname AND tablename = TG_RELNAME; ELSIF (TG_OP = 'INSERT') THEN UPDATE pg_user_table_counts SET rowcount = rowcount + 1 WHERE schemaname = this_schemaname AND tablename = TG_RELNAME; END IF; RETURN NULL; END; $pg_user_table_count_func$ LANGUAGE plpgsql; -- Create AFTER INSERT/UPDATE Trigger on our Test Table CREATE TRIGGER test_tbl_aidt AFTER INSERT OR DELETE ON test_tbl FOR EACH ROW EXECUTE PROCEDURE pg_user_table_count_func(); -- INSERT to Test Relation INSERT INTO test_tbl VALUES (1, 'Demo INSERT'); -- Query Collector demodb=# SELECT * FROM pg_user_table_counts; schemaname |tablename| rowcount +-+-- public | test_tbl|1 (1 row) -- DELETE from Test Relation DELETE FROM test_tbl; -- Query Collector emodb=# SELECT * FROM pg_user_table_counts; schemaname |tablename| rowcount +-+-- public | test_tbl|0 (1 row) Mark Kirkwood wrote: Jim C. Nasby wrote: Does anyone have working code they could contribute? It would be best to give at least an example in the docs. Even better would be something in pgfoundry that helps build a summary table and the rules/triggers you need to maintain it. http://developer.postgresql.org/docs/postgres/plpgsql-trigger.html#PLPGSQL-TRIGGER-SUMMARY-EXAMPLE regards Mark ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
On Mon, 24 Jan 2005 08:28:09 -0700, Jonah H. Harris [EMAIL PROTECTED] wrote: UPDATE pg_user_table_counts SET rowcount = rowcount + 1 WHERE schemaname = this_schemaname AND tablename = TG_RELNAME; This might work for small single user applications. You'll have to keep an eye on dead tuples in pg_user_table_counts though. But as soon as there are several concurrent transactions doing both INSERTs and DELETEs, your solution will in the best case serialise access to test_tbl or it will break down because of deadlocks. Servus Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
Bruce Momjian wrote: Added to TODO based on this discusion:... * Speed up COUNT(*) One think I think would help lots of people is if the documentation near the COUNT aggregate explained some of the techniques using triggers to maintain a count for tables where this is important. For every one person who reads the mailinglist archives, I bet there are dozens who merely read the product documentation and never find the workaround/solution. Perhaps a note below the table here: http://www.postgresql.org/docs/8.0/interactive/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE would be a good place. If it is already somewhere in the docs; perhaps the page I linked should refer to the other page as well. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
On Sun, Jan 23, 2005 at 12:36:10AM -0800, Ron Mayer wrote: Bruce Momjian wrote: Added to TODO based on this discusion:... * Speed up COUNT(*) One think I think would help lots of people is if the documentation near the COUNT aggregate explained some of the techniques using triggers to maintain a count for tables where this is important. For every one person who reads the mailinglist archives, I bet there are dozens who merely read the product documentation and never find the workaround/solution. Perhaps a note below the table here: http://www.postgresql.org/docs/8.0/interactive/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE would be a good place. If it is already somewhere in the docs; perhaps the page I linked should refer to the other page as well. Does anyone have working code they could contribute? It would be best to give at least an example in the docs. Even better would be something in pgfoundry that helps build a summary table and the rules/triggers you need to maintain it. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
Jim C. Nasby wrote: Does anyone have working code they could contribute? It would be best to give at least an example in the docs. Even better would be something in pgfoundry that helps build a summary table and the rules/triggers you need to maintain it. http://developer.postgresql.org/docs/postgres/plpgsql-trigger.html#PLPGSQL-TRIGGER-SUMMARY-EXAMPLE regards Mark ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
Added to TODO based on this discusion: --- * Speed up COUNT(*) We could use a fixed row count and a +/- count to follow MVCC visibility rules, or a single cached value could be used and invalidated if anyone modifies the table. Another idea is to -- get a count directly from a unique index, but for this to be faster than a sequential scan it must avoid access to the heap to obtain tuple visibility information. * Allow data to be pulled directly from indexes Currently indexes do not have enough tuple tuple visibility information to allow data to be pulled from the index without also accessing the heap. One way to allow this is to set a bit to index tuples to indicate if a tuple is currently visible to all transactions when the first valid heap lookup happens. This bit would have to be cleared when a heap tuple is expired. --- Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Ah, right, I missed the connection. Hmm ... that's sort of the inverse of the killed tuple optimization we put in a release or two back, where an index tuple is marked as definitely dead once it's committed dead and the deletion is older than all active transactions. Yes, it is sort of the reverse, but how do you get around the delete case? A would-be deleter of a tuple would have to go and clear the known good bits on all the tuple's index entries before it could commit. This would bring the tuple back into the uncertain status condition where backends would have to visit the heap to find out what's up. Eventually the state would become certain again (either dead to everyone or live to everyone) and one or the other hint bit could be set again. The ugly part of this is that clearing the bit is not like setting a hint bit, ie it's not okay if we lose that change. Therefore, each bit-clearing would have to be WAL-logged. This is a big part of my concern about the cost. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Tom Lane wrote: Tom Lane [EMAIL PROTECTED] writes: Manfred Koizar [EMAIL PROTECTED] writes: Last time we discussed this, didn't we come to the conclusion, that resetting status bits is not a good idea because of possible race conditions? There's no race condition, Actually, wait a minute --- you have a point. Consider a tuple whose inserting transaction (A) has just dropped below GlobalXmin. Transaction B is doing an index scan, so it's going to do something like * Visit index entry, observe that it is in uncertain state. * Visit heap tuple, observe that A has committed and is GlobalXmin, and there is no deleter. * Return to index entry and mark it visible to all. Now suppose transaction C decides to delete the tuple. It will * Insert itself as the XMAX of the heap tuple. * Visit index entry, set state to uncertain if not already that way. C could do this between steps 2 and 3 of B, in which case the index entry ends up improperly marked visible to all while in fact a deletion is pending. Ugh. We'd need some kind of interlock to prevent this from happening, and it's not clear what. Might be tricky to create such an interlock without introducing either deadlock or a big performance penalty. I am thinking we have to somehow lock the row while we set the index status bit. We could add a new heap bit that says my xid is going to set the status bit and put our xid in the expired location, set the bit, then return to the heap and clear it. Can we keep the heap and index page locked at the same time? Anyway it is clearly something that could be an issue. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Much Ado About COUNT(*)
-Original Message- From: Jeff Davis [mailto:[EMAIL PROTECTED] Sent: 19 January 2005 21:33 To: Alvaro Herrera Cc: Mark Cave-Ayland; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Much Ado About COUNT(*) To fill in some details I think what he's saying is this: = create table foo(...); = create table foo_count(num int); = insert into foo_count values(0); = create table foo_change(num int); then create a trigger after delete on foo that does insert into foo_change values(-1) and a trigger after insert on foo that inserts a +1 into foo_change. Periodically, do: = begin; = set transaction isolation level serializable; = update foo_count set num=num+(select sum(num) from foo_change); = delete from foo_change; = commit; = VACUUM; And then any time you need the correct count(*) value, do instead: = select sum(num) from (select num from foo_count union select num from foo_change); And that should work. I haven't tested this exact example, so I may have overlooked something. Hope that helps. That way, you don't have huge waste from the second table, and also triggers maintain it for you and you don't need to think about it. Regards, Jeff Davis Hi Jeff, Thanks for the information. I seem to remember something similar to this being discussed last year in a similar thread. My only real issue I can see with this approach is that the trigger is fired for every row, and it is likely that the database I am planning will have large inserts of several hundred thousand records. Normally the impact of these is minimised by inserting the entire set in one transaction. Is there any way that your trigger can be modified to fire once per transaction with the number of modified rows as a parameter? Many thanks, Mark. WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Much Ado About COUNT(*)
On Thu, 20 Jan 2005 10:12:17 - Mark Cave-Ayland [EMAIL PROTECTED] wrote: Thanks for the information. I seem to remember something similar to this being discussed last year in a similar thread. My only real issue I can see with this approach is that the trigger is fired for every row, and it is likely that the database I am planning will have large inserts of several hundred thousand records. Normally the impact of these is minimised by inserting the entire set in one transaction. Is there any way that your trigger can be modified to fire once per transaction with the number of modified rows as a parameter? I don't believe that such a facility exists but before dismissing it you should test it out. I think that you will find that disk buffering (the system's as well as PostgreSQL's) will effectively handle this for you anyway. -- D'Arcy J.M. Cain darcy@druid.net | Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 425 1212 (DoD#0082)(eNTP) | what's for dinner. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
D'Arcy J.M. Cain wrote: On Thu, 20 Jan 2005 10:12:17 - Mark Cave-Ayland [EMAIL PROTECTED] wrote: Thanks for the information. I seem to remember something similar to this being discussed last year in a similar thread. My only real issue I can see with this approach is that the trigger is fired for every row, and it is likely that the database I am planning will have large inserts of several hundred thousand records. Normally the impact of these is minimised by inserting the entire set in one transaction. Is there any way that your trigger can be modified to fire once per transaction with the number of modified rows as a parameter? I don't believe that such a facility exists but before dismissing it you should test it out. I think that you will find that disk buffering (the system's as well as PostgreSQL's) will effectively handle this for you anyway. Well, it looks like ROW_COUNT isn't set in a statement-level trigger function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise it would be easy to handle. It should be possible to expose this information though, since it gets reported at the command conclusion. -- Richard Huxton Archonet Ltd -- stmt_trig_test.sql -- BEGIN; CREATE TABLE trigtest ( a int4 NOT NULL, b text, PRIMARY KEY (a) ); CREATE FUNCTION tt_test_fn() RETURNS TRIGGER AS ' DECLARE nr integer; ro integer; nr2 integer; BEGIN GET DIAGNOSTICS nr = ROW_COUNT; GET DIAGNOSTICS ro = RESULT_OID; SELECT count(*) INTO nr2 FROM trigtest; RAISE NOTICE ''nr = % / ro = % / nr2 = %'',nr,ro,nr2; RETURN NULL; END; ' LANGUAGE plpgsql; CREATE TRIGGER tt_test AFTER INSERT OR UPDATE ON trigtest FOR EACH STATEMENT EXECUTE PROCEDURE tt_test_fn(); INSERT INTO trigtest VALUES (1,'a'); INSERT INTO trigtest VALUES (2,'b'); UPDATE trigtest SET b = 'x'; ROLLBACK; ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Much Ado About COUNT(*)
-Original Message- From: Richard Huxton [mailto:[EMAIL PROTECTED] Sent: 20 January 2005 12:45 To: D'Arcy J.M. Cain Cc: Mark Cave-Ayland; [EMAIL PROTECTED]; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Much Ado About COUNT(*) D'Arcy J.M. Cain wrote: On Thu, 20 Jan 2005 10:12:17 - Mark Cave-Ayland [EMAIL PROTECTED] wrote: Thanks for the information. I seem to remember something similar to this being discussed last year in a similar thread. My only real issue I can see with this approach is that the trigger is fired for every row, and it is likely that the database I am planning will have large inserts of several hundred thousand records. Normally the impact of these is minimised by inserting the entire set in one transaction. Is there any way that your trigger can be modified to fire once per transaction with the number of modified rows as a parameter? I don't believe that such a facility exists but before dismissing it you should test it out. I think that you will find that disk buffering (the system's as well as PostgreSQL's) will effectively handle this for you anyway. Well, it looks like ROW_COUNT isn't set in a statement-level trigger function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise it would be easy to handle. It should be possible to expose this information though, since it gets reported at the command conclusion. Hi Richard, This is more the sort of approach I would be looking for. However I think even in a transaction with ROW_COUNT defined, the trigger will still be called once per insert. I think something like this would require a new syntax like below, and some supporting code that would keep track of the tables touched by a transaction :( CREATE TRIGGER tt_test AFTER TRANSACTION ON trigtest FOR EACH TRANSACTION EXECUTE PROCEDURE tt_test_fn(); I am sure that Jeff's approach will work, however it just seems like writing out one table entry per row is going to slow large bulk inserts right down. Kind regards, Mark. WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Much Ado About COUNT(*)
On Thu, Jan 20, 2005 at 01:33:10PM -, Mark Cave-Ayland wrote: I am sure that Jeff's approach will work, however it just seems like writing out one table entry per row is going to slow large bulk inserts right down. I don't see how it is any slower than the approach of inserting one entry per row in the narrow table the OP wanted to use. And it will be faster for the scans. -- Alvaro Herrera ([EMAIL PROTECTED]) Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, Gee, Officer Krupke) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
Mark Cave-Ayland wrote: -Original Message- From: Richard Huxton [mailto:[EMAIL PROTECTED] Sent: 20 January 2005 12:45 To: D'Arcy J.M. Cain Cc: Mark Cave-Ayland; [EMAIL PROTECTED]; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Much Ado About COUNT(*) D'Arcy J.M. Cain wrote: On Thu, 20 Jan 2005 10:12:17 - Mark Cave-Ayland [EMAIL PROTECTED] wrote: Thanks for the information. I seem to remember something similar to this being discussed last year in a similar thread. My only real issue I can see with this approach is that the trigger is fired for every row, and it is likely that the database I am planning will have large inserts of several hundred thousand records. Normally the impact of these is minimised by inserting the entire set in one transaction. Is there any way that your trigger can be modified to fire once per transaction with the number of modified rows as a parameter? I don't believe that such a facility exists but before dismissing it you should test it out. I think that you will find that disk buffering (the system's as well as PostgreSQL's) will effectively handle this for you anyway. Well, it looks like ROW_COUNT isn't set in a statement-level trigger function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise it would be easy to handle. It should be possible to expose this information though, since it gets reported at the command conclusion. Hi Richard, This is more the sort of approach I would be looking for. However I think even in a transaction with ROW_COUNT defined, the trigger will still be called once per insert. I think something like this would require a new syntax like below, and some supporting code that would keep track of the tables touched by a transaction :( Well, a statement-level trigger would be called once per statement, which can be much less than per row. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Martha Stewart called it a Good Thing when [EMAIL PROTECTED] (Jeff Davis) wrote: I almost think to not supply an MVCC system would break the I in ACID, would it not? I can't think of any other obvious way to isolate the transactions, but on the other hand, wouldn't DB2 want to be ACID compliant? Wrong, wrong, wrong... MVCC allows an ACID implementation to not need to do a lot of resource locking. In the absence of MVCC, you have way more locks outstanding, which makes it easier for there to be conflicts between lock requests. In effect, with MVCC, you can do more things concurrently without the system crumbling due to a surfeit of deadlocks. -- cbbrowne,@,gmail.com http://www3.sympatico.ca/cbbrowne/multiplexor.html Why isn't phonetic spelled the way it sounds? ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Much Ado About COUNT(*)
Date: Wed, 12 Jan 2005 18:45:09 -0800 From: Jeff Davis [EMAIL PROTECTED] To: Alvaro Herrera [EMAIL PROTECTED] Cc: pgsql-hackers@postgresql.org Subject: Re: Much Ado About COUNT(*) Message-ID: [EMAIL PROTECTED] (cut) Thanks for the link. It looks like it breaks it up into chunks of about 2KB. I think the conversation was mostly assuming the tables were somewhat closer to the size of an index. If you have more than 2KB per tuple, pretty much anything you do with an index would be faster I would think. Hi Jeff/Alvaro, I'm considering an application at the moment whereby I would need to do lots of COUNT(*) on lots of separate tables without a WHERE clause. Would something like the following help speed up the COUNT(*) by reducing the tuple size being used for the count? CREATE SEQUENCE id_seq; CREATE TABLE person_count ( id int8 ); CREATE TABLE person ( id int8 DEFAULT nextval('id_seq'); first_name text, surname text, age int, address1 text, address2 text, address3 text, address4 text, postcode text tel text ); For each insert: BEGIN; INSERT INTO person (first_name, Tel) VALUES ('Fred', '12345'); INSERT INTO person_count(id) VALUES (currval('id_seq')); COMMIT; So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to know the current number of person records. How much quicker would a COUNT(*) be if visibility were included in the indices as opposed to a hacked approach like this? Many thanks, Mark. WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, Jan 19, 2005 at 14:59:17 -, Mark Cave-Ayland [EMAIL PROTECTED] wrote: BEGIN; INSERT INTO person (first_name, Tel) VALUES ('Fred', '12345'); INSERT INTO person_count(id) VALUES (currval('id_seq')); COMMIT; So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to know the current number of person records. How much quicker would a COUNT(*) be if visibility were included in the indices as opposed to a hacked approach like this? You are only going to get a constant factor speed up unless the space savings allows much better use of cache. You probably want to look at using triggers to maintain counts in another table. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote: On Wed, Jan 19, 2005 at 14:59:17 -, Mark Cave-Ayland [EMAIL PROTECTED] wrote: So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to know the current number of person records. How much quicker would a COUNT(*) be if visibility were included in the indices as opposed to a hacked approach like this? You are only going to get a constant factor speed up unless the space savings allows much better use of cache. You probably want to look at using triggers to maintain counts in another table. I'd try using a start value and a differences list. So the differences list would be initially empty and the start value would be 0. On insert or delete, you create a new difference (with +1 or whatever). Periodically, the differences would be added to the start value and the records deleted. Thus the time to calculate the total count is much smaller, and it follows MVCC rules. Of course there are lots of minor details not mentioned here. Not sure if I'd model this with a single table or two. -- Alvaro Herrera ([EMAIL PROTECTED]) I would rather have GNU than GNOT. (ccchips, lwn.net/Articles/37595/) ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
To fill in some details I think what he's saying is this: = create table foo(...); = create table foo_count(num int); = insert into foo_count values(0); = create table foo_change(num int); then create a trigger after delete on foo that does insert into foo_change values(-1) and a trigger after insert on foo that inserts a +1 into foo_change. Periodically, do: = begin; = set transaction isolation level serializable; = update foo_count set num=num+(select sum(num) from foo_change); = delete from foo_change; = commit; = VACUUM; And then any time you need the correct count(*) value, do instead: = select sum(num) from (select num from foo_count union select num from foo_change); And that should work. I haven't tested this exact example, so I may have overlooked something. Hope that helps. That way, you don't have huge waste from the second table, and also triggers maintain it for you and you don't need to think about it. Regards, Jeff Davis On Wed, 2005-01-19 at 17:40 -0300, Alvaro Herrera wrote: On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote: On Wed, Jan 19, 2005 at 14:59:17 -, Mark Cave-Ayland [EMAIL PROTECTED] wrote: So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to know the current number of person records. How much quicker would a COUNT(*) be if visibility were included in the indices as opposed to a hacked approach like this? You are only going to get a constant factor speed up unless the space savings allows much better use of cache. You probably want to look at using triggers to maintain counts in another table. I'd try using a start value and a differences list. So the differences list would be initially empty and the start value would be 0. On insert or delete, you create a new difference (with +1 or whatever). Periodically, the differences would be added to the start value and the records deleted. Thus the time to calculate the total count is much smaller, and it follows MVCC rules. Of course there are lots of minor details not mentioned here. Not sure if I'd model this with a single table or two. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Much Ado About COUNT(*)
Jonah == Jonah H Harris [EMAIL PROTECTED] writes: Jonah Replying to the list as a whole: Jonah If this is such a bad idea, why do other database systems Jonah use it? As a businessperson myself, it doesn't seem Jonah logical to me that commercial database companies would Jonah spend money on implementing this feature if it wouldn't be Jonah used. Remember guys, I'm just trying to help. Systems like DB2 don't implement versioning schemes. As a result there is no need to worry about maintaining visibility in indexes. Index-only plans are thus viable as they require no change in the physical structure of the index and no overhead on update/delete/insert ops. I don't know about Oracle, which I gather is the only commercial system to have something like MVCC. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
On Tue, 2005-01-18 at 12:45 -0800, Sailesh Krishnamurthy wrote: Jonah == Jonah H Harris [EMAIL PROTECTED] writes: Jonah Replying to the list as a whole: Jonah If this is such a bad idea, why do other database systems Jonah use it? As a businessperson myself, it doesn't seem Jonah logical to me that commercial database companies would Jonah spend money on implementing this feature if it wouldn't be Jonah used. Remember guys, I'm just trying to help. Systems like DB2 don't implement versioning schemes. As a result there is no need to worry about maintaining visibility in indexes. Index-only plans are thus viable as they require no change in the physical structure of the index and no overhead on update/delete/insert ops. I don't know about Oracle, which I gather is the only commercial system to have something like MVCC. Perhaps firebird/interbase also? Someone mentioned that on these lists, I'm not sure if it's true or not. I almost think to not supply an MVCC system would break the I in ACID, would it not? I can't think of any other obvious way to isolate the transactions, but on the other hand, wouldn't DB2 want to be ACID compliant? Regards, Jeff Davis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
Jeff Davis [EMAIL PROTECTED] writes: I almost think to not supply an MVCC system would break the I in ACID, would it not? Certainly not; ACID was a recognized goal long before anyone thought of MVCC. You do need much more locking to make it work without MVCC, though --- for instance, a reader that is interested in a just-modified row has to block until the writer completes or rolls back. People who hang around Postgres too long tend to think that MVCC is the obviously correct way to do things, but much of the rest of the world thinks differently ;-) regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
Tom == Tom Lane [EMAIL PROTECTED] writes: Tom People who hang around Postgres too long tend to think that Tom MVCC is the obviously correct way to do things, but much of Tom the rest of the world thinks differently ;-) It works the other way too ... people who come from the locking world find it difficult to wrap their heads around MVCC. A big part of this is because Gray's original paper on transaction isolation defines the different levels based on what kind of lock acquisitions they involve. A very nice alternative approach to defining transaction isolation is Generalized isolation level definitions by Adya, Liskov and O'Neill that appears in ICDE 2000. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
Certainly not; ACID was a recognized goal long before anyone thought of MVCC. You do need much more locking to make it work without MVCC, though --- for instance, a reader that is interested in a just-modified row has to block until the writer completes or rolls back. People who hang around Postgres too long tend to think that MVCC is the obviously correct way to do things, but much of the rest of the world thinks differently ;-) Well, that would explain why everyone is so happy with PostgreSQL's concurrent access performance. Thanks for the information, although I'm not sure I wanted to be reminded about complicated locking issues ( I suppose I must have known that at one time, but perhaps I surpressed it ;-) Regards, Jeff Davis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane [EMAIL PROTECTED] wrote: A would-be deleter of a tuple would have to go and clear the known good bits on all the tuple's index entries before it could commit. This would bring the tuple back into the uncertain status condition where backends would have to visit the heap to find out what's up. Eventually the state would become certain again (either dead to everyone or live to everyone) and one or the other hint bit could be set again. Last time we discussed this, didn't we come to the conclusion, that resetting status bits is not a good idea because of possible race conditions? In a previous post you wrote: | I think we still have one free bit in index tuple headers... AFAICS we'd need two new bits: visible to all and maybe dead. Writing the three status bits in the order visible to all, maybe dead, known dead, a normal index tuple lifecycle would be 000 - 100 - 110 - 111 In states 000 and 110 the heap tuple has to be read to determine visibility. The transitions 000 - 100 and 110 - 111 happen as side effects of index scans. 100 - 110 has to be done by the deleting transaction. This is the operation where the additional run time cost lies. One weakness of this approach is that once the index tuple state is 110 but the deleting transaction is aborted there is no easy way to reset the maybe deleted bit. So we'd have to consult the heap for the rest of the tuple's lifetime. Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
Manfred Koizar [EMAIL PROTECTED] writes: On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane [EMAIL PROTECTED] wrote: A would-be deleter of a tuple would have to go and clear the known good bits on all the tuple's index entries before it could commit. This would bring the tuple back into the uncertain status condition where backends would have to visit the heap to find out what's up. Eventually the state would become certain again (either dead to everyone or live to everyone) and one or the other hint bit could be set again. Last time we discussed this, didn't we come to the conclusion, that resetting status bits is not a good idea because of possible race conditions? There's no race condition, since resetting the hint only forces other transactions to go back to the original data (the tuple header) to decide what to do. AFAICS the above is safe; I'm just pretty dubious about the cost. AFAICS we'd need two new bits: visible to all and maybe dead. No, you've got this wrong. The three possible states are known visible to all, known dead to all, and uncertain. If you see uncertain this means you have to go to the heap and compare the XIDs in the tuple header to your snapshot to decide if you can see the row or not. The index states are not the same as the known committed good or known committed dead hint bits in the tuple header --- those can be set as soon as the inserting/deleting transaction's outcome is known, but we can't move the index entry into the visible to all or dead to all states until that outcome is beyond the GlobalXmin event horizon. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
Tom Lane [EMAIL PROTECTED] writes: Manfred Koizar [EMAIL PROTECTED] writes: Last time we discussed this, didn't we come to the conclusion, that resetting status bits is not a good idea because of possible race conditions? There's no race condition, Actually, wait a minute --- you have a point. Consider a tuple whose inserting transaction (A) has just dropped below GlobalXmin. Transaction B is doing an index scan, so it's going to do something like * Visit index entry, observe that it is in uncertain state. * Visit heap tuple, observe that A has committed and is GlobalXmin, and there is no deleter. * Return to index entry and mark it visible to all. Now suppose transaction C decides to delete the tuple. It will * Insert itself as the XMAX of the heap tuple. * Visit index entry, set state to uncertain if not already that way. C could do this between steps 2 and 3 of B, in which case the index entry ends up improperly marked visible to all while in fact a deletion is pending. Ugh. We'd need some kind of interlock to prevent this from happening, and it's not clear what. Might be tricky to create such an interlock without introducing either deadlock or a big performance penalty. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Much Ado About COUNT(*)
On Sun, 16 Jan 2005 20:49:45 +0100, Manfred Koizar wrote: On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane wrote: A would-be deleter of a tuple would have to go and clear the known good bits on all the tuple's index entries before it could commit. This would bring the tuple back into the uncertain status condition where backends would have to visit the heap to find out what's up. Eventually the state would become certain again (either dead to everyone or live to everyone) and one or the other hint bit could be set again. Last time we discussed this, didn't we come to the conclusion, that resetting status bits is not a good idea because of possible race conditions? http://archives.postgresql.org/pgsql-performance/2004-05/msg4.php In a previous post you wrote: | I think we still have one free bit in index tuple headers... AFAICS we'd need two new bits: visible to all and maybe dead. Writing the three status bits in the order visible to all, maybe dead, known dead, a normal index tuple lifecycle would be 000 - 100 - 110 - 111 In states 000 and 110 the heap tuple has to be read to determine visibility. The transitions 000 - 100 and 110 - 111 happen as side effects of index scans. 100 - 110 has to be done by the deleting transaction. This is the operation where the additional run time cost lies. One weakness of this approach is that once the index tuple state is 110 but the deleting transaction is aborted there is no easy way to reset the maybe deleted bit. So we'd have to consult the heap for the rest of the tuple's lifetime. How bad is that really with a typical workload? Jochem ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
On Sun, Jan 16, 2005 at 03:22:11PM -0500, Tom Lane wrote: Tom Lane [EMAIL PROTECTED] writes: Manfred Koizar [EMAIL PROTECTED] writes: Last time we discussed this, didn't we come to the conclusion, that resetting status bits is not a good idea because of possible race conditions? There's no race condition, Actually, wait a minute --- you have a point. Consider a tuple whose inserting transaction (A) has just dropped below GlobalXmin. Transaction B is doing an index scan, so it's going to do something like * Visit index entry, observe that it is in uncertain state. * Visit heap tuple, observe that A has committed and is GlobalXmin, and there is no deleter. * Return to index entry and mark it visible to all. Now suppose transaction C decides to delete the tuple. It will * Insert itself as the XMAX of the heap tuple. * Visit index entry, set state to uncertain if not already that way. C could do this between steps 2 and 3 of B, in which case the index entry ends up improperly marked visible to all while in fact a deletion is pending. Ugh. We'd need some kind of interlock to prevent this from happening, and it's not clear what. Might be tricky to create such an interlock without introducing either deadlock or a big performance penalty. Wouldn't the original proposal that had a state machine handle this? IIRC the original idea was: new tuple - known good - possibly dead - known dead In this case, you would have to visit the heap page when an entry is in the 'new tuple' or 'possibly dead' states. When it comes to transitions, you would enforce the transitions as shown, which would eliminate the race condition you thought of. Err, I guess maybe you have to allow going from possibly dead back to known good? But I think that would be the only 'backwards' transition. In the event of a rollback on an insert I think you'd want to go directly from new tuple to known dead, as well. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Much Ado About COUNT(*)
Jim C. Nasby [EMAIL PROTECTED] writes: Wouldn't the original proposal that had a state machine handle this? IIRC the original idea was: new tuple - known good - possibly dead - known dead Only if you disallow the transition from possibly dead back to known good, which strikes me as a rather large disadvantage. Failed UPDATEs aren't so uncommon that it's okay to have one permanently disable the optimization. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Much Ado About COUNT(*)
On Sun, 16 Jan 2005 20:01:36 -0500, Tom Lane wrote: Jim C. Nasby writes: Wouldn't the original proposal that had a state machine handle this? IIRC the original idea was: new tuple - known good - possibly dead - known dead Only if you disallow the transition from possibly dead back to known good, which strikes me as a rather large disadvantage. Failed UPDATEs aren't so uncommon that it's okay to have one permanently disable the optimization. But how about allowing the transition from possibly dead to new tuple? What if a failed update restores the tuple to the new tuple state, and only after that it can be promoted to known good state? Jochem ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Much Ado About COUNT(*)
On Sun, Jan 16, 2005 at 08:01:36PM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: Wouldn't the original proposal that had a state machine handle this? IIRC the original idea was: new tuple - known good - possibly dead - known dead Only if you disallow the transition from possibly dead back to known good, which strikes me as a rather large disadvantage. Failed UPDATEs aren't so uncommon that it's okay to have one permanently disable the optimization. Actually, I guess I wasn't understanding the problem to begin with. You'd never go from new tuple to known good while the transaction that created the tuple was in-flight, right? If that's the case, I'm not sure where there's a race condition. You can't delete a tuple that hasn't been committed, right? -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
On Sun, Jan 16, 2005 at 07:28:07PM -0600, Jim C. Nasby wrote: On Sun, Jan 16, 2005 at 08:01:36PM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: Wouldn't the original proposal that had a state machine handle this? IIRC the original idea was: new tuple - known good - possibly dead - known dead Only if you disallow the transition from possibly dead back to known good, which strikes me as a rather large disadvantage. Failed UPDATEs aren't so uncommon that it's okay to have one permanently disable the optimization. Actually, I guess I wasn't understanding the problem to begin with. You'd never go from new tuple to known good while the transaction that created the tuple was in-flight, right? If that's the case, I'm not sure where there's a race condition. You can't delete a tuple that hasn't been committed, right? Er, nevermind, I thought about it and realized the issue. What changes when a delete is done on a tuple? It seems that's the key... if a tuple has been marked as possibly being expired/deleted, don't allow it to go into known_good unless you can verify that the transaction that marked it as potentially deleted was rolled back. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Much Ado About COUNT(*)
Jim C. Nasby [EMAIL PROTECTED] writes: Actually, I guess I wasn't understanding the problem to begin with. You'd never go from new tuple to known good while the transaction that created the tuple was in-flight, right? By definition, not. If that's the case, I'm not sure where there's a race condition. You can't delete a tuple that hasn't been committed, right? The originating transaction could itself delete the tuple, but no one else could see it yet to do that. This means that you'd have to allow a transition directly from new tuple to possibly dead. (In the absence of subtransactions this could be optimized into a transition directly to known dead, but now that we have subtransactions I don't think we can do that.) However, the race condition comes in when someone wants to delete the row at about the same time as someone else is trying to mark it known good, ie, sometime *after* the originating transaction committed. This is definitely a possible situation. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
The fundamental problem is that you can't do it without adding at least 16 bytes, probably 20, to the size of an index tuple header. That would double the physical size of an index on a simple column (eg an integer or timestamp). The extra I/O costs and extra maintenance costs are unattractive to say the least. And it takes away some of the justification for the whole thing, which is that reading an index is much cheaper than reading the main table. That's only true if the index is much smaller than the main table ... Well, the trick would be to have it specified per-index, then it's up to the user whether it's faster or not... ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Much Ado About COUNT(*)
Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Ah, right, I missed the connection. Hmm ... that's sort of the inverse of the killed tuple optimization we put in a release or two back, where an index tuple is marked as definitely dead once it's committed dead and the deletion is older than all active transactions. Yes, it is sort of the reverse, but how do you get around the delete case? A would-be deleter of a tuple would have to go and clear the known good bits on all the tuple's index entries before it could commit. This would bring the tuple back into the uncertain status condition where backends would have to visit the heap to find out what's up. Eventually the state would become certain again (either dead to everyone or live to everyone) and one or the other hint bit could be set again. Right. Do you think the extra overhead of clearing those bits on update/delete would be worth it? The ugly part of this is that clearing the bit is not like setting a hint bit, ie it's not okay if we lose that change. Therefore, each bit-clearing would have to be WAL-logged. This is a big part of my concern about the cost. Yep, that was my concern too. My feeling is that once you mark the tuple for expiration (update/delete), you then clear the index bit. When reading WAL on recovery, you have to clear index bits on rows as you read expire information from WAL. I don't think it would require extra WAL information. The net effect of this idea is that indexes have to look up heap row status information only for rows that have been recently created/expired. If we did have those bits, it would allow us to read data directly from the index, which is something we don't do now. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Bruce Momjian pgman@candle.pha.pa.us writes: Tom Lane wrote: The ugly part of this is that clearing the bit is not like setting a hint bit, ie it's not okay if we lose that change. Therefore, each bit-clearing would have to be WAL-logged. This is a big part of my concern about the cost. Yep, that was my concern too. My feeling is that once you mark the tuple for expiration (update/delete), you then clear the index bit. When reading WAL on recovery, you have to clear index bits on rows as you read expire information from WAL. I don't think it would require extra WAL information. Wrong. The WAL recovery environment is not capable of executing arbitrary user-defined functions, therefore it cannot compute index entries on its own. The *only* way we can do this is if the WAL record stream tells exactly what to do and which physical tuple to do it to. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
On Thu, 13 Jan 2005 10:29:16 -0500 Tom Lane [EMAIL PROTECTED] wrote: Wrong. The WAL recovery environment is not capable of executing arbitrary user-defined functions, therefore it cannot compute index entries on its own. The *only* way we can do this is if the WAL record stream tells exactly what to do and which physical tuple to do it to. I'm not sure why everyone wants to push this into the database anyway. If I need to know the count of something, I am probably in a better position to decide what and how than the database can ever do. For example, I recently had to track balances for certificates in a database with 25M certificates with multiple transactions on each. In this case it is a SUM() instead of a count but the idea is the same. We switched from the deprecated money type to numeric and the calculations started taking too long for our purposes. We created a new table to track balances and created rules to keep it updated. All the complexity and extra work is limited to changes to that one table and does exactly what we need it to do. It even deals with transactions that get cancelled but remain in the table. If you need the count of entire tables, a simple rule on insert and delete can manage that for you. A slightly more complicated set of rules can keep counts based on the value of some field, just like we did for the certificate ID in the transactions. Getting the database to magically track this based on arbitrary business rules is guaranteed to be complex and still not handle everyone's requirements. -- D'Arcy J.M. Cain darcy@druid.net | Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 425 1212 (DoD#0082)(eNTP) | what's for dinner. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
D'Arcy J.M. Cain wrote: I'm not sure why everyone wants to push this into the database anyway. If I need to know the count of something, I am probably in a better position to decide what and how than the database can ever do. For example, I recently had to track balances for certificates in a database with 25M certificates with multiple transactions on each. In this case it is a SUM() instead of a count but the idea is the same. We switched from the deprecated money type to numeric and the calculations started taking too long for our purposes. We created a new table to track balances and created rules to keep it updated. All the complexity and extra work is limited to changes to that one table and does exactly what we need it to do. It even deals with transactions that get cancelled but remain in the table. If you need the count of entire tables, a simple rule on insert and delete can manage that for you. A slightly more complicated set of rules can keep counts based on the value of some field, just like we did for the certificate ID in the transactions. Getting the database to magically track this based on arbitrary business rules is guaranteed to be complex and still not handle everyone's requirements. This discussion is not solely related to COUNT, but advanced usage of the indexes in general. Did everyone get to read the info on Oracle's fast full index scan? It performs sequential I/O on the indexes, pulling all of the index blocks into memory to reduce random I/O to speed up the index scan. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
Jonah H. Harris [EMAIL PROTECTED] writes: Tom, Bruce, and others involved in this recurring TODO discussion First, let me start by saying that I understand this has been discussed many times before; however, Id like to see what the current state of affairs is regarding the possibility of using a unique index scan to speed up the COUNT aggregate. It's not happening, because no one has come up with a workable proposal. In particular, we're not willing to slow down every other operation in order to make COUNT-*-with-no-WHERE-clause faster. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
Tom, Bruce, and others involved in this recurring TODO discussion... First, let me start by saying that I understand this has been discussed many times before; however, I'd like to see what the current state of affairs is regarding the possibility of using a unique index scan to speed up the COUNT aggregate. To sum up: 1. There are good technical reasons why not to do this. The pg aggregate system is very elegant...not worth compromising it for a specific case. 2. postgresql can do many things faster than oracle. If you prefer the way oracle behaves, use oracle. 3. workaround #1: just run analyze once in a while (you should do that anyways) and query pg_Class for the #tuples in a relation. 4. workaround #2: rig up a materialized view and query that. This will be faster than what oracle does, btw, at the price of some coherency. 5. understand that count(*) from t, although frequently used, is of dubious value in the general sense. Sooner or later someone will optimize this, but in light of the currently available workarounds it doesn't seem that important. 6. for large tables, you can get a pretty accurate count by doing: select count(*) * 10 from t where random() .9; on my setup, this shaved about 15% off of the counting time...YMMV. Merlin ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
Tom, Thank you for your prompt response and I understand your statement completely. My thinking is that we may be able to implement index usage for not only unqualified counts, but also on any query that can be satisfied by the index itself. Index usage seems to be a feature that could speed up PostgreSQL for many people. I'm working on a project right now that could actually take advantage of it. Looking at the message boards, there is significant interest in the COUNT(*) aspect. However, rather than solely address the COUNT(*) TODO item, why not fix it and add additional functionality found in commercial databases as well? I believe Oracle has had this feature since 7.3 and I know people take advantage of it. I understand that you guys have a lot more important stuff to do than work on something like this. Unlike other people posting the request and whining about the speed, I'm offering to take it on and fix it. Take this mesage as my willingness to propose and implement this feature. Any details, pitfalls, or suggestions are appreciated. Thanks again! -Jonah Tom Lane wrote: Jonah H. Harris [EMAIL PROTECTED] writes: Tom, Bruce, and others involved in this recurring TODO discussion First, let me start by saying that I understand this has been discussed many times before; however, Id like to see what the current state of affairs is regarding the possibility of using a unique index scan to speed up the COUNT aggregate. It's not happening, because no one has come up with a workable proposal. In particular, we're not willing to slow down every other operation in order to make COUNT-*-with-no-WHERE-clause faster. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Jonah H. Harris [EMAIL PROTECTED] writes: Looking at the message boards, there is significant interest in the COUNT(*) aspect. However, rather than solely address the COUNT(*) TODO item, why not fix it and add additional functionality found in commercial databases as well? I believe Oracle has had this feature since 7.3 and I know people take advantage of it. I think part of the problem is that there's a bunch of features related to these types of queries and the lines between them blur. You seem to be talking about putting visibility information inside indexes for so index-only plans can be performed. But you're also talking about queries like select count(*) from foo with no where clauses. Such a query wouldn't be helped by index-only scans. Perhaps you're thinking about caching the total number of records in a global piece of state like a materialized view? That would be a nice feature but I think it should done as a general materialized view implementation, not a special case solution for just this one query. Perhaps you're thinking of the min/max problem of being able to use indexes to pick out just the tuples satisfying the min/max constraint. That seems to me to be one of the more tractable problems in this area but it would still require lots of work. I suggest you post a specific query you find is slow. Then discuss how you think it ought to be executed and why. -- greg ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
Greg Stark wrote: I think part of the problem is that there's a bunch of features related to these types of queries and the lines between them blur. You seem to be talking about putting visibility information inside indexes for so index-only plans can be performed. But you're also talking about queries like select count(*) from foo with no where clauses. Such a query wouldn't be helped by index-only scans. Perhaps you're thinking about caching the total number of records in a global piece of state like a materialized view? That would be a nice feature but I think it should done as a general materialized view implementation, not a special case solution for just this one query. Perhaps you're thinking of the min/max problem of being able to use indexes to pick out just the tuples satisfying the min/max constraint. That seems to me to be one of the more tractable problems in this area but it would still require lots of work. I suggest you post a specific query you find is slow. Then discuss how you think it ought to be executed and why. You are correct, I am proposing to add visibility to the indexes. As for unqualified counts, I believe that they could take advantage of an index-only scan as it requires much less I/O to perform an index scan than a sequential scan on large tables. Min/Max would also take advantage of index only scans but say, for example, that someone has the following: Relation SOME_USERS user_id BIGINT PK user_nm varchar(32) UNIQUE INDEX some_other_attributes... If an application needs the user names, it would run SELECT user_nm FROM SOME_USERS... in the current implementation this would require a sequential scan. On a relation which contains 1M+ tuples, this requires either a lot of I/O or a lot of cache. An index scan would immensely speed up this query. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
Jonah H. Harris [EMAIL PROTECTED] writes: My thinking is that we may be able to implement index usage for not only unqualified counts, but also on any query that can be satisfied by the index itself. The fundamental problem is that you can't do it without adding at least 16 bytes, probably 20, to the size of an index tuple header. That would double the physical size of an index on a simple column (eg an integer or timestamp). The extra I/O costs and extra maintenance costs are unattractive to say the least. And it takes away some of the justification for the whole thing, which is that reading an index is much cheaper than reading the main table. That's only true if the index is much smaller than the main table ... regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
Greg Stark wrote: I think part of the problem is that there's a bunch of features related to these types of queries and the lines between them blur. You seem to be talking about putting visibility information inside indexes for so index-only plans can be performed. But you're also talking about queries like select count(*) from foo with no where clauses. Such a query wouldn't be helped by index-only scans. Perhaps you're thinking about caching the total number of records in a global piece of state like a materialized view? That would be a nice feature but I think it should done as a general materialized view implementation, not a special case solution for just this one query. Perhaps you're thinking of the min/max problem of being able to use indexes to pick out just the tuples satisfying the min/max constraint. That seems to me to be one of the more tractable problems in this area but it would still require lots of work. I suggest you post a specific query you find is slow. Then discuss how you think it ought to be executed and why. You are correct, I am proposing to add visibility to the indexes. As for unqualified counts, I believe that they could take advantage of an index-only scan as it requires much less I/O to perform an index scan than a sequential scan on large tables. I agree with Greg...I think the way to approach this is a general materialized view solution. This would solve a broad class of tricky problems including count() and count(*)...you get to choice between the pay now/pay later tradeoff, etc. Merlin ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Much Ado About COUNT(*)
Tom Lane wrote: The fundamental problem is that you can't do it without adding at least 16 bytes, probably 20, to the size of an index tuple header. That would double the physical size of an index on a simple column (eg an integer or timestamp). The extra I/O costs and extra maintenance costs are unattractive to say the least. And it takes away some of the justification for the whole thing, which is that reading an index is much cheaper than reading the main table. That's only true if the index is much smaller than the main table ... regards, tom lane I recognize the added cost of implementing index only scans. As storage is relatively cheap these days, everyone I know is more concerned about faster access to data. Similarly, it would still be faster to scan the indexes than to perform a sequential scan over the entire relation for this case. I also acknowledge that it would be a negative impact to indexes where this type of acces isn't required, as you suggested and which is more than likely not the case. I just wonder what more people would be happier with and whether the added 16-20 bytes would be extremely noticable considering most 1-3 year old hardware. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
A notion for indices that are not unique... (won't help much on select count(*) but might be helpful for other types of query optimization) Put a count in the index for each distinct type. In the worst case, the index is actually unique and you have 8 wasted bytes per index entry and all the entries are in the leaves (perhaps it could be an OPTION for some tables). I don't know enough about the structure of PostgreSQL's indexes to know if my suggestion is pure hogwash, so don't laugh to hard if it is pure stupidity. The most practical value of SELECT COUNT(*) is for updating statistics (and looking good in phony-baloney benchmarks). But the statistics only need to be updated when you vacuum, so it hardly seems a crucial issue to me. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Wednesday, January 12, 2005 11:42 AM To: Jonah H. Harris Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Much Ado About COUNT(*) Jonah H. Harris [EMAIL PROTECTED] writes: My thinking is that we may be able to implement index usage for not only unqualified counts, but also on any query that can be satisfied by the index itself. The fundamental problem is that you can't do it without adding at least 16 bytes, probably 20, to the size of an index tuple header. That would double the physical size of an index on a simple column (eg an integer or timestamp). The extra I/O costs and extra maintenance costs are unattractive to say the least. And it takes away some of the justification for the whole thing, which is that reading an index is much cheaper than reading the main table. That's only true if the index is much smaller than the main table ... regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, 2005-01-12 at 12:52 -0700, Jonah H. Harris wrote: Tom Lane wrote: The fundamental problem is that you can't do it without adding at least 16 bytes, probably 20, to the size of an index tuple header. That would double the physical size of an index on a simple column (eg an integer or timestamp). The extra I/O costs and extra maintenance costs are unattractive to say the least. And it takes away some of the justification for the whole thing, which is that reading an index is much cheaper than reading the main table. That's only true if the index is much smaller than the main table ... I recognize the added cost of implementing index only scans. As storage is relatively cheap these days, everyone I know is more concerned about faster access to data. Similarly, it would still be faster to scan the indexes than to perform a sequential scan over the entire relation for this case. I also acknowledge that it would be a negative impact to indexes where this type of acces isn't required, as you suggested and which is more than likely not the case. I just wonder what more people would be happier with and whether the added 16-20 bytes would be extremely noticable considering most 1-3 year old hardware. I'm very much against this. After some quick math, my database would grow by about 40GB if this was done. Storage isn't that cheap when you include the hot-backup master, various slaves, RAM for caching of this additional index space, backup storage unit on the SAN, tape backups, additional spindles required to maintain same performance due to increased IO because I don't very many queries which would receive an advantage (big one for me -- we started buying spindles for performance a long time ago), etc. Make it a new index type if you like, but don't impose any new performance constraints on folks who have little to no advantage from the above proposal. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
Jonah H. Harris said: Tom Lane wrote: The fundamental problem is that you can't do it without adding at least 16 bytes, probably 20, to the size of an index tuple header. That would double the physical size of an index on a simple column (eg an integer or timestamp). The extra I/O costs and extra maintenance costs are unattractive to say the least. And it takes away some of the justification for the whole thing, which is that reading an index is much cheaper than reading the main table. That's only true if the index is much smaller than the main table ... I recognize the added cost of implementing index only scans. As storage is relatively cheap these days, everyone I know is more concerned about faster access to data. Similarly, it would still be faster to scan the indexes than to perform a sequential scan over the entire relation for this case. I also acknowledge that it would be a negative impact to indexes where this type of acces isn't required, as you suggested and which is more than likely not the case. I just wonder what more people would be happier with and whether the added 16-20 bytes would be extremely noticable considering most 1-3 year old hardware. Monetary cost is not the issue - cost in time is the issue. cheers andrew ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Jonah H. Harris [EMAIL PROTECTED] writes: You are correct, I am proposing to add visibility to the indexes. Then I think the only way you'll get any support is if it's an option. Since it would incur a performance penalty on updates and deletes. As for unqualified counts, I believe that they could take advantage of an index-only scan as it requires much less I/O to perform an index scan than a sequential scan on large tables. No, sequential scans require slightly more i/o than index scans. More importantly they require random access i/o instead of sequential i/o which is much slower. Though this depends. If the tuple is very wide then the index might be faster to scan since it would only contain the data from the fields being indexed. This brings to mind another approach. It might be handy to split the heap for a table into multiple heaps. The visibility information would only be in one of the heaps. This would be a big win if many of the fields were rarely used, especially if they're rarely used by sequential scans. Relation SOME_USERS user_id BIGINT PK user_nm varchar(32) UNIQUE INDEX some_other_attributes... What's with the fetish with unique indexes? None of this is any different for unique indexes versus non-unique indexes. -- greg ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
No, sequential scans require slightly more i/o than index scans. More importantly they require random access i/o instead of sequential i/o which is much slower. Just to clear it up, I think what you meant was the index requires random i/o, not the table. And the word slightly depends on the size of the table I suppose. And of course it also depends on how many tuples you actually need to retrieve (in this case we're talking about retrieving all the tuples ragardless). Though this depends. If the tuple is very wide then the index might be faster to scan since it would only contain the data from the fields being indexed. That, and it seems strange on the surface to visit every entry in an index, since normally indexes are used to find only a small fraction of the tuples. This brings to mind another approach. It might be handy to split the heap for a table into multiple heaps. The visibility information would only be in one of the heaps. This would be a big win if many of the fields were rarely used, especially if they're rarely used by sequential scans. Except then the two heaps would have to be joined somehow for every operation. It makes sense some times to (if you have a very wide table) split off the rarely-accessed attributes into a seperate table to be joined one-to-one when those attributes are needed. To have the system do that automatically would create problems if the attributes that are split off are frequently accessed, right? Perhaps you could optionally create a seperate copy of the same tuple visibility information linked in a way similar to an index. It still seems like you gain very little, and only in some very rare situation that I've never encountered (I've never had the need to do frequent unqualified count()s at the expense of other operations). Now, it seems like it might make a little more sense to use an index for min()/max(), but that's a different story. Regards, Jeff Davis ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Andrew Dunstan wrote: Monetary cost is not the issue - cost in time is the issue. cheers andrew We seem to be in agreement. I'm looking for faster/smarter access to data, not the monetary cost of doing so. Isn't it faster/smarter to satisfy a query with the index rather than sequentially scanning an entire relation if it is possible? Replying to the list as a whole: If this is such a bad idea, why do other database systems use it? As a businessperson myself, it doesn't seem logical to me that commercial database companies would spend money on implementing this feature if it wouldn't be used. Remember guys, I'm just trying to help. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
Rod Taylor wrote: grow by about 40GB if this was done. Storage isn't that cheap when you include the hot-backup master, various slaves, RAM for caching of this additional index space, backup storage unit on the SAN, tape backups, additional spindles required to maintain same performance due to increased IO because I don't very many queries which would receive an advantage (big one for me -- we started buying spindles for performance a long time ago), etc. Thanks for the calculation and example. This would be a hefty amount of overhead if none of your queries would benefit from this change. Make it a new index type if you like, but don't impose any new performance constraints on folks who have little to no advantage from the above proposal. I agree with you that some people may not see any benefit from this and that it may look worse performance/storage-wise. I've considered this route, but it seems like more of a workaround than a solution. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Much Ado About COUNT(*)
I recognize the added cost of implementing index only scans. As storage is relatively cheap these days, everyone I know is more concerned about faster access to data. Similarly, it would still be faster to scan the indexes than to perform a sequential scan over the entire relation for this case. I also acknowledge that it would be a negative impact to indexes where this type of acces isn't required, as you suggested and which is more than likely not the case. I just wonder what more people would be happier with and whether the added 16-20 bytes would be extremely noticable considering most 1-3 year old hardware. I think perhaps you missed the point: it's not about price. If an index takes up more space, it will also take more time to read that index. 16-20 bytes per index entry is way too much extra overhead for most people, no matter what hardware they have. That overhead also tightens the performace at what is already the bottleneck for almost every DB: i/o bandwidth. The cost to count the tuples is the cost to read that visibility information for each tuple in the table. A seqscan is the most efficient way to do that since it's sequential i/o, rather than random i/o. The only reason the word index even comes up is because it is inefficient to retrieve a lot of extra attributes you don't need from a table. You might be able to pack that visibility information a little bit more densely in an index than a table, assuming that the table has more than a couple columns. But if you shoehorn the visibility information into an index, you destroy much of the value of an index to most people, who require the index to be compact to be efficient. An index isn't really the place for something when all you really want to do is a sequential scan over a smaller amount of data (so that the visibility information is more dense). Make a narrow table, and seqscan over that. Then, if you need more attributes in the table, just do a one-to-one join with a seperate table. Regards, Jeff Davis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, 12 Jan 2005, Jonah H. Harris wrote: Andrew Dunstan wrote: Monetary cost is not the issue - cost in time is the issue. We seem to be in agreement. I'm looking for faster/smarter access to data, not the monetary cost of doing so. Isn't it faster/smarter to satisfy a query with the index rather than sequentially scanning an entire relation if it is possible? Replying to the list as a whole: If this is such a bad idea, why do other database systems use it? As a businessperson myself, it doesn't seem logical to me that commercial database companies would spend money on implementing this feature if it wouldn't be used. Remember guys, I'm just trying to help. If you're willing to do the work, and have the motivation, probably the best thing to do is just do it. Then you can use empirical measurements of the effect on disk space, speed of various operations, etc. to discuss the merits/demerits of your particular implementation. Then others don't need to feel so threatened by the potential change. Either it'll be (1) an obvious win, or (2) a mixed bag, where allowing the new way to be specified as an option is a possibility, or (3) you'll have to go back to the drawing board if it's an obvious loss. This problem's been talked about a lot, but seeing some code and metrics from someone with a personal interest in solving it would really be progress IMHO. Jon -- Jon Jensen End Point Corporation http://www.endpoint.com/ Software development with Interchange, Perl, PostgreSQL, Apache, Linux, ... ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
Jon Jensen wrote: If you're willing to do the work, and have the motivation, probably the best thing to do is just do it. Then you can use empirical measurements of the effect on disk space, speed of various operations, etc. to discuss the merits/demerits of your particular implementation. Then others don't need to feel so threatened by the potential change. Either it'll be (1) an obvious win, or (2) a mixed bag, where allowing the new way to be specified as an option is a possibility, or (3) you'll have to go back to the drawing board if it's an obvious loss. This problem's been talked about a lot, but seeing some code and metrics from someone with a personal interest in solving it would really be progress IMHO. Jon, This is pretty much where I'm coming from. I'm looking at working on this and I'd rather discuss suggestions for how to implement this than whether we should have it, etc. If it turns out better, great; if not, then I just wasted my time. I really do appreciate everyone's comments and suggestions. Thanks! -Jonah ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, Jan 12, 2005 at 13:42:58 -0700, Jonah H. Harris [EMAIL PROTECTED] wrote: We seem to be in agreement. I'm looking for faster/smarter access to data, not the monetary cost of doing so. Isn't it faster/smarter to satisfy a query with the index rather than sequentially scanning an entire relation if it is possible? Not necessarily. Also note that Postgres will use an index scan for count(*) if there is a relatively selective WHERE clause. Replying to the list as a whole: If this is such a bad idea, why do other database systems use it? As a businessperson myself, it doesn't seem logical to me that commercial database companies would spend money on implementing this feature if it wouldn't be used. Remember guys, I'm just trying to help. Other databases use different ways of handling tuples that are only visible to some concurrent transactions. Postgres is also flexible enough that you can make your own materialized view (using triggers) to handle count(*) if that makes sense for you. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
We seem to be in agreement. I'm looking for faster/smarter access to data, not the monetary cost of doing so. Isn't it faster/smarter to satisfy a query with the index rather than sequentially scanning an entire relation if it is possible? You have to scan every tuple's visibility information no matter what. Sequential i/o is fastest (per byte), so the most efficient possible method is seqscan. Unfortunately for count(*), the tables also store columns, which are really just clutter as you're moving through the table looking for visibility information. Indexes are designed for searches, not exhaustive retrieval of all records. If you can store that visibility information in a seperate place so that it's not cluttered by unneeded attributes that could be helpful, but an index is not the place for that. If you store that in the index, you are really imposing a new cost on yourself: the cost to do random i/o as you're jumping around the index trying to access every entry, plus the index metadata. You could make a narrow table and join with the other attributes. That might be a good place that wouldn't clutter up the visibility information much (it would just need a primary key). A seqscan over that would be quite efficient. If this is such a bad idea, why do other database systems use it? As a businessperson myself, it doesn't seem logical to me that commercial database companies would spend money on implementing this feature if it wouldn't be used. Remember guys, I'm just trying to help. Some databases use an internal counter to count rows as they are added/deleted. This does not give accurate results in a database that supports ACID -- more specifically a database that implements the isolation part of ACID. Two concurrent transactions, if the database supports proper isolation, could have two different results for count(*) and both would be correct. That makes all the optimized count(*) databases really just give a close number, not the real number. If you just want a close number, there are other ways of doing that in PostgreSQL that people have already mentioned. If you know of a database that supports proper isolation and also has a faster count(*) I would be interested to know what it is. There may be a way to do it without sacrificing in other areas, but I don't know what it is. Does someone know exactly what oracle actually does? Regards, Jeff Davis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
Jeff Davis wrote: Does someone know exactly what oracle actually does? some old info resides here, http://www.orsweb.com/techniques/fastfull.html I'll try and find something more recent. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
I agree with last statement. count(*) is not most important. Most nice thing with index only scan is when it contains more than one column. When there is join among many tables where from each table only one or few columns are taken it take boost query incredibly. For exmaple on when you have customer table and ID, NAME index on it then: select c.name,i.* from customer c, invoice i where c.id=i.customer_id then it is HUGE difference there. without index only scan you require to make index io and random table access (assuming no full scan). With index only scan you need only index scan and can skip expensive random table io. It is very simple but powerful optmization in many cases to reduce join expence on many difficult queries. You can have get some kind of index organized table (you use only index so in fact it is ordered table) Selecting only few columns is quite often scenario in reporting. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Jonah H. Harris Sent: Wednesday, January 12, 2005 8:36 PM To: Greg Stark Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Much Ado About COUNT(*) Greg Stark wrote: I think part of the problem is that there's a bunch of features related to these types of queries and the lines between them blur. You seem to be talking about putting visibility information inside indexes for so index-only plans can be performed. But you're also talking about queries like select count(*) from foo with no where clauses. Such a query wouldn't be helped by index-only scans. Perhaps you're thinking about caching the total number of records in a global piece of state like a materialized view? That would be a nice feature but I think it should done as a general materialized view implementation, not a special case solution for just this one query. Perhaps you're thinking of the min/max problem of being able to use indexes to pick out just the tuples satisfying the min/max constraint. That seems to me to be one of the more tractable problems in this area but it would still require lots of work. I suggest you post a specific query you find is slow. Then discuss how you think it ought to be executed and why. You are correct, I am proposing to add visibility to the indexes. As for unqualified counts, I believe that they could take advantage of an index-only scan as it requires much less I/O to perform an index scan than a sequential scan on large tables. Min/Max would also take advantage of index only scans but say, for example, that someone has the following: Relation SOME_USERS user_id BIGINT PK user_nm varchar(32) UNIQUE INDEX some_other_attributes... If an application needs the user names, it would run SELECT user_nm FROM SOME_USERS... in the current implementation this would require a sequential scan. On a relation which contains 1M+ tuples, this requires either a lot of I/O or a lot of cache. An index scan would immensely speed up this query. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, Jan 12, 2005 at 12:41:38PM -0800, Jeff Davis wrote: Except then the two heaps would have to be joined somehow for every operation. It makes sense some times to (if you have a very wide table) split off the rarely-accessed attributes into a seperate table to be joined one-to-one when those attributes are needed. To have the system do that automatically would create problems if the attributes that are split off are frequently accessed, right? That mechanism exists right now, and it's called TOAST, dubbed the best thing since sliced bread. We even have documentation for it, new as of our latest RC: http://developer.postgresql.org/docs/postgres/storage-toast.html -- Alvaro Herrera ([EMAIL PROTECTED]) El día que dejes de cambiar dejarás de vivir ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, Jan 12, 2005 at 14:09:07 -0700, Jonah H. Harris [EMAIL PROTECTED] wrote: Please keep stuff posted to the list so that other people can contribute and learn from the discussion unless there is a particular reason to limited who is involved in the discussion. Bruno, Thanks for the information. I was told that PostgreSQL couldn't use index scans for count(*) because of the visibility issue. Has something changed or was I told incorrectly? It isn't that it can't, it is that for cases where you are counting more than a few percent of a table, it will be faster to use a sequential scan. Part of the reason is that for any hits you get in the index, you have to check in the table to make sure the current transaction can see the current tuple. Even if you could just get away with using just an index scan you are only going to see a constant factor speed up with probably not too big of a constant. Perhaps you think that the count is somehow saved in the index so that you don't have to scan through the whole index to get the number of rows in a table? That isn't the case, but is what creating a materialized view would effectively do for you. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Much Ado About COUNT(*)
On Wed, 2005-01-12 at 15:09 -0500, Rod Taylor wrote: On Wed, 2005-01-12 at 12:52 -0700, Jonah H. Harris wrote: Tom Lane wrote: The fundamental problem is that you can't do it without adding at least 16 bytes, probably 20, to the size of an index tuple header. That would double the physical size of an index on a simple column (eg an integer or timestamp). The extra I/O costs and extra maintenance costs are unattractive to say the least. And it takes away some of the justification for the whole thing, which is that reading an index is much cheaper than reading the main table. That's only true if the index is much smaller than the main table ... I recognize the added cost of implementing index only scans. As storage is relatively cheap these days, everyone I know is more concerned about faster access to data. Similarly, it would still be faster to scan the indexes than to perform a sequential scan over the entire relation for this case. I also acknowledge that it would be a negative impact to indexes where this type of acces isn't required, as you suggested and which is more than likely not the case. I just wonder what more people would be happier with and whether the added 16-20 bytes would be extremely noticable considering most 1-3 year old hardware. I'm very much against this. After some quick math, my database would grow by about 40GB if this was done. Storage isn't that cheap when you include the hot-backup master, various slaves, RAM for caching of this additional index space, backup storage unit on the SAN, tape backups, additional spindles required to maintain same performance due to increased IO because I don't very many queries which would receive an advantage (big one for me -- we started buying spindles for performance a long time ago), etc. Make it a new index type if you like, but don't impose any new performance constraints on folks who have little to no advantage from the above proposal. Jonah, People's objections are: - this shouldn't be the system default, so would need to be implemented as a non-default option on a b-tree index - its a lot of code and if you want it, you gotta do it Remember you'll need to - agree all changes via the list and accept that redesigns may be required, even at a late stage of coding - write visibility code into the index - write an additional node type to handle the new capability - microarchitecture performance testing so you know whether its really worthwhile, covering a range of cases - add code to the optimiser to so it can estimate the cost of using this and to know when to do this - add a column to the catalog to record whether an index has the visibility option - add code to the parser to invoke the option - update pg_dump so that it correctly dumps tables with that option - copy and adapt all of the existing tests for the new mechanism - document it If you really want to do all of that, I'm sure you'd get help, but mostly it will be you that has to drive the change through. There are some other benefits of that implementation: You'd be able to vacuum the index (only), allowing index access to remain reasonably constant, even as the table itself grew from dead rows. The index could then make sensible the reasonably common practice of using a covered index - i.e. putting additional columns into the index to satisfy the whole query just from the index. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
Bruno Wolff III wrote: On Wed, Jan 12, 2005 at 14:09:07 -0700, Jonah H. Harris [EMAIL PROTECTED] wrote: Please keep stuff posted to the list so that other people can contribute and learn from the discussion unless there is a particular reason to limited who is involved in the discussion. not a problem. Perhaps you think that the count is somehow saved in the index so that you don't have to scan through the whole index to get the number of rows in a table? That isn't the case, but is what creating a materialized view would effectively do for you. I understand that the count is not stored in the index. I am saying that it may be faster to generate the count off the keys in the index. I shouldn't have titled this message COUNT(*) as that isn't the only thing I'm trying to accomplish. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
Simon Riggs wrote: Jonah, People's objections are: - this shouldn't be the system default, so would need to be implemented as a non-default option on a b-tree index - its a lot of code and if you want it, you gotta do it Remember you'll need to - agree all changes via the list and accept that redesigns may be required, even at a late stage of coding - write visibility code into the index - write an additional node type to handle the new capability - microarchitecture performance testing so you know whether its really worthwhile, covering a range of cases - add code to the optimiser to so it can estimate the cost of using this and to know when to do this - add a column to the catalog to record whether an index has the visibility option - add code to the parser to invoke the option - update pg_dump so that it correctly dumps tables with that option - copy and adapt all of the existing tests for the new mechanism - document it If you really want to do all of that, I'm sure you'd get help, but mostly it will be you that has to drive the change through. There are some other benefits of that implementation: You'd be able to vacuum the index (only), allowing index access to remain reasonably constant, even as the table itself grew from dead rows. The index could then make sensible the reasonably common practice of using a covered index - i.e. putting additional columns into the index to satisfy the whole query just from the index. Simon, I am willing to take it on and I understand that the workload is mine. As long as everyone gives me some suggestions, I'm good it being optional. -Jonah ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Much Ado About COUNT(*)
The index could then make sensible the reasonably common practice of using a covered index - i.e. putting additional columns into the index to satisfy the whole query just from the index. I am willing to take it on and I understand that the workload is mine. As long as everyone gives me some suggestions, I'm good it being optional. If nobody is working on it, you may find that the below TODO item might accomplish most of what you're looking for as well as generally improving performance. The count(*) on a where clause would result in one index scan and one partial sequential heap scan. Not as fast for the specific examples you've shown, but far better than today and covers many other cases as well. Fetch heap pages matching index entries in sequential order Rather than randomly accessing heap pages based on index entries, mark heap pages needing access in a bitmap and do the lookups in sequential order. Another method would be to sort heap ctids matching the index before accessing the heap rows. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Much Ado About COUNT(*)
That mechanism exists right now, and it's called TOAST, dubbed the best thing since sliced bread. We even have documentation for it, new as of our latest RC: http://developer.postgresql.org/docs/postgres/storage-toast.html Thanks for the link. It looks like it breaks it up into chunks of about 2KB. I think the conversation was mostly assuming the tables were somewhat closer to the size of an index. If you have more than 2KB per tuple, pretty much anything you do with an index would be faster I would think. My original concern was if I had a table like (x int) and then postgres broke the visibility information away form that, that would cause serious performance problems if postgres had to do a join just to do select ... where x = 5. Right? But of course, we all love toast. Everyone needs to make those wide tables once in a while, and toast does a great job of taking those worries away in an efficient way. I am just saying that hopefully we don't have to seqscan a table with wide tuples very often :) Regards, Jeff Davis ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Much Ado About COUNT(*)
Jeff Davis [EMAIL PROTECTED] writes: But of course, we all love toast. Everyone needs to make those wide tables once in a while, and toast does a great job of taking those worries away in an efficient way. I am just saying that hopefully we don't have to seqscan a table with wide tuples very often :) I thought toast only handled having individual large columns. So if I have a 2kb text column it'll pull that out of the table for me. But if I have 20 columns each of which have 100 bytes will it still help me? Will it kick in if I define a single column which stores a record type with 20 columns each of which have a 100 byte string? -- greg ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Much Ado About COUNT(*)
Greg Stark [EMAIL PROTECTED] writes: I thought toast only handled having individual large columns. So if I have a 2kb text column it'll pull that out of the table for me. But if I have 20 columns each of which have 100 bytes will it still help me? Will it kick in if I define a single column which stores a record type with 20 columns each of which have a 100 byte string? Yes, and yes. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Much Ado About COUNT(*)
Jonah H. Harris wrote: 1. Is there any answer to Bruce?s last statement in the thread, ?Re: [PERFORM] COUNT(*) again (was Re: Index/Function organized? (http://archives.postgresql.org/pgsql-hackers/2003-10/msg00245.php) Let me give you my ideas in the above URL and why they are probably wrong. My basic idea was to keep a status bit on each index entry telling it if a previous backend looked at the heap and determined it was valid. Here are the details: Each heap tuple stores the creation and expire xid. To determine if a tuple is visible, you have to check the clog to see if the recorded transaction ids were committed, in progress, or aborted. When you do the lookup the first time and the transaction isn't in progress, you can update a bit to say that the tuple is visible or not. In fact we have several tuple bits: #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */ #define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ Once set they allow a later backend access to know the visiblity of the row without having to re-check clog. The big overhead in index lookups is having to check the heap row for visibility. My idea is that once you check the visibility the first time in the heap, why can't we set some bit in the index so that later index lookups don't have to look up the heap anymore. Over time most index entries would have bits set and you wouldn't need to recheck the heap. (Oh, and you could only update the bit when all active transactions are newer than the creation transaction so we know they should all see it as visible.) Now, this would work for telling us that the transaction that created the index entry was committed or aborted. Over time most index entries would have that status. I think the problem with this idea is expiration. If a row is deleted we never go and update the index pointing to that heap, so the creation status isn't enough for us to know that the row is valid. I can't think of a way to fix this. I think it is expensive to clear the status bits on a row delete because finding an index row that goes with a particular heap is quite expensive. So, those are my ideas. If they could be made to work it would give us most of the advantages of an index scan with _few_ heap lookups with very little overhead. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
Bruce Momjian pgman@candle.pha.pa.us writes: My basic idea was to keep a status bit on each index entry telling it if a previous backend looked at the heap and determined it was valid. Even if you could track the tuple's committed-good status reliably, that isn't enough under MVCC. The tuple might be committed good, and seen that way by some other backend that set the bit, and yet it's not supposed to be visible to your older transaction. Or the reverse at tuple deletion. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: My basic idea was to keep a status bit on each index entry telling it if a previous backend looked at the heap and determined it was valid. Even if you could track the tuple's committed-good status reliably, that isn't enough under MVCC. The tuple might be committed good, and seen that way by some other backend that set the bit, and yet it's not supposed to be visible to your older transaction. Or the reverse at tuple deletion. I mentioned that: (Oh, and you could only update the bit when all active transactions are newer than the creation transaction so we know they should all see it as visible.) -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Much Ado About COUNT(*)
Bruce Momjian pgman@candle.pha.pa.us writes: Tom Lane wrote: Even if you could track the tuple's committed-good status reliably, that isn't enough under MVCC. I mentioned that: (Oh, and you could only update the bit when all active transactions are newer than the creation transaction so we know they should all see it as visible.) Ah, right, I missed the connection. Hmm ... that's sort of the inverse of the killed tuple optimization we put in a release or two back, where an index tuple is marked as definitely dead once it's committed dead and the deletion is older than all active transactions. Maybe that would work. You'd still have to visit the heap when a tuple is in the uncertain states, but with luck that'd be only a small fraction of the time. I'm still concerned about the update costs of maintaining these bits, but this would at least escape the index-bloat objection. I think we still have one free bit in index tuple headers... regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Much Ado About COUNT(*)
Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Tom Lane wrote: Even if you could track the tuple's committed-good status reliably, that isn't enough under MVCC. I mentioned that: (Oh, and you could only update the bit when all active transactions are newer than the creation transaction so we know they should all see it as visible.) Ah, right, I missed the connection. Hmm ... that's sort of the inverse of the killed tuple optimization we put in a release or two back, where an index tuple is marked as definitely dead once it's committed dead and the deletion is older than all active transactions. Maybe that would work. You'd still have to visit the heap when a tuple is in the uncertain states, but with luck that'd be only a small fraction of the time. Yes, it is sort of the reverse, but how do you get around the delete case? Even if the bit is set, how do you know it wasn't deleted since you set the bit? Seems you always have to still check the heap, no? I'm still concerned about the update costs of maintaining these bits, but this would at least escape the index-bloat objection. I think we still have one free bit in index tuple headers... You mean you are considering clearing the index bit when you delete the row? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Much Ado About COUNT(*)
Bruce Momjian pgman@candle.pha.pa.us writes: Ah, right, I missed the connection. Hmm ... that's sort of the inverse of the killed tuple optimization we put in a release or two back, where an index tuple is marked as definitely dead once it's committed dead and the deletion is older than all active transactions. Yes, it is sort of the reverse, but how do you get around the delete case? A would-be deleter of a tuple would have to go and clear the known good bits on all the tuple's index entries before it could commit. This would bring the tuple back into the uncertain status condition where backends would have to visit the heap to find out what's up. Eventually the state would become certain again (either dead to everyone or live to everyone) and one or the other hint bit could be set again. The ugly part of this is that clearing the bit is not like setting a hint bit, ie it's not okay if we lose that change. Therefore, each bit-clearing would have to be WAL-logged. This is a big part of my concern about the cost. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match