Re: [PERFORM] multi-column index
On Thu, 17 Mar 2005 23:48:30 -0800, Ron Mayer [EMAIL PROTECTED] wrote: Would this also help estimates in the case where values in a table are tightly clustered, though not in strictly ascending or descending order? No, I was just expanding the existing notion of correlation from single columns to index tuples. For example, address data has many fields that are related to each other (postal codes, cities, states/provinces). This looks like a case for cross-column statistics, though you might not have meant it as such. I guess what you're talking about can also be described with a single column. In a list like 3 3 ... 3 1 1 ... 1 7 7 ... 7 4 4 ... 4 ... equal items are clustered together but the values are not correlated to their positions. This would require a whole new column characteristic, something like the probability that we find the same value in adjacent heap tuples, or the number of different values we can expect on one heap page. The latter might even be easy to compute during ANALYSE. Servus Manfred ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] multi-column index
On Thu, 17 Mar 2005 13:15:32 -0500, Tom Lane [EMAIL PROTECTED] wrote: I am coming around to the view that we really do need to calculate index-specific correlation numbers, Correlation is a first step. We might also want distribution information like number of distinct index tuples and histograms. Now, as to the actual mechanics of getting the numbers: the above link seems to imply reading the whole index in index order. That turned out to be surprisingly easy (no need to look at data values, no operator lookup, etc.) to implement as a proof of concept. As it's good enough for my use cases I never bothered to change it. Which is a hugely expensive proposition for a big index, Just a thought: Could the gathering of the sample be integrated into the bulk delete phase of VACUUM? (I know, ANALYSE is not always performed as an option to VACUUM, and VACUUM might not even have to delete any index tuples.) We need a way to get the number from a small sample of pages. I had better (or at least different) ideas at that time, like walking down the tree, but somehow lost impetus :-( The idea I was toying with was to recalculate the index keys for the sample rows that ANALYZE already acquires, and then compare/sort those. This seems to be the approach that perfectly fits into what we have now. This is moderately expensive CPU-wise though, and it's also not clear what compare/sort means for non-btree indexes. Nothing. We'd need some notion of clusteredness instead of correlation. C.f. my answer to Ron in this thread. BTW, the more I think about it, the more I come to the conclusion that when the planner starts to account for clusteredness, random page cost has to be raised. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] multi-column index
On Wed, 16 Mar 2005 22:19:13 -0500, Tom Lane [EMAIL PROTECTED] wrote: calculate the correlation explicitly for each index May be it's time to revisit an old proposal that has failed to catch anybody's attention during the 7.4 beta period: http://archives.postgresql.org/pgsql-hackers/2003-08/msg00937.php I'm not sure I'd store index correlation in a separate table today. You've invented something better for functional index statistics, AFAIR. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] multi-column index
May be it's time to revisit an old proposal that has failed to catch anybody's attention during the 7.4 beta period: http://archives.postgresql.org/pgsql-hackers/2003-08/msg00937.php I'm not sure I'd store index correlation in a separate table today. You've invented something better for functional index statistics, AFAIR. Make it deal with cross-table fk correlations as well :) Chris ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] multi-column index
Manfred Koizar [EMAIL PROTECTED] writes: On Wed, 16 Mar 2005 22:19:13 -0500, Tom Lane [EMAIL PROTECTED] wrote: calculate the correlation explicitly for each index May be it's time to revisit an old proposal that has failed to catch anybody's attention during the 7.4 beta period: http://archives.postgresql.org/pgsql-hackers/2003-08/msg00937.php I'm not sure I'd store index correlation in a separate table today. You've invented something better for functional index statistics, AFAIR. Well, the original motivation for calculating correlations on columns was that historically, you didn't need to re-ANALYZE after creating an index: the stats on the base table were already in place. So the idea was to have the correlations already available whether or not the index existed. This works fine for plain indexes on single columns ;-). We didn't realize (or at least I didn't) how poorly the per-column stats apply to multi-column indexes. I am coming around to the view that we really do need to calculate index-specific correlation numbers, and that probably does need a special table ... or maybe better, add a column to pg_index. The column in pg_statistic is useless and should be removed, because there isn't any need for per-column correlation. Now, as to the actual mechanics of getting the numbers: the above link seems to imply reading the whole index in index order. Which is a hugely expensive proposition for a big index, especially one that's grown rather than been built recently --- the physical and logical orderings of the index will be different. (Hm, maybe we need a stat about the extent of disorder within the index itself?) We need a way to get the number from a small sample of pages. The idea I was toying with was to recalculate the index keys for the sample rows that ANALYZE already acquires, and then compare/sort those. This is moderately expensive CPU-wise though, and it's also not clear what compare/sort means for non-btree indexes. If we could get a correlation estimate by probing only a small fraction of the index pages, that would work, but in a disordered index I'm not sure how you figure out what you're looking at. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] multi-column index
Hello. I have a problem concerning multi-column indexes. I have a table containing some 250k lines. Table public.descriptionprodftdiclnk Column| Type | Modifiers -+-+--- idword | integer | not null idqualifier | integer | not null Indexes: descriptionprodftdiclnk_pkey primary key, btree (idword, idqualifier) ix_descriptionprodftdiclnk_idqualif btree (idqualifier) When analyzing a simple query on the idword column the query planner displays: explain analyze select * from descriptionprodftdiclnk where idword=44; QUERY PLAN --- Seq Scan on descriptionprodftdiclnk (cost=0.00..4788.14 rows=44388 width=8) (actual time=87.582..168.041 rows=43792 loops=1) Filter: (idword = 44) Total runtime: 195.339 ms (3 rows) I don't understand why the query planner would not use the default created multi-column index on the primary key. According to the Postgres online documentation it should. By setting the enable_seqscan parameter to off, i can force the planner to use the index: explain analyze select * from descriptionprodftdiclnk where idword=44; QUERY PLAN --- Index Scan using descriptionprodftdiclnk_pkey on descriptionprodftdiclnk (cost=0.00..36720.39 rows=44388 width=8) (actual time=0.205..73.489 rows=43792 loops=1) Index Cond: (idword = 44) Total runtime: 100.564 ms (3 rows) On the other hand, by defining a new index on the idword column (and enable_seqscan set to on), the query uses the index: create index ix_tempIndex on descriptionprodftdiclnk(idword); CREATE INDEX explain analyze select * from descriptionprodftdiclnk where idword=44; QUERY PLAN - Index Scan using ix_tempindex on descriptionprodftdiclnk (cost=0.00..916.24 rows=44388 width=8) (actual time=0.021..79.879 rows=43792 loops=1) Index Cond: (idword = 44) Total runtime: 107.081 ms (3 rows) Could someone provide an explanation for the planner's behaviour? Thanks for your help, Daniel ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] multi-column index
Daniel, Table public.descriptionprodftdiclnk What is this, German? ;-) explain analyze select * from descriptionprodftdiclnk where idword=44; QUERY PLAN --- Seq Scan on descriptionprodftdiclnk (cost=0.00..4788.14 rows=44388 width=8) (actual time=87.582..168.041 rows=43792 loops=1) Filter: (idword = 44) Total runtime: 195.339 ms (3 rows) explain analyze select * from descriptionprodftdiclnk where idword=44; QUERY PLAN --- Index Scan using descriptionprodftdiclnk_pkey on descriptionprodftdiclnk (cost=0.00..36720.39 rows=44388 width=8) (actual time=0.205..73.489 rows=43792 loops=1) Index Cond: (idword = 44) Total runtime: 100.564 ms (3 rows) create index ix_tempIndex on descriptionprodftdiclnk(idword); CREATE INDEX explain analyze select * from descriptionprodftdiclnk where idword=44; QUERY PLAN --- -- Index Scan using ix_tempindex on descriptionprodftdiclnk (cost=0.00..916.24 rows=44388 width=8) (actual time=0.021..79.879 rows=43792 loops=1) Index Cond: (idword = 44) Total runtime: 107.081 ms (3 rows) Could someone provide an explanation for the planner's behaviour? Pretty simple, really. Look at the cost calculations for the index scan for the multi-column index.PostgreSQL believes that: The cost of a seq scan is 4788.14 The cost of an 2-column index scan is 36720.39 The cost of a 1-column index scan is 916.24 Assuming that you ran each of these queries multiple times to eliminate caching as a factor, the issue is that the cost calculations are wrong. We give you a number of GUC variables to change that: effective_cache_size random_page_cost cpu_tuple_cost etc. See the RUNTIME-CONFIGURATION docs for more details. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] multi-column index
David Brown [EMAIL PROTECTED] writes: Actually, I'm surprised the planner came up with such a low cost for the single column index, unless ... perhaps correlation statistics aren't used when determining costs for multi-column indexes? The correlation calculation for multi-column indexes is pretty whacked out pre-8.0. I don't think it's that great in 8.0 either --- we really need to make ANALYZE calculate the correlation explicitly for each index, probably, rather than trying to use per-column correlations. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match