Re: [HACKERS] Index of a table is not used (in any case)
Tom Lane writes: Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: Imho one of the biggest sources for problems is people creating new indexes on populated tables when the rest of the db/table has badly outdated statistics or even only default statistics in place. In this situation the optimizer is badly misguided, because it now sees completely inconsistent statistics to work on. (e.g. old indexes on that table may seem way too cheap compared to table scan) I don't think any of this is correct. We don't have per-index statistics. The only stats updated by CREATE INDEX are the same ones updated by plain VACUUM, viz the number-of-tuples and number-of-pages counts in pg_class. 1. Have I said anything about other stats, than relpages and reltuples ? 2. There is only limited use in the most accurate pg_statistics if reltuples and relpages is completely off. In the current behavior you eg get: rel1: pages = 10-- updated from create index index1 pages = 2-- outdated index2 pages = 2000 -- current rel2: pages = 1 -- outdated -- Optimizer will prefer join order: rel2, rel1 I believe it's reasonable to update those stats more often than the pg_statistic stats (in fact, if we could keep them constantly up-to-date at a reasonable cost, we'd do so). There is a whole lot of difference between keeping them constantly up to date and modifying (part of) them in the create index command, so I do not counter your above sentence, but imho the conclusion is wrong. The pg_statistic stats are designed as much as possible to be independent of the absolute number of rows in the table, so that it's okay if they are out of sync with the pg_class stats. Independently, they can only be good for choosing whether to use an index or seq scan. They are not sufficient to choose a good join order. The major reason why you vacuumed but you never analyzed is such a killer is that in the absence of any pg_statistic data, the default selectivity estimates are such that you may get either an index or seq scan depending on how big the table is. The cost estimates are nonlinear (correctly so, IMHO, though I wouldn't necessarily defend the exact shape of the curve) and ye olde default 0.01 will give you an indexscan for a small table but not for a big one. In 7.2 I have reduced the default selectivity estimate to 0.005, for a number of reasons but mostly to get it out of the range where the decision will flip-flop. Yes, the new selectivity is better, imho even still too high. Imho the strategy should be to assume a good selectivity of values in absence of pg_statistics evidence. If the index was not selective enough for an average query, the dba should not have created the index in the first place. test71=# create table foo (f1 int); test71=# create index fooi on foo(f1); test71=# explain select * from foo where f1 = 42; Index Scan using fooi on foo (cost=0.00..8.14 rows=10 width=4) test71=# update pg_class set reltuples = 10, relpages = 1000 where relname = 'foo'; Index Scan using fooi on foo (cost=0.00..1399.04 rows=1000 width=4) test71=# update pg_class set reltuples = 100, relpages = 1 where relname = 'foo'; Seq Scan on foo (cost=0.00..22500.00 rows=1 width=4) In current sources you keep getting an indexscan as you increase the number of tuples... As you can see it toppeled at 10 Mio rows :-( Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index of a table is not used (in any case)
Of course the question did you vacuum (better, did you analyze) is annoying, just as the requirement to analyze is annoying in the first place, but unless someone designs a better query planner it will have to do. The reason why we always ask that question first is that people invariantly have not analyzed. I think it is also not allways useful to ANALYZE. There are applications that choose optimal plans with only the rudimentary statistics VACUUM creates. And even such that use optimal plans with only the default statistics in place. Imho one of the biggest sources for problems is people creating new indexes on populated tables when the rest of the db/table has badly outdated statistics or even only default statistics in place. In this situation the optimizer is badly misguided, because it now sees completely inconsistent statistics to work on. (e.g. old indexes on that table may seem way too cheap compared to table scan) I would thus propose a more distinguished approach of writing the statistics gathered during create index to the system tables. Something like: if (default stats in place) write defaults else if (this is the only index) write gathered statistics else write only normalized statistics for index (e.g. index.reltuples = table.reltuples; index.relpages = (index.gathered.relpages * table.relpages / table.gathered.relpages) Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index of a table is not used (in any case)
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: Imho one of the biggest sources for problems is people creating new indexes on populated tables when the rest of the db/table has badly outdated statistics or even only default statistics in place. In this situation the optimizer is badly misguided, because it now sees completely inconsistent statistics to work on. (e.g. old indexes on that table may seem way too cheap compared to table scan) I don't think any of this is correct. We don't have per-index statistics. The only stats updated by CREATE INDEX are the same ones updated by plain VACUUM, viz the number-of-tuples and number-of-pages counts in pg_class. I believe it's reasonable to update those stats more often than the pg_statistic stats (in fact, if we could keep them constantly up-to-date at a reasonable cost, we'd do so). The pg_statistic stats are designed as much as possible to be independent of the absolute number of rows in the table, so that it's okay if they are out of sync with the pg_class stats. The major reason why you vacuumed but you never analyzed is such a killer is that in the absence of any pg_statistic data, the default selectivity estimates are such that you may get either an index or seq scan depending on how big the table is. The cost estimates are nonlinear (correctly so, IMHO, though I wouldn't necessarily defend the exact shape of the curve) and ye olde default 0.01 will give you an indexscan for a small table but not for a big one. In 7.2 I have reduced the default selectivity estimate to 0.005, for a number of reasons but mostly to get it out of the range where the decision will flip-flop. Observe: test71=# create table foo (f1 int); CREATE test71=# create index fooi on foo(f1); CREATE test71=# explain select * from foo where f1 = 42; NOTICE: QUERY PLAN: Index Scan using fooi on foo (cost=0.00..8.14 rows=10 width=4) EXPLAIN test71=# select reltuples,relpages from pg_class where relname = 'foo'; reltuples | relpages ---+-- 1000 | 10 (1 row) EXPLAIN test71=# update pg_class set reltuples = 10, relpages = 1000 where relname = 'foo'; UPDATE 1 test71=# explain select * from foo where f1 = 42; NOTICE: QUERY PLAN: Index Scan using fooi on foo (cost=0.00..1399.04 rows=1000 width=4) EXPLAIN test71=# update pg_class set reltuples = 100, relpages = 1 where relname = 'foo'; UPDATE 1 test71=# explain select * from foo where f1 = 42; NOTICE: QUERY PLAN: Seq Scan on foo (cost=0.00..22500.00 rows=1 width=4) EXPLAIN test71=# In current sources you keep getting an indexscan as you increase the number of tuples... regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index of a table is not used (in any case)
*very* slow, due to seq scan on 20 million entries, which is a test setup up to now) Perennial first question: did you VACUUM ANALYZE? Can there, or could there, be a notion of rule based optimization of queries in PostgreSQL? The not using index problem is probably the most common and most misunderstood problem. There is a (sort of) rule based behavior in PostgreSQL, the down side of the current implementation is, that certain other commands than ANALYZE (e.g. create index) partly update optimizer statistics. This is bad behavior, since then only part of the statistics are accurate. Statistics always have to be seen in context to other table's and other index'es statistics. Thus, currently the rule based optimizer only works if you create the indexes on empty tables (before loading data), which obviously has downsides. Else you have no choice but to ANALYZE frequently. I have tried hard to fight for this pseudo rule based behavior, but was only partly successful in convincing core. My opinion is, that (unless runtime statistics are kept) no other command than ANALYZE should be allowed to touch optimizer relevant statistics (maybe unless explicitly told to). Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index of a table is not used (in any case)
Zeugswetter Andreas SB SD wrote: *very* slow, due to seq scan on 20 million entries, which is a test setup up to now) Perennial first question: did you VACUUM ANALYZE? Can there, or could there, be a notion of rule based optimization of queries in PostgreSQL? The not using index problem is probably the most common and most misunderstood problem. There is a (sort of) rule based behavior in PostgreSQL, the down side of the current implementation is, that certain other commands than ANALYZE (e.g. create index) partly update optimizer statistics. This is bad behavior, since then only part of the statistics are accurate. Statistics always have to be seen in context to other table's and other index'es statistics. Thus, currently the rule based optimizer only works if you create the indexes on empty tables (before loading data), which obviously has downsides. Else you have no choice but to ANALYZE frequently. I have tried hard to fight for this pseudo rule based behavior, but was only partly successful in convincing core. My opinion is, that (unless runtime statistics are kept) no other command than ANALYZE should be allowed to touch optimizer relevant statistics (maybe unless explicitly told to). Perhaps there could be an extension to ANALYZE, i.e. ANALYZE RULEBASED tablename that would restore or recalculate the state that a table would be if all indexes were created from scratch? The not using index was very frustrating to understand. The stock answer, did you vacuum? just isn't enough. There has to be some explanation (in the FAQ or something) about the indexed key distribution in your data. Postgres' statistics are pretty poor too, a relative few very populous entries in a table will make it virtually impossible for the cost based optimizer (CBO) to use an index. At my site we have lots of tables that have many duplicate items in an index. It is a music based site and has a huge amount of Various Artists entries. No matter what we do, there is NO way to get Postgres to use the index from the query alone. We have over 20 thousand artists, but 5 Various Artists or Soundtrack entries change the statistics so much that they exclude an index scan. We have to run the system with sequential scan disabled. Running with seq disabled eliminates the usefulness of the CBO because when it is a justified table scan, it does an index scan. I have approached this windmill before and a bit regretful at bringing it up again, but it is important, very important. There needs to be a way to direct the optimizer about how to optimize the query. Using set foo=bar prior to a query is not acceptable. Web sites use persistent connections to the databases and since set can not be restored, you override global settings for the session, or have to code, in the web page, the proper default setting. The result is either that different web processes will behave differently depending on the order in which they execute queries, or you have to have your DBA write web pages. A syntax like: select * from table where /* enable_seqscan = false */ key = 'value'; Would be great in that you could tune the optimizer as long as the settings were for the clause directly following the directive, without affecting the state of the session or transaction. For instance: select id from t1, t2 where /* enable_seqscan = false */ t1.key = 'value' and t2.key = 'test' and t1.id = t2.id; The where t1.key = 'value' condition would be prohibited from using a sequntial scan, while the t2.key = 'test' would use it if it made sense. Is this possible? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index of a table is not used (in any case)
mlw writes: The not using index was very frustrating to understand. The stock answer, did you vacuum? just isn't enough. There has to be some explanation (in the FAQ or something) about the indexed key distribution in your data. Most not using index questions seem to be related to a misunderstanding of users to the effect that if there is an index it must be used, not matter what the query, which is of course far from reality. Add to that the (related) category of inquiries from people that think the index ought to be used but don't have any actual timings to show, you have a lot of people that just need to be educated. Of course the question did you vacuum (better, did you analyze) is annoying, just as the requirement to analyze is annoying in the first place, but unless someone designs a better query planner it will have to do. The reason why we always ask that question first is that people invariantly have not analyzed. A seasoned developer can often tell from the EXPLAIN output whether ANALYZE has been done, but users cannot. Perhaps something can be done in this area, but I'm not exactly sure what. -- Peter Eisentraut [EMAIL PROTECTED] http://funkturm.homeip.net/~peter ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index of a table is not used (in any case)
mlw [EMAIL PROTECTED] writes: ... Postgres' statistics are pretty poor too, a relative few very populous entries in a table will make it virtually impossible for the cost based optimizer (CBO) to use an index. Have you looked at development sources lately? regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index of a table is not used (in any case)
Doug McNaught wrote: Reiner Dassing [EMAIL PROTECTED] writes: Hello PostgreSQl Users! PostSQL V 7.1.1: I have defined a table and the necessary indices. But the index is not used in every SELECT. (Therefore, the selects are *very* slow, due to seq scan on 20 million entries, which is a test setup up to now) Perennial first question: did you VACUUM ANALYZE? Can there, or could there, be a notion of rule based optimization of queries in PostgreSQL? The not using index problem is probably the most common and most misunderstood problem. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index of a table is not used (in any case)
Hello PostgreSQl Users! PostSQL V 7.1.1: You should upgrade to 7.1.3 at some point... I have defined a table and the necessary indices. But the index is not used in every SELECT. (Therefore, the selects are *very* slow, due to seq scan on 20 million entries, which is a test setup up to now) The definitions can be seen in the annex. Does some body know the reason and how to circumvent the seq scan? Yes. You probably have not run 'VACUUM ANALYZE' on your large table. Is the order of index creation relevant? I.e., should I create the indices before inserting entries or the other way around? If you are inserting a great many entries, insert the data first and then create the indices - it will be much faster this way. Should a hashing index be used? (I tried this, but I got the known error Out of overflow pages) Just do the default CREATE INDEX - btree should be fine... (probably) The table entry 'epoche' is used in two different indices. Should that be avoided? It's not a problem, but just check your EXPLAIN output after the VACUUM to check that you have them right. Chris ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index of a table is not used (in any case)
Reiner Dassing [EMAIL PROTECTED] writes: Hello PostgreSQl Users! PostSQL V 7.1.1: I have defined a table and the necessary indices. But the index is not used in every SELECT. (Therefore, the selects are *very* slow, due to seq scan on 20 million entries, which is a test setup up to now) Perennial first question: did you VACUUM ANALYZE? -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly