Re: [HACKERS] Index of a table is not used (in any case)

2001-10-25 Thread Zeugswetter Andreas SB SD

Tom Lane writes:
 Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
  Imho one of the biggest sources for problems is people creating new
  indexes on populated tables when the rest of the db/table has badly
  outdated statistics or even only default statistics in place.
  In this situation the optimizer is badly misguided, because it now
  sees completely inconsistent statistics to work on.
  (e.g. old indexes on that table may seem way too cheap compared 
  to table scan) 
 
 I don't think any of this is correct.  We don't have per-index
 statistics.  The only stats updated by CREATE INDEX are the same ones
 updated by plain VACUUM, viz the number-of-tuples and number-of-pages
 counts in pg_class.

1. Have I said anything about other stats, than relpages and reltuples ?

2. There is only limited use in the most accurate pg_statistics if
reltuples
and relpages is completely off. In the current behavior you eg get:

rel1: pages = 10-- updated from create index
index1 pages = 2-- outdated
index2 pages = 2000 -- current

rel2: pages = 1 -- outdated

-- Optimizer will prefer join order: rel2, rel1

 I believe it's reasonable to update those stats
 more often than the pg_statistic stats (in fact, if we could keep them
 constantly up-to-date at a reasonable cost, we'd do so).

There is a whole lot of difference between keeping them constantly up to

date and modifying (part of) them in the create index command, so I do

not counter your above sentence, but imho the conclusion is wrong.

 The
 pg_statistic stats are designed as much as possible to be independent
 of the absolute number of rows in the table, so that it's okay if they
 are out of sync with the pg_class stats.

Independently, they can only be good for choosing whether to use an 
index or seq scan. They are not sufficient to choose a good join order.

 The major reason why you vacuumed but you never analyzed is such a
 killer is that in the absence of any pg_statistic data, the default
 selectivity estimates are such that you may get either an index or seq
 scan depending on how big the table is.  The cost estimates are
 nonlinear (correctly so, IMHO, though I wouldn't necessarily 
 defend the
 exact shape of the curve) and ye olde default 0.01 will give you an
 indexscan for a small table but not for a big one.  In 7.2 I have
 reduced the default selectivity estimate to 0.005, for a number of
 reasons but mostly to get it out of the range where the decision will
 flip-flop.

Yes, the new selectivity is better, imho even still too high.
Imho the strategy should be to assume a good selectivity
of values in absence of pg_statistics evidence.
If the index was not selective enough for an average query, the
dba should not have created the index in the first place.

 test71=# create table foo (f1 int);
 test71=# create index fooi on foo(f1);
 test71=# explain select * from foo where f1 = 42;

 Index Scan using fooi on foo  (cost=0.00..8.14 rows=10 width=4)

 test71=# update pg_class set reltuples = 10, relpages = 
 1000 where relname = 'foo';
 Index Scan using fooi on foo  (cost=0.00..1399.04 rows=1000 width=4)

 test71=# update pg_class set reltuples = 100, relpages = 
 1 where relname = 'foo';

 Seq Scan on foo  (cost=0.00..22500.00 rows=1 width=4)

 In current sources you keep getting an indexscan as you increase the
 number of tuples...

As you can see it toppeled at 10 Mio rows :-(

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-25 Thread Zeugswetter Andreas SB SD


 Of course the question did you vacuum (better, did you analyze) is
 annoying, just as the requirement to analyze is annoying in the first
 place, but unless someone designs a better query planner it 
 will have to do.  The reason why we always ask that question first is 
 that people invariantly have not analyzed.

I think it is also not allways useful to ANALYZE. There are applications

that choose optimal plans with only the rudimentary statistics VACUUM 
creates. And even such that use optimal plans with only the default 
statistics in place.

Imho one of the biggest sources for problems is people creating new
indexes on populated tables when the rest of the db/table has badly
outdated statistics or even only default statistics in place.
In this situation the optimizer is badly misguided, because it now
sees completely inconsistent statistics to work on.
(e.g. old indexes on that table may seem way too cheap compared 
to table scan) 

I would thus propose a more distinguished approach of writing 
the statistics gathered during create index to the system tables.

Something like:
if (default stats in place)
write defaults
else if (this is the only index)
write gathered statistics
else 
write only normalized statistics for index
(e.g. index.reltuples = table.reltuples;
  index.relpages = (index.gathered.relpages * 
  table.relpages / table.gathered.relpages)

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-25 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 Imho one of the biggest sources for problems is people creating new
 indexes on populated tables when the rest of the db/table has badly
 outdated statistics or even only default statistics in place.
 In this situation the optimizer is badly misguided, because it now
 sees completely inconsistent statistics to work on.
 (e.g. old indexes on that table may seem way too cheap compared 
 to table scan) 

I don't think any of this is correct.  We don't have per-index
statistics.  The only stats updated by CREATE INDEX are the same ones
updated by plain VACUUM, viz the number-of-tuples and number-of-pages
counts in pg_class.  I believe it's reasonable to update those stats
more often than the pg_statistic stats (in fact, if we could keep them
constantly up-to-date at a reasonable cost, we'd do so).  The
pg_statistic stats are designed as much as possible to be independent
of the absolute number of rows in the table, so that it's okay if they
are out of sync with the pg_class stats.

The major reason why you vacuumed but you never analyzed is such a
killer is that in the absence of any pg_statistic data, the default
selectivity estimates are such that you may get either an index or seq
scan depending on how big the table is.  The cost estimates are
nonlinear (correctly so, IMHO, though I wouldn't necessarily defend the
exact shape of the curve) and ye olde default 0.01 will give you an
indexscan for a small table but not for a big one.  In 7.2 I have
reduced the default selectivity estimate to 0.005, for a number of
reasons but mostly to get it out of the range where the decision will
flip-flop.  Observe:

test71=# create table foo (f1 int);
CREATE
test71=# create index fooi on foo(f1);
CREATE
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:

Index Scan using fooi on foo  (cost=0.00..8.14 rows=10 width=4)

EXPLAIN
test71=# select reltuples,relpages from pg_class where relname = 'foo';
 reltuples | relpages
---+--
  1000 |   10
(1 row)

EXPLAIN
test71=# update pg_class set reltuples = 10, relpages = 1000 where relname = 'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:

Index Scan using fooi on foo  (cost=0.00..1399.04 rows=1000 width=4)

EXPLAIN
test71=# update pg_class set reltuples = 100, relpages = 1 where relname = 
'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:

Seq Scan on foo  (cost=0.00..22500.00 rows=1 width=4)

EXPLAIN
test71=#

In current sources you keep getting an indexscan as you increase the
number of tuples...

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-24 Thread Zeugswetter Andreas SB SD


   *very* slow, due to seq scan on
   20 million entries, which is a test setup up to now)
 
  Perennial first question: did you VACUUM ANALYZE?
 
 Can there, or could there, be a notion of rule based optimization of
 queries in PostgreSQL? The not using index problem is probably the
most
 common and most misunderstood problem.

There is a (sort of) rule based behavior in PostgreSQL, 
the down side of the current implementation is, that certain 
other commands than ANALYZE (e.g. create index) partly update 
optimizer statistics. This is bad behavior, since then only part 
of the statistics are accurate. Statistics always have to be seen 
in context to other table's and other index'es statistics. 

Thus, currently the rule based optimizer only works if you create 
the indexes on empty tables (before loading data), which obviously 
has downsides. Else you have no choice but to ANALYZE frequently.

I have tried hard to fight for this pseudo rule based behavior, 
but was only partly successful in convincing core. My opinion is, 
that (unless runtime statistics are kept) no other command than 
ANALYZE should be allowed to touch optimizer relevant statistics 
(maybe unless explicitly told to).

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-24 Thread mlw

Zeugswetter Andreas SB SD wrote:
 
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)
  
   Perennial first question: did you VACUUM ANALYZE?
 
  Can there, or could there, be a notion of rule based optimization of
  queries in PostgreSQL? The not using index problem is probably the
 most
  common and most misunderstood problem.
 
 There is a (sort of) rule based behavior in PostgreSQL,
 the down side of the current implementation is, that certain
 other commands than ANALYZE (e.g. create index) partly update
 optimizer statistics. This is bad behavior, since then only part
 of the statistics are accurate. Statistics always have to be seen
 in context to other table's and other index'es statistics.
 
 Thus, currently the rule based optimizer only works if you create
 the indexes on empty tables (before loading data), which obviously
 has downsides. Else you have no choice but to ANALYZE frequently.
 
 I have tried hard to fight for this pseudo rule based behavior,
 but was only partly successful in convincing core. My opinion is,
 that (unless runtime statistics are kept) no other command than
 ANALYZE should be allowed to touch optimizer relevant statistics
 (maybe unless explicitly told to).

Perhaps there could be an extension to ANALYZE, i.e. ANALYZE RULEBASED
tablename that would restore or recalculate the state that a table would be if
all indexes were created from scratch?

The not using index was very frustrating to understand. The stock answer,
did you vacuum? just isn't enough. There has to be some explanation (in the
FAQ or something) about the indexed key distribution in your data. Postgres'
statistics are pretty poor too, a relative few very populous entries in a table
will make it virtually impossible for the cost based optimizer (CBO) to use an
index.

At my site we have lots of tables that have many duplicate items in an index.
It is a music based site and has a huge amount of Various Artists entries. No
matter what we do, there is NO way to get Postgres to use the index from the
query alone. We have over 20 thousand artists, but 5 Various Artists or
Soundtrack entries change the statistics so much that they exclude an index
scan. We have to run the system with sequential scan disabled. Running with seq
disabled eliminates the usefulness of the CBO because when it is a justified
table scan, it does an index scan.

I have approached this windmill before and a bit regretful at bringing it up
again, but it is important, very important. There needs to be a way to direct
the optimizer about how to optimize the query.

Using set foo=bar prior to a query is not acceptable. Web sites use
persistent connections to the databases and since set can not be restored,
you override global settings for the session, or have to code, in the web page,
the proper default setting. The result is either that different web processes
will behave differently depending on the order in which they execute queries,
or you have to have your DBA write web pages.

A syntax like:

select * from table where /* enable_seqscan = false */ key = 'value';

Would be great in that you could tune the optimizer as long as the settings
were for the clause directly following the directive, without affecting the
state of the session or transaction. For instance:

select id from t1, t2 where /* enable_seqscan = false */ t1.key = 'value' and
t2.key = 'test' and t1.id = t2.id;

The where t1.key = 'value' condition would be prohibited from using a
sequntial scan, while the t2.key = 'test' would use it if it made sense.

Is this possible?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-24 Thread Peter Eisentraut

mlw writes:

 The not using index was very frustrating to understand. The stock answer,
 did you vacuum? just isn't enough. There has to be some explanation (in the
 FAQ or something) about the indexed key distribution in your data.

Most not using index questions seem to be related to a misunderstanding
of users to the effect that if there is an index it must be used, not
matter what the query, which is of course far from reality.  Add to that
the (related) category of inquiries from people that think the index ought
to be used but don't have any actual timings to show, you have a lot of
people that just need to be educated.

Of course the question did you vacuum (better, did you analyze) is
annoying, just as the requirement to analyze is annoying in the first
place, but unless someone designs a better query planner it will have to
do.  The reason why we always ask that question first is that people
invariantly have not analyzed.  A seasoned developer can often tell from
the EXPLAIN output whether ANALYZE has been done, but users cannot.
Perhaps something can be done in this area, but I'm not exactly sure what.

-- 
Peter Eisentraut   [EMAIL PROTECTED]   http://funkturm.homeip.net/~peter


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-24 Thread Tom Lane

mlw [EMAIL PROTECTED] writes:
 ... Postgres' statistics are pretty poor too, a relative few very
 populous entries in a table will make it virtually impossible for the
 cost based optimizer (CBO) to use an index.

Have you looked at development sources lately?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-23 Thread mlw

Doug McNaught wrote:

 Reiner Dassing [EMAIL PROTECTED] writes:

  Hello PostgreSQl Users!
 
  PostSQL V 7.1.1:
 
  I have defined a table and the necessary indices.
  But the index is not used in every SELECT. (Therefore, the selects are
  *very* slow, due to seq scan on
  20 million entries, which is a test setup up to now)

 Perennial first question: did you VACUUM ANALYZE?

Can there, or could there, be a notion of rule based optimization of
queries in PostgreSQL? The not using index problem is probably the most
common and most misunderstood problem.



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-22 Thread Christopher Kings-Lynne

 Hello PostgreSQl Users!

 PostSQL V 7.1.1:

You should upgrade to 7.1.3 at some point...

 I have defined a table and the necessary indices.
 But the index is not used in every SELECT. (Therefore, the selects are
 *very* slow, due to seq scan on
 20 million entries, which is a test setup up to now)

 The definitions can be seen in the annex.

 Does some body know the reason and how to circumvent the seq scan?

Yes. You probably have not run 'VACUUM ANALYZE' on your large table.

 Is the order of index creation relevant? I.e., should I create the
 indices before inserting
 entries or the other way around?

If you are inserting a great many entries, insert the data first and then
create the indices - it will be much faster this way.

 Should a hashing index be used? (I tried this, but I got the known error
 Out of overflow pages)

Just do the default CREATE INDEX - btree should be fine... (probably)

 The table entry 'epoche' is used in two different indices. Should that
 be avoided?

It's not a problem, but just check your EXPLAIN output after the VACUUM to
check that you have them right.

Chris


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Index of a table is not used (in any case)

2001-10-22 Thread Doug McNaught

Reiner Dassing [EMAIL PROTECTED] writes:

 Hello PostgreSQl Users!
 
 PostSQL V 7.1.1:
 
 I have defined a table and the necessary indices.
 But the index is not used in every SELECT. (Therefore, the selects are
 *very* slow, due to seq scan on
 20 million entries, which is a test setup up to now)

Perennial first question: did you VACUUM ANALYZE?

-Doug
-- 
Let us cross over the river, and rest under the shade of the trees.
   --T. J. Jackson, 1863

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly