Re: [PERFORM] multi-column index

2005-03-18 Thread Manfred Koizar
On Thu, 17 Mar 2005 23:48:30 -0800, Ron Mayer
[EMAIL PROTECTED] wrote:
Would this also help estimates in the case where values in a table
are tightly clustered, though not in strictly ascending or descending
order?

No, I was just expanding the existing notion of correlation from single
columns to index tuples.

For example, address data has many fields that are related
to each other (postal codes, cities, states/provinces).

This looks like a case for cross-column statistics, though you might not
have meant it as such.  I guess what you're talking about can also be
described with a single column.  In a list like

  3 3 ... 3 1 1 ... 1 7 7 ... 7 4 4 ... 4 ...

equal items are clustered together but the values are not correlated
to their positions.  This would require a whole new column
characteristic, something like the probability that we find the same
value in adjacent heap tuples, or the number of different values we can
expect on one heap page.  The latter might even be easy to compute
during ANALYSE.

Servus
 Manfred

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] multi-column index

2005-03-18 Thread Manfred Koizar
On Thu, 17 Mar 2005 13:15:32 -0500, Tom Lane [EMAIL PROTECTED] wrote:
I am coming around to the view that we really do need to calculate
index-specific correlation numbers,

Correlation is a first step.  We might also want distribution
information like number of distinct index tuples and histograms.

Now, as to the actual mechanics of getting the numbers: the above link
seems to imply reading the whole index in index order.

That turned out to be surprisingly easy (no need to look at data values,
no operator lookup, etc.) to implement as a proof of concept.  As it's
good enough for my use cases I never bothered to change it.

  Which is a
hugely expensive proposition for a big index,

Just a thought:  Could the gathering of the sample be integrated into
the bulk delete phase of VACUUM?  (I know, ANALYSE is not always
performed as an option to VACUUM, and VACUUM might not even have to
delete any index tuples.)

  We need a way
to get the number from a small sample of pages.

I had better (or at least different) ideas at that time, like walking
down the tree, but somehow lost impetus :-(

The idea I was toying with was to recalculate the index keys for the
sample rows that ANALYZE already acquires, and then compare/sort
those.

This seems to be the approach that perfectly fits into what we have now.

  This is moderately expensive CPU-wise though, and it's also not
clear what compare/sort means for non-btree indexes.

Nothing.  We'd need some notion of clusteredness instead of
correlation.  C.f. my answer to Ron in this thread.

BTW, the more I think about it, the more I come to the conclusion that
when the planner starts to account for clusteredness, random page cost
has to be raised.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] multi-column index

2005-03-17 Thread Manfred Koizar
On Wed, 16 Mar 2005 22:19:13 -0500, Tom Lane [EMAIL PROTECTED] wrote:
calculate the correlation explicitly for each index

May be it's time to revisit an old proposal that has failed to catch
anybody's attention during the 7.4 beta period:
http://archives.postgresql.org/pgsql-hackers/2003-08/msg00937.php

I'm not sure I'd store index correlation in a separate table today.
You've invented something better for functional index statistics, AFAIR.

Servus
 Manfred

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] multi-column index

2005-03-17 Thread Christopher Kings-Lynne
May be it's time to revisit an old proposal that has failed to catch
anybody's attention during the 7.4 beta period:
http://archives.postgresql.org/pgsql-hackers/2003-08/msg00937.php
I'm not sure I'd store index correlation in a separate table today.
You've invented something better for functional index statistics, AFAIR.
Make it deal with cross-table fk correlations as well :)
Chris
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [PERFORM] multi-column index

2005-03-17 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 On Wed, 16 Mar 2005 22:19:13 -0500, Tom Lane [EMAIL PROTECTED] wrote:
 calculate the correlation explicitly for each index

 May be it's time to revisit an old proposal that has failed to catch
 anybody's attention during the 7.4 beta period:
 http://archives.postgresql.org/pgsql-hackers/2003-08/msg00937.php

 I'm not sure I'd store index correlation in a separate table today.
 You've invented something better for functional index statistics, AFAIR.

Well, the original motivation for calculating correlations on columns
was that historically, you didn't need to re-ANALYZE after creating an
index: the stats on the base table were already in place.  So the idea
was to have the correlations already available whether or not the index
existed.  This works fine for plain indexes on single columns ;-).  We
didn't realize (or at least I didn't) how poorly the per-column stats
apply to multi-column indexes.

I am coming around to the view that we really do need to calculate
index-specific correlation numbers, and that probably does need a
special table ... or maybe better, add a column to pg_index.  The column
in pg_statistic is useless and should be removed, because there isn't
any need for per-column correlation.

Now, as to the actual mechanics of getting the numbers: the above link
seems to imply reading the whole index in index order.  Which is a
hugely expensive proposition for a big index, especially one that's
grown rather than been built recently --- the physical and logical
orderings of the index will be different.  (Hm, maybe we need a stat
about the extent of disorder within the index itself?)  We need a way
to get the number from a small sample of pages.

The idea I was toying with was to recalculate the index keys for the
sample rows that ANALYZE already acquires, and then compare/sort
those.  This is moderately expensive CPU-wise though, and it's also not
clear what compare/sort means for non-btree indexes.

If we could get a correlation estimate by probing only a small fraction
of the index pages, that would work, but in a disordered index I'm not
sure how you figure out what you're looking at.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] multi-column index

2005-03-16 Thread Daniel Crisan
Hello.
I have a problem concerning multi-column indexes.
I have a table containing some 250k lines.
Table public.descriptionprodftdiclnk
  Column|  Type   | Modifiers
-+-+---
idword  | integer | not null
idqualifier | integer | not null
Indexes:
   descriptionprodftdiclnk_pkey primary key, btree (idword, idqualifier)
   ix_descriptionprodftdiclnk_idqualif btree (idqualifier)

When analyzing a simple query on the idword column the query planner 
displays:

explain analyze select * from descriptionprodftdiclnk where idword=44;
 QUERY PLAN
---
Seq Scan on descriptionprodftdiclnk  (cost=0.00..4788.14 rows=44388 
width=8) (actual time=87.582..168.041 rows=43792 loops=1)
  Filter: (idword = 44)
Total runtime: 195.339 ms
(3 rows)

I don't understand why the query planner would not use the default 
created multi-column index
on the primary key. According to the Postgres online documentation it 
should. By setting the
enable_seqscan parameter to off, i can force the planner to use the index:

explain analyze select * from descriptionprodftdiclnk where idword=44;
   
QUERY PLAN
---
Index Scan using descriptionprodftdiclnk_pkey on 
descriptionprodftdiclnk  (cost=0.00..36720.39 rows=44388 width=8) 
(actual time=0.205..73.489 rows=43792 loops=1)
  Index Cond: (idword = 44)
Total runtime: 100.564 ms
(3 rows)


On the other hand, by defining a new index on the idword column (and 
enable_seqscan set to on),
the query uses the index:

create index ix_tempIndex on descriptionprodftdiclnk(idword);
CREATE INDEX
explain analyze select * from descriptionprodftdiclnk where idword=44;
  QUERY 
PLAN
-
Index Scan using ix_tempindex on descriptionprodftdiclnk  
(cost=0.00..916.24 rows=44388 width=8) (actual time=0.021..79.879 
rows=43792 loops=1)
  Index Cond: (idword = 44)
Total runtime: 107.081 ms
(3 rows)

Could someone provide an explanation for the planner's behaviour?
Thanks for your help,
Daniel
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] multi-column index

2005-03-16 Thread Josh Berkus
Daniel,

 Table public.descriptionprodftdiclnk

What is this, German?  ;-)

 explain analyze select * from descriptionprodftdiclnk where idword=44;
   QUERY PLAN
 ---
 Seq Scan on
 descriptionprodftdiclnk  (cost=0.00..4788.14 rows=44388 width=8) (actual
 time=87.582..168.041 rows=43792 loops=1)
Filter: (idword = 44)
  Total runtime: 195.339 ms
 (3 rows)

 explain analyze select * from descriptionprodftdiclnk where idword=44;

 QUERY PLAN
 ---

 Index Scan using descriptionprodftdiclnk_pkey on
 descriptionprodftdiclnk  (cost=0.00..36720.39 rows=44388 width=8)
 (actual time=0.205..73.489 rows=43792 loops=1)
Index Cond: (idword = 44)
  Total runtime: 100.564 ms
 (3 rows)

 create index ix_tempIndex on descriptionprodftdiclnk(idword);
 CREATE INDEX
 explain analyze select * from descriptionprodftdiclnk where idword=44;
QUERY
 PLAN
 ---
-- Index
 Scan using ix_tempindex on descriptionprodftdiclnk
 (cost=0.00..916.24 rows=44388 width=8) (actual time=0.021..79.879
 rows=43792 loops=1)
Index Cond: (idword = 44)
  Total runtime: 107.081 ms
 (3 rows)

 Could someone provide an explanation for the planner's behaviour?

Pretty simple, really.  Look at the cost calculations for the index scan for 
the multi-column index.PostgreSQL believes that:
The cost of a seq scan is 4788.14
The cost of an 2-column index scan is 36720.39
The cost of a 1-column index scan is 916.24

Assuming that you ran each of these queries multiple times to eliminate 
caching as a factor, the issue is that the cost calculations are wrong.   We 
give you a number of GUC variables to change that:
effective_cache_size
random_page_cost
cpu_tuple_cost
etc.

See the RUNTIME-CONFIGURATION docs for more details.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] multi-column index

2005-03-16 Thread Tom Lane
David Brown [EMAIL PROTECTED] writes:
 Actually, I'm surprised the planner came up with such a low cost for the 
 single column index, unless ... perhaps correlation statistics aren't 
 used when determining costs for multi-column indexes?

The correlation calculation for multi-column indexes is pretty whacked
out pre-8.0.  I don't think it's that great in 8.0 either --- we really
need to make ANALYZE calculate the correlation explicitly for each
index, probably, rather than trying to use per-column correlations.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match