Recent discussion of index cost estimation ("[HACKERS] Correlation in
cost_index()" ca. two weeks ago) has lead to the conclusion that the
column correlation calculated by VACUUM does not always help when we
want to find out how well index access order corresponds to physical
tuple position.  Most problems arise when dealing with multi-column or
functional indices.  Another case is when there are many equal values
in the first index column.

What we really want is not column correlation, but index correlation
which turns out to be surprisingly easy to calculate.  There's no need
to look up comparison functions and handle different datatypes;
there's no need to look at the key values at all:  Just read the index
items in index order, sort them by heap page, and compute the Spearman
Rho function.  This even works for non-btree indices.

Try it yourself:
 .  download http://www.pivot.at/pg/contrib_icorrel.tgz 
 .  unpack
 .  make
 .  make install
 .  psql
    . \i path/to/share/postgresql/contrib/icorrel.sql
    . SELECT icorrel('myindex'::regclass);

This should work with 7.3.x and 7.4Beta.


How could the planner make use of index correlation?  Here
(http://www.pivot.at/pg/22-IndexCorrel_74b1.diff) is an experimental
patch that introduces a new system table pg_indexstat.  If the GUC
variable enable_indexstat is set to on, genericcostestimate tries to
get index correlation from pg_indexstat;  if there is none,
btcostestimate falls back to the old method.

Compatibility:  Although there is a new catalog table, initdb is not
required.  A patched postmaster still works with an old cluster, even
without creating pg_catalog.pg_indexstat.  Before you can make use of
pg_indexstat you have to create it via a standalone backend:

    $ bin/postgres -D data -O template1
    
    backend> CREATE TABLE pg_catalog.pg_indexstat( \
                    istindex oid NOT NULL, \
                    istcorrel float4 NOT NULL) WITHOUT OIDS;
    backend> CREATE UNIQUE INDEX pg_indexstat_index_index \
                    ON pg_indexstat(istindex);

Repeat this for each database.


Usage example (using Sean's data):

psql testdb
testdb=# \d rucc
                 Table "public.rucc"
    Column     |           Type           | Modifiers
---------------+--------------------------+-----------
 user_id       | integer                  | not null
 category_id   | integer                  | not null
 img_bytes     | bigint                   | not null
 img_hits      | integer                  | not null
 html_bytes    | bigint                   | not null
 html_hits     | integer                  | not null
 unknown_bytes | bigint                   | not null
 unknown_hits  | integer                  | not null
 utc_date      | timestamp with time zone | not null
 time_interval | interval                 | not null
Indexes:
    "rucc_htmlbytes_idx" btree (html_bytes),
    "rucc_id_date_idx" btree (user_id, utc_date)

testdb=# SELECT 'rucc_id_date_idx'::regclass::oid;
   oid
---------
 1281422
(1 row)

testdb=# set enable_seqscan = off;
SET

testdb=# set enable_indexstat = on;
SET

testdb=# INSERT INTO pg_indexstat VALUES (1281422, 0.0001);
INSERT 0 1
testdb=# EXPLAIN SELECT * FROM rucc WHERE user_id < 1000;
                                      QUERY PLAN
--------------------------------------------------------------------------------
 Index Scan using rucc_id_date_idx on rucc  (cost=0.00..634342.50
rows=139802 width=64)
   Index Cond: (user_id < 1000)
(2 rows)

testdb=# UPDATE pg_indexstat SET istcorrel=0.1 WHERE istindex=1281422;
testdb=# EXPLAIN SELECT ...

istcorrel |      cost
----------+----------
   0.0001 | 634342.50
      0.1 | 514678.48
      0.2 | 407497.85
      0.5 | 161612.89
      0.9 |  10299.07
      1.0 |   3994.32

Actually the table is clustered on rucc_id_date_idx, so index
correlation is 1.0, but there is no way to know that, when we only
have the column correlations for user_id (1.0) and utc_date (0.59).
The current code guesses the index correlation to be 0.5 which gives a
cost estimate that is far too high.

For comparison:
seq scan estimated cost ~ 21000, actual ~ 11500,
index scan                       actual ~ 4000


If you are going to test this patch, please be aware that I created it
on top of another one of my experimental patches
(http://www.pivot.at/pg/16d-correlation_74b1.diff).  If you don't want
to apply this one, one hunk of the IndexCorrel patch will fail in
selfuncs.c.  Should be no problem to apply it manually.

And those who are still experimenting with 7.3.4 performance, can use
http://www.pivot.at/pg/16d-correlation_734.diff and
http://www.pivot.at/pg/22-IndexCorrel_74b1.diff.

ToDo:

.. Move get_index_correlation from contrib into the backend.

.. ANALYZE table  computes index correlation for all indexes.

.. New command  ANALYZE index?

.. System cache invalidation?
   syscache.c: reloidattr = Anum_pg_indexstat_istindex ?

.. Dependency?

.. Remove GUC variable enable_indexstat

.. Remove old method in btcostestimate()

.. Automatic catalog upgrade

Servus
 Manfred

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faqs/FAQ.html

Reply via email to