[HACKERS] Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

2011-02-17 Thread Jesper Krogh

Hi All.

The NULL element always suprises me in unpleasant ways..  my brain simply
cant really understand the logic, so please let me know if this is
one of the cases where I just should spend way more efforts into fixing 
that instead.


I have a table with a null_frac of 0.5 and i have tested that a where
clause that evaluates to null isnt included in the result:

testdb=# select id from testtable where  null @@ to_tsquery('testterm80');
 id

(0 rows)

Then I'd expect to have the null_fraq taken into account when computing the
estimates for the query:

testdb=# explain select id from testtable where fts @@ 
to_tsquery('testterm80');

  QUERY PLAN
---
 Seq Scan on testtable  (cost=0.00..1985.03 rows=1966 width=4)
   Filter: (fts @@ to_tsquery('testterm80'::text))
(2 rows)

Whereas it actually does it if I explicitly add the fts is not null 
clause to the query.


testdb=# explain select id from testtable where fts @@ 
to_tsquery('testterm80') and fts is not null;

 QUERY PLAN
-
 Bitmap Heap Scan on testtable  (cost=130.34..1735.19 rows=983 width=4)
   Recheck Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts IS 
NOT NULL))
   -  Bitmap Index Scan on testtable_fts_idx  (cost=0.00..130.09 
rows=983 width=0)
 Index Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts 
IS NOT NULL))

(4 rows)

When something evaluates to null isn't included in the result, 
shouldn't the query-planner

then take the null_frac into account when computing the estimate?

Trying to do the same thing using integers and the  operator seem to 
take the null_frac into

account.

Below snippet allows to reproduce the dataset.


create table testtable (id serial primary key, document text, fts tsvector);
create index on testtable using gist(fts);
CREATE OR REPLACE FUNCTION public.filltable(rows integer)
 RETURNS boolean
 LANGUAGE plpgsql
AS $function$
DECLARE
count integer;
BEGIN
count := 0;
LOOP
EXIT WHEN count = rows;
count := count +1;
insert into testtable(document,fts) select 
document,to_tsvector('english',document) from (select 
string_agg(concat,' ') as document from (select concat('testterm' || 
generate_series(1,floor(random()*100)::integer))) as foo) as bar;

END LOOP;
RETURN TRUE;
END;
$function$

select filltable(1);
testdb=# update testtable set fts = null where id % 2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing public.testtable
INFO:  testtable: scanned 1835 of 1835 pages, containing 10002 live 
rows and 5001 dead rows; 10002 rows in sample, 10002 estimated total rows

ANALYZE
testdb=# select null_frac from pg_stats where attname = 'fts';
 null_frac
---
   0.5
(1 row)

... trying with integers:

testdb=# ALTER  TABLE testtable add column testint integer;
ALTER TABLE
testdb=# update testtable set testint = floor(random()*100);
UPDATE 10002
testdb=# ANALYZE verbose testtable;
INFO:  analyzing public.testtable
INFO:  testtable: scanned 2186 of 2186 pages, containing 10002 live 
rows and 10002 dead rows; 10002 rows in sample, 10002 estimated total rows

ANALYZE
testdb=# update testtable set testint = null where id %2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing public.testtable
INFO:  testtable: scanned 2282 of 2282 pages, containing 10002 live 
rows and 13335 dead rows; 10002 rows in sample, 10002 estimated total rows

analyzze ANALYZE
testdb=# explain select id from testtable where testint = 50;
 QUERY PLAN
-
 Seq Scan on testtable  (cost=0.00..2407.03 rows=64 width=4)
   Filter: (testint = 50)
(2 rows)

testdb=# explain select id from testtable where testint = 1;
 QUERY PLAN
-
 Seq Scan on testtable  (cost=0.00..2407.03 rows=48 width=4)
   Filter: (testint = 1)
(2 rows)

testdb=# explain select id from testtable where testint  50;
  QUERY PLAN
---
 Seq Scan on testtable  (cost=0.00..2407.03 rows=2470 width=4)
   Filter: (testint  50)
(2 rows)


(found on 8.4 and reproduced on git-head)

Attached patch tries to align the behaviour

Thanks.

--
Jesper
From 4b4be27864f683a9b3464b86d6cfa567f8ab6bd2 Mon Sep 17 00:00:00 2001
From: Jesper Krogh j...@novozymes.com
Date: Thu, 17 Feb 2011 22:21:52 +0100
Subject: [PATCH] Take null_frac into account in calculating selectivity for @@ operator. This makes behaviour align with what  operator does for integers.

---
 src/backend/tsearch/ts_selfuncs.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/tsearch/ts_selfuncs.c 

Re: [HACKERS] Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

2011-02-17 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes:
 When something evaluates to null isn't included in the result, 
 shouldn't the query-planner
 then take the null_frac into account when computing the estimate?

The proposed patch seems wrong to me: if we're estimating on the basis
of most-common-value fractions, the null_frac is already accounted for,
because it's not part of the MCV selectivity fractions.  IOW, aren't you
double-counting the null fraction?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

2011-02-17 Thread Jesper Krogh

On 2011-02-17 23:20, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:

When something evaluates to null isn't included in the result,
shouldn't the query-planner
then take the null_frac into account when computing the estimate?

The proposed patch seems wrong to me: if we're estimating on the basis
of most-common-value fractions, the null_frac is already accounted for,
because it's not part of the MCV selectivity fractions.  IOW, aren't you
double-counting the null fraction?

It might be the wrong place to fix, but here it seems like we're only
counting MCE-freqs based on non-null elements:

http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/tsearch/ts_typanalyze.c;h=2654d644579fd1959282d83919474f42540ca703;hb=HEAD#l396

And the testdata confirms the behaviour.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

2011-02-17 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes:
 On 2011-02-17 23:20, Tom Lane wrote:
 The proposed patch seems wrong to me: if we're estimating on the basis
 of most-common-value fractions, the null_frac is already accounted for,
 because it's not part of the MCV selectivity fractions.  IOW, aren't you
 double-counting the null fraction?

 It might be the wrong place to fix, but here it seems like we're only
 counting MCE-freqs based on non-null elements:
 http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/tsearch/ts_typanalyze.c;h=2654d644579fd1959282d83919474f42540ca703;hb=HEAD#l396

Hmm, you're right, and the specification in pg_statistic.h neglects to
say that.  This does need work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

2011-02-17 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes:
 Attached patch tries to align the behaviour

Applied with a bit of editorialization.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers