Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes
On Tue, Feb 12, 2019 at 4:17 PM Tom Lane  wrote:

> Jeff Janes  writes:
> > In order for bloom (or any other users of CREATE ACCESS METHOD, if there
> > are any) to have a fighting chance to do better, I think many of
> selfuncs.c
> > currently private functions would have to be declared in some header
> file,
> > perhaps utils/selfuncs.h.  But that then requires a cascade of other
> > inclusions.  Perhaps that is why it was not done.
>
> I'm just in the midst of refactoring that stuff, so if you have
> suggestions, let's hear 'em.
>

The goal would be that I can copy the entire definition of
genericcostestimate into blcost.c, change the function's name, and get it
to compile.  I don't know the correct way accomplish that.  Maybe
utils/selfuncs.h can be expanded to work, or if there should be a new
header file like "utils/index_am_cost.h"

What I've done for now is:

#include "../../src/backend/utils/adt/selfuncs.c"

which I assume is not acceptable as a real solution.


It's possible that a good cost model for bloom is so far outside
> genericcostestimate's ideas that trying to use it is not a good
> idea anyway.
>

I think that might be the case.  I don't know what the right answer would
look like, but I think it will likely end up needing to access everything
that genericcostestimate currently needs to access.  Or if bloom doesn't,
some other extension implementing an ACCESS METHOD will.

Cheers,

Jeff


Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Tom Lane
Jeff Janes  writes:
> In order for bloom (or any other users of CREATE ACCESS METHOD, if there
> are any) to have a fighting chance to do better, I think many of selfuncs.c
> currently private functions would have to be declared in some header file,
> perhaps utils/selfuncs.h.  But that then requires a cascade of other
> inclusions.  Perhaps that is why it was not done.

I'm just in the midst of refactoring that stuff, so if you have
suggestions, let's hear 'em.

It's possible that a good cost model for bloom is so far outside
genericcostestimate's ideas that trying to use it is not a good
idea anyway.

regards, tom lane



Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes
On Tue, Feb 12, 2019 at 11:58 AM Jeff Janes  wrote:

>
> On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:
>
>>
>> Hm.  blcostestimate is using the default cost calculation, except for
>>
>> /* We have to visit all index tuples anyway */
>> costs.numIndexTuples = index->tuples;
>>
>> which essentially tells genericcostestimate to assume that every index
>> tuple will be visited.  This obviously is going to increase the cost
>> estimate; maybe there's something wrong with that?
>>
>
> I assumed (without investigating yet) that genericcostestimate is applying
> a cpu_operator_cost (or a few of them) on each index tuple, while the
> premise of a bloom index is that you do very fast bit-fiddling, not more
> expense SQL operators, for each tuple and then do the recheck only on what
> survives to the table tuple part.
>

In order for bloom (or any other users of CREATE ACCESS METHOD, if there
are any) to have a fighting chance to do better, I think many of selfuncs.c
currently private functions would have to be declared in some header file,
perhaps utils/selfuncs.h.  But that then requires a cascade of other
inclusions.  Perhaps that is why it was not done.

Cheers,

Jeff

>


Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes
On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:

> Thomas Kellerer  writes:
> > The bloom index is only used if either Seq Scan is disabled or if the
> random_page_cost is set to 1 (anything about 1 triggers a Seq Scan on my
> Windows laptop).
>
> Hm.  blcostestimate is using the default cost calculation, except for
>
> /* We have to visit all index tuples anyway */
> costs.numIndexTuples = index->tuples;
>
> which essentially tells genericcostestimate to assume that every index
> tuple will be visited.  This obviously is going to increase the cost
> estimate; maybe there's something wrong with that?
>

I assumed (without investigating yet) that genericcostestimate is applying
a cpu_operator_cost (or a few of them) on each index tuple, while the
premise of a bloom index is that you do very fast bit-fiddling, not more
expense SQL operators, for each tuple and then do the recheck only on what
survives to the table tuple part.

Cheers,

Jeff


Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Tom Lane
Thomas Kellerer  writes:
> The bloom index is only used if either Seq Scan is disabled or if the 
> random_page_cost is set to 1 (anything about 1 triggers a Seq Scan on my 
> Windows laptop). 

Hm.  blcostestimate is using the default cost calculation, except for

/* We have to visit all index tuples anyway */
costs.numIndexTuples = index->tuples;

which essentially tells genericcostestimate to assume that every index
tuple will be visited.  This obviously is going to increase the cost
estimate; maybe there's something wrong with that?

I notice that the bloom regression test script is only testing queries
where it forces the choice of plan type, so it really doesn't prove
anything about whether the cost estimates are sane.

regards, tom lane



Bloom index cost model seems to be wrong

2019-02-12 Thread Thomas Kellerer
I stumbled upon this question:

https://dba.stackexchange.com/questions/229413

in a nutshell: the bloom index is not used with the example from the manual. 

The bloom index is only used if either Seq Scan is disabled or if the 
random_page_cost is set to 1 (anything about 1 triggers a Seq Scan on my 
Windows laptop). 

If parallel execution is disabled, then the bloom index is only used if the 
random_page_cost is lower than 4. 

This does not use the index:

  set random_page_cost = 4; 
  set max_parallel_workers_per_gather=0;
  explain (analyze, buffers) 
  select * 
  from tbloom 
  where i2 = 898732 
and i5 = 123451;

This uses the bloom index:

  set random_page_cost = 3.5; 
  set max_parallel_workers_per_gather=0;
  explain (analyze, buffers) 
  select * 
  from tbloom 
  where i2 = 898732 
and i5 = 123451;

And this uses the index also: 

  set random_page_cost = 1; 
  explain (analyze, buffers) 
  select * 
  from tbloom 
  where i2 = 898732 
and i5 = 123451;

This is the plan with when the index is used (either through "enable_seqscan = 
off" or "random_page_cost = 1")

Bitmap Heap Scan on tbloom  (cost=138436.69..138440.70 rows=1 width=24) (actual 
time=42.444..42.444 rows=0 loops=1)  
  Recheck Cond: ((i2 = 898732) AND (i5 = 123451))   
 
  Rows Removed by Index Recheck: 2400   
 
  Heap Blocks: exact=2365   
 
  Buffers: shared hit=21973 
 
  ->  Bitmap Index Scan on bloomidx  (cost=0.00..138436.69 rows=1 width=0) 
(actual time=40.756..40.756 rows=2400 loops=1)
Index Cond: ((i2 = 898732) AND (i5 = 123451))   
 
Buffers: shared hit=19608   
 
Planning Time: 0.075 ms 
 
Execution Time: 42.531 ms   
 

And this is the plan when everything left at default settings:

Seq Scan on tbloom  (cost=0.00..133695.80 rows=1 width=24) (actual 
time=1220.116..1220.116 rows=0 loops=1)
  Filter: ((i2 = 898732) AND (i5 = 123451)) 
  
  Rows Removed by Filter: 1000  
  
  Buffers: shared hit=4697 read=58998   
  
  I/O Timings: read=354.670 
  
Planning Time: 0.075 ms 
  
Execution Time: 1220.144 ms 
  

Can this be considered a bug in the cost model of the bloom index 
implementation? 
Or is it expected that this is only used if random access is really cheap? 

Thomas