Re: PoC: using sampling to estimate joins / complex conditions

2022-03-21 Thread Tomas Vondra



On 3/22/22 00:35, Andres Freund wrote:
> Hi,
> 
> On 2022-01-21 01:06:37 +0100, Tomas Vondra wrote:
>> Yeah, I haven't updated some of the test output because some of those
>> changes are a bit wrong (and I think that's fine for a PoC patch). I
>> should have mentioned that in the message, though. Sorry about that.
> 
> Given that the patch hasn't been updated since January and that it's a PoC in
> the final CF, it seems like it should at least be moved to the next CF? Or
> perhaps returned?
> 
> I've just marked it as waiting-on-author for now - iirc that leads to fewer
> reruns by cfbot once it's failing...
> 

Either option works for me.

> 
>> 2) The correlated samples are currently built using a query, executed
>> through SPI in a loop. So given a "driving" sample of 30k rows, we do
>> 30k lookups - that'll take time, even if we do that just once and cache
>> the results.
> 
> Ugh, yea, that's going to increase overhead by at least a few factors.
> 
> 
>> I'm sure there there's room for some improvement, though - for example
>> we don't need to fetch all columns included in the statistics object,
>> but just stuff referenced by the clauses we're estimating. That could
>> improve chance of using IOS etc.
> 
> Yea. Even just avoid avoiding SPI / planner + executor seems likely to be a
> big win.
> 
> 
> It seems one more of the cases where we really need logic to recognize "cheap"
> vs "expensive" plans, so that we only do sampling when useful. I don't think
> that's solved just by having a declarative syntax.
> 

Right.

I was thinking about walking the first table, collecting all the values,
and then doing a single IN () query for the second table - a bit like a
custom join (which seems a bit terrifying, TBH).

But even if we manage to make this much cheaper, there will still be
simple queries where it's going to be prohibitively expensive.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PoC: using sampling to estimate joins / complex conditions

2022-03-21 Thread Andres Freund
Hi,

On 2022-01-21 01:06:37 +0100, Tomas Vondra wrote:
> Yeah, I haven't updated some of the test output because some of those
> changes are a bit wrong (and I think that's fine for a PoC patch). I
> should have mentioned that in the message, though. Sorry about that.

Given that the patch hasn't been updated since January and that it's a PoC in
the final CF, it seems like it should at least be moved to the next CF? Or
perhaps returned?

I've just marked it as waiting-on-author for now - iirc that leads to fewer
reruns by cfbot once it's failing...


> 2) The correlated samples are currently built using a query, executed
> through SPI in a loop. So given a "driving" sample of 30k rows, we do
> 30k lookups - that'll take time, even if we do that just once and cache
> the results.

Ugh, yea, that's going to increase overhead by at least a few factors.


> I'm sure there there's room for some improvement, though - for example
> we don't need to fetch all columns included in the statistics object,
> but just stuff referenced by the clauses we're estimating. That could
> improve chance of using IOS etc.

Yea. Even just avoid avoiding SPI / planner + executor seems likely to be a
big win.


It seems one more of the cases where we really need logic to recognize "cheap"
vs "expensive" plans, so that we only do sampling when useful. I don't think
that's solved just by having a declarative syntax.


Greetings,

Andres Freund




Re: PoC: using sampling to estimate joins / complex conditions

2022-01-20 Thread Tomas Vondra
On 1/5/22 00:58, Andres Freund wrote:
> Hi,
> 
> On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote:
>> estimating joins is one of the significant gaps related to extended
>> statistics, and I've been regularly asked about what we might do about
>> that. This is an early experimental patch that I think might help us
>> with improving this, possible even in PG15.
> 
> The tests in this patch fail:
> https://cirrus-ci.com/task/5304621299138560
> https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs
> 
> Looks like the regression test output hasn't been updated?
> 

Yeah, I haven't updated some of the test output because some of those
changes are a bit wrong (and I think that's fine for a PoC patch). I
should have mentioned that in the message, though. Sorry about that.

There are three types of failures:


1) Changes to deparsed statistics definition in \d command:

-"public.ctlt_all_a_b_stat" ON a, b FROM ctlt_all
+"public.ctlt_all_a_b_stat" (ndistinct, dependencies, mcv) ON a, b
FROM ctlt_all

This happens because there's a new kind "sample" but it's not set by
default if creating new statistics, and the deparsing logic decides it
means it has to list the kinds explicitly. I've fixed this in the
attached patch, but it was mostly harmless and I'm not sure this is how
sample should behave.


2) Three GUC parameters allowing to enable/disable sampling for
different parts of a query (scan, join, correlated join sampling). I
still consider those GUCs temporary, for experiments, but I've added
them to the expected output.


3) Changes in estimates for OR conditions - a couple estimates get less
accurate, because OR clauses are handled as a single clause the first
time we pass them to statext_clauselist_selectivity(). So we combine the
results incorrectly. OR clauses may need some changes, because it's
causing issues in other patches too (e.g. in the "Var op Var" one).


I haven't done anything about (3) yet - it's a valid issue and needs to
be fixed (either by changing how we handle OR clauses, or maybe handling
samples and MCVs at the same time). Or maybe some other way. In any
case, there is more important stuff that needs fixing first.

The main issue is planning overhead - for the example in the first
message, with a simple query joining two tables, you'll see this:

   Planning Time: 603.442 ms
   Execution Time: 1071.735 ms

There's a couple reasons why it takes this long:


1) We sample the same relation repeatedly - once as a "scan" and then
while estimating joins. And for this query we do that in three different
contexts:

- set_joinrel_size_estimates
- populate_joinrel_with_paths (twice)

I guess we'll get different number of samples for different queries, but
it'll get worse for queries joining more tables etc. It seems fairly
simple to cache the samples - for example in StatisticExtInfo (or maybe
somewhere else, to keep just one sample per relation, not per RTE).

Unfortunately, I'm not sure this works for "independent" samples, not
for correlated ones (which are just FK lookups for another sample, so
that depends on what's the other sample). Which is a bummer, because
correlated samples are the more expensive ones :-(


2) The correlated samples are currently built using a query, executed
through SPI in a loop. So given a "driving" sample of 30k rows, we do
30k lookups - that'll take time, even if we do that just once and cache
the results.

I'm sure there there's room for some improvement, though - for example
we don't need to fetch all columns included in the statistics object,
but just stuff referenced by the clauses we're estimating. That could
improve chance of using IOS etc.

I wonder if there's a more efficient way to do this, in a batched manner
or something ... But even then it'll still be quite expensive.


The only other alternative I can think of is collecting samples during
ANALYZE, and storing them somewhere. That'll be difficult for correlated
samples, particularly for transitive cases (in snowflake schema). But I
believe it's doable (at least for cases covered by FK constraints).

But I'm not sure where/how to store the samples. An obvious option would
be to serialize them into pg_statistic_ext_data, and I'll try doing that
(it's  a bit like a huge MCV without any cutoff). But maybe there's a
better way that would not require constant serialization/deserialization
of many tuples.


Any ideas about these options?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index f7419b8f562..b25edbfa6e5 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -83,13 +83,14 @@ CreateStatistics(CreateStatsStmt *stmt)
 	Oid			relid;
 	ObjectAddress parentobject,
 myself;
-	Datum		types[4];		/* one for each possible type of statistic */
+	Datum		types[5];		/* one for each 

Re: PoC: using sampling to estimate joins / complex conditions

2022-01-04 Thread Andres Freund
Hi,

On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote:
> estimating joins is one of the significant gaps related to extended
> statistics, and I've been regularly asked about what we might do about
> that. This is an early experimental patch that I think might help us
> with improving this, possible even in PG15.

The tests in this patch fail:
https://cirrus-ci.com/task/5304621299138560
https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs

Looks like the regression test output hasn't been updated?

Greetings,

Andres Freund




Re: PoC: using sampling to estimate joins / complex conditions

2021-12-02 Thread Michael Paquier
On Sun, Jun 27, 2021 at 07:55:24PM +0200, Tomas Vondra wrote:
> estimating joins is one of the significant gaps related to extended
> statistics, and I've been regularly asked about what we might do about
> that. This is an early experimental patch that I think might help us
> with improving this, possible even in PG15.

The patch does not apply, so a rebase would be in place.  I have
switched that as waiting on author for now, moving it to the next CF.
--
Michael


signature.asc
Description: PGP signature


PoC: using sampling to estimate joins / complex conditions

2021-06-27 Thread Tomas Vondra
Hi,

estimating joins is one of the significant gaps related to extended
statistics, and I've been regularly asked about what we might do about
that. This is an early experimental patch that I think might help us
with improving this, possible even in PG15.

Note: I do not claim this is exactly how it should be implemented, but
it's probably sufficient to demonstrate the pros/cons of various
alternative approaches, etc.

In short, the patch samples the tables and uses those samples to
estimate selectivity for scans and joins. The samples are collected
during planning, which may be quite expensive - random I/O for each
query, etc. It'd be possible to build them during analyze, but that'd
require solving serialization, tweak CREATE STATISTICS to handle join
queries, etc. I decided to keep the PoC simple.

It still uses CREATE STATISTICS with a new "sample" kind, instructing
the optimizer to use sampling when estimating clauses on the attributes.

A little example demonstrating what the patch does:

  create table t (a int, b int, c int);

  insert into t select mod(i,10), mod(i,20), mod(i,40)
from generate_series(1,1000) s(i);

  analyze t;

  -- estimate without any statistics / sampling
  explain analyze select * from t where a = 0 and b = 0 and c = 0;

 QUERY PLAN
  ---
   Seq Scan on t  (cost=0.00..229055.00 rows=1361 width=12)
  (actual time=0.025..761.571 rows=25 loops=1)
 Filter: ((a = 0) AND (b = 0) AND (c = 0))
 Rows Removed by Filter: 975
   Planning Time: 0.471 ms
   Execution Time: 901.182 ms
  (5 rows)

  -- enable sampling on those columns
  create statistics s (sample) on a, b, c from t;

  explain analyze select * from t where a = 0 and b = 0 and c = 0;

 QUERY PLAN
  ---
   Seq Scan on t  (cost=0.00..229055.00 rows=250390 width=12)
  (actual time=0.307..717.937 rows=25 loops=1)
 Filter: ((a = 0) AND (b = 0) AND (c = 0))
 Rows Removed by Filter: 975
   Planning Time: 194.528 ms
   Execution Time: 851.832 ms
  (5 rows)

Of course, in this case a MCV would work well too, because there are
very few combinations in (a,b,c) - a sample would work even when that's
not the case, and it has various other benefits (can estimate almost any
expression while MCV supports only a subset, etc.)

Now, let's look at a join between a fact and a dimension table:

  create table f (d1 int, d2 int, f1 int, f2 int, f3 int);

  create table d (d1 int, d2 int, d3 int, d4 int, d5 int,
  primary key (d1, d2));

  insert into d select i, i, mod(i,100), mod(i,100), mod(i,100)
from generate_series(0,999) s(i);

  insert into f select mod(i,1000), mod(i,1000), mod(i,100), mod(i,100),
   mod(i,100) from generate_series(1,100) s(i);

  analyze f, d;

  explain analyze select * from f join d using (d1,d2)
where f1 < 50 and f2 < 50 and d3 < 50 and d4 < 50;

QUERY PLAN
  --
   Hash Join  (cost=25.75..22717.01 rows=63 width=32)
  (actual time=3.197..861.899 rows=50 loops=1)
 Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2))
 ->  Seq Scan on f  (cost=0.00..21370.00 rows=251669 width=20)
(actual time=0.033..315.401 rows=50 loops=1)
   Filter: ((f1 < 50) AND (f2 < 50))
   Rows Removed by Filter: 50
 ->  Hash  (cost=22.00..22.00 rows=250 width=20)
   (actual time=3.139..3.141 rows=500 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 34kB
   ->  Seq Scan on d  (cost=0.00..22.00 rows=250 width=20)
(actual time=0.018..1.706 rows=500 loops=1)
 Filter: ((d3 < 50) AND (d4 < 50))
 Rows Removed by Filter: 500
   Planning Time: 0.806 ms
   Execution Time: 1099.229 ms
  (12 rows)

So, not great - underestimated by 1x is likely to lead to
inefficient plans. And now with the samples enabled on both sides:

  create statistics s1 (sample) on d1, d2, f1, f2, f3 from f;
  create statistics s2 (sample) on d1, d2, d3, d4, d5 from d;

QUERY PLAN
  --
   Hash Join  (cost=29.50..24057.25 rows=503170 width=32)
  (actual time=0.630..837.483 rows=50 loops=1)
 Hash Cond: ((f.d1 = d.d1) AND (f.d2 = d.d2))
 ->  Seq Scan on f  (cost=0.00..21370.00 rows=503879 width=20)
(actual time=0.008..301.584 rows=50 loops=1)
   Filter: ((f1 < 50) AND (f2 < 50))
   Rows Removed by Filter: 50
 ->  Hash  (cost=22.00..22.00 rows=500 width=20)
   (actual time=0.616..0.618 rows=500 loops=1)
   Buckets: 1024  Batches: 1  Memory