Re: [PERFORM] osdl-dbt3 run results - puzzled by the execution plans

2003-09-19 Thread Manfred Koizar
On Thu, 18 Sep 2003 15:36:50 -0700, Jenny Zhang <[EMAIL PROTECTED]>
wrote:
>We thought the large effective_cache_size should lead us to better 
>plans. But we found the opposite. 

The common structure of your query plans is:

 Sort
   Sort Key: sum((partsupp.ps_supplycost * partsupp.ps_availqty))
   InitPlan
 ->  Aggregate
   ->  SubPlan
   ->  Aggregate
 Filter: (sum((ps_supplycost * ps_availqty)) > $0)
 ->  Group
   ->  Sort
 Sort Key: partsupp.ps_partkey
 ->  SubPlan (same as above)

where the SubPlan is

 ->  Merge Join  (cost=519.60..99880.05 rows=32068 width=65)
 (actual time=114.78..17435.28 rows=30400 loops=1)
 ctr=5.73
   Merge Cond: ("outer".ps_suppkey = "inner".s_suppkey)
   ->  Index Scan using i_ps_suppkey on partsupp
 (cost=0.00..96953.31 rows=801712 width=34)
 (actual time=0.42..14008.92 rows=799361 loops=1)
 ctr=6.92
   ->  Sort  (cost=519.60..520.60 rows=400 width=31)
 (actual time=106.88..143.49 rows=30321 loops=1)
 ctr=3.63
 Sort Key: supplier.s_suppkey
 ->  SubSubPlan

for large effective_cache_size and

 ->  Nested Loop  (cost=0.00..130168.30 rows=32068 width=65)
  (actual time=0.56..1374.41 rows=30400 loops=1)
  ctr=94.71
   ->  SubSubPlan
   ->  Index Scan using i_ps_suppkey on partsupp
 (cost=0.00..323.16 rows=80 width=34)
 (actual time=0.16..2.98 rows=80 loops=380)
 ctr=108.44
 Index Cond: (partsupp.ps_suppkey = "outer".s_suppkey)

for small effective_cache_size.  Both subplans have an almost
identical subsubplan:

->  Nested Loop  (cost=0.00..502.31 rows=400 width=31)
 (actual time=0.23..110.51 rows=380 loops=1)
 ctr=4.55
  Join Filter: ("inner".s_nationkey = "outer".n_nationkey)
  ->  Seq Scan on nation  (cost=0.00..1.31 rows=1 width=10)
  (actual time=0.08..0.14 rows=1 loops=1)
  ctr=9.36
Filter: (n_name = 'ETHIOPIA'::bpchar)
  ->  Seq Scan on supplier (cost=0.00..376.00 rows=1 width=21)
  (actual time=0.10..70.72 rows=1 loops=1)
   ctr=5.32

I have added the ctr (cost:time ratio) for each plan node.  These
values are mostly between 5 and 10 with two notable exceptions:

1) ->  Sort  (cost=519.60..520.60 rows=400 width=31)
 (actual time=106.88..143.49 rows=30321 loops=1)
 ctr=3.63

It has already been noticed by Matt Clark that this is the only plan
node where the row count estimation looks wrong.  However, I don't
believe that this has great influence on the total cost of the plan,
because the ctr is not far from the usual range and if it were a bit
higher, it would only add a few hundred cost units to a branch costing
almost 10 units.  BTW I vaguely remember that there is something
strange with the way actual rows are counted inside a merge join.
Look at the branch below this plan node:  It shows an actual row count
of 380.

2) ->  Index Scan using i_ps_suppkey on partsupp
 (cost=0.00..323.16 rows=80 width=34)
 (actual time=0.16..2.98 rows=80 loops=380)
 ctr=108.44

Here we have the only plan node where loops > 1, and it is the only
one where the ctr is far off.  The planner computes the cost for one
loop and multiplies it by the number of loops (which it estimates
quite accurately to be 400), thus getting a total cost of ca. 13.
We have no reason to believe that the single loop cost is very far
from reality (for a *single* index scan), but the planner does not
account for additional index scans hitting pages in the cache that
have been brought in by preceding scans.  This is a known problem, Tom
has mentioned it several times, IIRC.

Now I'm very interested in getting a better understanding of this
problem, so could you please report the results of

. \d i_ps_suppkey

. VACUUM VERBOSE ANALYSE partsupp;
  VACUUM VERBOSE ANALYSE supplier;

. SELECT attname, null_frac, avg_witdh, n_distinct, correlation
FROM pg_stats
   WHERE tablename = 'partsupp' AND attname IN ('ps_suppkey', ...);

  Please insert other interesting column names for ..., especially
  those contained in i_ps_suppkey, if any.

. SELECT relname, relpages, reltuples
FROM pg_class
   WHERE relname IN ('partsupp', 'supplier', ...);
 ^^^
Add relevant index names here.

. EXPLAIN ANALYSE
  SELECT ps_partkey, ps_supplycost, ps_availqty
FROM partsupp, supplier
   WHERE ps_suppkey = s_suppkey AND s_nationkey = '';

  The idea is to eliminate parts of the plan that are always the same.
  Omitting nation is possibly to much a simplification.  In this case
  pleas

Re: [PERFORM] osdl-dbt3 run results - puzzled by the execution plans

2003-09-19 Thread Greg Stark

Tom Lane <[EMAIL PROTECTED]> writes:

> I think this is a pipe dream.  Variation in where the data gets laid
> down on your disk drive would alone create more than that kind of delta.
> I'm frankly amazed you could get repeatability within 2-3%.

I think the reason he gets good repeatability is because he's talking about
the aggregate results for a whole test run. Not individual queries. In theory
you could just run the whole test multiple times. The more times you run it
the lower the variation in the total run time would be.

Actually, the variation in run time is also a useful statistic, both for
postgres and the kernel. It might be useful to do multiple complete runs and
keep track of the average standard deviation of the time required for each
step.

Higher standard deviation implies queries can't be reliably depended on not to
take inordinately long, which can be a problem for some working models. For
the kernel it could mean latency issues or it could mean the swapper or buffer
cache was overly aggressive.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] osdl-dbt3 run results - puzzled by the execution plans

2003-09-18 Thread Tom Lane
Jenny Zhang <[EMAIL PROTECTED]> writes:
> ...  It seems to me that small 
> effective_cache_size favors the choice of nested loop joins (NLJ) 
> while the big effective_cache_size is in favor of merge joins (MJ).  

No, I wouldn't think that, because a nestloop plan will involve repeated
fetches of the same tuples whereas a merge join doesn't (at least not
when it sorts its inner input, as this plan does).  Larger cache
improves the odds of a repeated fetch not having to do I/O.  In practice
a larger cache area would also have some effects on access costs for the
sort's temp file, but I don't think the planner's cost model for sorting
takes that into account.

As Matt Clark points out nearby, the real question is whether these
planner estimates have anything to do with reality.  EXPLAIN ANALYZE
results would be far more interesting than plain EXPLAIN.

> However, within the same run set consist of 6 runs, we see 2-3% 
> standard deviation for the run metrics associated with the multiple
> stream part of the test (as opposed to the single stream part).

 Och, laddie, we useta *dream* of 2-3% variation 

> We would like to reduce the variation to be less than 1% so that a 
> 2% change between two different kernels would be significant. 

I think this is a pipe dream.  Variation in where the data gets laid
down on your disk drive would alone create more than that kind of delta.
I'm frankly amazed you could get repeatability within 2-3%.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] osdl-dbt3 run results - puzzled by the execution plans

2003-09-18 Thread Matt Clark
> We thought the large effective_cache_size should lead us to better
> plans. But we found the opposite.

Maybe it's inappropriate for little old me to jump in here, but the plan
isn't usually that important compared to the actual runtime.  The links you
give show the output of 'explain' but not 'explain analyze', so it's not
clear wich plan is actually _faster_.

If you really do have only 8MB of FS cache, then either plan will run
slowly.  If you really do have 5GB of FS cache then either plan will run a
lot faster.  Why would you deliberately give the planner false information
about this?

PG obviously thinks plan 1 is 'better' when pages have to be fetched from
disk, and plan 2 is 'better' when they don't.  Which is really better
depends on whether those pages do have to be fetched from disk or not, and
PG can only know what you tell it about that, so changing ECS without
actually removing the RAM from the system seems a little pointless to me...

M





---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match