Re: [HACKERS] Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)

2005-04-07 Thread Jim C. Nasby
On Wed, Apr 06, 2005 at 06:35:10PM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
  Can anyone suggest a more general rule?  Do we need for example to
  consider whether the relation membership is the same in two clauses
  that might be opposite sides of a range restriction?  It seems like
  
  a.x  b.y AND a.x  b.z
 
  In a case like this, you could actually look at the  data in b and see
  what the average range size is.
 
 Not with the current statistics --- you'd need some kind of cross-column
 statistics involving both y and z.  (That is, I doubt it would be
 helpful to estimate the average range width by taking the difference of
 independently-calculated mean values of y and z ...)  But yeah, in
 principle it would be possible to make a non-default estimate.

Actually, it might be possible to take a SWAG at it using the histogram
and correlation stats.

You know... since getting universally useful cross-platform stats seems
to be pretty pie-in-the-sky, would it be possible to generate more
complex stats on the fly from a sampling of a table? If you're looking
at a fairly sizeable table ISTM it would be worth sampling the rows on
10 or 20 random pages to see what you get. In this case, you'd want to
know the average difference between two fields. Other queries might want
something different.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Arjen van der Meijden
Hi list,
I noticed on a forum a query taking a surprisingly large amount of time 
in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much 
better. To my surprise PostgreSQL was ten times worse on the same 
machine! And I don't understand why.

I don't really need this query to be fast since I don't use it, but the 
range-thing is not really an uncommon query I suppose. So I'm wondering 
why it is so slow and this may point to a wrong plan being chosen or 
generated.

Here are table definitions:
Table public.postcodes
   Column| Type  | Modifiers
-+---+---
 postcode_id | smallint  | not null
 range_from  | smallint  |
 range_till  | smallint  |
Indexes:
postcodes_pkey PRIMARY KEY, btree (postcode_id)
range UNIQUE, btree (range_from, range_till)
   Table public.data_main
 Column |   Type   | Modifiers
+--+---
 userid | integer  | not null
 range  | smallint |
Indexes:
data_main_pkey PRIMARY KEY, btree (userid)
And here's the query I ran:
SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range BETWEEN p.range_from AND p.range_till
  QUERY PLAN
---
 Aggregate  (cost=332586.85..332586.85 rows=1 width=0) (actual 
time=22712.038..22712.039 rows=1 loops=1)
   -  Nested Loop  (cost=3.76..328945.96 rows=1456356 width=0) (actual 
time=0.054..22600.826 rows=82688 loops=1)
 Join Filter: ((outer.range = inner.range_from) AND 
(outer.range = inner.range_till))
 -  Seq Scan on data_main dm  (cost=0.00..1262.20 rows=81920 
width=2) (actual time=0.020..136.930 rows=81920 loops=1)
 -  Materialize  (cost=3.76..5.36 rows=160 width=4) (actual 
time=0.001..0.099 rows=160 loops=81920)
   -  Seq Scan on postcodes p  (cost=0.00..3.60 rows=160 
width=4) (actual time=0.010..0.396 rows=160 loops=1)
 Total runtime: 22712.211 ms

When I do something completely bogus, which will result in coupling the 
data per record from data_main on one record from postcodes, it still 
not very fast but acceptable:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range / 10 = p.postcode_id
 QUERY PLAN

 Aggregate  (cost=10076.98..10076.98 rows=1 width=0) (actual 
time=1456.016..1456.017 rows=1 loops=1)
   -  Merge Join  (cost=8636.81..9913.13 rows=65537 width=0) (actual 
time=1058.105..1358.571 rows=81920 loops=1)
 Merge Cond: (outer.postcode_id = inner.?column2?)
 -  Index Scan using postcodes_pkey on postcodes p 
(cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160 
loops=1)
 -  Sort  (cost=8636.81..8841.61 rows=81920 width=2) (actual 
time=1057.698..1169.879 rows=81920 loops=1)
   Sort Key: (dm.range / 10)
   -  Seq Scan on data_main dm  (cost=0.00..1262.20 
rows=81920 width=2) (actual time=0.020..238.886 rows=81920 loops=1)
 Total runtime: 1461.156 ms

Doing something similarily bogus, but with less results is much faster, 
even though it should have basically the same plan:

SELECT COUNT(*) FROM
data_main AS dm,
postcodes AS p
WHERE dm.range  = p.postcode_id
  QUERY PLAN
---
 Aggregate  (cost=2138.63..2138.63 rows=1 width=0) (actual 
time=180.667..180.668 rows=1 loops=1)
   -  Hash Join  (cost=4.00..2087.02 rows=20642 width=0) (actual 
time=180.645..180.645 rows=0 loops=1)
 Hash Cond: (outer.range = inner.postcode_id)
 -  Seq Scan on data_main dm  (cost=0.00..1262.20 rows=81920 
width=2) (actual time=0.005..105.548 rows=81920 loops=1)
 -  Hash  (cost=3.60..3.60 rows=160 width=2) (actual 
time=0.592..0.592 rows=0 loops=1)
   -  Seq Scan on postcodes p  (cost=0.00..3.60 rows=160 
width=2) (actual time=0.025..0.349 rows=160 loops=1)
 Total runtime: 180.807 ms
(7 rows)

If you like to toy around with the datasets on your heavily optimized 
postgresql-installs, let me know. The data is just generated for 
testing-purposes and I'd happily send a copy to anyone interested.

Best regards,
Arjen van der Meijden
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Dave Held
 -Original Message-
 From: Arjen van der Meijden 
 [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, April 06, 2005 11:53 AM
 To: performance pgsql
 Subject: [PERFORM] Plan for relatively simple query seems to be very
 inefficient
 
 [...]
 SELECT COUNT(*) FROM
 data_main AS dm,
 postcodes AS p
 WHERE dm.range BETWEEN p.range_from AND p.range_till
 [...]
   Aggregate  (cost=332586.85..332586.85 rows=1 width=0) (actual 
 time=22712.038..22712.039 rows=1 loops=1)
 -  Nested Loop  (cost=3.76..328945.96 rows=1456356 
 width=0) (actual 
 time=0.054..22600.826 rows=82688 loops=1)

I'm still a noob at reading EXPLAIN ANALYZE, but it seems to me
that your statistics are throwing off the planner here.  It 
estimates 1.4M and gets 82K, so it's off by a factor of about 20.  
Have you considered doing a VACUUM or upping your statistics?

 [...]
 When I do something completely bogus, which will result in 
 coupling the data per record from data_main on one record from
 postcodes, it still not very fast but acceptable:
 [...]
   Aggregate  (cost=10076.98..10076.98 rows=1 width=0) (actual 
 time=1456.016..1456.017 rows=1 loops=1)
 -  Merge Join  (cost=8636.81..9913.13 rows=65537 
 width=0) (actual 
 time=1058.105..1358.571 rows=81920 loops=1)

Looks like Merge Join is faster than the Nested Loop for this
query.  If you notice, the row counts are a lot closer to the
estimates, too.  This is probably a good plan.

 [...]
 Doing something similarily bogus, but with less results is 
 much faster, even though it should have basically the same
 plan:
 
 SELECT COUNT(*) FROM
 data_main AS dm,
 postcodes AS p
 WHERE dm.range  = p.postcode_id
 [...]
   Aggregate  (cost=2138.63..2138.63 rows=1 width=0) (actual 
 time=180.667..180.668 rows=1 loops=1)
 -  Hash Join  (cost=4.00..2087.02 rows=20642 width=0) (actual 
 time=180.645..180.645 rows=0 loops=1)

This one I don't understand at all.  Clearly, the Hash Join is
the way to go, but the estimates are way off (which probably 
explains why this plan isn't chosen in the first place).

   Hash Cond: (outer.range = inner.postcode_id)
   -  Seq Scan on data_main dm  (cost=0.00..1262.20 
 rows=81920 
 width=2) (actual time=0.005..105.548 rows=81920 loops=1)
   -  Hash  (cost=3.60..3.60 rows=160 width=2) (actual 
 time=0.592..0.592 rows=0 loops=1)
 -  Seq Scan on postcodes p  (cost=0.00..3.60 
 rows=160 
 width=2) (actual time=0.025..0.349 rows=160 loops=1)
   Total runtime: 180.807 ms
 (7 rows)
 [...]

My completely amateur guess is that the planner is able to use
Merge Join and Hash Join on your contrived queries because you
are only trying to join one field to a single value (i.e.:
operator=).  But the BETWEEN clause is what forces the Nested
Loop.  You can see that here:

-  Seq Scan on postcodes p  (cost=0.00..3.60 rows=160 
width=4) (actual time=0.010..0.396 rows=160 loops=1)
vs. here:

  -  Index Scan using postcodes_pkey on postcodes p 
(cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160 
loops=1)

So the first query forces a SeqScan on postcodes, while the
second can do an IndexScan.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Arjen van der Meijden
On 6-4-2005 19:04, Steve Atkins wrote:
On Wed, Apr 06, 2005 at 06:52:35PM +0200, Arjen van der Meijden wrote:
Hi list,
I noticed on a forum a query taking a surprisingly large amount of time 
in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much 
better. To my surprise PostgreSQL was ten times worse on the same 
machine! And I don't understand why.

I don't really need this query to be fast since I don't use it, but the 
range-thing is not really an uncommon query I suppose. So I'm wondering 
why it is so slow and this may point to a wrong plan being chosen or 
generated.

That's the wrong index type for fast range queries. You really need
something like GiST or rtree for that. I do something similar in
production and queries are down at the millisecond level with the
right index.
That may be, but since that table is only two pages the index would 
probably not be used even if it was rtree or GiST?
Btw, access method rtree does not support multicolumn indexes, I'd 
need another way of storing it as well? Plus it doesn't support  and  
so the query should be changed for the way ranges are checked.

I'm not sure if the dataset is really suitable for other range checks. 
It is a linear set of postal codes grouped by their number (range_from 
to range_till) into regions and the query basically joins the region to 
each records of a user table. Of course one could use lines on the 
x-axis and define the postal-code of a specific user as a point on one 
of those lines...

But nonetheless, /this/ query should be not that slow either, right?
Arjen
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Tom Lane
Arjen van der Meijden [EMAIL PROTECTED] writes:
 I noticed on a forum a query taking a surprisingly large amount of time 
 in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much 
 better. To my surprise PostgreSQL was ten times worse on the same 
 machine! And I don't understand why.

Wrong index ... what you probably could use here is an index on
data_main.range, so that the query could run with postcodes as the
outer side.  I get such a plan by default with empty tables:

 Aggregate  (cost=99177.80..99177.80 rows=1 width=0)
   -  Nested Loop  (cost=0.00..98021.80 rows=462400 width=0)
 -  Seq Scan on postcodes p  (cost=0.00..30.40 rows=2040 width=4)
 -  Index Scan using rangei on data_main dm  (cost=0.00..44.63 
rows=227 width=2)
   Index Cond: ((dm.range = outer.range_from) AND (dm.range = 
outer.range_till))

but I'm not sure if the planner would prefer it with the tables loaded
up.  (It might not be the right thing anyway ... but seems worth
trying.)

Given the relatively small size of the postcodes table, and the fact
that each data_main row seems to join to about one postcodes row,
it's possible that what the planner did for you was actually the
optimal thing anyhow.  I'm not sure that any range-capable index would
be faster than just scanning through 160 entries in memory ...

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Tom Lane
Dave Held [EMAIL PROTECTED] writes:
 My completely amateur guess is that the planner is able to use
 Merge Join and Hash Join on your contrived queries because you
 are only trying to join one field to a single value (i.e.:
 operator=).  But the BETWEEN clause is what forces the Nested
 Loop.  You can see that here:

Yeah --- both merge and hash join are only usable for equality joins.
(Thinking about it, it seems possible that mergejoin could be extended
to work for range joins, but we're certainly far from being able to
do that today.)  So the basic alternatives the planner has are nestloops
with either postcode on the outside, or data_main on the outside.  The
postcode-on-the-outside case would be plausible with an index on
data_main.range, but Arjen didn't have one.  The data_main-on-the-outside
case could only use an index if the index was range-query-capable, which
a 2-column btree index isn't.  Given the small size of the postcodes
table it's not real clear that an index probe would be much of a win
anyway over a simple sequential scan.

Comparing the nestloop case to the hash case does make one think that
there's an awful lot of overhead somewhere, though.  Two int2
comparisons ought not take very long :-(.  Arjen, are you interested
in getting a gprof profile of what the backend is doing in the nestloop
-with-materialize plan?  Or if you don't want to mess with it, please
send me the data off-list and I'll run a profile.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Arjen van der Meijden
On 6-4-2005 19:42, Tom Lane wrote:
Arjen van der Meijden [EMAIL PROTECTED] writes:
I noticed on a forum a query taking a surprisingly large amount of time 
in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much 
better. To my surprise PostgreSQL was ten times worse on the same 
machine! And I don't understand why.

Wrong index ... what you probably could use here is an index on
data_main.range, so that the query could run with postcodes as the
outer side.  I get such a plan by default with empty tables:
 Aggregate  (cost=99177.80..99177.80 rows=1 width=0)
   -  Nested Loop  (cost=0.00..98021.80 rows=462400 width=0)
 -  Seq Scan on postcodes p  (cost=0.00..30.40 rows=2040 width=4)
 -  Index Scan using rangei on data_main dm  (cost=0.00..44.63 
rows=227 width=2)
   Index Cond: ((dm.range = outer.range_from) AND (dm.range = 
outer.range_till))
but I'm not sure if the planner would prefer it with the tables loaded
up.  (It might not be the right thing anyway ... but seems worth
trying.)
No it didn't prefer it.
Given the relatively small size of the postcodes table, and the fact
that each data_main row seems to join to about one postcodes row,
it's possible that what the planner did for you was actually the
optimal thing anyhow.  I'm not sure that any range-capable index would
be faster than just scanning through 160 entries in memory ...
			regards, tom lane
Yep, there is only one or in corner cases two postcode-ranges per 
postcode. Actually it should be only one, but my generated data is not 
perfect.
But the sequential scan per record is not really what surprises me, 
especially since the postcode table is only two pages of data, I didn't 
really expect otherwise.
It is the fact that it takes 22 seconds that surprises me. Especially 
since  the two other examples on the same data which consider about the 
same amount of records per table/record only take 1.4 and 0.18 seconds.

Best regards,
Arjen
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Mischa
Quoting Arjen van der Meijden [EMAIL PROTECTED]:

 Hi list,
 
 I noticed on a forum a query taking a surprisingly large amount of time 
 in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much 
 better. To my surprise PostgreSQL was ten times worse on the same 
 machine! And I don't understand why.
 
 I don't really need this query to be fast since I don't use it, but the 
 range-thing is not really an uncommon query I suppose. So I'm wondering 
 why it is so slow and this may point to a wrong plan being chosen or 
 generated.
 
 Here are table definitions:
 
  Table public.postcodes
 Column| Type  | Modifiers
 -+---+---
   postcode_id | smallint  | not null
   range_from  | smallint  |
   range_till  | smallint  |
 Indexes:
  postcodes_pkey PRIMARY KEY, btree (postcode_id)
  range UNIQUE, btree (range_from, range_till)
 
 Table public.data_main
   Column |   Type   | Modifiers
 +--+---
   userid | integer  | not null
   range  | smallint |
 Indexes:
  data_main_pkey PRIMARY KEY, btree (userid)
 
 And here's the query I ran:
 
 SELECT COUNT(*) FROM
 data_main AS dm,
 postcodes AS p
 WHERE dm.range BETWEEN p.range_from AND p.range_till

I just posted an answer to this (via webcafe webmail; can't recall which
pg-list), that might interest you.

BTree indexes as they stand (multi-column, ...) answer what most people need for
queries. Unfortunately, out-of-the-box, they have no good way of handling range
queries. To compensate, you can use a small amount of kinky SQL. This is in the
same line as the tricks used to implement hierarchic queries in relational SQL.

[1] Create a table widths(wid int) of powers of 2, up to what will just cover
max(range_till-range_from). Since your range column is a smallint, this table
can have no more than 15 rows. You can get as fussy as you want about keeping
this table to a minimum.

[2] Change postcodes:
ALTER TABLE postcodes 
   ADD wid INT USING 2 ^ CEIL(LOG(range_from - range_till,2));
ALTER TABLE postcodes
   ADD start INT USING range_from - (range_from % wid);
CREATE INDEX postcodes_wid_start_index ON (wid, start);
ANALYZE postcodes;

[4] Write your query as:
SELECT COUNT(*)
FROM data_main AS dm
CROSS JOIN widths -- yes, CROSS JOIN. For once, it HELPS performance.
JOIN postcodes AS p
  ON dm.wid = widths.wid AND dm.start = p.range - p.range % widths.wid
WHERE dm.range BETWEEN p.range_from AND p.range_till

This uses BTREE exact-match to make a tight restriction on which rows to check.
YMMV, but this has worked even for multi-M table joins.

-- 
Dreams come true, not free.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Plan for relatively simple query seems to be very inefficient

2005-04-06 Thread Tom Lane
Arjen van der Meijden [EMAIL PROTECTED] writes:
 On 6-4-2005 20:09, Tom Lane wrote:
 Comparing the nestloop case to the hash case does make one think that
 there's an awful lot of overhead somewhere, though.  Two int2
 comparisons ought not take very long :-(.  Arjen, are you interested
 in getting a gprof profile of what the backend is doing in the nestloop
 -with-materialize plan?  Or if you don't want to mess with it, please
 send me the data off-list and I'll run a profile.

 Here you go, both are full pg_dump-dumps with create-data (including the 
 index on data_main.range).

Well, indeed int2ge and int2le are pretty far down the list, but the
stuff that's near the top has already been beat on pretty heavily :-(.
I'm not sure there is a lot we can do about this short of a wholesale
redesign of the way we do expression evaluation.

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self  self total   
 time   seconds   secondscalls  ms/call  ms/call  name
 36.14 21.3021.30 _mcount
  7.62 25.79 4.49 13412606 0.00 0.00  
ExecMakeFunctionResultNoSets
  5.46 29.01 3.22 26825216 0.00 0.00  slot_getattr
  4.19 31.48 2.47 26825216 0.00 0.00  ExecEvalVar
  3.87 33.76 2.28 13189120 0.00 0.00  ExecMaterial
  3.38 35.75 1.99 13494688 0.00 0.00  slot_deform_tuple
  3.38 37.74 1.99 noshlibs
  3.12 39.58 1.84 13353893 0.00 0.00  ExecProcNode
  2.99 41.34 1.76 13107201 0.00 0.00  ExecQual
  2.90 43.05 1.71 13271974 0.00 0.00  AllocSetReset
  2.72 44.65 1.60 ExecEvalVar
  2.43 46.08 1.43 $$dyncall
  2.24 47.40 1.32 13271972 0.00 0.00  MemoryContextReset
  2.24 48.72 1.32 13188960 0.00 0.00  tuplestore_gettuple
  2.12 49.97 1.25 13189441 0.00 0.00  ExecStoreTuple
  1.80 51.03 1.0682689 0.01 0.06  ExecNestLoop
  1.70 52.03 1.00 13354235 0.00 0.00  ExecClearTuple
  1.63 52.99 0.96 13412761 0.00 0.00  check_stack_depth
  1.58 53.92 0.93 AllocSetReset
  1.29 54.68 0.76 int2ge
  1.20 55.39 0.71 
ExecMakeFunctionResultNoSets
  1.14 56.06 0.67 13107200 0.00 0.00  int2ge
  1.05 56.68 0.62 ExecEvalCoerceToDomain
  1.04 57.29 0.61 13189120 0.00 0.00  tuplestore_ateof
  0.64 57.67 0.38 13271972 0.00 0.00  MemoryContextResetChildren
  0.41 57.91 0.24 readtup_heap
  0.36 58.12 0.21 log_disconnections
  0.24 58.26 0.14 BlessTupleDesc
  0.19 58.37 0.11 ExecCountSlotsMaterial
  0.14 58.45 0.08 
MemoryContextAllocZeroAligned
  0.12 58.52 0.07 ExecProcNode
  0.10 58.58 0.06 int42div
  0.08 58.63 0.05 AllocSetStats
  0.05 58.66 0.03   166022 0.00 0.00  LockBuffer
  0.05 58.69 0.0382688 0.00 0.00  
advance_transition_function
  0.05 58.72 0.0382080 0.00 0.00  HeapTupleSatisfiesSnapshot
  0.05 58.75 0.03 ExecInitNestLoop
  0.03 58.77 0.02 SeqNext
  0.02 58.78 0.01   305408 0.00 0.00  int2le
  0.02 58.79 0.0184231 0.00 0.00  LWLockAcquire
  0.02 58.80 0.0182849 0.00 0.00  ExecProject
  0.02 58.81 0.0182848 0.00 0.00  ExecVariableList
  0.02 58.82 0.0182844 0.00 0.00  
ResourceOwnerEnlargeBuffers
  0.02 58.83 0.0182844 0.00 0.00  
ResourceOwnerRememberBuffer
  0.02 58.84 0.0182813 0.00 0.00  ReleaseAndReadBuffer
  0.02 58.85 0.0182688 0.00 0.00  ExecEvalConst
  0.02 58.86 0.0182688 0.00 0.00  ExecEvalExprSwitchContext
  0.02 58.87 0.0182688 0.00 0.00  advance_aggregates
  0.02 58.88 0.0182084 0.00 0.00  heapgettup
  0.02 58.89 0.0181920 0.00 0.00  ExecMaterialReScan
  0.02 58.90 0.0181920 0.00 0.00  ExecReScan
  0.02 58.91 0.01   19 0.53 0.53  
downcase_truncate_identifier
  0.02 58.92 0.01   10 1.00 1.00  AllocateFile
  0.02 58.93 0.01110.0070.59  agg_retrieve_direct
[ nothing else shows as having any sample hits ]

_mcount is profiler overhead, in case you were wondering; ignore it and
mentally scale all the other 

Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)

2005-04-06 Thread Tom Lane
I wrote:
 Arjen van der Meijden [EMAIL PROTECTED] writes:
 SELECT COUNT(*) FROM
 data_main AS dm,
 postcodes AS p
 WHERE dm.range BETWEEN p.range_from AND p.range_till

 Planner error ... because it doesn't have any good way to estimate the
 number of matching rows, it thinks that way is a bit more expensive than
 data_main as the outside, but in reality it seems a good deal cheaper:

BTW, it would get the right answer if it had recognized the WHERE clause
as a range restriction --- it still doesn't know exactly what fraction
of rows will match, but its default estimate is a great deal tighter for
WHERE x  something AND x  somethingelse than it is for two unrelated
inequality constraints.  Enough tighter that it would have gone for the
correct plan.

The problem is that it doesn't recognize the WHERE as a range constraint
on dm.range.  I thought for a moment that this might be a
recently-introduced bug, but actually the code is operating as designed:
clauselist_selectivity says

 * See if it looks like a restriction clause with a pseudoconstant
 * on one side.  (Anything more complicated than that might not
 * behave in the simple way we are expecting.)

Pseudoconstant in this context means a constant, parameter symbol, or
non-volatile functions of these ... so comparisons against values from
another table don't qualify.  It seems like we're missing a bet though.

Can anyone suggest a more general rule?  Do we need for example to
consider whether the relation membership is the same in two clauses
that might be opposite sides of a range restriction?  It seems like

a.x  b.y AND a.x  b.z

probably can be treated as a range restriction on a.x for this purpose,
but I'm much less sure that the same is true of

a.x  b.y AND a.x  c.z

Thoughts?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)

2005-04-06 Thread Jim C. Nasby
On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
 Can anyone suggest a more general rule?  Do we need for example to
 consider whether the relation membership is the same in two clauses
 that might be opposite sides of a range restriction?  It seems like
 
   a.x  b.y AND a.x  b.z

In a case like this, you could actually look at the  data in b and see
what the average range size is. If you wanted to get really fancy, the
optimizer could decide how best to access a based on each row of b.

 probably can be treated as a range restriction on a.x for this purpose,
 but I'm much less sure that the same is true of
 
   a.x  b.y AND a.x  c.z

Well, this could end up being much trickier, since who knows how b and c
are related. Though thinking about it, although I threw out the
row-by-row analysis idea to be glib, that would actually work in this
case; you could take a look at what b and c look like each time 'through
the loop'.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)

2005-04-06 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote:
 Can anyone suggest a more general rule?  Do we need for example to
 consider whether the relation membership is the same in two clauses
 that might be opposite sides of a range restriction?  It seems like
 
 a.x  b.y AND a.x  b.z

 In a case like this, you could actually look at the  data in b and see
 what the average range size is.

Not with the current statistics --- you'd need some kind of cross-column
statistics involving both y and z.  (That is, I doubt it would be
helpful to estimate the average range width by taking the difference of
independently-calculated mean values of y and z ...)  But yeah, in
principle it would be possible to make a non-default estimate.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)

2005-04-06 Thread Tom Lane
John A Meinel [EMAIL PROTECTED] writes:
 Actually, I think he was saying do a nested loop, and for each item in
 the nested loop, re-evaluate if an index or a sequential scan is more
 efficient.

 I don't think postgres re-plans once it has started, though you could
 test this in a plpgsql function.

It doesn't, and in any case that's a microscopic view of the issue.
The entire shape of the plan might change depending on what we think
the selectivity is --- much more than could be handled by switching
scan types at the bottom level.

Also, I anticipate that bitmap-driven index scans will change things
considerably here.  The range of usefulness of pure seqscans will
drop drastically...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq