Re: [PERFORM] [HACKERS] EXISTS optimization

2007-03-26 Thread Peter Kovacs

On 3/23/07, Kevin Grittner [EMAIL PROTECTED] wrote:
[...]

That's the good news.  The bad news is that I operate under a management 
portability dictate which doesn't currently allow that syntax, since not all of 
the products they want to


It doesn't really touch the substance, but I am curious: are you not
even allowed to discriminate between products in your code like:
if db is 'postresql' then
...
else
...
?

What would be the rationale for that?

Thanks
Peter

cover support it.  I tried something which seems equivalent, but it is
running for a very long time.  I'll show it with just the explain
while I wait to see how long the explain analyze takes.



[...]

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] EXISTS optimization

2007-03-24 Thread Martijn van Oosterhout
On Fri, Mar 23, 2007 at 05:30:27PM -0500, Kevin Grittner wrote:
 I don't understand -- TRUE OR UNKNOWN evaluates to TRUE, so why would
 the IN need to continue?  I'm not quite following the rest; could you
 elaborate or give an example?  (Sorry if I'm lagging behind the rest
 of the class here.)

You're right, I'm getting confused with the interaction of NULL and NOT
IN.

The multiple evaluation thing still applies, but that's minor.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


[HACKERS] EXISTS optimization

2007-03-23 Thread Kevin Grittner
I'm posting this to performance in case our workaround may be of benefit to 
someone with a similar issue.  I'm posting to hackers because I hope we can 
improve our planner in this area so that a workaround is not necessary.  (It 
might make sense to reply to one group or the other, depending on reply 
content.)
 
We are converting from a commercial database (which shall remain unnamed here, 
due to license restrictions on publishing benchmarks).  Most queries run faster 
on PostgreSQL; a small number choose very poor plans and run much longer.  This 
particular query runs on the commercial product in 6.1s first time, 1.4s 
cached.  In PostgreSQL it runs in about 144s both first time and cached.  I was 
able to use an easy but fairly ugly rewrite (getting duplicate rows and 
eliminating them with DISTINCT) which runs on the commercial product in 
9.2s/3.0s and in PostgreSQL in 2.0s/0.7s.
 
Here are the tables:
 
  Table public.TranHeader
Column |   Type   | Modifiers
---+--+---
 tranNo| TranNoT| not null
 countyNo  | CountyNoT  | not null
 acctPd| DateT  | not null
 date  | DateT  | not null
 isComplete| boolean  | not null
 tranId| TranIdT| not null
 tranType  | TranTypeT  | not null
 userId| UserIdT| not null
 workstationId | WorkstationIdT | not null
 time  | TimeT  |
Indexes:
TranHeader_pkey PRIMARY KEY, btree (tranNo, countyNo)
TranHeader_TranAcctPeriod UNIQUE, btree (acctPd, tranNo, countyNo)
TranHeader_TranDate UNIQUE, btree (date, tranNo, countyNo)

Table public.TranDetail
 Column  |Type| Modifiers
-++---
 tranNo  | TranNoT  | not null
 tranDetailSeqNo | TranDetailSeqNoT | not null
 countyNo| CountyNoT| not null
 acctCode| AcctCodeT| not null
 amt | MoneyT   | not null
 assessNo| TranIdT  |
 caseNo  | CaseNoT  |
 citnNo  | CitnNoT  |
 citnViolDate| DateT|
 issAgencyNo | IssAgencyNoT |
 partyNo | PartyNoT |
 payableNo   | PayableNoT   |
 rcvblNo | RcvblNoT |
Indexes:
TranDetail_pkey PRIMARY KEY, btree (tranNo, tranDetailSeqNo, 
countyNo)
TranDetail_TranDetCaseNo UNIQUE, btree (caseNo, tranNo, 
tranDetailSeqNo, countyNo)
TranDetail_TranDetPay UNIQUE, btree (payableNo, tranNo, 
tranDetailSeqNo, countyNo)
TranDetail_TranDetRcvbl UNIQUE, btree (rcvblNo, tranNo, 
tranDetailSeqNo, countyNo)
TranDetail_TranDetAcct btree (acctCode, citnNo, countyNo)

  Table public.Adjustment
 Column  | Type  | Modifiers
-+---+---
 adjustmentNo| TranIdT | not null
 countyNo| CountyNoT   | not null
 date| DateT   | not null
 isTranVoided| boolean   | not null
 reasonCode  | ReasonCodeT | not null
 tranNo  | TranNoT | not null
 adjustsTranId   | TranIdT |
 adjustsTranNo   | TranNoT |
 adjustsTranType | TranTypeT   |
 explanation | character varying(50) |
Indexes:
Adjustment_pkey PRIMARY KEY, btree (adjustmentNo, countyNo)
Adjustment_AdjustsTranId btree (adjustsTranId, adjustsTranType, 
tranNo, countyNo)
Adjustment_AdjustsTranNo btree (adjustsTranNo, tranNo, countyNo)
Adjustment_Date btree (date, countyNo)
 
Admittedly, the indexes are optimized for our query load under the commercial 
product, which can use the covering index optimization.
 
explain analyze
SELECT A.adjustmentNo, A.tranNo, A.countyNo, H.date, 
H.userId, H.time
  FROM Adjustment A
  JOIN TranHeader H ON (H.tranId = A.adjustmentNo AND 
H.countyNo = A.countyNo AND H.tranNo = A.tranNo)
  WHERE H.tranType = 'A'
AND A.date  DATE '2006-01-01'
AND H.countyNo = 66
AND A.countyNo = 66
AND EXISTS
(
  SELECT 1 FROM TranDetail D
WHERE D.tranNo = H.tranNo
  AND D.countyNo = H.countyNo
  AND D.caseNo LIKE '2006TR%'
)
;

 Nested Loop  (cost=182.56..72736.37 rows=1 width=46) (actual 
time=6398.108..143631.427 rows=2205 loops=1)
   Join Filter: ((H.tranId)::bpchar = (A.adjustmentNo)::bpchar)
   -  Bitmap Heap Scan on Adjustment A  (cost=182.56..1535.69 rows=11542 
width=22) (actual time=38.098..68.324 rows=12958 loops=1)
 Recheck Cond: (((date)::date  '2006-01-01'::date) AND 
((countyNo)::smallint = 66))
 -  Bitmap Index Scan on Adjustment_Date  (cost=0.00..179.67 
rows=11542 width=0) (actual time=32.958..32.958 rows=12958 loops=1)
   Index Cond: (((date)::date  '2006-01-01'::date) AND 
((countyNo)::smallint = 66))
   -  Index Scan using TranHeader_pkey on TranHeader H  

Re: [HACKERS] EXISTS optimization

2007-03-23 Thread Tom Lane
Kevin Grittner [EMAIL PROTECTED] writes:
 explain analyze
 SELECT A.adjustmentNo, A.tranNo, A.countyNo, H.date, 
 H.userId, H.time
   FROM Adjustment A
   JOIN TranHeader H ON (H.tranId = A.adjustmentNo AND 
 H.countyNo = A.countyNo AND H.tranNo = A.tranNo)
   WHERE H.tranType = 'A'
 AND A.date  DATE '2006-01-01'
 AND H.countyNo = 66
 AND A.countyNo = 66
 AND EXISTS
 (
   SELECT 1 FROM TranDetail D
 WHERE D.tranNo = H.tranNo
   AND D.countyNo = H.countyNo
   AND D.caseNo LIKE '2006TR%'
 )
 ;

 The commercial product scans the index on caseNo in TranDetail to build a 
 work table of unique values, then uses indexed access to the TranHeader and 
 then to Adjustment.

If you want that, try rewriting the EXISTS to an IN:

   AND (H.tranNo, H.countyNo) IN
(
  SELECT D.tranNo, D.countyNo FROM TranDetail D
WHERE D.caseNo LIKE '2006TR%'
)

We don't currently try to flatten EXISTS into a unique/join plan as we
do for IN.  I seem to recall not doing so when I rewrote IN planning
because I didn't think it would be exactly semantically equivalent,
but that was awhile ago.  Right at the moment it seems like it ought
to be equivalent as long as the comparison operators are strict.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] EXISTS optimization

2007-03-23 Thread Kevin Grittner


 On Fri, Mar 23, 2007 at  4:49 PM, in message [EMAIL PROTECTED],
Tom Lane [EMAIL PROTECTED] wrote: 
 Kevin Grittner [EMAIL PROTECTED] writes:
 explain analyze
 SELECT A.adjustmentNo, A.tranNo, A.countyNo, H.date, 
 H.userId, H.time
   FROM Adjustment A
   JOIN TranHeader H ON (H.tranId = A.adjustmentNo AND 
 H.countyNo = A.countyNo AND H.tranNo = A.tranNo)
   WHERE H.tranType = 'A'
 AND A.date  DATE '2006- 01- 01'
 AND H.countyNo = 66
 AND A.countyNo = 66
 AND EXISTS
 (
   SELECT 1 FROM TranDetail D
 WHERE D.tranNo = H.tranNo
   AND D.countyNo = H.countyNo
   AND D.caseNo LIKE '2006TR%'
 )
 ;
 
 The commercial product scans the index on caseNo in TranDetail to build a 
 work table of unique values, then uses indexed access to the TranHeader and 
 then to Adjustment.
 
 If you want that, try rewriting the EXISTS to an IN:
 
AND (H.tranNo, H.countyNo) IN
 (
   SELECT D.tranNo, D.countyNo FROM TranDetail D
 WHERE D.caseNo LIKE '2006TR%'
 )

Nice.  I get this:
 
explain analyze
SELECT A.adjustmentNo, A.tranNo, A.countyNo, H.date, 
H.userId, H.time
  FROM Adjustment A
  JOIN TranHeader H ON (H.tranId = A.adjustmentNo AND 
H.countyNo = A.countyNo AND H.tranNo = A.tranNo)
  WHERE H.tranType = 'A'
AND A.date  DATE '2006- 01- 01'
AND H.countyNo = 66
AND A.countyNo = 66
AND (H.tranNo, H.countyNo) IN
(
  SELECT D.tranNo, D.countyNo FROM TranDetail D
WHERE D.caseNo LIKE '2006TR%'
)
;
 
 Nested Loop  (cost=27.76..36.38 rows=1 width=46) (actual time=92.999..200.398 
rows=2209 loops=1)
   Join Filter: ((H.tranNo)::integer = (A.tranNo)::integer)
   -  Nested Loop  (cost=27.76..32.08 rows=1 width=50) (actual 
time=92.970..176.472 rows=2209 loops=1)
 -  HashAggregate  (cost=27.76..27.77 rows=1 width=6) (actual 
time=92.765..100.810 rows=9788 loops=1)
   -  Index Scan using TranDetail_TranDetCaseNo on TranDetail 
D  (cost=0.00..27.66 rows=20 width=6) (actual time=0.059..60.967 rows=46301 
loops=1)
 Index Cond: (((caseNo)::bpchar = '2006TR'::bpchar) AND 
((caseNo)::bpchar  '2006TS'::bpchar) AND ((countyNo)::smallint = 66))
 Filter: ((caseNo)::bpchar ~~ '2006TR%'::text)
 -  Index Scan using TranHeader_pkey on TranHeader H  
(cost=0.00..4.30 rows=1 width=46) (actual time=0.006..0.006 rows=0 loops=9788)
   Index Cond: (((H.tranNo)::integer = (D.tranNo)::integer) 
AND ((H.countyNo)::smallint = 66))
   Filter: ((tranType)::bpchar = 'A'::bpchar)
   -  Index Scan using Adjustment_pkey on Adjustment A  (cost=0.00..4.28 
rows=1 width=22) (actual time=0.008..0.009 rows=1 loops=2209)
 Index Cond: (((H.tranId)::bpchar = (A.adjustmentNo)::bpchar) 
AND ((A.countyNo)::smallint = 66))
 Filter: ((date)::date  '2006-01-01'::date)
 Total runtime: 201.306 ms
 
That's the good news.  The bad news is that I operate under a management 
portability dictate which doesn't currently allow that syntax, since not all of 
the products they want to cover support it.  I tried something which seems 
equivalent, but it is running for a very long time.  I'll show it with just the 
explain while I wait to see how long the explain analyze takes.
 
explain
SELECT A.adjustmentNo, A.tranNo, A.countyNo, H.date, 
H.userId, H.time
  FROM Adjustment A
  JOIN TranHeader H ON (H.tranId = A.adjustmentNo AND 
H.countyNo = A.countyNo AND H.tranNo = A.tranNo)
  WHERE H.tranType = 'A'
AND A.date  DATE '2006- 01- 01'
AND H.countyNo = 66
AND A.countyNo = 66
AND H.tranNo IN
(
  SELECT D.tranNo FROM TranDetail D
WHERE D.caseNo LIKE '2006TR%'
  AND D.countyNo = H.countyNo
)
;

 Nested Loop  (cost=0.00..181673.08 rows=1 width=46)
   Join Filter: ((H.tranId)::bpchar = (A.adjustmentNo)::bpchar)
   -  Seq Scan on Adjustment A  (cost=0.00..2384.27 rows=11733 width=22)
 Filter: (((date)::date  '2006-01-01'::date) AND 
((countyNo)::smallint = 66))
   -  Index Scan using TranHeader_pkey on TranHeader H  
(cost=0.00..15.27 rows=1 width=46)
 Index Cond: (((H.tranNo)::integer = (A.tranNo)::integer) AND 
((H.countyNo)::smallint = 66))
 Filter: (((tranType)::bpchar = 'A'::bpchar) AND (subplan))
 SubPlan
   -  Index Scan using TranDetail_TranDetCaseNo on TranDetail D  
(cost=0.00..27.66 rows=20 width=4)
 Index Cond: (((caseNo)::bpchar = '2006TR'::bpchar) AND 
((caseNo)::bpchar  '2006TS'::bpchar) AND ((countyNo)::smallint = 
($0)::smallint))
 Filter: ((caseNo)::bpchar ~~ '2006TR%'::text)

 We don't currently try to flatten EXISTS into a unique/join plan as we
 do for IN.  I seem to recall not doing so when I rewrote IN planning
 because I didn't think it would be exactly semantically equivalent,
 but that was awhile ago.  Right at the moment it seems like it ought
 to be equivalent as 

Re: [HACKERS] EXISTS optimization

2007-03-23 Thread Martijn van Oosterhout
On Fri, Mar 23, 2007 at 05:49:42PM -0400, Tom Lane wrote:
 We don't currently try to flatten EXISTS into a unique/join plan as we
 do for IN.  I seem to recall not doing so when I rewrote IN planning
 because I didn't think it would be exactly semantically equivalent,
 but that was awhile ago.  Right at the moment it seems like it ought
 to be equivalent as long as the comparison operators are strict.

Wasn't it due to the fact that IN needs to scan through all
possibilites anyway because of its interaction with NULL, whereas
EXISTS can stop at the first row?

That would mean the subquery to be materialised would not be equivalent
if it called any non-immutable functions. It's also much less clear to
be a win in the EXISTs case. But then, that's a costs issue the planner
can deal with...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] EXISTS optimization

2007-03-23 Thread Kevin Grittner
I don't understand -- TRUE OR UNKNOWN evaluates to TRUE, so why would the IN 
need to continue?  I'm not quite following the rest; could you elaborate or 
give an example?  (Sorry if I'm lagging behind the rest of the class here.)
 
-Kevin
 
 
 Martijn van Oosterhout kleptog@svana.org 03/23/07 5:26 PM  
On Fri, Mar 23, 2007 at 05:49:42PM -0400, Tom Lane wrote:
 We don't currently try to flatten EXISTS into a unique/join plan as we
 do for IN.  I seem to recall not doing so when I rewrote IN planning
 because I didn't think it would be exactly semantically equivalent,
 but that was awhile ago.  Right at the moment it seems like it ought
 to be equivalent as long as the comparison operators are strict.

Wasn't it due to the fact that IN needs to scan through all
possibilites anyway because of its interaction with NULL, whereas
EXISTS can stop at the first row?

That would mean the subquery to be materialised would not be equivalent
if it called any non-immutable functions. It's also much less clear to
be a win in the EXISTs case. But then, that's a costs issue the planner
can deal with...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] [HACKERS] EXISTS optimization

2007-03-23 Thread Kevin Grittner


 On Fri, Mar 23, 2007 at  5:26 PM, in message
[EMAIL PROTECTED], Kevin Grittner
[EMAIL PROTECTED] wrote: 

 I tried something which seems 
 equivalent, but it is running for a very long time.  I'll show it with just 
 the explain while I wait to see how long the explain analyze takes.
  
 explain
 SELECT A.adjustmentNo, A.tranNo, A.countyNo, H.date, 
 H.userId, H.time
   FROM Adjustment A
   JOIN TranHeader H ON (H.tranId = A.adjustmentNo AND 
 H.countyNo = A.countyNo AND H.tranNo = A.tranNo)
   WHERE H.tranType = 'A'
 AND A.date  DATE '2006-  01-  01'
 AND H.countyNo = 66
 AND A.countyNo = 66
 AND H.tranNo IN
 (
   SELECT D.tranNo FROM TranDetail D
 WHERE D.caseNo LIKE '2006TR%'
   AND D.countyNo = H.countyNo
 )
 ;

explain analyze results:
 
 Nested Loop  (cost=0.00..181673.08 rows=1 width=46) (actual 
time=42224.077..964266.969 rows=2209 loops=1)
   Join Filter: ((H.tranId)::bpchar = (A.adjustmentNo)::bpchar)
   -  Seq Scan on Adjustment A  (cost=0.00..2384.27 rows=11733 width=22) 
(actual time=15.355..146.620 rows=13003 loops=1)
 Filter: (((date)::date  '2006-01-01'::date) AND 
((countyNo)::smallint = 66))
   -  Index Scan using TranHeader_pkey on TranHeader H  
(cost=0.00..15.27 rows=1 width=46) (actual time=74.141..74.141 rows=0 
loops=13003)
 Index Cond: (((H.tranNo)::integer = (A.tranNo)::integer) AND 
((H.countyNo)::smallint = 66))
 Filter: (((tranType)::bpchar = 'A'::bpchar) AND (subplan))
 SubPlan
   -  Index Scan using TranDetail_TranDetCaseNo on TranDetail D  
(cost=0.00..27.66 rows=20 width=4) (actual time=0.039..58.234 rows=42342 
loops=13003)
 Index Cond: (((caseNo)::bpchar = '2006TR'::bpchar) AND 
((caseNo)::bpchar  '2006TS'::bpchar) AND ((countyNo)::smallint = 
($0)::smallint))
 Filter: ((caseNo)::bpchar ~~ '2006TR%'::text)
 Total runtime: 964269.555 ms


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] EXISTS optimization

2007-03-23 Thread Tom Lane
Kevin Grittner [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] wrote: 
 If you want that, try rewriting the EXISTS to an IN:
 
 AND (H.tranNo, H.countyNo) IN
 (
 SELECT D.tranNo, D.countyNo FROM TranDetail D
 WHERE D.caseNo LIKE '2006TR%'
 )

 That's the good news.  The bad news is that I operate under a
 management portability dictate which doesn't currently allow that
 syntax, since not all of the products they want to cover support it.

Which part of it don't they like --- the multiple IN-comparisons?

 I tried something which seems equivalent, but it is running for a very
 long time.
 AND H.tranNo IN
 (
   SELECT D.tranNo FROM TranDetail D
 WHERE D.caseNo LIKE '2006TR%'
   AND D.countyNo = H.countyNo
 )

No, that's not gonna accomplish a darn thing, because you've still got
a correlated subquery (ie, a reference to outer H) and so turning the
IN into a join doesn't work.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] [HACKERS] EXISTS optimization

2007-03-23 Thread Kevin Grittner
 On Fri, Mar 23, 2007 at  6:04 PM, in message
[EMAIL PROTECTED], Peter Kovacs
[EMAIL PROTECTED] wrote: 
 On 3/23/07, Kevin Grittner [EMAIL PROTECTED] wrote:
 [...]
 That's the good news.  The bad news is that I operate under a management 
 portability dictate which doesn't currently allow that syntax, since not all 
 of the products they want to
 
 It doesn't really touch the substance, but I am curious: are you not
 even allowed to discriminate between products in your code like:
 if db is 'postresql' then
 ...
 else
 ...
 ?
 
 What would be the rationale for that?
 
Anybody who's not curious about that should skip the rest of this email.
 
Management has simply given a mandate that the software be independent of OS 
and database vendor, and to use Java to help with the OS independence.  I have 
to admit that I am the architect of the database independence solution that was 
devised.  (The choice of Java for the OS independence has been very successful. 
 We have run our bytecode on HP-UX, Windows, Sun Solaris, and various flavors 
of Linux without having to compile different versions of the bytecode.  Other 
than when people get careless with case sensitivity on file names or with path 
separators, it just drops right in and runs.
 
For the data side, we write all of our queries in ANSI SQL in our own query 
tool, parse it, and generate Java classes to run it.  The ANSI source is broken 
down to lowest common denominator queries, with all procedural code covered 
in the Java query classes.  So we have stored procedures which can be called, 
triggers that fire, etc. in Java, issuing SELECT, INSERT, UPDATE, DELETE 
statements to the database.  This allows us to funnel all DML through a few 
primitive routines which capture before and after images and save them in our 
own transaction image tables.  We use this to replicate from our 72 county 
databases, which are the official court record, to multiple central databases, 
and a transaction repository, used for auditing case activity and assisting 
with failure recovery.
 
The problem with burying 'if db is MaxDB', 'if db is SQLServer', 'if db is 
PostgreSQL' everywhere is that you have no idea what to do when you then want 
to drop in some different product.   We have a plugin layer to manage known 
areas of differences which aren't handled cleanly by JDBC, where the default 
behavior is ANSI-compliant, and a few dozen to a few hundred  lines need to be 
written to modify that default support a new database product.  (Of course, 
each one so far has brought in a few surprises, making the plugin layer just a 
little bit thicker.)
 
So, to support some new syntax, we have to update our parser, and have a way to 
generate code which runs on all the candidate database products, either 
directly or through a plugin layer.  If any of the products don't support 
multi-value row value constructors, I have a hard time seeing a good way to 
cover that with the plugin.  On the subject issue, I'm pretty sure it would 
actually be less work for me to modify the PostgreSQL optimizer to efficiently 
handle the syntax we do support than to try to bend row value constructors to a 
syntax that is supported on other database products.
 
And, by the way, I did take a shot on getting them to commit to PostgreSQL as 
the long-term solution, and relax the portability rules.  No sale.  Perhaps 
when everything is converted to PostgreSQL and working for a while they may 
reconsider.
 
-Kevin
 


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster