[PERFORM] Efficiency of EXISTS?

2012-07-23 Thread Kenneth Tilton
My mental model of the EXISTS clause must be off. This snippet appears at
the end of a series of WITH clauses I suspect are irrelevant:

with etc etc ... , cids as
   (select distinct c.id from ddr2 c
 join claim_entries ce on ce.claim_id = c.id
 where (c.assigned_ddr = 879
 or exists (select 1 from ddr_cdt dc
 where
 dc.sys_user_id = 879
 and dc.document_type = c.document_type
 -- makes it faster: and (dc.cdt_code is null or dc.cdt_code = ce.cpt_code)
 )))

 select count(*) from cids


If I uncomment the bit where it says make it faster I get decent response
and the graphical analyze display shows the expected user+doctype+cdtcode
index is being used (and nice thin lines suggesting efficient lookup).

As it is, the analyze display shows the expected user+doctype index* being
used but the lines are fat, and performance is an exponential disaster.

* I created the (to me ) redundant user+doctype index trying to get
Postgres to Do the Right Thing(tm), but I can see that was not the issue.

I presume the reason performance drops off a cliff is because there can be
9000 cdt_codes for one user+doctype, but I was hoping EXISTS would just
look to see if there was at least one row matching user+doctype and return
its decision. I have tried select *, select 1, and limit 1 on the nested
select to no avail.

Am I just doing something wrong? I am a relative noob. Is there some other
hint I can give the planner?

Thx, ken


Re: [PERFORM] Efficiency of EXISTS?

2012-07-23 Thread Merlin Moncure
On Mon, Jul 23, 2012 at 4:12 PM, Kenneth Tilton ktil...@mcna.net wrote:
 My mental model of the EXISTS clause must be off. This snippet appears at
 the end of a series of WITH clauses I suspect are irrelevant:

 with etc etc ... , cids as
   (select distinct c.id from ddr2 c
 join claim_entries ce on ce.claim_id = c.id
 where (c.assigned_ddr = 879
 or exists (select 1 from ddr_cdt dc
 where
 dc.sys_user_id = 879
 and dc.document_type = c.document_type
 -- makes it faster: and (dc.cdt_code is null or dc.cdt_code = ce.cpt_code)
 )))

 select count(*) from cids


 If I uncomment the bit where it says make it faster I get decent response
 and the graphical analyze display shows the expected user+doctype+cdtcode
 index is being used (and nice thin lines suggesting efficient lookup).

 As it is, the analyze display shows the expected user+doctype index* being
 used but the lines are fat, and performance is an exponential disaster.

 * I created the (to me ) redundant user+doctype index trying to get Postgres
 to Do the Right Thing(tm), but I can see that was not the issue.

 I presume the reason performance drops off a cliff is because there can be
 9000 cdt_codes for one user+doctype, but I was hoping EXISTS would just look
 to see if there was at least one row matching user+doctype and return its
 decision. I have tried select *, select 1, and limit 1 on the nested select
 to no avail.

 Am I just doing something wrong? I am a relative noob. Is there some other
 hint I can give the planner?

hard to say without having the explain analyze output.  also it's not
clear why you need to use WITH, at least for the terminating query.
I'd just do:

select count(*) from
(
  inner_query
)

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Efficiency of EXISTS?

2012-07-23 Thread Kenneth Tilton
On Mon, Jul 23, 2012 at 2:52 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Mon, Jul 23, 2012 at 4:12 PM, Kenneth Tilton ktil...@mcna.net wrote:
  My mental model of the EXISTS clause must be off. This snippet appears at
  the end of a series of WITH clauses I suspect are irrelevant:
 
  with etc etc ... , cids as
(select distinct c.id from ddr2 c
  join claim_entries ce on ce.claim_id = c.id
  where (c.assigned_ddr = 879
  or exists (select 1 from ddr_cdt dc
  where
  dc.sys_user_id = 879
  and dc.document_type = c.document_type
  -- makes it faster: and (dc.cdt_code is null or dc.cdt_code =
 ce.cpt_code)
  )))
 
  select count(*) from cids
 
 
  If I uncomment the bit where it says make it faster I get decent
 response
  and the graphical analyze display shows the expected user+doctype+cdtcode
  index is being used (and nice thin lines suggesting efficient lookup).
 
  As it is, the analyze display shows the expected user+doctype index*
 being
  used but the lines are fat, and performance is an exponential disaster.
 
  * I created the (to me ) redundant user+doctype index trying to get
 Postgres
  to Do the Right Thing(tm), but I can see that was not the issue.
 
  I presume the reason performance drops off a cliff is because there can
 be
  9000 cdt_codes for one user+doctype, but I was hoping EXISTS would just
 look
  to see if there was at least one row matching user+doctype and return its
  decision. I have tried select *, select 1, and limit 1 on the nested
 select
  to no avail.
 
  Am I just doing something wrong? I am a relative noob. Is there some
 other
  hint I can give the planner?

 hard to say without having the explain analyze output.  also it's not
 clear why you need to use WITH, at least for the terminating query.
 I'd just do:

 select count(*) from
 (
   inner_query
 )


OK. Here is the full query:

with ddr as (
select c.id
 ,case
   when c.portal_user_id is null then u.provider_facility_id
   else pu.provider_facility_id
  end provider_facility_id
 from claims c
 left join sys_users u on u.id = c.created_by
 left join portal_users pu on pu.id = c.portal_user_id
 WHERE c.deleted = 0
 AND c.status = 0
 AND (c.created_by is not null or c.portal_user_id is not null)
 AND true not in ( select ineligible_code_id in (46,65)
from claim_carcs cc
where c.id = cc.claim_id
and cc.deleted = 0 )
 AND (false OR c.document_type = 0)
 AND (false OR c.group_plan_id = 44)

 limit 1500
)

,ddr2 as (
select c.id
 , c.document_type
 , c.radiographs
 , c.nea_number
 , c.assigned_ddr
 , d.provider_facility_id as submitting_facility_id
 , count(ca.id) as claim_attachments_count
 , cast(exists (select 1 from triples where s = c.id and sda='claims' and p
= 'ddr-review-passed-on-by') as boolean) as passedon
 from ddr d
inner join
claims c on d.id = c.id
join claim_attachments ca on c.id = ca.claim_id
group by
 c.id
 , submitting_facility_id
having ((nullif(trim(c.nea_number, ' '),'') is not null)
or case transmission_method
when 'P' then count(distinct ca.id)  1
else count(distinct ca.id)  0
   end
or c.radiographs  0))

, cids as
  (select distinct c.id from ddr2 c
join claim_entries ce on ce.claim_id = c.id
where (c.assigned_ddr = 879
or exists (select 1 from ddr_cdt dc
where
dc.sys_user_id = 879
and dc.document_type = c.document_type
--and (dc.cdt_code is null or dc.cdt_code = ce.cpt_code)
)))
select count(*) from cids

And the explain output:

Aggregate  (cost=56060.60..56060.61 rows=1 width=0)
  CTE ddr
-  Limit  (cost=306.29..16203.83 rows=1500 width=16)
  -  Nested Loop Left Join  (cost=306.29..7442626.75 rows=702214
width=16)
-  Hash Left Join  (cost=306.29..7244556.97 rows=702214
width=12)
  Hash Cond: (c.created_by = u.id)
  -  Index Scan using claims_lca1 on claims c
 (cost=0.00..7230212.96 rows=702214 width=12)
Index Cond: ((deleted = 0) AND (status = 0)
AND (group_plan_id = 44) AND (document_type = 0))
Filter: (((created_by IS NOT NULL) OR
(portal_user_id IS NOT NULL)) AND (NOT (SubPlan 1)))
SubPlan 1
  -  Index Scan using claim_carcs_claim_id on
claim_carcs cc  (cost=0.00..9.23 rows=1 width=4)
Index Cond: (c.id = claim_id)
Filter: (deleted = 0)
  -  Hash  (cost=224.46..224.46 rows=6546 width=8)
-  Seq Scan on sys_users u
 (cost=0.00..224.46 rows=6546 width=8)
-  Index Scan using portal_users_pkey on portal_users pu
 (cost=0.00..0.27 rows=1 width=8)
  Index Cond: (id = c.portal_user_id)
  CTE ddr2
-  GroupAggregate  (cost=25714.40..28093.98 rows=286 width=27)
  Filter: ((NULLIF(btrim((c.nea_number)::text, ' '::text),
''::text) IS NOT NULL) OR CASE c.transmission_method WHEN 'P'::bpchar THEN
(count(DISTINCT ca.id)  1) ELSE (count(DISTINCT ca.id)  

Re: [PERFORM] IN or EXISTS

2011-09-22 Thread Jeff Davis
On Wed, 2011-08-31 at 09:33 +0800, Craig Ringer wrote:
 On the other hand, the `IN' subquery is uncorrelated needs only run 
 once, where the `EXISTS' subquery is correlated and has to run once for 
 every outer record.

If the EXISTS looks semantically similar to an IN (aside from NULL
semantics), then it can be made into a semijoin. It doesn't require
re-executing any part of the plan.

I don't think there are any cases where [NOT] IN is an improvement, am I
mistaken?

 Another complication is the possible presence of NULL in an IN list. 
 Getting NULLs in `IN' lists is a common source of questions on this 
 list, because people are quite surprised by how it works. EXISTS avoids 
 the NULL handling issue (and in the process demonstrates how woefully 
 inconsistent SQL's handling of NULL really is).

Absolutely. The NULL behavior of IN is what makes it hard to optimize,
and therefore you should use EXISTS instead if the semantics are
suitable.

 Theoretically the query planner could transform:
 
 SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE 
 z.y_id IS NOT NULL);
 
 into:
 
 SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)
 
 ... or vice versa depending on which it thought would be faster.

Although those two queries are semantically the same (I think), a lot of
very similar pairs of queries are not equivalent. For instance, if it
was a NOT IN you couldn't change that to a NOT EXISTS.

Regards,
Jeff Davis


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] IN or EXISTS

2011-08-31 Thread Craig Ringer

On 31/08/2011 4:30 AM, Andy Colson wrote:

Hi all,

I have read things someplace saying not exists was better than not 
in... or something like that.  Not sure if that was for in/exists and 
not in/not exists, and for a lot of records or not.


`EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary 
to build a list of values then iterate over them to check for a match. 
By contrast, `EXISTS' may use a simple index lookup or the like to test 
for the presence of a value.


On the other hand, the `IN' subquery is uncorrelated needs only run 
once, where the `EXISTS' subquery is correlated and has to run once for 
every outer record. That means that the `IN' list approach can be a lot 
faster where the subquery in question is relatively time consuming for 
the number of values it returns. For example, if the `IN' query returns 
only 5 values and takes 100ms, you're scanning 1 million records in the 
outer query, and the subquery `EXISTS' version would take 50ms, using 
`IN' is a no-brainer since 1 million times 50ms will be a lot slower 
than 1 times 100ms plus the time required to scan 5 elements 1 million 
times.


Another complication is the possible presence of NULL in an IN list. 
Getting NULLs in `IN' lists is a common source of questions on this 
list, because people are quite surprised by how it works. EXISTS avoids 
the NULL handling issue (and in the process demonstrates how woefully 
inconsistent SQL's handling of NULL really is).


Theoretically the query planner could transform:

SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE 
z.y_id IS NOT NULL);


into:

SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)

... or vice versa depending on which it thought would be faster. AFAIK 
it doesn't currently do this. To be able to do it the planner would need 
to know how to estimate the cost of scanning an `IN' result list. It'd 
also need to be able to use constraints on the target table to prove 
that the result of the `IN' may not contain nulls. To transform the 
EXISTS version into the IN version where it'd be more efficient, it'd 
also have to be able to use constraints on the target table to prove 
that results of a SELECT would be unique without explicit deduplication.


All this makes me wonder ... does Pg currently support sorting IN lists 
and using a binary search? It'd be pretty nice to be able to prove that:


SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z);

is equvalent to:

SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id 
IS NOT NULL)


... and either transform it to an EXISTS test or add an ORDER BY z_id 
and flag the resultset as sorted so a binary search could be done on it 
whenever a row hits the IN test.


--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] IN or EXISTS

2011-08-31 Thread Andy Colson

On 8/30/2011 8:33 PM, Craig Ringer wrote:

On 31/08/2011 4:30 AM, Andy Colson wrote:

Hi all,

I have read things someplace saying not exists was better than not
in... or something like that. Not sure if that was for in/exists and
not in/not exists, and for a lot of records or not.


`EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary
to build a list of values then iterate over them to check for a match.
By contrast, `EXISTS' may use a simple index lookup or the like to test
for the presence of a value.

On the other hand, the `IN' subquery is uncorrelated needs only run
once, where the `EXISTS' subquery is correlated and has to run once for
every outer record. That means that the `IN' list approach can be a lot
faster where the subquery in question is relatively time consuming for
the number of values it returns. For example, if the `IN' query returns
only 5 values and takes 100ms, you're scanning 1 million records in the
outer query, and the subquery `EXISTS' version would take 50ms, using
`IN' is a no-brainer since 1 million times 50ms will be a lot slower
than 1 times 100ms plus the time required to scan 5 elements 1 million
times.

Another complication is the possible presence of NULL in an IN list.
Getting NULLs in `IN' lists is a common source of questions on this
list, because people are quite surprised by how it works. EXISTS avoids
the NULL handling issue (and in the process demonstrates how woefully
inconsistent SQL's handling of NULL really is).

Theoretically the query planner could transform:

SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE
z.y_id IS NOT NULL);

into:

SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)

... or vice versa depending on which it thought would be faster. AFAIK
it doesn't currently do this. To be able to do it the planner would need
to know how to estimate the cost of scanning an `IN' result list. It'd
also need to be able to use constraints on the target table to prove
that the result of the `IN' may not contain nulls. To transform the
EXISTS version into the IN version where it'd be more efficient, it'd
also have to be able to use constraints on the target table to prove
that results of a SELECT would be unique without explicit deduplication.

All this makes me wonder ... does Pg currently support sorting IN lists
and using a binary search? It'd be pretty nice to be able to prove that:

SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z);

is equvalent to:

SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id
IS NOT NULL)

... and either transform it to an EXISTS test or add an ORDER BY z_id
and flag the resultset as sorted so a binary search could be done on it
whenever a row hits the IN test.

--
Craig Ringer



Yeah... my current code uses IN.  Most of my updates are small, so my 
inner list is 500 integers.  It runs fine.  What I'm worried about is 
when I update the entire table, so my inner list is 60k integers.  Maybe 
I'm just worrying for naught.  I tested a table with 100k rows, ran both 
with explain analyzes, and they look the same:


 Delete  (cost=11186.26..20817.60 rows=25911 width=12) (actual 
time=408.138..408.138 rows=0 loops=1)
   -  Hash Semi Join  (cost=11186.26..20817.60 rows=25911 width=12) 
(actual time=61.997..182.573 rows=105434 loops=1)

 Hash Cond: (public.general.gid = upd.general.gid)
 -  Seq Scan on general  (cost=0.00..9113.11 rows=25911 
width=10) (actual time=0.004..42.364 rows=105434 loops=1)
 -  Hash  (cost=9868.34..9868.34 rows=105434 width=10) (actual 
time=61.958..61.958 rows=105434 loops=1)

   Buckets: 16384  Batches: 1  Memory Usage: 4531kB
   -  Seq Scan on general  (cost=0.00..9868.34 rows=105434 
width=10) (actual time=0.003..34.372 rows=105434 loops=1)


With or without an index, (even if I ANALYZE it) it still does a table 
scan and builds a hash.  Both IN and EXISTS act the same way.


I assume:
Buckets: 16384  Batches: 1  Memory Usage: 4531kB

That means a total of 4.5 meg of ram was used for the hash, so if my 
work_mem was lower than that it would swap?  (or choose a different plan?)


I'll only ever be running one update at a time, so I'm not worried about 
multiple connections running at once.


Anyway, I'll just leave it alone (and stop optimizing things that dont 
need it)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] IN or EXISTS

2011-08-31 Thread Tomas Vondra
On 31 Srpen 2011, 15:59, Andy Colson wrote:
 I assume:
 Buckets: 16384  Batches: 1  Memory Usage: 4531kB

 That means a total of 4.5 meg of ram was used for the hash, so if my
 work_mem was lower than that it would swap?  (or choose a different plan?)

Why don't you try that? Just set the work_mem to 1MB or so and run the query.

I think it'll use the same plan but multiple batches - read just part of
the inner table so that the hash table fits into work_mem, scan the outer
table etc. The downside is it'd rescan the outer table several times.

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] IN or EXISTS

2011-08-30 Thread Andy Colson

Hi all,

I have read things someplace saying not exists was better than not in... 
or something like that.  Not sure if that was for in/exists and not 
in/not exists, and for a lot of records or not.


Here is my setup:

My website has a general table, let say 60k rows.  Its mostly read-only. 
 Every once and a while we get updated data, so I:

create schema upd;
create table upd.general(like public.general);

Then I dump the new data into upd.general.  (This has many table's and 
steps, I'm simplifying it here).


For the last step, I want to:

begin;
delete from public.general where gid in (select gid from upd.general);
insert into public.general select * from upd.general;
... 7 other tables same way ...
commit;


Most of the time upd.general will be  500 rows.  Every once and a while 
things get messed up and we just update the entire database, so count(*) 
upd.general == count(*) public.general.


My question is:
fast is nice, but safe and less resource intensive is better, so which 
would I probably like better:


delete from public.general where gid in (select gid from upd.general);

or

-- currently dont have and index, so
create index general_pk on upd.general(gid);
delete from public.general a where exists(select 1 from upd.general b 
where a.gid=b.gid);



Thanks for any suggestions,

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] join vs exists

2010-06-20 Thread AI Rumman
Which one is good - join between table or using exists in where condition?

Query 1;

Select a.*
from a
where exists
(
select 1 from b inner join c on b.id1 = c.id where a.id = b.id)

Query 2:
select a.*
from a
inner join
(select b.id from b inner join c on b.id1 = c.id) as q
on a.id = q.id

Any suggestion please.


Re: [PERFORM] join vs exists

2010-06-20 Thread Kevin Grittner
AI Rumman  wrote:
 
 Which one is good - join between table or using exists in where
 condition?
 
Your example wouldn't return the same results unless there was at
most one matching row in b and one matching row in c, at least
without resorting to DISTINCT (which you don't show).  So, be careful
of not getting the wrong results in an attempt to optimize.
 
You don't say which version of PostgreSQL you're using, but if its a
fairly recent major version, I would expect nearly identical
performance if the queries returned the same results without
DISTINCT, and would usually expect better results for the EXISTS than
the JOIN with DISTINCT.
 
-Kevin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance