[SQL] which is better: using OR clauses or UNION?

2011-08-16 Thread adam_pgsql

Hi,

I have a query hitting a table of 25 million rows. The table has a text field 
('identifier') which i need to query for matching rows. The question is if i 
have multiple strings to match against this field I can use multiple OR 
sub-statements or multiple statements in a UNION. The UNION seems to run 
quicker is this to be expected? or is there anything else I can do improve 
the speed of this query? Some query details:


table "dba_data_base", index:
"in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops)


Query 1
---
datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, 
p_value
 FROM
 dba_data_base a
 WHERE
 ( lower(identifier) LIKE lower('BUGS001884677') OR
   lower(identifier) LIKE lower('BUGS001884678') OR
   lower(identifier) LIKE lower('BUGS001884679') OR
   lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )
 ORDER BY a.identifier;

  QUERY PLAN
   



Sort  (cost=14930.13..14939.77 rows=3857 width=62) (actual 
time=3208.466..3208.652 rows=318 loops=1)
  Sort Key: identifier
  ->  Bitmap Heap Scan on dba_data_base a  (cost=134.43..14700.38 rows=3857 
width=62) (actual time=81.106..3207.721 rows=318 loops=1)
Recheck Cond: ((lower(identifier) ~~ 'bugs001884677'::text) OR 
(lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ 
'bugs001884679'::text) OR (lower(identifier) ~
~ 'sptigr4-2210 (6f24)'::text))
Filter: ((lower(identifier) ~~ 'bugs001884677'::text) OR 
(lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ 
'bugs001884679'::text) OR (lower(identifier) ~~ 'spt
igr4-2210 (6f24)'::text))
->  BitmapOr  (cost=134.43..134.43 rows=3857 width=0) (actual 
time=71.397..71.397 rows=0 loops=1)
  ->  Bitmap Index Scan on in_dba_data_base_identifier  
(cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0 loops=1)
Index Cond: (lower(identifier) ~=~ 
'bugs001884677'::character varying)
  ->  Bitmap Index Scan on in_dba_data_base_identifier  
(cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: (lower(identifier) ~=~ 
'bugs001884678'::character varying)
  ->  Bitmap Index Scan on in_dba_data_base_identifier  
(cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: (lower(identifier) ~=~ 
'bugs001884679'::character varying)
  ->  Bitmap Index Scan on in_dba_data_base_identifier  
(cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 
loops=1)
Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 
(6f24)'::character varying)
Total runtime: 3208.904 ms


Query 2
---
datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, 
p_value
 FROM
 dba_data_base a
 WHERE
 lower(identifier) LIKE lower('BUGS001884677')
UNION
SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
 FROM
 dba_data_base a
 WHERE
   lower(identifier) LIKE lower('BUGS001884678')
UNION
SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
 FROM
 dba_data_base a
 WHERE
   lower(identifier) LIKE lower('BUGS001884679')
UNION
SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
 FROM
 dba_data_base a
 WHERE
   lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)')
 ORDER BY identifier;

Sort  (cost=15702.26..15711.90 rows=3856 width=62) (actual time=3.688..3.886 
rows=317 loops=1)
  Sort Key: identifier
  ->  Unique  (cost=15414.74..15472.58 rows=3856 width=62) (actual 
time=2.663..3.387 rows=317 loops=1)
->  Sort  (cost=15414.74..15424.38 rows=3856 width=62) (actual 
time=2.660..2.834 rows=318 loops=1)
  Sort Key: bioassay_id, identifier, ratio, log_ratio, p_value
  ->  Append  (cost=32.88..15185.06 rows=3856 width=62) (actual 
time=0.320..2.131 rows=318 loops=1)
->  Bitmap Heap Scan on dba_data_base a  
(cost=32.88..3786.62 rows=964 width=62) (actual time=0.041..0.041 rows=0 
loops=1)
  Filter: (lower(identifier) ~~ 
'bugs001884677'::text)
  ->  Bitmap Index Scan on in_dba_data_base_identifier  
(cost=0.00..32.64 rows=964 width=0) (actual time=0.036..0.036 rows=0 loops=1)
Index Cond: (lower(identifier) ~=~ 
'bugs001884677'::character varying)
->  Bitmap Heap Scan on dba_data_base a  
(cost=3

Re: [SQL] which is better: using OR clauses or UNION?

2011-08-16 Thread adam_pgsql

Hi Viktor,

thanks for your email, gave that a try, but the 

lower(identifier) LIKE lower('BUGS001884677') OR

still comes in quicker than the IN approach

thanks

adam



On 16 Aug 2011, at 12:56, Viktor Bojović wrote:

> hi Adam,
> im not sure which is faster/slower but, possibly you can speed it up by using 
> "in" operator
> 
> ...where lower(identifier) in (lower('BUGS001884677') , 
> lower('BUGS001884678'),);
> 
> if you create function based index:
> CREATE INDEX idx_table_lower_text ON table(lower(text_field));
> (taken from: http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php)
> 
> 
> 
> 
> On Tue, Aug 16, 2011 at 1:39 PM, adam_pgsql  wrote:
> 
> Hi,
> 
> I have a query hitting a table of 25 million rows. The table has a text field 
> ('identifier') which i need to query for matching rows. The question is if i 
> have multiple strings to match against this field I can use multiple OR 
> sub-statements or multiple statements in a UNION. The UNION seems to run 
> quicker is this to be expected? or is there anything else I can do 
> improve the speed of this query? Some query details:
> 
> 
> table "dba_data_base", index:
> "in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops)
> 
> 
> Query 1
> ---
> datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, 
> p_value
>  FROM
> dba_data_base a
>  WHERE
> ( lower(identifier) LIKE lower('BUGS001884677') OR
>   lower(identifier) LIKE lower('BUGS001884678') OR
>   lower(identifier) LIKE lower('BUGS001884679') OR
>   lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )
>  ORDER BY a.identifier;
>   
>QUERY PLAN
> 
> 
> 
> Sort  (cost=14930.13..14939.77 rows=3857 width=62) (actual 
> time=3208.466..3208.652 rows=318 loops=1)
>  Sort Key: identifier
>  ->  Bitmap Heap Scan on dba_data_base a  (cost=134.43..14700.38 rows=3857 
> width=62) (actual time=81.106..3207.721 rows=318 loops=1)
>Recheck Cond: ((lower(identifier) ~~ 'bugs001884677'::text) OR 
> (lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ 
> 'bugs001884679'::text) OR (lower(identifier) ~
> ~ 'sptigr4-2210 (6f24)'::text))
>Filter: ((lower(identifier) ~~ 'bugs001884677'::text) OR 
> (lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ 
> 'bugs001884679'::text) OR (lower(identifier) ~~ 'spt
> igr4-2210 (6f24)'::text))
>->  BitmapOr  (cost=134.43..134.43 rows=3857 width=0) (actual 
> time=71.397..71.397 rows=0 loops=1)
>  ->  Bitmap Index Scan on in_dba_data_base_identifier  
> (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0 loops=1)
>Index Cond: (lower(identifier) ~=~ 
> 'bugs001884677'::character varying)
>  ->  Bitmap Index Scan on in_dba_data_base_identifier  
> (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)
>Index Cond: (lower(identifier) ~=~ 
> 'bugs001884678'::character varying)
>  ->  Bitmap Index Scan on in_dba_data_base_identifier  
> (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)
>Index Cond: (lower(identifier) ~=~ 
> 'bugs001884679'::character varying)
>  ->  Bitmap Index Scan on in_dba_data_base_identifier  
> (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 
> loops=1)
>Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 
> (6f24)'::character varying)
> Total runtime: 3208.904 ms
> 
> 
> Query 2
> ---
> datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, 
> p_value
>  FROM
> dba_data_base a
>  WHERE
> lower(identifier) LIKE lower('BUGS001884677')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
> dba_data_base a
>  WHERE
>   lower(identifier) LIKE lower('BUGS001884678')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
> dba_data_base a
>  W

Re: [SQL] which is better: using OR clauses or UNION?

2011-08-16 Thread adam_pgsql

On 16 Aug 2011, at 15:09, Tom Lane wrote:

> adam_pgsql  writes:
>> I have a query hitting a table of 25 million rows. The table has a
>> text field ('identifier') which i need to query for matching rows. The
>> question is if i have multiple strings to match against this field I
>> can use multiple OR sub-statements or multiple statements in a
>> UNION. The UNION seems to run quicker is this to be expected?
> 
> Your test cases don't seem exactly comparable; in particular I think the
> second one is benefiting from the first one having already read and
> cached the relevant disk blocks.  Notice how you've got, eg,
> 
>>  ->  Bitmap Index Scan on in_dba_data_base_identifier  
>> (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 
>> loops=1)
>>Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 
>> (6f24)'::character varying)
> 
> versus
> 
>>  ->  Bitmap Index Scan on 
>> in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual 
>> time=0.178..0.178 rows=318 loops=1)
>>Index Cond: (lower(identifier) ~=~ 
>> 'sptigr4-2210 (6f24)'::character varying)
> 
> Those are the exact same subplan, so any honest comparison should be
> finding them to take the same amount of time.  When the actual readings
> are different by a factor of several hundred, there's something wrong
> with your measurement process.
> 
> In the end this comes down to whether duplicates will be eliminated more
> efficiently by a BitmapOr step or by sort/uniq on the resulting rows.
> I'd have to bet on the BitmapOr myself, but it's likely that this is
> down in the noise compared to the actual disk accesses in any
> not-fully-cached scenario.  Also, if you don't expect the sub-statements
> to yield any duplicates, or don't care about seeing the same row twice
> in the output, you should consider UNION ALL instead of UNION.


Thanks guys, I'll give some of those options a try and see which ones improve 
performance

(Tom, yes i ran those queries after each other so there was caching going on. 
However, I had noticed a difference in performance when spacing the queries 
before and after a few other big queries to help clear the cache).

adam




-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql