[SQL] which is better: using OR clauses or UNION?
Hi, I have a query hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query for matching rows. The question is if i have multiple strings to match against this field I can use multiple OR sub-statements or multiple statements in a UNION. The UNION seems to run quicker is this to be expected? or is there anything else I can do improve the speed of this query? Some query details: table "dba_data_base", index: "in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops) Query 1 --- datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value FROM dba_data_base a WHERE ( lower(identifier) LIKE lower('BUGS001884677') OR lower(identifier) LIKE lower('BUGS001884678') OR lower(identifier) LIKE lower('BUGS001884679') OR lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') ) ORDER BY a.identifier; QUERY PLAN Sort (cost=14930.13..14939.77 rows=3857 width=62) (actual time=3208.466..3208.652 rows=318 loops=1) Sort Key: identifier -> Bitmap Heap Scan on dba_data_base a (cost=134.43..14700.38 rows=3857 width=62) (actual time=81.106..3207.721 rows=318 loops=1) Recheck Cond: ((lower(identifier) ~~ 'bugs001884677'::text) OR (lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ 'bugs001884679'::text) OR (lower(identifier) ~ ~ 'sptigr4-2210 (6f24)'::text)) Filter: ((lower(identifier) ~~ 'bugs001884677'::text) OR (lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ 'bugs001884679'::text) OR (lower(identifier) ~~ 'spt igr4-2210 (6f24)'::text)) -> BitmapOr (cost=134.43..134.43 rows=3857 width=0) (actual time=71.397..71.397 rows=0 loops=1) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs001884677'::character varying) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs001884678'::character varying) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs001884679'::character varying) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 loops=1) Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) Total runtime: 3208.904 ms Query 2 --- datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value FROM dba_data_base a WHERE lower(identifier) LIKE lower('BUGS001884677') UNION SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value FROM dba_data_base a WHERE lower(identifier) LIKE lower('BUGS001884678') UNION SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value FROM dba_data_base a WHERE lower(identifier) LIKE lower('BUGS001884679') UNION SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value FROM dba_data_base a WHERE lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') ORDER BY identifier; Sort (cost=15702.26..15711.90 rows=3856 width=62) (actual time=3.688..3.886 rows=317 loops=1) Sort Key: identifier -> Unique (cost=15414.74..15472.58 rows=3856 width=62) (actual time=2.663..3.387 rows=317 loops=1) -> Sort (cost=15414.74..15424.38 rows=3856 width=62) (actual time=2.660..2.834 rows=318 loops=1) Sort Key: bioassay_id, identifier, ratio, log_ratio, p_value -> Append (cost=32.88..15185.06 rows=3856 width=62) (actual time=0.320..2.131 rows=318 loops=1) -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.041..0.041 rows=0 loops=1) Filter: (lower(identifier) ~~ 'bugs001884677'::text) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.036..0.036 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs001884677'::character varying) -> Bitmap Heap Scan on dba_data_base a (cost=3
Re: [SQL] which is better: using OR clauses or UNION?
Hi Viktor, thanks for your email, gave that a try, but the lower(identifier) LIKE lower('BUGS001884677') OR still comes in quicker than the IN approach thanks adam On 16 Aug 2011, at 12:56, Viktor Bojović wrote: > hi Adam, > im not sure which is faster/slower but, possibly you can speed it up by using > "in" operator > > ...where lower(identifier) in (lower('BUGS001884677') , > lower('BUGS001884678'),); > > if you create function based index: > CREATE INDEX idx_table_lower_text ON table(lower(text_field)); > (taken from: http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php) > > > > > On Tue, Aug 16, 2011 at 1:39 PM, adam_pgsql wrote: > > Hi, > > I have a query hitting a table of 25 million rows. The table has a text field > ('identifier') which i need to query for matching rows. The question is if i > have multiple strings to match against this field I can use multiple OR > sub-statements or multiple statements in a UNION. The UNION seems to run > quicker is this to be expected? or is there anything else I can do > improve the speed of this query? Some query details: > > > table "dba_data_base", index: > "in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops) > > > Query 1 > --- > datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, > p_value > FROM > dba_data_base a > WHERE > ( lower(identifier) LIKE lower('BUGS001884677') OR > lower(identifier) LIKE lower('BUGS001884678') OR > lower(identifier) LIKE lower('BUGS001884679') OR > lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') ) > ORDER BY a.identifier; > >QUERY PLAN > > > > Sort (cost=14930.13..14939.77 rows=3857 width=62) (actual > time=3208.466..3208.652 rows=318 loops=1) > Sort Key: identifier > -> Bitmap Heap Scan on dba_data_base a (cost=134.43..14700.38 rows=3857 > width=62) (actual time=81.106..3207.721 rows=318 loops=1) >Recheck Cond: ((lower(identifier) ~~ 'bugs001884677'::text) OR > (lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ > 'bugs001884679'::text) OR (lower(identifier) ~ > ~ 'sptigr4-2210 (6f24)'::text)) >Filter: ((lower(identifier) ~~ 'bugs001884677'::text) OR > (lower(identifier) ~~ 'bugs001884678'::text) OR (lower(identifier) ~~ > 'bugs001884679'::text) OR (lower(identifier) ~~ 'spt > igr4-2210 (6f24)'::text)) >-> BitmapOr (cost=134.43..134.43 rows=3857 width=0) (actual > time=71.397..71.397 rows=0 loops=1) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0 loops=1) >Index Cond: (lower(identifier) ~=~ > 'bugs001884677'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) >Index Cond: (lower(identifier) ~=~ > 'bugs001884678'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) >Index Cond: (lower(identifier) ~=~ > 'bugs001884679'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 > loops=1) >Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 > (6f24)'::character varying) > Total runtime: 3208.904 ms > > > Query 2 > --- > datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, > p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS001884677') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS001884678') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > W
Re: [SQL] which is better: using OR clauses or UNION?
On 16 Aug 2011, at 15:09, Tom Lane wrote: > adam_pgsql writes: >> I have a query hitting a table of 25 million rows. The table has a >> text field ('identifier') which i need to query for matching rows. The >> question is if i have multiple strings to match against this field I >> can use multiple OR sub-statements or multiple statements in a >> UNION. The UNION seems to run quicker is this to be expected? > > Your test cases don't seem exactly comparable; in particular I think the > second one is benefiting from the first one having already read and > cached the relevant disk blocks. Notice how you've got, eg, > >> -> Bitmap Index Scan on in_dba_data_base_identifier >> (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 >> loops=1) >>Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 >> (6f24)'::character varying) > > versus > >> -> Bitmap Index Scan on >> in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual >> time=0.178..0.178 rows=318 loops=1) >>Index Cond: (lower(identifier) ~=~ >> 'sptigr4-2210 (6f24)'::character varying) > > Those are the exact same subplan, so any honest comparison should be > finding them to take the same amount of time. When the actual readings > are different by a factor of several hundred, there's something wrong > with your measurement process. > > In the end this comes down to whether duplicates will be eliminated more > efficiently by a BitmapOr step or by sort/uniq on the resulting rows. > I'd have to bet on the BitmapOr myself, but it's likely that this is > down in the noise compared to the actual disk accesses in any > not-fully-cached scenario. Also, if you don't expect the sub-statements > to yield any duplicates, or don't care about seeing the same row twice > in the output, you should consider UNION ALL instead of UNION. Thanks guys, I'll give some of those options a try and see which ones improve performance (Tom, yes i ran those queries after each other so there was caching going on. However, I had noticed a difference in performance when spacing the queries before and after a few other big queries to help clear the cache). adam -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql