[SQL] in() VS exists()

2008-03-13 Thread Julien
Hello,

Does anyone has an idea why sometimes:
- select ... where ... in (select ...)
is faster than :
- select ... where ... exists(select ...)
and sometimes it's the opposite ?

I had such a situation, I've pasted the queries on:
http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html

It's running PostgreSQL 8.1 with an effective_cache_size of 3.

specimens.id is the primary key and there are indexes on
sequences(specimen_id) and specimen_measurements(specimen_id)

Is there a general "rule" to know when to use the in() version and when
to use the exists() version ? Is it true to say that the exists()
version is more scalable (with many rows) than the in() version (from
the little tests I made it seems the case) ?

Thanks,
Julien

-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: [EMAIL PROTECTED]
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] in() VS exists()

2008-03-13 Thread Julien
If I understood well the query plan, the planner optimize the
IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :

->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
time=93.971..201.908 rows=1431 loops=1)
 Hash Cond: ("outer".id = "inner".specimen_id)

so I guess that :

select count(sp.id) from specimens sp where sp.id in (select specimen_id
from sequences);

is almost the same as :

select count(sp.id) from specimens sp INNER JOIN (select specimen_id
from sequences GROUP BY specimen_id) as foo on foo.specimen_id = sp.id;

?

Thanks,
Julien

On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> The chapter on indexes in the manual should give you a pretty good
> idea on the why.
> IN and EXISTS are not the only possibilities, you can also use inner
> or outer joins.
> Which solution performs best depends on the data, the database
> version, the available indexes, ...
> 
> >>> Julien <[EMAIL PROTECTED]> 2008-03-13 15:47 >>>
> Hello,
> 
> Does anyone has an idea why sometimes:
> - select ... where ... in (select ...)
> is faster than :
> - select ... where ... exists(select ...)
> and sometimes it's the opposite ?
> 
> I had such a situation, I've pasted the queries on:
> http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html
> 
> It's running PostgreSQL 8.1 with an effective_cache_size of 3.
> 
> specimens.id is the primary key and there are indexes on
> sequences(specimen_id) and specimen_measurements(specimen_id)
> 
> Is there a general "rule" to know when to use the in() version and
> when
> to use the exists() version ? Is it true to say that the exists()
> version is more scalable (with many rows) than the in() version (from
> the little tests I made it seems the case) ?
> 
> Thanks,
> Julien
> 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: [EMAIL PROTECTED]
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: [EMAIL PROTECTED]
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] in() VS exists()

2008-03-13 Thread Bart Degryse
I think that just
select count(sp.id) from specimens sp INNER JOIN sequences s on
s.specimen_id = sp.id;
should be enough

>>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:10 >>>
If I understood well the query plan, the planner optimize the
IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :

->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
time=93.971..201.908 rows=1431 loops=1)
 Hash Cond: ("outer".id = "inner".specimen_id)

so I guess that :

select count(sp.id) from specimens sp where sp.id in (select
specimen_id
from sequences);

is almost the same as :

select count(sp.id) from specimens sp INNER JOIN (select specimen_id
from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
sp.id;

?

Thanks,
Julien

On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> The chapter on indexes in the manual should give you a pretty good
> idea on the why.
> IN and EXISTS are not the only possibilities, you can also use inner
> or outer joins.
> Which solution performs best depends on the data, the database
> version, the available indexes, ...
> 
> >>> Julien <[EMAIL PROTECTED]> 2008-03-13 15:47 >>>
> Hello,
> 
> Does anyone has an idea why sometimes:
> - select ... where ... in (select ...)
> is faster than :
> - select ... where ... exists(select ...)
> and sometimes it's the opposite ?
> 
> I had such a situation, I've pasted the queries on:
> http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html 
> 
> It's running PostgreSQL 8.1 with an effective_cache_size of 3.
> 
> specimens.id is the primary key and there are indexes on
> sequences(specimen_id) and specimen_measurements(specimen_id)
> 
> Is there a general "rule" to know when to use the in() version and
> when
> to use the exists() version ? Is it true to say that the exists()
> version is more scalable (with many rows) than the in() version
(from
> the little tests I made it seems the case) ?
> 
> Thanks,
> Julien
> 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be ( http://www.biodiversity.be/ )
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: [EMAIL PROTECTED] 
> @biobel: http://biobel.biodiversity.be/person/show/471 
> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql 
> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be ( http://www.biodiversity.be/ )
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: [EMAIL PROTECTED] 
@biobel: http://biobel.biodiversity.be/person/show/471 
Tel : 02 650 57 52


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] in() VS exists()

2008-03-13 Thread Julien
mmh no because it's a one to many relation (a specimen can have more
than one sequence) :

muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
on s.specimen_id = sp.id;
 count 
---
  1536
(1 row)

Time: 81.242 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
 count 
---
  1431
(1 row)

Time: 81.736 ms
muridae=> 

(of course this is a bad example, because I could just do: select
count(specimen_id) from sequences group by specimen_id;, but in my
application I have more fields coming from specimens of course)

Julien

On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> I think that just
> select count(sp.id) from specimens sp INNER JOIN sequences s on
> s.specimen_id = sp.id;
> should be enough
> 
> >>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:10 >>>
> If I understood well the query plan, the planner optimize the
> IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
> 
> ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> time=93.971..201.908 rows=1431 loops=1)
>  Hash Cond: ("outer".id = "inner".specimen_id)
> 
> so I guess that :
> 
> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id
> from sequences);
> 
> is almost the same as :
> 
> select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> sp.id;
> 
> ?
> 
> Thanks,
> Julien
> 
> On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > The chapter on indexes in the manual should give you a pretty good
> > idea on the why.
> > IN and EXISTS are not the only possibilities, you can also use inner
> > or outer joins.
> > Which solution performs best depends on the data, the database
> > version, the available indexes, ...
> > 
> > >>> Julien <[EMAIL PROTECTED]> 2008-03-13 15:47 >>>
> > Hello,
> > 
> > Does anyone has an idea why sometimes:
> > - select ... where ... in (select ...)
> > is faster than :
> > - select ... where ... exists(select ...)
> > and sometimes it's the opposite ?
> > 
> > I had such a situation, I've pasted the queries on:
> > http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html
> > 
> > It's running PostgreSQL 8.1 with an effective_cache_size of 3.
> > 
> > specimens.id is the primary key and there are indexes on
> > sequences(specimen_id) and specimen_measurements(specimen_id)
> > 
> > Is there a general "rule" to know when to use the in() version and
> > when
> > to use the exists() version ? Is it true to say that the exists()
> > version is more scalable (with many rows) than the in() version
> (from
> > the little tests I made it seems the case) ?
> > 
> > Thanks,
> > Julien
> > 
> > -- 
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: [EMAIL PROTECTED]
> > @biobel: http://biobel.biodiversity.be/person/show/471
> > Tel : 02 650 57 52
> > 
> > 
> > -- 
> > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql
> > 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: [EMAIL PROTECTED]
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: [EMAIL PROTECTED]
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] in() VS exists()

2008-03-13 Thread Bart Degryse
how about
select count(distinct s.specimen_id) from specimens sp INNER JOIN
sequences s 
on s.specimen_id = sp.id;


>>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:27 >>>
mmh no because it's a one to many relation (a specimen can have more
than one sequence) :

muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
on s.specimen_id = sp.id;
count 
---
  1536
(1 row)

Time: 81.242 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
count 
---
  1431
(1 row)

Time: 81.736 ms
muridae=> 

(of course this is a bad example, because I could just do: select
count(specimen_id) from sequences group by specimen_id;, but in my
application I have more fields coming from specimens of course)

Julien

On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> I think that just
> select count(sp.id) from specimens sp INNER JOIN sequences s on
> s.specimen_id = sp.id;
> should be enough
> 
> >>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:10 >>>
> If I understood well the query plan, the planner optimize the
> IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
> 
> ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> time=93.971..201.908 rows=1431 loops=1)
>  Hash Cond: ("outer".id = "inner".specimen_id)
> 
> so I guess that :
> 
> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id
> from sequences);
> 
> is almost the same as :
> 
> select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> sp.id;
> 
> ?
> 
> Thanks,
> Julien
> 
> On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > The chapter on indexes in the manual should give you a pretty good
> > idea on the why.
> > IN and EXISTS are not the only possibilities, you can also use
inner
> > or outer joins.
> > Which solution performs best depends on the data, the database
> > version, the available indexes, ...
> > 
> > >>> Julien <[EMAIL PROTECTED]> 2008-03-13 15:47 >>>
> > Hello,
> > 
> > Does anyone has an idea why sometimes:
> > - select ... where ... in (select ...)
> > is faster than :
> > - select ... where ... exists(select ...)
> > and sometimes it's the opposite ?
> > 
> > I had such a situation, I've pasted the queries on:
> > http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html

> > 
> > It's running PostgreSQL 8.1 with an effective_cache_size of 3.
> > 
> > specimens.id is the primary key and there are indexes on
> > sequences(specimen_id) and specimen_measurements(specimen_id)
> > 
> > Is there a general "rule" to know when to use the in() version and
> > when
> > to use the exists() version ? Is it true to say that the exists()
> > version is more scalable (with many rows) than the in() version
> (from
> > the little tests I made it seems the case) ?
> > 
> > Thanks,
> > Julien
> > 
> > -- 
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be ( http://www.biodiversity.be/ )
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: [EMAIL PROTECTED] 
> > @biobel: http://biobel.biodiversity.be/person/show/471 
> > Tel : 02 650 57 52
> > 
> > 
> > -- 
> > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql 
> > 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be ( http://www.biodiversity.be/ )
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: [EMAIL PROTECTED] 
> @biobel: http://biobel.biodiversity.be/person/show/471 
> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql 
> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be ( http://www.biodiversity.be/ )
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: [EMAIL PROTECTED] 
@biobel: http://biobel.biodiversity.be/person/show/471 
Tel : 02 650 57 52


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] in() VS exists()

2008-03-13 Thread Bart Degryse
The chapter on indexes in the manual  (
http://www.postgresql.org/docs/8.2/static/indexes.html )should give
you a pretty good idea on the why.
IN and EXISTS are not the only possibilities, you can also use inner or
outer joins.
Which solution performs best depends on the data, the database version,
the available indexes, ...

>>> Julien <[EMAIL PROTECTED]> 2008-03-13 15:47 >>>
Hello,

Does anyone has an idea why sometimes:
- select ... where ... in (select ...)
is faster than :
- select ... where ... exists(select ...)
and sometimes it's the opposite ?

I had such a situation, I've pasted the queries on:
http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html 

It's running PostgreSQL 8.1 with an effective_cache_size of 3.

specimens.id is the primary key and there are indexes on
sequences(specimen_id) and specimen_measurements(specimen_id)

Is there a general "rule" to know when to use the in() version and
when
to use the exists() version ? Is it true to say that the exists()
version is more scalable (with many rows) than the in() version (from
the little tests I made it seems the case) ?

Thanks,
Julien

-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be ( http://www.biodiversity.be/ )
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: [EMAIL PROTECTED] 
@biobel: http://biobel.biodiversity.be/person/show/471 
Tel : 02 650 57 52


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] in() VS exists()

2008-03-13 Thread Julien
>From my experience I tend to avoid SELECT DISTINCT queries because it's
usually very slow with many rows ...

For my specific case the result is the same:

muridae=> select count(distinct s.specimen_id) from specimens sp INNER
JOIN sequences s on s.specimen_id = sp.id;
 count 
---
  1431
(1 row)

Time: 65.351 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
 count 
---
  1431
(1 row)

Time: 66.371 ms

But to give an example, I have a table with ~1 000 000 rows where the
DISTINCT solution is more than 10 times slower :

muridae=> select count(distinct sp.id) from specimens sp INNER JOIN
specimen_measurements m ON m.specimen_id = sp.id;
 count 
---
 75241
(1 row)

Time: 15970.668 ms

muridae=> select count(sp.id) from specimens sp INNER JOIN (select
specimen_id from specimen_measurements GROUP BY specimen_id) as foo on
foo.specimen_id = sp.id;
 count 
---
 75241
(1 row)

Time: 1165.487 ms

Regards,
Julien

On Thu, 2008-03-13 at 15:28 +0100, Bart Degryse wrote:
> how about
> select count(distinct s.specimen_id) from specimens sp INNER JOIN
> sequences s 
> on s.specimen_id = sp.id;
> 
> 
> >>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:27 >>>
> mmh no because it's a one to many relation (a specimen can have more
> than one sequence) :
> 
> muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
> on s.specimen_id = sp.id;
> count 
> ---
>   1536
> (1 row)
> 
> Time: 81.242 ms
> muridae=> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id from sequences group by specimen_id);
> count 
> ---
>   1431
> (1 row)
> 
> Time: 81.736 ms
> muridae=> 
> 
> (of course this is a bad example, because I could just do: select
> count(specimen_id) from sequences group by specimen_id;, but in my
> application I have more fields coming from specimens of course)
> 
> Julien
> 
> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> > I think that just
> > select count(sp.id) from specimens sp INNER JOIN sequences s on
> > s.specimen_id = sp.id;
> > should be enough
> > 
> > >>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:10 >>>
> > If I understood well the query plan, the planner optimize the
> > IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
> > 
> > ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> > time=93.971..201.908 rows=1431 loops=1)
> >  Hash Cond: ("outer".id = "inner".specimen_id)
> > 
> > so I guess that :
> > 
> > select count(sp.id) from specimens sp where sp.id in (select
> > specimen_id
> > from sequences);
> > 
> > is almost the same as :
> > 
> > select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> > from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> > sp.id;
> > 
> > ?
> > 
> > Thanks,
> > Julien
> > 
> > On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > > The chapter on indexes in the manual should give you a pretty good
> > > idea on the why.
> > > IN and EXISTS are not the only possibilities, you can also use
> inner
> > > or outer joins.
> > > Which solution performs best depends on the data, the database
> > > version, the available indexes, ...
> > > 
> > > >>> Julien <[EMAIL PROTECTED]> 2008-03-13 15:47 >>>
> > > Hello,
> > > 
> > > Does anyone has an idea why sometimes:
> > > - select ... where ... in (select ...)
> > > is faster than :
> > > - select ... where ... exists(select ...)
> > > and sometimes it's the opposite ?
> > > 
> > > I had such a situation, I've pasted the queries on:
> > > http://rafb.net/p/KXNZ6892.html and
> http://rafb.net/p/jvo5DO38.html
> > > 
> > > It's running PostgreSQL 8.1 with an effective_cache_size of 3.
> > > 
> > > specimens.id is the primary key and there are indexes on
> > > sequences(specimen_id) and specimen_measurements(specimen_id)
> > > 
> > > Is there a general "rule" to know when to use the in() version and
> > > when
> > > to use the exists() version ? Is it true to say that the exists()
> > > version is more scalable (with many rows) than the in() version
> > (from
> > > the little tests I made it seems the case) ?
> > > 
> > > Thanks,
> > > Julien
> > > 
> > > -- 
> > > Julien Cigar
> > > Belgian Biodiversity Platform
> > > http://www.biodiversity.be
> > > Université Libre de Bruxelles (ULB)
> > > Campus de la Plaine CP 257
> > > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > > Boulevard du Triomphe, entrée ULB 2
> > > B-1050 Bruxelles
> > > Mail: [EMAIL PROTECTED]
> > > @biobel: http://biobel.biodiversity.be/person/show/471
> > > Tel : 02 650 57 52
> > > 
> > > 
> > > -- 
> > > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-sql
> > > 
> > -- 
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (

Re: [SQL] in() VS exists()

2008-03-13 Thread Bart Degryse
So it all comes back to what I wrote in the beginning:
"Which solution performs best depends on the data, the database
version, the available indexes, ..."
Tips:
  - be aware that statements can be written in different ways
  - test them on realistic data
  - use explain to tune your statements


>>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:50 >>>
>From my experience I tend to avoid SELECT DISTINCT queries because
it's
usually very slow with many rows ...

For my specific case the result is the same:

muridae=> select count(distinct s.specimen_id) from specimens sp INNER
JOIN sequences s on s.specimen_id = sp.id;
count 
---
  1431
(1 row)

Time: 65.351 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
count 
---
  1431
(1 row)

Time: 66.371 ms

But to give an example, I have a table with ~1 000 000 rows where the
DISTINCT solution is more than 10 times slower :

muridae=> select count(distinct sp.id) from specimens sp INNER JOIN
specimen_measurements m ON m.specimen_id = sp.id;
count 
---
75241
(1 row)

Time: 15970.668 ms

muridae=> select count(sp.id) from specimens sp INNER JOIN (select
specimen_id from specimen_measurements GROUP BY specimen_id) as foo on
foo.specimen_id = sp.id;
count 
---
75241
(1 row)

Time: 1165.487 ms

Regards,
Julien

On Thu, 2008-03-13 at 15:28 +0100, Bart Degryse wrote:
> how about
> select count(distinct s.specimen_id) from specimens sp INNER JOIN
> sequences s 
> on s.specimen_id = sp.id;
> 
> 
> >>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:27 >>>
> mmh no because it's a one to many relation (a specimen can have more
> than one sequence) :
> 
> muridae=> select count(sp.id) from specimens sp INNER JOIN sequences
s
> on s.specimen_id = sp.id;
> count 
> ---
>   1536
> (1 row)
> 
> Time: 81.242 ms
> muridae=> select count(sp.id) from specimens sp where sp.id in
(select
> specimen_id from sequences group by specimen_id);
> count 
> ---
>   1431
> (1 row)
> 
> Time: 81.736 ms
> muridae=> 
> 
> (of course this is a bad example, because I could just do: select
> count(specimen_id) from sequences group by specimen_id;, but in my
> application I have more fields coming from specimens of course)
> 
> Julien
> 
> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> > I think that just
> > select count(sp.id) from specimens sp INNER JOIN sequences s on
> > s.specimen_id = sp.id;
> > should be enough
> > 
> > >>> Julien <[EMAIL PROTECTED]> 2008-03-13 17:10 >>>
> > If I understood well the query plan, the planner optimize the
> > IN(SELECT ...) version with a JOIN (line 19-20 of the first paste)
:
> > 
> > ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> > time=93.971..201.908 rows=1431 loops=1)
> >  Hash Cond: ("outer".id = "inner".specimen_id)
> > 
> > so I guess that :
> > 
> > select count(sp.id) from specimens sp where sp.id in (select
> > specimen_id
> > from sequences);
> > 
> > is almost the same as :
> > 
> > select count(sp.id) from specimens sp INNER JOIN (select
specimen_id
> > from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> > sp.id;
> > 
> > ?
> > 
> > Thanks,
> > Julien
> > 
> > On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > > The chapter on indexes in the manual should give you a pretty
good
> > > idea on the why.
> > > IN and EXISTS are not the only possibilities, you can also use
> inner
> > > or outer joins.
> > > Which solution performs best depends on the data, the database
> > > version, the available indexes, ...
> > > 
> > > >>> Julien <[EMAIL PROTECTED]> 2008-03-13 15:47 >>>
> > > Hello,
> > > 
> > > Does anyone has an idea why sometimes:
> > > - select ... where ... in (select ...)
> > > is faster than :
> > > - select ... where ... exists(select ...)
> > > and sometimes it's the opposite ?
> > > 
> > > I had such a situation, I've pasted the queries on:
> > > http://rafb.net/p/KXNZ6892.html and
> http://rafb.net/p/jvo5DO38.html 
> > > 
> > > It's running PostgreSQL 8.1 with an effective_cache_size of
3.
> > > 
> > > specimens.id is the primary key and there are indexes on
> > > sequences(specimen_id) and specimen_measurements(specimen_id)
> > > 
> > > Is there a general "rule" to know when to use the in() version
and
> > > when
> > > to use the exists() version ? Is it true to say that the
exists()
> > > version is more scalable (with many rows) than the in() version
> > (from
> > > the little tests I made it seems the case) ?
> > > 
> > > Thanks,
> > > Julien
> > > 
> > > -- 
> > > Julien Cigar
> > > Belgian Biodiversity Platform
> > > http://www.biodiversity.be ( http://www.biodiversity.be/ )
> > > Université Libre de Bruxelles (ULB)
> > > Campus de la Plaine CP 257
> > > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > > Boulevard du Triomphe, entrée ULB 2
> > > B-1050 Bruxelles
> > > Mail: [EMAIL PROTECTED] 
> > > @biobel: http://biobel.biodiversity.be/person/show/471 
> > > Tel : 02 650 57 52
> > > 
>

[SQL] Counting days ...

2008-03-13 Thread Aarni Ruuhimäki
Hi all,

A bit stuck here with something I know I can do with output / loops / 
filtering in the (web)application but want to do in SQL or within PostgreSQL.

Simply said, count days of accommodation for a given time period.

E.g.

res_id 1, start_day 2008-01-25, end_day 2008-02-15, number of persons 6
res_id 2, start_day 2008-02-10, end_day 2008-02-15, number of persons 4

for the period from 2008-02-01 to 2008-02-29 these two rows would give a total 
of

15 days x 6 persons + 4 days x 5 persons = 110 days

SELECT SUM(
CASE
WHEN res_start_day >= '2008-01-01' THEN
(res_end_day - res_start_day)
ELSE (res_end_day - (DATE '2008-01-01' - INTEGER '1'))
END
* group_size) AS days_in_period
FROM product_res pr
WHERE res_end_day >= '2008-01-01' AND res_end_day <= '2008-12-31';

 days_in_period

   68383
(1 row)

Ok, but a reservation can be of any nationality / country:

SELECT count(country_id) FROM countries;
 count
---
   243
(1 row)

Country_id is also stored in the product_res table.

I would like to, or need to, get the total split into different nationalities, 
like:

FI 12345
RU 9876
DE 4321
...

Anyone ?

With very best regards,

-- 
Aarni Ruuhimäki
---
Burglars usually come in through your windows.
---

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Counting days ...

2008-03-13 Thread Steve Crawford

Aarni Ruuhimäki wrote:

Hi all,

A bit stuck here with something I know I can do with output / loops / 
filtering in the (web)application but want to do in SQL or within PostgreSQL.


Simply said, count days of accommodation for a given time period.

E.g.

res_id 1, start_day 2008-01-25, end_day 2008-02-15, number of persons 6
res_id 2, start_day 2008-02-10, end_day 2008-02-15, number of persons 4

for the period from 2008-02-01 to 2008-02-29 these two rows would give a total 
of


15 days x 6 persons + 4 days x 5 persons = 110 days

SELECT SUM(
CASE
WHEN res_start_day >= '2008-01-01' THEN
(res_end_day - res_start_day)
ELSE (res_end_day - (DATE '2008-01-01' - INTEGER '1'))
END
* group_size) AS days_in_period
FROM product_res pr
WHERE res_end_day >= '2008-01-01' AND res_end_day <= '2008-12-31';
  

This appears fraught with off-by-one and other errors.

For res_id 1 limited to the month of February you do indeed have 6 
persons and 15 days = 90 person-days as you are including day 1 and day 15.


If you use the same inclusive counting of days for res_id 2, you have 4 
persons (don't know where 5 came from) and 6 days for 24 person-days.


I'm making an assumption that you have reservations with arbitrary start 
and end dates (assumed to be inclusive of both start and end) along with 
group size and you want to see the person-days utilized within a 
specified period.


First, to simply establish upper and lower bounds, 
date_larger/date_smaller seems a lot easier - ie. for February inclusive 
dates:


select
sum (
((date_smaller(res_end_day, '2008-02-29'::date)
 - date_larger(res_start_day, '2008-02-01'::date))::int + 1) * group_size
) as person_days;


Country_id is also stored in the product_res table.

I would like to, or need to, get the total split into different nationalities, 
like:


FI 12345
RU 9876
DE 4321
...
  

OK.

select
country_id,
sum (
((date_smaller(res_end_day, '2008-02-29'::date)
 - date_larger(res_start_day, '2008-02-01'::date))::int + 1) * group_size
) as person_days
group by country_id;

Add where-clauses to either for efficiency.

Cheers,
Steve

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Counting days ...

2008-03-13 Thread Aarni Ruuhimäki
This was superfast, thank you !

On Thursday 13 March 2008 20:58, Steve Crawford wrote:
> Aarni Ruuhimäki wrote:

> > res_id 2, start_day 2008-02-10, end_day 2008-02-15, number of persons 4
> >
>
> If you use the same inclusive counting of days for res_id 2, you have 4
> persons (don't know where 5 came from) and 6 days for 24 person-days.
>

Sorry, my bad, I meant 5 days x 4 persons = 20 person-days. From 10 to 15 it 
is 5 'days' ? Arrival 10th and departure 15th makes 5 'nights' or 'days', 
whatever one might call it, statistical accommodation units.

Likewise, res_id x, start_day some time before Feb 08, end_day 2008-02-16 for 
a statistical period Feb 08 makes 16 units ?

>
> First, to simply establish upper and lower bounds,
> date_larger/date_smaller seems a lot easier - ie. for February inclusive
> dates:
>
> select
> sum (
> ((date_smaller(res_end_day, '2008-02-29'::date)
>   - date_larger(res_start_day, '2008-02-01'::date))::int + 1) * group_size
> ) as person_days;
>
> > Country_id is also stored in the product_res table.
> >
> > I would like to, or need to, get the total split into different
> > nationalities, like:
> >
> > FI 12345
> > RU 9876
> > DE 4321
> > ...
>
> OK.
>
> select
> country_id,
> sum (
> ((date_smaller(res_end_day, '2008-02-29'::date)
>   - date_larger(res_start_day, '2008-02-01'::date))::int + 1) * group_size
> ) as person_days
> group by country_id;

Ok, thanks, I'll dig into this.

>
> Add where-clauses to either for efficiency.

AND region_id = xIsDefined(company_id), AND company_id = x 
IsDefined(product_id), AND product_id = x
>
> Cheers,
> Steve

 days_in_period |  country
+
   5519 | Unknown
 16 | Germany
 18 | Estonia
   3061 | Russian Federation
   1491 | Suomi
 20 | Ukraine
(6 rows)

Getting there.

Cheers to all you amazingly helpful folks out there,

-- 
Aarni Ruuhimäki
---
Burglars usually come in through your windows.
---

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Counting days ...

2008-03-13 Thread Steve Crawford


Sorry, my bad, I meant 5 days x 4 persons = 20 person-days. From 10 to 15 it 
is 5 'days' ? Arrival 10th and departure 15th makes 5 'nights' or 'days', 
whatever one might call it, statistical accommodation units.


Likewise, res_id x, start_day some time before Feb 08, end_day 2008-02-16 for 
a statistical period Feb 08 makes 16 units ?


  

First, to simply establish upper and lower bounds,
date_larger/date_smaller seems a lot easier - ie. for February inclusive
dates:

select
sum (
((date_smaller(res_end_day, '2008-02-29'::date)
  - date_larger(res_start_day, '2008-02-01'::date))::int + 1) * group_size
) as person_days;



Check my work, but I think the sum part of the query simply becomes:

sum (
 (
 date_smaller(res_end_day, '2008-02-29'::date) -
 date_larger(res_start_day, '2008-01-31'::date)
 ) * group_size
)

Basically remove the "+1" so we don't include both start and end dates 
but move the start base back one day so anyone starting prior to Feb 1 
gets the extra day added.


Cheers,
Steve


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] cursors and sessions

2008-03-13 Thread chester c young
is there any way to share a cursor between sessions?

I have a costly query whose records need to be visited by in order by n
number of concurrent sessions, and am unable to find the best way of
doing this.

I'd almost like to write a daemon that hands out the next record, but
that's a royal pain to set up and maintain.


  

Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  
http://tools.search.yahoo.com/newsearch/category.php?category=shopping

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] cursors and sessions

2008-03-13 Thread Aaron Bono
On Thu, Mar 13, 2008 at 5:54 PM, chester c young <[EMAIL PROTECTED]>
wrote:

> is there any way to share a cursor between sessions?
>
> I have a costly query whose records need to be visited by in order by n
> number of concurrent sessions, and am unable to find the best way of
> doing this.
>
> I'd almost like to write a daemon that hands out the next record, but
> that's a royal pain to set up and maintain.
>  
>

Why not pump your data into a table that mirrors the cursor and then create
a function that pops items off the top of this queuing table as needed by
the different sessions?

-- 
==
Aaron Bono
Aranya Software Technologies, Inc.
http://www.aranya.com
http://codeelixir.com
==