[PERFORM] select distinct on varchar -- wild performance differences!
Hey, all. I've bounced this around in #postgres for an hour or so, and it was suggested that I post it here as well. Hopefully someone can help me out. I have three machines. All have 512MB of ram. Machine A is a 2.0ghz celeron, running debian, pg verison 7.4.6. Machine B is a 1.8ghz celeron, running centos 3.4, pg verison 8.0.3. (7.3.9 also exhibited the behaviour below, by the way) Machine C is a 1.0ghz athlon, running centos 4.0, pg verison 7.4.7. The SAME data and schema is loaded (from a pg_dump, default parameters) onto all three machines. With the same query: select distinct model from exif_common, machines A and C return results quickly (1/4 second). Machine B chews on it for 30ish seconds! Note, this column is a VARCHAR(40). Here's an explain analyze for it. Machine A (fast): photos=# explain analyze select distinct model from exif_common; QUERY PLAN -- Unique (cost=2629.74..2732.11 rows=5 width=15) (actual time=211.358..265.049 rows=6 loops=1) - Sort (cost=2629.74..2680.93 rows=20473 width=15) (actual time=211.351..242.296 rows=20473 loops=1) Sort Key: model - Seq Scan on exif_common (cost=0.00..1163.73 rows=20473 width=15) (actual time=0.022..58.635 rows=20473 loops=1) Total runtime: 265.928 ms (5 rows) Machine B (slow): photos=# explain analyze select distinct model from exif_common; QUERY PLAN -- Unique (cost=2640.74..2743.11 rows=6 width=15) (actual time=27939.231..32914.134 rows=6 loops=1) - Sort (cost=2640.74..2691.93 rows=20473 width=15) (actual time=27939.222..27983.784 rows=20473 loops=1) Sort Key: model - Seq Scan on exif_common (cost=0.00..1174.73 rows=20473 width=15) (actual time=0.071..97.772 rows=20473 loops=1) Total runtime: 32915.031 ms (5 rows) ( yes, i know, six distinct rows out of 20,000 But holy moly! 1/4 sec vs 32.9 sec?!?! ) Now, if I do a similar query against an INT column, the speeds are more in line with each other: Machine A: photos=# explain analyze select distinct imagewidth from exif_common; QUERY PLAN - Unique (cost=2629.74..2732.11 rows=36 width=4) (actual time=179.899..225.934 rows=107 loops=1) - Sort (cost=2629.74..2680.93 rows=20473 width=4) (actual time=179.891..207.632 rows=20473 loops=1) Sort Key: imagewidth - Seq Scan on exif_common (cost=0.00..1163.73 rows=20473 width=4) (actual time=0.024..62.946 rows=20473 loops=1) Total runtime: 226.707 ms (5 rows) Machine B: photos=# explain analyze select distinct imagewidth from exif_common; QUERY PLAN - Unique (cost=2640.74..2743.11 rows=24 width=4) (actual time=209.394..287.131 rows=107 loops=1) - Sort (cost=2640.74..2691.93 rows=20473 width=4) (actual time=209.384..251.693 rows=20473 loops=1) Sort Key: imagewidth - Seq Scan on exif_common (cost=0.00..1174.73 rows=20473 width=4) (actual time=0.074..94.574 rows=20473 loops=1) Total runtime: 288.411 ms (5 rows) Machine C exhibits the same behaviour as A for all queries. This weird slow behaviour on machine B also appeared in 7.3.9. Upgrading didn't seem to help. neilc from irc thought it may be a qsort(2) quirk, but a sample C program I whipped up testing different sized data sets with a similar distribution gave very similar sort timings between the three machines.. Therefore, I don't think it's qsort(2) to blame... Anyone have any ideas as to what may be up with machine B? Thanks, -Elliott ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[PERFORM] select distinct on varchar - wild performance differences!
Hey, all. I've bounced this around in #postgres for an hour or so, and it was suggested that I post it here as well. Hopefully someone can help me out. I have three machines. All have 512MB of ram. Machine A is a 2.0ghz celeron, running debian, pg verison 7.4.6. Machine B is a 1.8ghz celeron, running centos 3.4, pg verison 8.0.3. (7.3.9 also exhibited the behaviour below, by the way) Machine C is a 1.0ghz athlon, running centos 4.0, pg verison 7.4.7. The SAME data and schema is loaded (from a pg_dump, default parameters) onto all three machines. With the same query: select distinct model from exif_common, machines A and C return results quickly (1/4 second). Machine B chews on it for 30ish seconds! Note, this column is a VARCHAR(40). Here's an explain analyze for it. Machine A (fast): photos=# explain analyze select distinct model from exif_common; QUERY PLAN -- +-- Unique (cost=2629.74..2732.11 rows=5 width=15) (actual time=211.358..265.049 rows=6 loops=1) - Sort (cost=2629.74..2680.93 rows=20473 width=15) (actual time=211.351..242.296 rows=20473 loops=1) Sort Key: model - Seq Scan on exif_common (cost=0.00..1163.73 rows=20473 width=15) (actual time=0.022..58.635 rows=20473 loops=1) Total runtime: 265.928 ms (5 rows) Machine B (slow): photos=# explain analyze select distinct model from exif_common; QUERY PLAN +-- Unique (cost=2640.74..2743.11 rows=6 width=15) (actual time=27939.231..32914.134 rows=6 loops=1) - Sort (cost=2640.74..2691.93 rows=20473 width=15) (actual time=27939.222..27983.784 rows=20473 loops=1) Sort Key: model - Seq Scan on exif_common (cost=0.00..1174.73 rows=20473 width=15) (actual time=0.071..97.772 rows=20473 loops=1) Total runtime: 32915.031 ms (5 rows) ( yes, i know, six distinct rows out of 20,000 But holy moly! 1/4 sec vs 32.9 sec?!?! ) Now, if I do a similar query against an INT column, the speeds are more in line with each other: Machine A: photos=# explain analyze select distinct imagewidth from exif_common; QUERY PLAN +- Unique (cost=2629.74..2732.11 rows=36 width=4) (actual time=179.899..225.934 rows=107 loops=1) - Sort (cost=2629.74..2680.93 rows=20473 width=4) (actual time=179.891..207.632 rows=20473 loops=1) Sort Key: imagewidth - Seq Scan on exif_common (cost=0.00..1163.73 rows=20473 width=4) (actual time=0.024..62.946 rows=20473 loops=1) Total runtime: 226.707 ms (5 rows) Machine B: photos=# explain analyze select distinct imagewidth from exif_common; QUERY PLAN +- Unique (cost=2640.74..2743.11 rows=24 width=4) (actual time=209.394..287.131 rows=107 loops=1) - Sort (cost=2640.74..2691.93 rows=20473 width=4) (actual time=209.384..251.693 rows=20473 loops=1) Sort Key: imagewidth - Seq Scan on exif_common (cost=0.00..1174.73 rows=20473 width=4) (actual time=0.074..94.574 rows=20473 loops=1) Total runtime: 288.411 ms (5 rows) Machine C exhibits the same behaviour as A for all queries. This weird slow behaviour on machine B also appeared in 7.3.9. Upgrading didn't seem to help. neilc from irc thought it may be a qsort(2) quirk, but a sample C program I whipped up testing different sized data sets with a similar distribution gave very similar sort timings between the three machines.. Therefore, I don't think it's qsort(2) to blame... Anyone have any ideas as to what may be up with machine B? Thanks, -Elliott ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] select distinct on varchar - wild performance differences!
Elliott Bennett [EMAIL PROTECTED] writes: Anyone have any ideas as to what may be up with machine B? Different locale setting? strcoll() can be horribly slow in some locales ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] select distinct on varchar - wild performance differences!
hah! That did it. Setting to 'C' makes it just as fast as the other machines. I think it defaulted to en_US... Thanks! -Elliott On Thu, Jun 23, 2005 at 11:34:55AM -0400, Tom Lane wrote: Elliott Bennett [EMAIL PROTECTED] writes: Anyone have any ideas as to what may be up with machine B? Different locale setting? strcoll() can be horribly slow in some locales ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org