Re: [PERFORM] Slow SELECT by primary key? Postgres 9.1.2
On 28.05.2013, at 2:17, John Mudd johnbm...@gmail.com wrote: Thanks again. Well, I have two problems with using the CLUSTER option. It's only temporary since any updates, depending how much free space is reserved per page, requires re-running the CLUSTER. And my primary concern is that it arbitrarily gives an unfair advantage to the primary key SELECT. Still, it's easy to test so here are the results. The primary key still looses even with the CLUSTER. Granted it is close but considering this is now an unfair comparison it still doesn't make sense to me. How can a search for a specific row that should be fairly straight forward take longer than a search that includes an ORDER BY clause? Well, you do just regular index scan because of LIMIT 1. And now it is just a matter of index size and table organization. I also don't understand why you consider CLUSTER unfair - the way you populated the table was natural cluster over my_key. But it bothers me why my_key is always better. Can you please test it on different values but the same rows? Because now it is two different tuples and you count every io. test=# CLUSTER test_select USING test_select_pkey ; CLUSTER test=# VACUUM ANALYZE test_select ; VACUUM (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE key1 = 50 ORDER BY key1, key2, key3, id LIMIT 1; QUERY PLAN -- Limit (cost=0.00..0.08 rows=1 width=21) (actual time=19.430..19.431 rows=1 loops=1) - Index Scan using my_key on test_select (cost=0.00..41938.15 rows=42 width=21) (actual time=19.428..19.428 rows=1 loops=1) Index Cond: (key1 = 50) Total runtime: 19.526 ms (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE id = 50; QUERY PLAN - Index Scan using test_select_pkey on test_select (cost=0.00..8.36 rows=1 width=21) (actual time=21.070..21.072 rows=1 loops=1) Index Cond: (id = 50) Total runtime: 21.178 ms On Mon, May 27, 2013 at 10:59 AM, Evgeny Shishkin itparan...@gmail.com wrote: On May 27, 2013, at 6:35 PM, John Mudd johnbm...@gmail.com wrote: Thanks, that's easy enough to test. Didn't seem to help though. Ok. And if you CLUSTER tables USING PK? test=# REINDEX index test_select_pkey; REINDEX test=# VACUUM ANALYZE test_select ; VACUUM (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE key1 = 50 ORDER BY key1, key2, key3, id LIMIT 1; QUERY PLAN -- Limit (cost=0.00..0.08 rows=1 width=21) (actual time=16.368..16.369 rows=1 loops=1) - Index Scan using my_key on test_select (cost=0.00..41981.16 rows=501333 width=21) (actual time=16.366..16.366 rows=1 loops=1) Index Cond: (key1 = 50) Total runtime: 16.444 ms (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE id = 50; QUERY PLAN - Index Scan using test_select_pkey on test_select (cost=0.00..8.36 rows=1 width=21) (actual time=23.072..23.074 rows=1 loops=1) Index Cond: (id = 50) Total runtime: 23.192 ms On Mon, May 27, 2013 at 10:21 AM, Evgeny Shishkin itparan...@gmail.com wrote: On May 27, 2013, at 6:02 PM, John Mudd johnbm...@gmail.com wrote: Postgres 9.1.2 on Ubuntu 12.04 Any reason why a select by primary key would be slower than a select that includes an ORDER BY? I was really hoping using the primary key would give me a boost. You created my_key after data loading, and PK was there all the time. If you REINDEX PK, i bet it will be as fast. I stopped the server and cleared the O/S cache using sync; echo 3 /proc/sys/vm/drop_caches between the runs. test=# VACUUM ANALYZE test_select; VACUUM (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE key1 = 50 ORDER BY key1, key2, key3, id LIMIT 1;
Re: [PERFORM] Slow SELECT by primary key? Postgres 9.1.2
John, And can you please include BUFFERS to ANALYZE? Regards, Roman Konoval On Tue, May 28, 2013 at 9:48 AM, Evgeniy Shishkin itparan...@gmail.comwrote: On 28.05.2013, at 2:17, John Mudd johnbm...@gmail.com wrote: Thanks again. Well, I have two problems with using the CLUSTER option. It's only temporary since any updates, depending how much free space is reserved per page, requires re-running the CLUSTER. And my primary concern is that it arbitrarily gives an unfair advantage to the primary key SELECT. Still, it's easy to test so here are the results. The primary key still looses even with the CLUSTER. Granted it is close but considering this is now an unfair comparison it still doesn't make sense to me. How can a search for a specific row that should be fairly straight forward take longer than a search that includes an ORDER BY clause? Well, you do just regular index scan because of LIMIT 1. And now it is just a matter of index size and table organization. I also don't understand why you consider CLUSTER unfair - the way you populated the table was natural cluster over my_key. But it bothers me why my_key is always better. Can you please test it on different values but the same rows? Because now it is two different tuples and you count every io. test=# CLUSTER test_select USING test_select_pkey ; CLUSTER test=# VACUUM ANALYZE test_select ; VACUUM (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE key1 = 50 ORDER BY key1, key2, key3, id LIMIT 1; QUERY PLAN -- Limit (cost=0.00..0.08 rows=1 width=21) (actual time=19.430..19.431 rows=1 loops=1) - Index Scan using my_key on test_select (cost=0.00..41938.15 rows=42 width=21) (actual time=19.428..19.428 rows=1 loops=1) Index Cond: (key1 = 50) Total runtime: 19.526 ms (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE id = 50; QUERY PLAN - Index Scan using test_select_pkey on test_select (cost=0.00..8.36 rows=1 width=21) (actual time=21.070..21.072 rows=1 loops=1) Index Cond: (id = 50) Total runtime: 21.178 ms On Mon, May 27, 2013 at 10:59 AM, Evgeny Shishkin itparan...@gmail.comwrote: On May 27, 2013, at 6:35 PM, John Mudd johnbm...@gmail.com wrote: Thanks, that's easy enough to test. Didn't seem to help though. Ok. And if you CLUSTER tables USING PK? test=# REINDEX index test_select_pkey; REINDEX test=# VACUUM ANALYZE test_select ; VACUUM (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE key1 = 50 ORDER BY key1, key2, key3, id LIMIT 1; QUERY PLAN -- Limit (cost=0.00..0.08 rows=1 width=21) (actual time=16.368..16.369 rows=1 loops=1) - Index Scan using my_key on test_select (cost=0.00..41981.16 rows=501333 width=21) (actual time=16.366..16.366 rows=1 loops=1) Index Cond: (key1 = 50) Total runtime: 16.444 ms (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE id = 50; QUERY PLAN - Index Scan using test_select_pkey on test_select (cost=0.00..8.36 rows=1 width=21) (actual time=23.072..23.074 rows=1 loops=1) Index Cond: (id = 50) Total runtime: 23.192 ms On Mon, May 27, 2013 at 10:21 AM, Evgeny Shishkin itparan...@gmail.comwrote: On May 27, 2013, at 6:02 PM, John Mudd johnbm...@gmail.com wrote: Postgres 9.1.2 on Ubuntu 12.04 Any reason why a select by primary key would be slower than a select that includes an ORDER BY? I was really hoping using the primary key would give me a boost. You created my_key after data loading, and PK was there all the time. If you REINDEX PK, i bet it will be as fast. I stopped the server and cleared the O/S cache using sync; echo 3 /proc/sys/vm/drop_caches between the runs. test=# VACUUM ANALYZE test_select; VACUUM (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE key1 = 50 ORDER BY key1, key2, key3, id LIMIT 1;
Re: [PERFORM] Slow SELECT by primary key? Postgres 9.1.2
On Mon, May 27, 2013 at 11:02 AM, John Mudd johnbm...@gmail.com wrote: Postgres 9.1.2 on Ubuntu 12.04 Any reason why a select by primary key would be slower than a select that includes an ORDER BY? I was really hoping using the primary key would give me a boost. I stopped the server and cleared the O/S cache using sync; echo 3 /proc/sys/vm/drop_caches between the runs. test=# VACUUM ANALYZE test_select; VACUUM (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE key1 = 50 ORDER BY key1, key2, key3, id LIMIT 1; QUERY PLAN -- Limit (cost=0.00..0.08 rows=1 width=21) (actual time=12.599..12.600 rows=1 loops=1) - Index Scan using my_key on test_select (cost=0.00..41895.49 rows=498724 width=21) (actual time=12.597..12.597 rows=1 loops=1) Index Cond: (key1 = 50) Total runtime: 12.678 ms (stopped postgres; reset O/S cache; started postgres) test=# explain analyze SELECT * FROM test_select WHERE id = 50; QUERY PLAN - Index Scan using test_select_pkey on test_select (cost=0.00..8.36 rows=1 width=21) (actual time=31.396..31.398 rows=1 loops=1) Index Cond: (id = 50) Total runtime: 31.504 ms Schema: test=# \d test_select Table public.test_select Column | Type |Modifiers +--+-- id | integer | not null default nextval('test_select_id_seq'::regclass) key1 | integer | key2 | integer | key3 | integer | data | character(4) | Indexes: test_select_pkey PRIMARY KEY, btree (id) my_key btree (key1, key2, key3, id) test=# Sample data: test=# SELECT * FROM test_select LIMIT 10; id | key1 | key2 | key3 | data ++++-- 1 | 984966 | 283954 | 772063 | x 2 | 817668 | 393533 | 924888 | x 3 | 751039 | 798753 | 454309 | x 4 | 128505 | 329643 | 280553 | x 5 | 105600 | 257225 | 710015 | x 6 | 323891 | 615614 | 83206 | x 7 | 194054 | 63506 | 353171 | x 8 | 212068 | 881225 | 271804 | x 9 | 644180 | 26693 | 200738 | x 10 | 136586 | 498699 | 554417 | x (10 rows) For me looks like my_key index should be better than the PK in this case. For some reasons: 1. You are using a ORDER BY that has the same fields (and at the same order) from your index, so PG only needs to navigate the index. 2. You are using LIMIT 1, which means PG only needs to fetch the first element which key1=5 (and stop the search right after it). In the case of your PK, PG will need to navigate through the index and return only one value also, but in this case the number of entries it needs to look at is bigger, because id has more distinct values than key1. Regards, -- Matheus de Oliveira Analista de Banco de Dados Dextra Sistemas - MPS.Br nĂvel F! www.dextra.com.br/postgres
Re: [PERFORM] Performance of complicated query
I have been working on this query, and I was able to modify it and get it's run time cut in half. Here's where it is right now: SELECT first_name, last_name, serial_number, latest_read, value, lifetime_value, lifetime.patient_id FROM ( SELECT DISTINCT patient_id, first_name, last_name, MAX(read_datetime) OVER(PARTITION BY patient_id) AS latest_read , SUM(value) OVER(PARTITION BY patient_id) AS value, first_value(serial_number) OVER(PARTITION BY patient_id ORDER BY read_datetime DESC) AS serial_number FROM ( SELECT patient_devices.device_id, patient_id, MAX(value - issued_value) AS value, MAX(read_datetime) AS read_datetime FROM read_reads INNER JOIN patient_devices ON patient_devices.device_id = read_reads.device_id AND read_datetime = issuance_datetime AND read_datetime COALESCE(unassignment_datetime , 'infinity'::timestamp) WHERE read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42' GROUP BY patient_devices.id ) AS first INNER JOIN devices ON devices.id = device_id INNER JOIN patients ON patient_id = patients.id ) AS filtered INNER JOIN ( SELECT patient_id, SUM(value) AS lifetime_value FROM ( SELECT patient_id, MAX(value - issued_value) AS value FROM read_reads INNER JOIN patient_devices ON patient_devices.device_id = read_reads.device_id AND read_datetime = issuance_datetime AND read_datetime COALESCE(unassignment_datetime , 'infinity'::timestamp) GROUP BY patient_devices.id ) AS first GROUP BY patient_id ) AS lifetime ON filtered.patient_id = lifetime.patient_id I think the key to cutting it down was moving some of the joins up a level. Even though this is faster, I'd still like to cut it down a bunch more (as this will be run a lot in my application). Any more insight would be greatly appreciated. A summary of explain (analyze, buffers) can be found at http://explain.depesz.com/s/qx7f. Thanks On Thu, May 23, 2013 at 5:21 PM, Jonathan Morra jonmo...@gmail.com wrote: Sorry for the messy query, I'm very new to writing these complex queries. I'll try and make it easier to read by using WITH clauses. However, just to clarify, the WITH clauses only increase readability and not performance in any way, right? On Thu, May 23, 2013 at 4:22 PM, james ja...@mansionfamily.plus.comwrote: On 23/05/2013 22:57, Jonathan Morra wrote: I'm not sure I understand your proposed solution. There is also the case to consider where the same patient can be assigned the same device multiple times. In this case, the value may be reset at each assignment (hence the line value - issued_value AS value from the original query). Perhaps you could use triggers to help somewhat? At least for the lifetime part. For a given assignment of a device to a patient, only the last value is useful, so you can maintain that easily enough (a bit like a materialised view but before 9.3 I guess). But, that might fix 'lifetime' but not some arbitrary windowed view. I can see why an 'as at' end time is useful, but not why a start time is so useful: if a device has readings before the window but not in the window, is that 'no reading' or should the last reading prior to the window apply? It also seems to me that the solution you have is hard to reason about. Its like a Haskell program done in one big inline fold rather than a bunch of 'where' clauses, and I find these cause significant brain overload. Perhaps you could break it out into identifiable chunks that work out (both for lifetime if not using triggers, and for your date range otherwise) the readings that are not superceded (ie the last in the date bounds for a device assignment), and then work with those. Consider the CTE 'WITH queries' for doing this? It seems to me that if you can do this, then the problem might be easier to express. Failing that, I'd be looking at using temporary tables, and forcing a series of reduce steps using them, but then I'm a nasty old Sybase hacker at heart. ;-)
Re: [PERFORM] Performance of complicated query
On 05/23/2013 05:21 PM, Jonathan Morra wrote: Sorry for the messy query, I'm very new to writing these complex queries. I'll try and make it easier to read by using WITH clauses. However, just to clarify, the WITH clauses only increase readability and not performance in any way, right? It depends. The planner is a tricky beast and sometimes rewriting a seeming identical query will result in a much more (or less) efficient plan. A classic case was the difference between where foo in (select bar from...)... vs. where exists (select 1 from bar where...) In an ideal world the planner would figure out that both are the same and optimize accordingly but there was a point where one was typically more efficient then it switched to the other being better for the planner. I don't recall the current state. Casting can be important - sometimes the planner needs a nudge to use an index on, say, a varchar column being compared to, perhaps, a text value or column in which case casting to the exact data-type being indexed can be a big win. Cheers, Steve -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance