Re: [PERFORM] Performance of complicated query

2013-05-28 Thread Steve Crawford

On 05/23/2013 05:21 PM, Jonathan Morra wrote:
Sorry for the messy query, I'm very new to writing these complex 
queries.  I'll try and make it easier to read by using WITH clauses. 
 However, just to clarify, the WITH clauses only increase readability 
and not performance in any way, right?


It depends. The planner is a tricky beast and sometimes rewriting a 
seeming identical query will result in a much more (or less) efficient 
plan. A classic case was the difference between where foo in (select 
bar from...)... vs. where exists (select 1 from bar where...) In an 
ideal world the planner would figure out that both are the same and 
optimize accordingly but there was a point where one was typically more 
efficient then it switched to the other being better for the planner. I 
don't recall the current state.


Casting can be important - sometimes the planner needs a "nudge" to use 
an index on, say, a varchar column being compared to, perhaps, a text 
value or column in which case casting to the exact data-type being 
indexed can be a big win.


Cheers,
Steve



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance of complicated query

2013-05-28 Thread Jonathan Morra
I have been working on this query, and I was able to modify it and get it's
run time cut in half.  Here's where it is right now:

SELECT first_name, last_name, serial_number, latest_read, value,
lifetime_value, lifetime.patient_id
FROM (
SELECT DISTINCT patient_id, first_name, last_name, MAX(read_datetime)
OVER(PARTITION BY patient_id) AS latest_read
, SUM(value) OVER(PARTITION BY patient_id) AS value,
first_value(serial_number) OVER(PARTITION BY patient_id ORDER BY
read_datetime DESC) AS serial_number
 FROM (
SELECT patient_devices.device_id, patient_id, MAX(value - issued_value) AS
value, MAX(read_datetime) AS read_datetime
FROM read_reads
INNER JOIN patient_devices ON patient_devices.device_id =
read_reads.device_id
AND read_datetime >= issuance_datetime
AND read_datetime < COALESCE(unassignment_datetime , 'infinity'::timestamp)
WHERE read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42'
GROUP BY patient_devices.id
) AS first
INNER JOIN devices ON devices.id = device_id
INNER JOIN patients ON patient_id = patients.id
) AS filtered
INNER JOIN (
SELECT patient_id, SUM(value) AS lifetime_value
 FROM (
SELECT patient_id, MAX(value - issued_value) AS value FROM read_reads
INNER JOIN patient_devices ON patient_devices.device_id =
read_reads.device_id
AND read_datetime >= issuance_datetime
AND read_datetime < COALESCE(unassignment_datetime , 'infinity'::timestamp)
GROUP BY patient_devices.id
) AS first GROUP BY patient_id
) AS lifetime ON filtered.patient_id = lifetime.patient_id

I think the key to cutting it down was moving some of the joins up a level.
 Even though this is faster, I'd still like to cut it down a bunch more (as
this will be run a lot in my application).  Any more insight would be
greatly appreciated.  A summary of explain (analyze, buffers) can be found
at http://explain.depesz.com/s/qx7f.

Thanks


On Thu, May 23, 2013 at 5:21 PM, Jonathan Morra  wrote:

> Sorry for the messy query, I'm very new to writing these complex queries.
>  I'll try and make it easier to read by using WITH clauses.  However, just
> to clarify, the WITH clauses only increase readability and not performance
> in any way, right?
>
>
> On Thu, May 23, 2013 at 4:22 PM, james wrote:
>
>>  On 23/05/2013 22:57, Jonathan Morra wrote:
>>
>>  I'm not sure I understand your proposed solution.  There is also the
>> case to consider where the same patient can be assigned the same device
>> multiple times.  In this case, the value may be reset at each assignment
>> (hence the line value - issued_value AS value from the original query).
>>
>>
>> Perhaps you could use triggers to help somewhat?  At least for the
>> lifetime part.
>>
>> For a given assignment of a device to a patient, only the last value is
>> useful, so you can maintain that easily enough (a bit like a materialised
>> view but before 9.3 I guess).
>>
>> But, that might fix 'lifetime' but not some arbitrary windowed view.  I
>> can see why an 'as at' end time is useful, but not why a start time is so
>> useful: if a device has readings before the window but not in the window,
>> is that 'no reading' or should the last reading prior to the window apply?
>>
>> It also seems to me that the solution you have is hard to reason about.
>> Its like a Haskell program done in one big inline fold rather than a bunch
>> of 'where' clauses, and I find these cause significant brain overload.
>>
>> Perhaps you could break it out into identifiable chunks that work out
>> (both for lifetime if not using triggers, and for your date range
>> otherwise) the readings that are not superceded (ie the last in the date
>> bounds for a device assignment), and then work with those.  Consider the
>> CTE 'WITH queries' for doing this?
>>
>> It seems to me that if you can do this, then the problem might be easier
>> to express.
>>
>> Failing that, I'd be looking at using temporary tables, and forcing a
>> series of reduce steps using them, but then I'm a nasty old Sybase hacker
>> at heart. ;-)
>>
>>
>


Re: [PERFORM] Slow SELECT by primary key? Postgres 9.1.2

2013-05-28 Thread Matheus de Oliveira
On Mon, May 27, 2013 at 11:02 AM, John Mudd  wrote:

> Postgres 9.1.2 on Ubuntu 12.04
>
> Any reason why a select by primary key would be slower than a select that
> includes an ORDER BY? I was really hoping using the primary key would give
> me a boost.
>
> I stopped the server and cleared the O/S cache using "sync; echo 3 >
> /proc/sys/vm/drop_caches" between the runs.
>
>
>
>  test=# VACUUM ANALYZE test_select;
> VACUUM
>
> (stopped postgres; reset O/S cache; started postgres)
>
> test=# explain analyze SELECT * FROM test_select WHERE key1 >= 50
> ORDER BY key1, key2, key3, id LIMIT 1;
>   QUERY PLAN
>
> --
>  Limit  (cost=0.00..0.08 rows=1 width=21) (actual time=12.599..12.600
> rows=1 loops=1)
>->  Index Scan using my_key on test_select  (cost=0.00..41895.49
> rows=498724 width=21) (actual time=12.597..12.597 rows=1 loops=1)
>  Index Cond: (key1 >= 50)
>  Total runtime: 12.678 ms
>
> (stopped postgres; reset O/S cache; started postgres)
>
> test=# explain analyze SELECT * FROM test_select WHERE id = 50;
>QUERY PLAN
>
>
> -
>  Index Scan using test_select_pkey on test_select  (cost=0.00..8.36 rows=1
> width=21) (actual time=31.396..31.398 rows=1 loops=1)
>Index Cond: (id = 50)
>  Total runtime: 31.504 ms
>
>
>
> Schema:
>
> test=# \d test_select
> Table "public.test_select"
>  Column | Type |Modifiers
>
>
> +--+--
>  id | integer  | not null default
> nextval('test_select_id_seq'::regclass)
>  key1   | integer  |
>  key2   | integer  |
>  key3   | integer  |
>  data   | character(4) |
> Indexes:
> "test_select_pkey" PRIMARY KEY, btree (id)
> "my_key" btree (key1, key2, key3, id)
>
> test=#
>
>
>
> Sample data:
>
> test=# SELECT * FROM test_select LIMIT 10;
>  id |  key1  |  key2  |  key3  | data
> ++++--
>   1 | 984966 | 283954 | 772063 | x
>   2 | 817668 | 393533 | 924888 | x
>   3 | 751039 | 798753 | 454309 | x
>   4 | 128505 | 329643 | 280553 | x
>   5 | 105600 | 257225 | 710015 | x
>   6 | 323891 | 615614 |  83206 | x
>   7 | 194054 |  63506 | 353171 | x
>   8 | 212068 | 881225 | 271804 | x
>   9 | 644180 |  26693 | 200738 | x
>  10 | 136586 | 498699 | 554417 | x
> (10 rows)
>
>
>
For me looks like "my_key" index should be better than the PK in this case.
For some reasons:

1. You are using a ORDER BY that has the same fields (and at the same
order) from your index, so PG only needs to navigate the index.
2. You are using LIMIT 1, which means PG only needs to fetch the first
element which key1>=5 (and stop the search right after it).

In the case of your PK, PG will need to navigate through the index and
return only one value also, but in this case the number of entries it needs
to look at is bigger, because "id" has more distinct values than "key1".

Regards,
-- 
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nĂ­vel F!
www.dextra.com.br/postgres


Re: [PERFORM] Slow SELECT by primary key? Postgres 9.1.2

2013-05-28 Thread Roman Konoval
John,

And can you please include BUFFERS to ANALYZE?

Regards,
Roman Konoval


On Tue, May 28, 2013 at 9:48 AM, Evgeniy Shishkin wrote:

>
>
>
>
> On 28.05.2013, at 2:17, John Mudd  wrote:
>
> Thanks again.
>
> Well, I have two problems with using the CLUSTER option. It's only
> temporary since any updates, depending how much free space is reserved per
> page, requires re-running the CLUSTER. And my primary concern is that it
> arbitrarily gives an unfair advantage to the primary key SELECT. Still,
> it's easy to test so here are the results. The primary key still looses
> even with the CLUSTER. Granted it is close but considering this is now an
> unfair comparison it still doesn't make sense to me. How can a search for a
> specific row that should be fairly straight forward take longer than a
> search that includes an ORDER BY clause?
>
>
> Well, you do just regular index scan because of LIMIT 1.
>
> And now it is just a matter of index size and table organization.
>
> I also don't understand why you consider CLUSTER unfair - the way you
> populated the table was natural cluster over my_key.
>
> But it bothers me why my_key is always better. Can you please test it on
> different values but the same rows? Because now it is two different tuples
> and you count every io.
>
>
> test=# CLUSTER test_select USING test_select_pkey ;
> CLUSTER
> test=# VACUUM ANALYZE test_select ;
> VACUUM
>
> (stopped postgres; reset O/S cache; started postgres)
>
>
> test=# explain analyze SELECT * FROM test_select WHERE key1 >= 50
> ORDER BY key1, key2, key3, id LIMIT 1;
>   QUERY PLAN
>
> --
>  Limit  (cost=0.00..0.08 rows=1 width=21) (actual time=19.430..19.431
> rows=1 loops=1)
>->  Index Scan using my_key on test_select  (cost=0.00..41938.15
> rows=42 width=21) (actual time=19.428..19.428 rows=1 loops=1)
>  Index Cond: (key1 >= 50)
>  Total runtime: 19.526 ms
>
>
> (stopped postgres; reset O/S cache; started postgres)
>
>
> test=# explain analyze SELECT * FROM test_select WHERE id = 50;
>QUERY PLAN
>
>
> -
>  Index Scan using test_select_pkey on test_select  (cost=0.00..8.36 rows=1
> width=21) (actual time=21.070..21.072 rows=1 loops=1)
>Index Cond: (id = 50)
>  Total runtime: 21.178 ms
>
>
>
>
> On Mon, May 27, 2013 at 10:59 AM, Evgeny Shishkin wrote:
>
>>
>> On May 27, 2013, at 6:35 PM, John Mudd  wrote:
>>
>> Thanks, that's easy enough to test. Didn't seem to help though.
>>
>>
>> Ok. And if you CLUSTER tables USING PK?
>>
>>
>> test=# REINDEX index test_select_pkey;
>> REINDEX
>> test=# VACUUM ANALYZE test_select ;
>> VACUUM
>>
>>
>> (stopped postgres; reset O/S cache; started postgres)
>>
>> test=# explain analyze SELECT * FROM test_select WHERE key1 >= 50
>> ORDER BY key1, key2, key3, id LIMIT 1;
>>   QUERY PLAN
>>
>>
>> --
>>  Limit  (cost=0.00..0.08 rows=1 width=21) (actual time=16.368..16.369
>> rows=1 loops=1)
>>->  Index Scan using my_key on test_select  (cost=0.00..41981.16
>> rows=501333 width=21) (actual time=16.366..16.366 rows=1 loops=1)
>>  Index Cond: (key1 >= 50)
>>  Total runtime: 16.444 ms
>>
>>
>> (stopped postgres; reset O/S cache; started postgres)
>>
>> test=# explain analyze SELECT * FROM test_select WHERE id = 50;
>>QUERY PLAN
>>
>> -
>>   Index Scan using test_select_pkey on test_select  (cost=0.00..8.36
>> rows=1 width=21) (actual time=23.072..23.074 rows=1 loops=1)
>>Index Cond: (id = 50)
>>  Total runtime: 23.192 ms
>>
>>
>>
>>
>> On Mon, May 27, 2013 at 10:21 AM, Evgeny Shishkin 
>> wrote:
>>
>>>
>>> On May 27, 2013, at 6:02 PM, John Mudd  wrote:
>>>
>>> > Postgres 9.1.2 on Ubuntu 12.04
>>> >
>>> > Any reason why a select by primary key would be slower than a select
>>> that includes an ORDER BY? I was really hoping using the primary key would
>>> give me a boost.
>>> >
>>>
>>> You created my_key after data loading, and PK was there all the time.
>>> If you REINDEX PK, i bet it will be as fast.
>>>
>>> > I stopped the server and cleared the O/S cache using "sync; echo 3 >
>>> /proc/sys/vm/drop_caches" between the runs.
>>> >
>>> >
>>> >
>>> > test=# VACUUM ANALYZE test_select;
>>> > VACUUM
>>> >
>>> > (stopped postgres; reset O/S cache; started postgres)
>>> >
>>> > test=# explai