Re: [PERFORM] A very long running query....

2012-07-21 Thread Craig Ringer

On 07/21/2012 06:19 AM, Ioannis Anagnostopoulos wrote:

On this Ubuntu installation the default_statistics_target = 1000 and 
not 100. Do you think that this might be an issue?


Nope. You should generally avoid setting default_statistics_target too 
high anyway; leave it where it is and use ALTER TABLE ... ALTER COLUMN 
... SET STATISTICS to raise the targets on columns where you're seeing 
bad statistics estimates.


http://www.postgresql.org/docs/9.1/static/sql-altertable.html

Also make sure autovaccum is running frequently so it keeps the stats up 
to date.


--
Craig Ringer




Re: [PERFORM] query overhead

2012-07-21 Thread Craig Ringer

On 07/17/2012 11:33 PM, Andy Halsall wrote:



 If you're working with ISAM-like access though, cursors may well be 
very helpful for you. It's a pity for your app that Pg doesn't support 
cursors that see changes committed after cursor creation, since these 
are ideal when emulating ISAM "next record" / "previous record" access 
models. They're still suitable for tasks where you know the app 
doesn't need to see concurrently modified data, though.


> That's right, that would've been ideal behaviour for us. We're going 
to manage our own shared cache in the application layer to give 
similar functionality. We have lots of reads but fewer writes.


How have you gone with this? I'm curious.

By the way, when replying it's the convention to indent the text written 
by the person you're replying to, not indent your own text. It's kind of 
hard to read.



> In the context of what we've been talking about, we're reading a set 
of information which is ordered in a reasonably complex way. Set is 
about 1 records and requires a table join. This sort takes a while 
as it heap scans - couldn't persuade it to use indexes.


> Having read the set, the application "gets next" until the end. To 
start with we were re-establishing the set (minus the previous record) 
and choosing the first (LIMIT 1) on each "get next" - obviously a 
non-starter. We moved to caching the record keys for the set and only 
visiting the database for the specific records on each "get next" - 
hence the questions about round trip overhead for small queries.
Given that pattern, why aren't you using a cursor? Do you need to see 
concurrent changes? Is the cursor just held open too long, affecting 
autovacum?


--
Craig Ringer



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] A very long running query....

2012-07-21 Thread Marc Mamin


Hello,
isn't the first test superfluous here ?

>   where extract('day' from message_copies.msg_date_rec) = 17
>   and date_trunc('day',message_copies.msg_date_rec) = '2012-07-17'


> Here is the index:
> 
> CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
>   ON feed_all_y2012m07.message_copies_wk2
>   USING btree
>   (date_trunc('day'::text, msg_date_rec),
>   src_id,
>   (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
> pos_georef4::text))
> TABLESPACE archive
>   WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
> pos_georef4::text) IS NOT NULL 
>   OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || 
> pos_georef4::text) = ''::text;


the georef test can be simplified using coalesce:

>  and (message_copies.pos_georef1 || message_copies.pos_georef2 || 
> message_copies.pos_georef3 || message_copies.pos_georef4) <> ''
>  and not (message_copies.pos_georef1 || message_copies.pos_georef2 || 
> message_copies.pos_georef3 || message_copies.pos_georef4) is null
  =>
  and coaesce ( 
(message_copies.pos_georef1 || message_copies.pos_georef2 || 
message_copies.pos_georef3 || message_copies.pos_georef4), 
 '') <> ''
  
In order to avoid this test at query time you might add a boolean column   
message_copies.pos.has_georef,
and keep it up to date  with a before insert or update trigger. This will allow 
to shorten your index definition and simplify the planner task a little bit.
Moreover it will fasten your query in cases when the index don't get used.

As Tom already mentioned it, it may make sense not to concatenate the georef 
within the index, but keep them separated, or even keep them in different 
indexes.
Which is the best depend on the other queries running against this table
  
HTH,

Marc Mamin
  


-Original Message-
From: pgsql-performance-ow...@postgresql.org on behalf of Ioannis 
Anagnostopoulos
Sent: Sat 7/21/2012 1:56 AM
To: Tom Lane
Cc: Claudio Freire; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] A very long running query
 
On 21/07/2012 00:10, Tom Lane wrote:
> Claudio Freire  writes:
>> Looking at this:
>> "->  Index Scan using
>> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
>> message_copies  (cost=0.00..19057.93 rows=52 width=32) (actual
>> time=62.124..5486270.845 rows=387524 loops=1)"
>> "  Index Cond: ((date_trunc('day'::text,
>> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
>> AND (src_id = 1))"
>> "  Filter: ((date_part('day'::text,
>> msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text
>> || (pos_georef2)::text) || (pos_georef3)::text) ||
>> (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text ||
>> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
>> <> ''::text))"
> I think the real problem is that the planner has no hope of doing
> anything very accurate with such an unwieldy filter condition.  I'd look
> at ways of making the filter conditions simpler, perhaps by recasting
> the data representation.  In particular, that's a horridly bad way of
> asking whether some columns are empty, which I gather is the intent.
> If you really want to do it just like that, creating an index on the
> concatenation expression would guide ANALYZE to collect some stats about
> it, but it would probably be a lot more efficient to put together an AND
> or OR of tests on the individual columns.
>
>   regards, tom lane
So what you suggest is to forget all together the concatenation of the 
georef1/2/3/4 and instead alter my query with something like:

georef1 is not null and not georeg1 = ''etc for georef2 3 and 4

That would require to alter my index and have the four georef columns 
separately in it and not as a concatenation and so on for the partial 
index part. And a final thing, you seem to imply that the indexes are 
used by the analyser to collect statistics even if they are not used. So 
an index serves not only as a way to speed up targeted queries but also 
to provide better statistics to the analyzer?

Kind Regards
Yiannis

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




Re: [PERFORM] A very long running query....

2012-07-21 Thread Ioannis Anagnostopoulos

On 21/07/2012 10:16, Marc Mamin wrote:

RE: [PERFORM] A very long running query

Hello,
isn't the first test superfluous here ?

>   where extract('day' from message_copies.msg_date_rec) = 17
>   and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'


> Here is the index:
>
> CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
>   ON feed_all_y2012m07.message_copies_wk2
>   USING btree
>   (date_trunc('day'::text, msg_date_rec),
>   src_id,
>   (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) 
|| pos_georef4::text))

> TABLESPACE archive
>   WHERE (((pos_georef1::text || pos_georef2::text) || 
pos_georef3::text) || pos_georef4::text) IS NOT NULL
>   OR NOT (((pos_georef1::text || pos_georef2::text) || 
pos_georef3::text) || pos_georef4::text) = ''::text;



the georef test can be simplified using coalesce:

>  and (message_copies.pos_georef1 || message_copies.pos_georef2 
|| message_copies.pos_georef3 || message_copies.pos_georef4) <> ''
>  and not (message_copies.pos_georef1 || message_copies.pos_georef2 
|| message_copies.pos_georef3 || message_copies.pos_georef4) is null

  =>
  and coaesce (
(message_copies.pos_georef1 || message_copies.pos_georef2 || 
message_copies.pos_georef3 || message_copies.pos_georef4),

 '') <> ''

In order to avoid this test at query time you might add a boolean 
column   message_copies.pos.has_georef,
and keep it up to date  with a before insert or update trigger. This 
will allow to shorten your index definition and simplify the planner 
task a little bit.

Moreover it will fasten your query in cases when the index don't get used.

As Tom already mentioned it, it may make sense not to concatenate the 
georef within the index, but keep them separated, or even keep them in 
different indexes.

Which is the best depend on the other queries running against this table

HTH,

Marc Mamin



-Original Message-
From: pgsql-performance-ow...@postgresql.org on behalf of Ioannis 
Anagnostopoulos

Sent: Sat 7/21/2012 1:56 AM
To: Tom Lane
Cc: Claudio Freire; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] A very long running query

On 21/07/2012 00:10, Tom Lane wrote:
> Claudio Freire  writes:
>> Looking at this:
>> "->  Index Scan using
>> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
>> message_copies  (cost=0.00..19057.93 rows=52 width=32) (actual
>> time=62.124..5486270.845 rows=387524 loops=1)"
>> "  Index Cond: ((date_trunc('day'::text,
>> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
>> AND (src_id = 1))"
>> "  Filter: ((date_part('day'::text,
>> msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text
>> || (pos_georef2)::text) || (pos_georef3)::text) ||
>> (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text ||
>> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
>> <> ''::text))"
> I think the real problem is that the planner has no hope of doing
> anything very accurate with such an unwieldy filter condition.  I'd look
> at ways of making the filter conditions simpler, perhaps by recasting
> the data representation.  In particular, that's a horridly bad way of
> asking whether some columns are empty, which I gather is the intent.
> If you really want to do it just like that, creating an index on the
> concatenation expression would guide ANALYZE to collect some stats about
> it, but it would probably be a lot more efficient to put together an AND
> or OR of tests on the individual columns.
>
>   regards, tom lane
So what you suggest is to forget all together the concatenation of the
georef1/2/3/4 and instead alter my query with something like:

georef1 is not null and not georeg1 = ''etc for georef2 3 and 4

That would require to alter my index and have the four georef columns
separately in it and not as a concatenation and so on for the partial
index part. And a final thing, you seem to imply that the indexes are
used by the analyser to collect statistics even if they are not used. So
an index serves not only as a way to speed up targeted queries but also
to provide better statistics to the analyzer?

Kind Regards
Yiannis

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


No because it is used to select a partition. Otherwise it will go 
through the whole hierarchy...


Re: [PERFORM] A very long running query....

2012-07-21 Thread Tom Lane
[ Please try to trim quotes when replying.  People don't want to re-read
  the entire thread in every message. ]

Ioannis Anagnostopoulos  writes:
> On 21/07/2012 10:16, Marc Mamin wrote:
>> isn't the first test superfluous here ?
>> 
>>> where extract('day' from message_copies.msg_date_rec) = 17
>>> and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'

> No because it is used to select a partition. Otherwise it will go 
> through the whole hierarchy...

You're using extract(day...) to define partitions?  You might want to
rethink that.  The planner has got absolutely no intelligence about
the behavior of extract, and in particular doesn't realize that the
date_trunc condition implies the extract condition; so that's another
part of the cause of the estimation error here.

What's usually recommended for partitioning is simple equality or
range constraints, such as "msg_date_rec >= 'date1' AND
msg_date_rec < 'date2'", which the planner does have a fair amount
of intelligence about.

Now, you can generalize that to equality or range constraints using
an expression; for instance there'd be no problem to partition on
date_trunc('day', msg_date_rec) rather than msg_date_rec directly,
so long as your queries always use that same expression.  But you
should not expect that the planner can deduce very much about the
correlations between results of different functions.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] A very long running query....

2012-07-21 Thread Ioannis Anagnostopoulos

On 21/07/2012 17:58, Tom Lane wrote:

[ Please try to trim quotes when replying.  People don't want to re-read
   the entire thread in every message. ]

Ioannis Anagnostopoulos  writes:

On 21/07/2012 10:16, Marc Mamin wrote:

isn't the first test superfluous here ?


where extract('day' from message_copies.msg_date_rec) = 17
and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'

No because it is used to select a partition. Otherwise it will go
through the whole hierarchy...

You're using extract(day...) to define partitions?  You might want to
rethink that.  The planner has got absolutely no intelligence about
the behavior of extract, and in particular doesn't realize that the
date_trunc condition implies the extract condition; so that's another
part of the cause of the estimation error here.

What's usually recommended for partitioning is simple equality or
range constraints, such as "msg_date_rec >= 'date1' AND
msg_date_rec < 'date2'", which the planner does have a fair amount
of intelligence about.

Now, you can generalize that to equality or range constraints using
an expression; for instance there'd be no problem to partition on
date_trunc('day', msg_date_rec) rather than msg_date_rec directly,
so long as your queries always use that same expression.  But you
should not expect that the planner can deduce very much about the
correlations between results of different functions.

regards, tom lane
I think you got this wrong here. If you see the query again you will see 
that I do use equality. The problem is that my "equality" occurs
by extracting the date from the msg_date_rec column. To put it in other 
words, for not using the "extract" I should have an additional
column only with the "date" number to perform the equality. Don't you 
feel that this is not right since I have the actual date? The constrain

within the table  that defines the partition is as follows:

CONSTRAINT message_copies_wk0_date CHECK (date_part('day'::text, 
msg_date_rec) >= 1::double precision AND date_part('day'::text, 
msg_date_rec) <= 7::double precision)


I see not problem at this. The planner gets it right and "hits" the 
correct table every time. So unless if there is a technique here that I 
completely miss,

where is the problem?


Regards
Yiannis


Re: [PERFORM] A very long running query....

2012-07-21 Thread Ioannis Anagnostopoulos

On 21/07/2012 00:10, Tom Lane wrote:

Claudio Freire  writes:

Looking at this:
"->  Index Scan using
idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
message_copies  (cost=0.00..19057.93 rows=52 width=32) (actual
time=62.124..5486270.845 rows=387524 loops=1)"
"  Index Cond: ((date_trunc('day'::text,
msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
AND (src_id = 1))"
"  Filter: ((date_part('day'::text,
msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text
|| (pos_georef2)::text) || (pos_georef3)::text) ||
(pos_georef4)::text) IS NULL)) AND (pos_georef1)::text ||
(pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
<> ''::text))"

I think the real problem is that the planner has no hope of doing
anything very accurate with such an unwieldy filter condition.  I'd look
at ways of making the filter conditions simpler, perhaps by recasting
the data representation.  In particular, that's a horridly bad way of
asking whether some columns are empty, which I gather is the intent.
If you really want to do it just like that, creating an index on the
concatenation expression would guide ANALYZE to collect some stats about
it, but it would probably be a lot more efficient to put together an AND
or OR of tests on the individual columns.

regards, tom lane
OK regarding the index I use... I follow your second advice about 
efficiency with individual columns and changed it to:


CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
  ON feed_all_y2012m07.message_copies_wk2
  USING btree
  (date_trunc('day'::text, msg_date_rec), src_id, pos_georef1, 
pos_georef2, pos_georef3, pos_georef4)

TABLESPACE "index"
  WHERE
pos_georef1 IS NOT NULL
AND NOT pos_georef1::text = ''::text
AND pos_georef2 IS NOT NULL
AND NOT pos_georef2::text = ''::text
AND pos_georef3 IS NOT NULL
AND NOT pos_georef3::text = ''::text
AND pos_georef4 IS NOT NULL
AND NOT pos_georef4::text = ''::text;

The query has been changed as well as follows now:

SELECT
src_id,
date_trunc('day', message_copies.msg_date_rec) as date_count,
message_copies.pos_georef1,
message_copies.pos_georef2,
message_copies.pos_georef3,
message_copies.pos_georef4,
ais_server.array_accum(CASE WHEN msg_type BETWEEN  1 and  3 
THEN message_copies.msg_id END) as msgA_array,
ais_server.array_accum(CASE WHEN msg_type = 18 THEN 
message_copies.msg_id END) as msgB_std_array,
ais_server.array_accum(CASE WHEN msg_type = 19 THEN 
message_copies.msg_id END) as msgB_ext_array,

uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_A' 
THEN obj_mmsi END)

) as mmsi_type_A_array,
uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_B' 
THEN obj_mmsi END)

) as mmsi_type_B_array,
avg(ship_speed) / 10.0 as avg_speed,
avg(ship_heading) as avg_heading,
avg(ship_course) / 10.0 as avg_course,
ST_Multi(ST_Collect(ship_pos_messages.pos_point)) as geom
from
feed_all_y2012m07.message_copies join
(feed_all_y2012m07.ship_pos_messages join 
ais_server.ship_objects on (ship_pos_messages.obj_id = 
ship_objects.obj_id))

on (message_copies.msg_id = ship_pos_messages.msg_id)
where
extract('day' from message_copies.msg_date_rec) = 17
and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'
and message_copies.src_id = 5
and not message_copies.pos_georef1 = '' and not 
message_copies.pos_georef2 = '' and not message_copies.pos_georef3 = '' 
and not message_copies.pos_georef4 = ''
and message_copies.pos_georef1 is not null and 
message_copies.pos_georef2 is not null and message_copies.pos_georef3 is 
not null and message_copies.pos_georef4 is not null

and extract('day' from ship_pos_messages.msg_date_rec) = 17
group by src_id, date_count, message_copies.pos_georef1, 
message_copies.pos_georef2, message_copies.pos_georef3, 
message_copies.pos_georef4;


I am not sure that I can see an improvement, at least on src_id that 
have lots of msg_id per day the query never returned even 5 hours later 
running "exaplain analyze". For smaller src_id
(message wise) there might be some improvement or it was just the 
analyse that I run. As I said the stats goes quickly out of scope 
because of the big number of updates. So it looks like that
it is not the "funny" "where" concatenation or some kind of index 
construction problem. Which brings us back to the issue of the 
"statistics_target" on  per column. My problem is that given the
query plan I provided you yesterday, I am not sure which columns 
statistics_target to touch and what short of number to introduce. Is 
there any 

Re: [PERFORM] A very long running query....

2012-07-21 Thread Claudio Freire
On Sat, Jul 21, 2012 at 4:16 PM, Ioannis Anagnostopoulos
 wrote:
> I am not sure that I can see an improvement, at least on src_id that have
> lots of msg_id per day the query never returned even 5 hours later running
> "exaplain analyze". For smaller src_id
> (message wise) there might be some improvement or it was just the analyse
> that I run. As I said the stats goes quickly out of scope because of the big
> number of updates. So it looks like that
> it is not the "funny" "where" concatenation or some kind of index
> construction problem. Which brings us back to the issue of the
> "statistics_target" on  per column. My problem is that given the
> query plan I provided you yesterday, I am not sure which columns
> statistics_target to touch and what short of number to introduce. Is there
> any rule of thumb?

What's the size of your index, tables, and such?
In GB I mean, not tuples.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] A very long running query....

2012-07-21 Thread Ioannis Anagnostopoulos

On 21/07/2012 20:19, Claudio Freire wrote:

On Sat, Jul 21, 2012 at 4:16 PM, Ioannis Anagnostopoulos
 wrote:

I am not sure that I can see an improvement, at least on src_id that have
lots of msg_id per day the query never returned even 5 hours later running
"exaplain analyze". For smaller src_id
(message wise) there might be some improvement or it was just the analyse
that I run. As I said the stats goes quickly out of scope because of the big
number of updates. So it looks like that
it is not the "funny" "where" concatenation or some kind of index
construction problem. Which brings us back to the issue of the
"statistics_target" on  per column. My problem is that given the
query plan I provided you yesterday, I am not sure which columns
statistics_target to touch and what short of number to introduce. Is there
any rule of thumb?

What's the size of your index, tables, and such?
In GB I mean, not tuples.

The message_copies_wk2 that I currently hit is 13GB and 11 the Indexes, the
ship_a_pos_messages_wk2 is 17GB and 2.5MB the index and the ship_objects
is 150MB table and index approx.

Yiannis

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] A very long running query....

2012-07-21 Thread Claudio Freire
On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos
 wrote:
> (feed_all_y2012m07.ship_pos_messages join
> ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id))
> on (message_copies.msg_id = ship_pos_messages.msg_id)

It's this part of the query that's taking 3.2 hours.

Move the filtered message_copies to a CTE, and the filtered
ship_pos_messages join to another CTE. That should (in my experience)
get you better performance.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] A very long running query....

2012-07-21 Thread Claudio Freire
On Sat, Jul 21, 2012 at 5:10 PM, Claudio Freire  wrote:
>  wrote:
>> (feed_all_y2012m07.ship_pos_messages join
>> ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id))
>> on (message_copies.msg_id = ship_pos_messages.msg_id)
>
> It's this part of the query that's taking 3.2 hours.
>
> Move the filtered message_copies to a CTE, and the filtered
> ship_pos_messages join to another CTE. That should (in my experience)
> get you better performance.

Btw... did you try the hash thing?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] A very long running query....

2012-07-21 Thread Ioannis Anagnostopoulos

On 21/07/2012 21:11, Claudio Freire wrote:

On Sat, Jul 21, 2012 at 5:10 PM, Claudio Freire  wrote:

 wrote:

 (feed_all_y2012m07.ship_pos_messages join
ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id))
 on (message_copies.msg_id = ship_pos_messages.msg_id)

It's this part of the query that's taking 3.2 hours.

Move the filtered message_copies to a CTE, and the filtered
ship_pos_messages join to another CTE. That should (in my experience)
get you better performance.

Btw... did you try the hash thing?
Not yet as I am trying at present to simplify the index getting the 
georefs out of it. Don't know if this is a good idea but I though that 
since I am not testing (yet) any equality other than making sure that 
the georefs are not null or empty, I could avoid having it in the index, 
thus reducing its size a lot... At least for now.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance