Re: [PERFORM] A very long running query....
On 07/21/2012 06:19 AM, Ioannis Anagnostopoulos wrote: On this Ubuntu installation the default_statistics_target = 1000 and not 100. Do you think that this might be an issue? Nope. You should generally avoid setting default_statistics_target too high anyway; leave it where it is and use ALTER TABLE ... ALTER COLUMN ... SET STATISTICS to raise the targets on columns where you're seeing bad statistics estimates. http://www.postgresql.org/docs/9.1/static/sql-altertable.html Also make sure autovaccum is running frequently so it keeps the stats up to date. -- Craig Ringer
Re: [PERFORM] A very long running query....
Hello, isn't the first test superfluous here ? where extract('day' from message_copies.msg_date_rec) = 17 and date_trunc('day',message_copies.msg_date_rec) = '2012-07-17' Here is the index: CREATE INDEX idx_message_copies_wk2_date_src_pos_partial ON feed_all_y2012m07.message_copies_wk2 USING btree (date_trunc('day'::text, msg_date_rec), src_id, (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text)) TABLESPACE archive WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) IS NOT NULL OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) = ''::text; the georef test can be simplified using coalesce: and (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) '' and not (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) is null = and coaesce ( (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4), '') '' In order to avoid this test at query time you might add a boolean column message_copies.pos.has_georef, and keep it up to date with a before insert or update trigger. This will allow to shorten your index definition and simplify the planner task a little bit. Moreover it will fasten your query in cases when the index don't get used. As Tom already mentioned it, it may make sense not to concatenate the georef within the index, but keep them separated, or even keep them in different indexes. Which is the best depend on the other queries running against this table HTH, Marc Mamin -Original Message- From: pgsql-performance-ow...@postgresql.org on behalf of Ioannis Anagnostopoulos Sent: Sat 7/21/2012 1:56 AM To: Tom Lane Cc: Claudio Freire; pgsql-performance@postgresql.org Subject: Re: [PERFORM] A very long running query On 21/07/2012 00:10, Tom Lane wrote: Claudio Freire klaussfre...@gmail.com writes: Looking at this: - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) I think the real problem is that the planner has no hope of doing anything very accurate with such an unwieldy filter condition. I'd look at ways of making the filter conditions simpler, perhaps by recasting the data representation. In particular, that's a horridly bad way of asking whether some columns are empty, which I gather is the intent. If you really want to do it just like that, creating an index on the concatenation expression would guide ANALYZE to collect some stats about it, but it would probably be a lot more efficient to put together an AND or OR of tests on the individual columns. regards, tom lane So what you suggest is to forget all together the concatenation of the georef1/2/3/4 and instead alter my query with something like: georef1 is not null and not georeg1 = ''etc for georef2 3 and 4 That would require to alter my index and have the four georef columns separately in it and not as a concatenation and so on for the partial index part. And a final thing, you seem to imply that the indexes are used by the analyser to collect statistics even if they are not used. So an index serves not only as a way to speed up targeted queries but also to provide better statistics to the analyzer? Kind Regards Yiannis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 21/07/2012 10:16, Marc Mamin wrote: RE: [PERFORM] A very long running query Hello, isn't the first test superfluous here ? where extract('day' from message_copies.msg_date_rec) = 17 and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17' Here is the index: CREATE INDEX idx_message_copies_wk2_date_src_pos_partial ON feed_all_y2012m07.message_copies_wk2 USING btree (date_trunc('day'::text, msg_date_rec), src_id, (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text)) TABLESPACE archive WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) IS NOT NULL OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) = ''::text; the georef test can be simplified using coalesce: and (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) '' and not (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) is null = and coaesce ( (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4), '') '' In order to avoid this test at query time you might add a boolean column message_copies.pos.has_georef, and keep it up to date with a before insert or update trigger. This will allow to shorten your index definition and simplify the planner task a little bit. Moreover it will fasten your query in cases when the index don't get used. As Tom already mentioned it, it may make sense not to concatenate the georef within the index, but keep them separated, or even keep them in different indexes. Which is the best depend on the other queries running against this table HTH, Marc Mamin -Original Message- From: pgsql-performance-ow...@postgresql.org on behalf of Ioannis Anagnostopoulos Sent: Sat 7/21/2012 1:56 AM To: Tom Lane Cc: Claudio Freire; pgsql-performance@postgresql.org Subject: Re: [PERFORM] A very long running query On 21/07/2012 00:10, Tom Lane wrote: Claudio Freire klaussfre...@gmail.com writes: Looking at this: - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) I think the real problem is that the planner has no hope of doing anything very accurate with such an unwieldy filter condition. I'd look at ways of making the filter conditions simpler, perhaps by recasting the data representation. In particular, that's a horridly bad way of asking whether some columns are empty, which I gather is the intent. If you really want to do it just like that, creating an index on the concatenation expression would guide ANALYZE to collect some stats about it, but it would probably be a lot more efficient to put together an AND or OR of tests on the individual columns. regards, tom lane So what you suggest is to forget all together the concatenation of the georef1/2/3/4 and instead alter my query with something like: georef1 is not null and not georeg1 = ''etc for georef2 3 and 4 That would require to alter my index and have the four georef columns separately in it and not as a concatenation and so on for the partial index part. And a final thing, you seem to imply that the indexes are used by the analyser to collect statistics even if they are not used. So an index serves not only as a way to speed up targeted queries but also to provide better statistics to the analyzer? Kind Regards Yiannis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance No because it is used to select a partition. Otherwise it will go through the whole hierarchy...
Re: [PERFORM] A very long running query....
[ Please try to trim quotes when replying. People don't want to re-read the entire thread in every message. ] Ioannis Anagnostopoulos ioan...@anatec.com writes: On 21/07/2012 10:16, Marc Mamin wrote: isn't the first test superfluous here ? where extract('day' from message_copies.msg_date_rec) = 17 and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17' No because it is used to select a partition. Otherwise it will go through the whole hierarchy... You're using extract(day...) to define partitions? You might want to rethink that. The planner has got absolutely no intelligence about the behavior of extract, and in particular doesn't realize that the date_trunc condition implies the extract condition; so that's another part of the cause of the estimation error here. What's usually recommended for partitioning is simple equality or range constraints, such as msg_date_rec = 'date1' AND msg_date_rec 'date2', which the planner does have a fair amount of intelligence about. Now, you can generalize that to equality or range constraints using an expression; for instance there'd be no problem to partition on date_trunc('day', msg_date_rec) rather than msg_date_rec directly, so long as your queries always use that same expression. But you should not expect that the planner can deduce very much about the correlations between results of different functions. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 21/07/2012 17:58, Tom Lane wrote: [ Please try to trim quotes when replying. People don't want to re-read the entire thread in every message. ] Ioannis Anagnostopoulos ioan...@anatec.com writes: On 21/07/2012 10:16, Marc Mamin wrote: isn't the first test superfluous here ? where extract('day' from message_copies.msg_date_rec) = 17 and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17' No because it is used to select a partition. Otherwise it will go through the whole hierarchy... You're using extract(day...) to define partitions? You might want to rethink that. The planner has got absolutely no intelligence about the behavior of extract, and in particular doesn't realize that the date_trunc condition implies the extract condition; so that's another part of the cause of the estimation error here. What's usually recommended for partitioning is simple equality or range constraints, such as msg_date_rec = 'date1' AND msg_date_rec 'date2', which the planner does have a fair amount of intelligence about. Now, you can generalize that to equality or range constraints using an expression; for instance there'd be no problem to partition on date_trunc('day', msg_date_rec) rather than msg_date_rec directly, so long as your queries always use that same expression. But you should not expect that the planner can deduce very much about the correlations between results of different functions. regards, tom lane I think you got this wrong here. If you see the query again you will see that I do use equality. The problem is that my equality occurs by extracting the date from the msg_date_rec column. To put it in other words, for not using the extract I should have an additional column only with the date number to perform the equality. Don't you feel that this is not right since I have the actual date? The constrain within the table that defines the partition is as follows: CONSTRAINT message_copies_wk0_date CHECK (date_part('day'::text, msg_date_rec) = 1::double precision AND date_part('day'::text, msg_date_rec) = 7::double precision) I see not problem at this. The planner gets it right and hits the correct table every time. So unless if there is a technique here that I completely miss, where is the problem? Regards Yiannis
Re: [PERFORM] A very long running query....
On 21/07/2012 00:10, Tom Lane wrote: Claudio Freire klaussfre...@gmail.com writes: Looking at this: - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) I think the real problem is that the planner has no hope of doing anything very accurate with such an unwieldy filter condition. I'd look at ways of making the filter conditions simpler, perhaps by recasting the data representation. In particular, that's a horridly bad way of asking whether some columns are empty, which I gather is the intent. If you really want to do it just like that, creating an index on the concatenation expression would guide ANALYZE to collect some stats about it, but it would probably be a lot more efficient to put together an AND or OR of tests on the individual columns. regards, tom lane OK regarding the index I use... I follow your second advice about efficiency with individual columns and changed it to: CREATE INDEX idx_message_copies_wk2_date_src_pos_partial ON feed_all_y2012m07.message_copies_wk2 USING btree (date_trunc('day'::text, msg_date_rec), src_id, pos_georef1, pos_georef2, pos_georef3, pos_georef4) TABLESPACE index WHERE pos_georef1 IS NOT NULL AND NOT pos_georef1::text = ''::text AND pos_georef2 IS NOT NULL AND NOT pos_georef2::text = ''::text AND pos_georef3 IS NOT NULL AND NOT pos_georef3::text = ''::text AND pos_georef4 IS NOT NULL AND NOT pos_georef4::text = ''::text; The query has been changed as well as follows now: SELECT src_id, date_trunc('day', message_copies.msg_date_rec) as date_count, message_copies.pos_georef1, message_copies.pos_georef2, message_copies.pos_georef3, message_copies.pos_georef4, ais_server.array_accum(CASE WHEN msg_type BETWEEN 1 and 3 THEN message_copies.msg_id END) as msgA_array, ais_server.array_accum(CASE WHEN msg_type = 18 THEN message_copies.msg_id END) as msgB_std_array, ais_server.array_accum(CASE WHEN msg_type = 19 THEN message_copies.msg_id END) as msgB_ext_array, uniq ( ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_A' THEN obj_mmsi END) ) as mmsi_type_A_array, uniq ( ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_B' THEN obj_mmsi END) ) as mmsi_type_B_array, avg(ship_speed) / 10.0 as avg_speed, avg(ship_heading) as avg_heading, avg(ship_course) / 10.0 as avg_course, ST_Multi(ST_Collect(ship_pos_messages.pos_point)) as geom from feed_all_y2012m07.message_copies join (feed_all_y2012m07.ship_pos_messages join ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id)) on (message_copies.msg_id = ship_pos_messages.msg_id) where extract('day' from message_copies.msg_date_rec) = 17 and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17' and message_copies.src_id = 5 and not message_copies.pos_georef1 = '' and not message_copies.pos_georef2 = '' and not message_copies.pos_georef3 = '' and not message_copies.pos_georef4 = '' and message_copies.pos_georef1 is not null and message_copies.pos_georef2 is not null and message_copies.pos_georef3 is not null and message_copies.pos_georef4 is not null and extract('day' from ship_pos_messages.msg_date_rec) = 17 group by src_id, date_count, message_copies.pos_georef1, message_copies.pos_georef2, message_copies.pos_georef3, message_copies.pos_georef4; I am not sure that I can see an improvement, at least on src_id that have lots of msg_id per day the query never returned even 5 hours later running exaplain analyze. For smaller src_id (message wise) there might be some improvement or it was just the analyse that I run. As I said the stats goes quickly out of scope because of the big number of updates. So it looks like that it is not the funny where concatenation or some kind of index construction problem. Which brings us back to the issue of the statistics_target on per column. My problem is that given the query plan I provided you yesterday, I am not sure which columns statistics_target to touch and what short of number to introduce. Is there
Re: [PERFORM] A very long running query....
On Sat, Jul 21, 2012 at 4:16 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: I am not sure that I can see an improvement, at least on src_id that have lots of msg_id per day the query never returned even 5 hours later running exaplain analyze. For smaller src_id (message wise) there might be some improvement or it was just the analyse that I run. As I said the stats goes quickly out of scope because of the big number of updates. So it looks like that it is not the funny where concatenation or some kind of index construction problem. Which brings us back to the issue of the statistics_target on per column. My problem is that given the query plan I provided you yesterday, I am not sure which columns statistics_target to touch and what short of number to introduce. Is there any rule of thumb? What's the size of your index, tables, and such? In GB I mean, not tuples. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 21/07/2012 20:19, Claudio Freire wrote: On Sat, Jul 21, 2012 at 4:16 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: I am not sure that I can see an improvement, at least on src_id that have lots of msg_id per day the query never returned even 5 hours later running exaplain analyze. For smaller src_id (message wise) there might be some improvement or it was just the analyse that I run. As I said the stats goes quickly out of scope because of the big number of updates. So it looks like that it is not the funny where concatenation or some kind of index construction problem. Which brings us back to the issue of the statistics_target on per column. My problem is that given the query plan I provided you yesterday, I am not sure which columns statistics_target to touch and what short of number to introduce. Is there any rule of thumb? What's the size of your index, tables, and such? In GB I mean, not tuples. The message_copies_wk2 that I currently hit is 13GB and 11 the Indexes, the ship_a_pos_messages_wk2 is 17GB and 2.5MB the index and the ship_objects is 150MB table and index approx. Yiannis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: (feed_all_y2012m07.ship_pos_messages join ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id)) on (message_copies.msg_id = ship_pos_messages.msg_id) It's this part of the query that's taking 3.2 hours. Move the filtered message_copies to a CTE, and the filtered ship_pos_messages join to another CTE. That should (in my experience) get you better performance. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On Sat, Jul 21, 2012 at 5:10 PM, Claudio Freire klaussfre...@gmail.com wrote: ioan...@anatec.com wrote: (feed_all_y2012m07.ship_pos_messages join ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id)) on (message_copies.msg_id = ship_pos_messages.msg_id) It's this part of the query that's taking 3.2 hours. Move the filtered message_copies to a CTE, and the filtered ship_pos_messages join to another CTE. That should (in my experience) get you better performance. Btw... did you try the hash thing? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 21/07/2012 21:11, Claudio Freire wrote: On Sat, Jul 21, 2012 at 5:10 PM, Claudio Freire klaussfre...@gmail.com wrote: ioan...@anatec.com wrote: (feed_all_y2012m07.ship_pos_messages join ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id)) on (message_copies.msg_id = ship_pos_messages.msg_id) It's this part of the query that's taking 3.2 hours. Move the filtered message_copies to a CTE, and the filtered ship_pos_messages join to another CTE. That should (in my experience) get you better performance. Btw... did you try the hash thing? Not yet as I am trying at present to simplify the index getting the georefs out of it. Don't know if this is a good idea but I though that since I am not testing (yet) any equality other than making sure that the georefs are not null or empty, I could avoid having it in the index, thus reducing its size a lot... At least for now. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] A very long running query....
Hello, the following query seems to take ages to get executed. However I am more than sure (as you can see from the explain analyse) that uses all the correct indexes. In general I have serious issues with joins in my database. This is a Postgres ver. 9.0 running postgis with the _int.sql contrib enabled. Further more I think that the execution of this query seriously degrades the performance of the database. I had to device this query and run it like an overnight batch to populate a table as I couldn't afford users to execute it over and over in a need to do base. Unfortunately it is still slow and some times it either brings down the whole database (my insertions are buffered on the app server) or it never completes before morning. SELECT src_id, date_trunc('day', message_copies.msg_date_rec) as date_count, message_copies.pos_georef1, message_copies.pos_georef2, message_copies.pos_georef3, message_copies.pos_georef4, ais_server.array_accum(CASE WHEN msg_type BETWEEN 1 and 3 THEN message_copies.msg_id END) as msgA_array, ais_server.array_accum(CASE WHEN msg_type = 18 THEN message_copies.msg_id END) as msgB_std_array, ais_server.array_accum(CASE WHEN msg_type = 19 THEN message_copies.msg_id END) as msgB_ext_array, uniq ( ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_A' THEN obj_mmsi END) ) as mmsi_type_A_array, uniq ( ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_B' THEN obj_mmsi END) ) as mmsi_type_B_array, avg(ship_speed) / 10.0 as avg_speed, avg(ship_heading) as avg_heading, avg(ship_course) / 10.0 as avg_course, ST_Multi(ST_Collect(ship_pos_messages.pos_point)) as geom from feed_all_y2012m07.message_copies join (feed_all_y2012m07.ship_pos_messages join ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id)) on (message_copies.msg_id = ship_pos_messages.msg_id) where extract('day' from message_copies.msg_date_rec) = 17 and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17' and message_copies.src_id = 1 and (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) '' and not (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) is null and extract('day' from ship_pos_messages.msg_date_rec) = 17 group by src_id, date_count, message_copies.pos_georef1, message_copies.pos_georef2, message_copies.pos_georef3, message_copies.pos_georef4; What follows is the Explain Analyze: HashAggregate (cost=21295.20..21298.51 rows=53 width=148) (actual time=17832235.321..17832318.546 rows=2340 loops=1) - Nested Loop (cost=0.00..21293.21 rows=53 width=148) (actual time=62.188..17801780.764 rows=387105 loops=1) - Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual time=62.174..17783236.718 rows=387105 loops=1) Join Filter: (feed_all_y2012m07.message_copies.msg_id = feed_all_y2012m07.ship_pos_messages.msg_id) - Append (cost=0.00..19057.93 rows=53 width=33) (actual time=62.124..5486473.545 rows=387524 loops=1) - Seq Scan on message_copies (cost=0.00..0.00 rows=1 width=68) (actual time=0.000..0.000 rows=0 loops=1) Filter: ((src_id = 1) AND (date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) - Append (cost=0.00..35.50 rows=5 width=93) (actual time=31.684..31.724 rows=1 loops=387524) - Seq Scan on ship_pos_messages (cost=0.00..0.00 rows=1 width=52) (actual time=0.001..0.001 rows=0 loops=387524) Filter: (date_part('day'::text,
Re: [PERFORM] A very long running query....
On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: - Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual time=62.174..17783236.718 rows=387105 loops=1) Join Filter: (feed_all_y2012m07.message_copies.msg_id = feed_all_y2012m07.ship_pos_messages.msg_id) - Append (cost=0.00..19057.93 rows=53 width=33) (actual time=62.124..5486473.545 rows=387524 loops=1) Misestimated row counts... did you try running an analyze, or upping statistic targets? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 20/07/2012 22:23, Claudio Freire wrote: On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: - Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual time=62.174..17783236.718 rows=387105 loops=1) Join Filter: (feed_all_y2012m07.message_copies.msg_id = feed_all_y2012m07.ship_pos_messages.msg_id) - Append (cost=0.00..19057.93 rows=53 width=33) (actual time=62.124..5486473.545 rows=387524 loops=1) Misestimated row counts... did you try running an analyze, or upping statistic targets? I have run analyse every so often. I think the problem is that as I get 16K new rows every minutes, the stats are always out... Possible? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On Fri, Jul 20, 2012 at 6:27 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: On 20/07/2012 22:23, Claudio Freire wrote: On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: - Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual time=62.174..17783236.718 rows=387105 loops=1) Join Filter: (feed_all_y2012m07.message_copies.msg_id = feed_all_y2012m07.ship_pos_messages.msg_id) - Append (cost=0.00..19057.93 rows=53 width=33) (actual time=62.124..5486473.545 rows=387524 loops=1) Misestimated row counts... did you try running an analyze, or upping statistic targets? I have run analyse every so often. I think the problem is that as I get 16K new rows every minutes, the stats are always out... Possible? Looking at this: - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) It's very possible. I think pg 9.1 had a fix for that, but I'm not sure it will help in your case, I'd have to know what that index looks like. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 20/07/2012 22:33, Claudio Freire wrote: On Fri, Jul 20, 2012 at 6:27 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: On 20/07/2012 22:23, Claudio Freire wrote: On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: - Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual time=62.174..17783236.718 rows=387105 loops=1) Join Filter: (feed_all_y2012m07.message_copies.msg_id = feed_all_y2012m07.ship_pos_messages.msg_id) - Append (cost=0.00..19057.93 rows=53 width=33) (actual time=62.124..5486473.545 rows=387524 loops=1) Misestimated row counts... did you try running an analyze, or upping statistic targets? I have run analyse every so often. I think the problem is that as I get 16K new rows every minutes, the stats are always out... Possible? Looking at this: - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) It's very possible. I think pg 9.1 had a fix for that, but I'm not sure it will help in your case, I'd have to know what that index looks like. Here is the index: CREATE INDEX idx_message_copies_wk2_date_src_pos_partial ON feed_all_y2012m07.message_copies_wk2 USING btree (date_trunc('day'::text, msg_date_rec), src_id, (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text)) TABLESPACE archive WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) IS NOT NULL OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) = ''::text;
Re: [PERFORM] A very long running query....
On 20/07/2012 22:33, Rosser Schwarz wrote: On Fri, Jul 20, 2012 at 2:27 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: On 20/07/2012 22:23, Claudio Freire wrote: Misestimated row counts... did you try running an analyze, or upping statistic targets? I have run analyse every so often. I think the problem is that as I get 16K new rows every minutes, the stats are always out... Possible? It may not help much with any skew in your data that results from divergent data appearing, but you can update the statistics targets for those columns and analyze again, and the planner should have much better information about the distributions of their data. The max stats target is 1, but the default is 100. Increasing it even just to 500 or 1000 should help the planner significantly. rls I suppose that this is some kind of postgres.conf tweak? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 20/07/2012 22:53, Ioannis Anagnostopoulos wrote: On 20/07/2012 22:33, Rosser Schwarz wrote: On Fri, Jul 20, 2012 at 2:27 PM, Ioannis Anagnostopoulos ioan...@anatec.com wrote: On 20/07/2012 22:23, Claudio Freire wrote: Misestimated row counts... did you try running an analyze, or upping statistic targets? I have run analyse every so often. I think the problem is that as I get 16K new rows every minutes, the stats are always out... Possible? It may not help much with any skew in your data that results from divergent data appearing, but you can update the statistics targets for those columns and analyze again, and the planner should have much better information about the distributions of their data. The max stats target is 1, but the default is 100. Increasing it even just to 500 or 1000 should help the planner significantly. rls I suppose that this is some kind of postgres.conf tweak? On this Ubuntu installation the default_statistics_target = 1000 and not 100. Do you think that this might be an issue? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
Claudio Freire klaussfre...@gmail.com writes: Looking at this: - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) I think the real problem is that the planner has no hope of doing anything very accurate with such an unwieldy filter condition. I'd look at ways of making the filter conditions simpler, perhaps by recasting the data representation. In particular, that's a horridly bad way of asking whether some columns are empty, which I gather is the intent. If you really want to do it just like that, creating an index on the concatenation expression would guide ANALYZE to collect some stats about it, but it would probably be a lot more efficient to put together an AND or OR of tests on the individual columns. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] A very long running query....
On 21/07/2012 00:10, Tom Lane wrote: Claudio Freire klaussfre...@gmail.com writes: Looking at this: - Index Scan using idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2 message_copies (cost=0.00..19057.93 rows=52 width=32) (actual time=62.124..5486270.845 rows=387524 loops=1) Index Cond: ((date_trunc('day'::text, msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND (src_id = 1)) Filter: ((date_part('day'::text, msg_date_rec) = 17::double precision) AND (NOT (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) IS NULL)) AND (pos_georef1)::text || (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text) ''::text)) I think the real problem is that the planner has no hope of doing anything very accurate with such an unwieldy filter condition. I'd look at ways of making the filter conditions simpler, perhaps by recasting the data representation. In particular, that's a horridly bad way of asking whether some columns are empty, which I gather is the intent. If you really want to do it just like that, creating an index on the concatenation expression would guide ANALYZE to collect some stats about it, but it would probably be a lot more efficient to put together an AND or OR of tests on the individual columns. regards, tom lane So what you suggest is to forget all together the concatenation of the georef1/2/3/4 and instead alter my query with something like: georef1 is not null and not georeg1 = ''etc for georef2 3 and 4 That would require to alter my index and have the four georef columns separately in it and not as a concatenation and so on for the partial index part. And a final thing, you seem to imply that the indexes are used by the analyser to collect statistics even if they are not used. So an index serves not only as a way to speed up targeted queries but also to provide better statistics to the analyzer? Kind Regards Yiannis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance