Re: [PERFORM] Query tuning: partitioning, DISTINCT ON, and indexing
On Thu, Jun 20, 2013 at 10:14 PM, Maciek Sakrejda m.sakre...@gmail.comwrote: On Thu, Jun 20, 2013 at 9:13 PM, bricklen brick...@gmail.com wrote: On Thu, Jun 20, 2013 at 6:24 PM, Maciek Sakrejda m.sakre...@gmail.comwrote: SELECT DISTINCT ON (type) ts, type, details FROM observations WHERE subject = '...' ORDER BY type, ts DESC; First thing: What is your work_mem set to, and how much RAM is in the machine? If you look at the plan, you'll immediately notice the external merge Disk line where it spills to disk on the sort. Try setting your work_mem to 120MB or so (depending on how much RAM you have, # concurrent sessions, complexity of queries etc) Good call, thanks, although the in-mem quicksort is not much faster: QUERY PLAN --- Unique (cost=471248.30..489392.67 rows=3 width=47) (actual time=32002.133..32817.474 rows=3 loops=1) Buffers: shared read=30264 - Sort (cost=471248.30..480320.48 rows=3628873 width=47) (actual time=32002.128..32455.950 rows=3628803 loops=1) Sort Key: public.observations.type, public.observations.ts Sort Method: quicksort Memory: 381805kB Buffers: shared read=30264 - Result (cost=0.00..75862.81 rows=3628873 width=47) (actual time=0.026..1323.317 rows=3628803 loops=1) Buffers: shared read=30264 - Append (cost=0.00..75862.81 rows=3628873 width=47) (actual time=0.026..978.477 rows=3628803 loops=1) Buffers: shared read=30264 ... the machine is not nailed down, but I think I'd need to find a way to drastically improve the plan to keep this in Postgres. The alternative is probably caching the results somewhere else: for any given subject, I only need the latest observation of each type 99.9+% of the time. Here are some pages that might help for what details to provide: https://wiki.postgresql.org/wiki/Server_Configuration https://wiki.postgresql.org/wiki/Slow_Query_Questions Did you try an index on (type, ts desc) ? I don't have much else to add at this point, but maybe after posting some more server and table (parent and child) details someone will have an answer for you.
Re: [PERFORM] Query tuning: partitioning, DISTINCT ON, and indexing
On Fri, Jun 21, 2013 at 9:08 AM, bricklen brick...@gmail.com wrote: Did you try an index on (type, ts desc) ? I don't have much else to add at this point, but maybe after posting some more server and table (parent and child) details someone will have an answer for you. No, this is exactly what I was missing. I had forgotten the default index order is useless for a descending lookup like this: I made the change and the performance is 3000x better (the plan's using the index now). Thanks for all your help.
Re: [PERFORM] Query tuning: partitioning, DISTINCT ON, and indexing
On Thu, Jun 20, 2013 at 6:24 PM, Maciek Sakrejda m.sakre...@gmail.comwrote: SELECT DISTINCT ON (type) ts, type, details FROM observations WHERE subject = '...' ORDER BY type, ts DESC; First thing: What is your work_mem set to, and how much RAM is in the machine? If you look at the plan, you'll immediately notice the external merge Disk line where it spills to disk on the sort. Try setting your work_mem to 120MB or so (depending on how much RAM you have, # concurrent sessions, complexity of queries etc)
Re: [PERFORM] Query tuning: partitioning, DISTINCT ON, and indexing
On Thu, Jun 20, 2013 at 9:13 PM, bricklen brick...@gmail.com wrote: On Thu, Jun 20, 2013 at 6:24 PM, Maciek Sakrejda m.sakre...@gmail.comwrote: SELECT DISTINCT ON (type) ts, type, details FROM observations WHERE subject = '...' ORDER BY type, ts DESC; First thing: What is your work_mem set to, and how much RAM is in the machine? If you look at the plan, you'll immediately notice the external merge Disk line where it spills to disk on the sort. Try setting your work_mem to 120MB or so (depending on how much RAM you have, # concurrent sessions, complexity of queries etc) Good call, thanks, although the in-mem quicksort is not much faster: QUERY PLAN --- Unique (cost=471248.30..489392.67 rows=3 width=47) (actual time=32002.133..32817.474 rows=3 loops=1) Buffers: shared read=30264 - Sort (cost=471248.30..480320.48 rows=3628873 width=47) (actual time=32002.128..32455.950 rows=3628803 loops=1) Sort Key: public.observations.type, public.observations.ts Sort Method: quicksort Memory: 381805kB Buffers: shared read=30264 - Result (cost=0.00..75862.81 rows=3628873 width=47) (actual time=0.026..1323.317 rows=3628803 loops=1) Buffers: shared read=30264 - Append (cost=0.00..75862.81 rows=3628873 width=47) (actual time=0.026..978.477 rows=3628803 loops=1) Buffers: shared read=30264 ... the machine is not nailed down, but I think I'd need to find a way to drastically improve the plan to keep this in Postgres. The alternative is probably caching the results somewhere else: for any given subject, I only need the latest observation of each type 99.9+% of the time.
Re: [PERFORM] Query tuning help
Hello please, send EXPLAIN ANALYZE output instead. Regards Pavel Stehule 2011/10/11 CS DBA cs_...@consistentstate.com: Hi all ; I'm trying to tune a difficult query. I have 2 tables: cust_acct (9million rows) cust_orders (200,000 rows) Here's the query: SELECT a.account_id, a.customer_id, a.order_id, a.primary_contact_id, a.status, a.customer_location_id, a.added_date, o.agent_id, p.order_location_id_id, COALESCE(a.customer_location_id, p.order_location_id) AS order_location_id FROM cust_acct a JOIN cust_orders o ON a.order_id = p.order_id; I can't get it to run much faster that about 13 seconds, in most cases it's more like 30 seconds. We have a box with 8 2.5GZ cores and 28GB of ram, shared_buffers is at 8GB I've tried separating the queries as filtering queries joining the results, disabling seq scans, upping work_mem and half a dozen other approaches. Here's the explain plan: Hash Join (cost=151.05..684860.30 rows=9783130 width=100) Hash Cond: (a.order_id = o.order_id) - Seq Scan on cust_acct a (cost=0.00..537962.30 rows=9783130 width=92) - Hash (cost=122.69..122.69 rows=2269 width=12) - Seq Scan on cust_orders o (cost=0.00..122.69 rows=2269 width=12) Thanks in advance for any help, tips, etc... -- - Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com - -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query tuning help
On 11 October 2011 19:52, CS DBA cs_...@consistentstate.com wrote: Hi all ; I'm trying to tune a difficult query. I have 2 tables: cust_acct (9million rows) cust_orders (200,000 rows) Here's the query: SELECT a.account_id, a.customer_id, a.order_id, a.primary_contact_id, a.status, a.customer_location_id, a.added_date, o.agent_id, p.order_location_id_id, COALESCE(a.customer_location_id, p.order_location_id) AS order_location_id FROM cust_acct a JOIN cust_orders o ON a.order_id = p.order_id; I can't get it to run much faster that about 13 seconds, in most cases it's more like 30 seconds. We have a box with 8 2.5GZ cores and 28GB of ram, shared_buffers is at 8GB I've tried separating the queries as filtering queries joining the results, disabling seq scans, upping work_mem and half a dozen other approaches. Here's the explain plan: Hash Join (cost=151.05..684860.30 rows=9783130 width=100) Hash Cond: (a.order_id = o.order_id) - Seq Scan on cust_acct a (cost=0.00..537962.30 rows=9783130 width=92) - Hash (cost=122.69..122.69 rows=2269 width=12) - Seq Scan on cust_orders o (cost=0.00..122.69 rows=2269 width=12) Thanks in advance for any help, tips, etc... Hi, two simple questions: - do you really need getting all 9M rows? - show us the table structure, together with index definitions regards Szymon
Re: [PERFORM] Query tuning help
On 10/11/2011 12:02 PM, Pavel Stehule wrote: Hello please, send EXPLAIN ANALYZE output instead. Regards Pavel Stehule 2011/10/11 CS DBAcs_...@consistentstate.com: Hi all ; I'm trying to tune a difficult query. I have 2 tables: cust_acct (9million rows) cust_orders (200,000 rows) Here's the query: SELECT a.account_id, a.customer_id, a.order_id, a.primary_contact_id, a.status, a.customer_location_id, a.added_date, o.agent_id, p.order_location_id_id, COALESCE(a.customer_location_id, p.order_location_id) AS order_location_id FROM cust_acct a JOIN cust_orders o ON a.order_id = p.order_id; I can't get it to run much faster that about 13 seconds, in most cases it's more like 30 seconds. We have a box with 8 2.5GZ cores and 28GB of ram, shared_buffers is at 8GB I've tried separating the queries as filtering queries joining the results, disabling seq scans, upping work_mem and half a dozen other approaches. Here's the explain plan: Hash Join (cost=151.05..684860.30 rows=9783130 width=100) Hash Cond: (a.order_id = o.order_id) - Seq Scan on cust_acct a (cost=0.00..537962.30 rows=9783130 width=92) - Hash (cost=122.69..122.69 rows=2269 width=12) - Seq Scan on cust_orders o (cost=0.00..122.69 rows=2269 width=12) Thanks in advance for any help, tips, etc... Explain Analyze: Hash Join (cost=154.46..691776.11 rows=10059626 width=100) (actual time=5.191..37551.360 rows=10063432 loops=1) Hash Cond: (a.order_id = o.order_id) - Seq Scan on cust_acct a (cost=0.00..540727.26 rows=10059626 width=92) (actual time=0.022..18987.095 rows=10063432 loops=1) - Hash (cost=124.76..124.76 rows=2376 width=12) (actual time=5.135..5.135 rows=2534 loops=1) - Seq Scan on cust_orders o (cost=0.00..124.76 rows=2376 width=12) (actual time=0.011..2.843 rows=2534 loops=1) Total runtime: 43639.105 ms (6 rows) -- - Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com - -- - Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com -
Re: [PERFORM] Query tuning help
On 10/11/2011 12:03 PM, Szymon Guz wrote: On 11 October 2011 19:52, CS DBA cs_...@consistentstate.com mailto:cs_...@consistentstate.com wrote: Hi all ; I'm trying to tune a difficult query. I have 2 tables: cust_acct (9million rows) cust_orders (200,000 rows) Here's the query: SELECT a.account_id, a.customer_id, a.order_id, a.primary_contact_id, a.status, a.customer_location_id, a.added_date, o.agent_id, p.order_location_id_id, COALESCE(a.customer_location_id, p.order_location_id) AS order_location_id FROM cust_acct a JOIN cust_orders o ON a.order_id = p.order_id; I can't get it to run much faster that about 13 seconds, in most cases it's more like 30 seconds. We have a box with 8 2.5GZ cores and 28GB of ram, shared_buffers is at 8GB I've tried separating the queries as filtering queries joining the results, disabling seq scans, upping work_mem and half a dozen other approaches. Here's the explain plan: Hash Join (cost=151.05..684860.30 rows=9783130 width=100) Hash Cond: (a.order_id = o.order_id) - Seq Scan on cust_acct a (cost=0.00..537962.30 rows=9783130 width=92) - Hash (cost=122.69..122.69 rows=2269 width=12) - Seq Scan on cust_orders o (cost=0.00..122.69 rows=2269 width=12) Thanks in advance for any help, tips, etc... Hi, two simple questions: - do you really need getting all 9M rows? unfortunately yes - show us the table structure, together with index definitions cust_acct table Column |Type | Modifiers ---+-+--- account_id| bigint | not null default nextval('cust_account_id_seq'::regclass) customer_id | character varying(10) | order_id | integer | not null primary_contact_id| bigint | status| accounts_status_type| not null customer_location_id | integer | added_date| timestamp with time zone| not null Indexes: cust_acct_pkey PRIMARY KEY, btree (account_id) cust_acct_cust_id_indx btree (customer_id) cust_acct_order_id_id_indx btree (order_id) cust_acct_pri_contact_id_indx btree (primary_contact_id) cust_orders table Column|Type | Modifiers -+-+--- order_id| integer | not null default nextval('order_id_seq'::regclass) backorder_tag_id| character varying(18) | order_location_id | integer | not null work_order_name | character varying(75) | not null status | programs_status_type| not null additional_info_tag_shipper | character(16) | not null additional_info_tag_cust| character(16) | not null additional_info_tag_manuf | character(16) | not null additional_info_tag_supply | character(16) | not null acct_active_dt | timestamp without time zone | acct_last_activity_date | timestamp without time zone | acct_backorder_items| boolean | not null default false custom_info1| text| custom_info2| text| custom_info3| text| custom_info4| text| custom_info5| text| custom_info6| text| custom_info7| text| Indexes: cust_orders_pkey PRIMARY KEY, btree (order_id) cust_orders_order_id_loc_id_key UNIQUE, btree (order_id, order_location_id) cust_orders_loc_id_idx btree (order_location_id) regards Szymon -- - Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com -
Re: [PERFORM] Query tuning help
Hash Join (cost=154.46..691776.11 rows=10059626 width=100) (actual time=5.191..37551.360 rows=10063432 loops=1) Hash Cond: (a.order_id = o.order_id) - Seq Scan on cust_acct a (cost=0.00..540727.26 rows=10059626 width=92) (actual time=0.022..18987.095 rows=10063432 loops=1) - Hash (cost=124.76..124.76 rows=2376 width=12) (actual time=5.135..5.135 rows=2534 loops=1) - Seq Scan on cust_orders o (cost=0.00..124.76 rows=2376 width=12) (actual time=0.011..2.843 rows=2534 loops=1) Total runtime: 43639.105 ms (6 rows) I am thinking so this time is adequate - processing of 10 mil rows result must be slow a tips: * recheck a seq. read speed - if this is about expected values * play with work_mem - probably is not enough for one bucket - you can decrease time about 10-20 sec, but attention to going to swap - EXPLAIN ANALYZE VERBOSE show a number of buckets - ideal is one. * use a some filter if it's possible * use a limit if it's possible if you really should to process all rows and you need better reaction time, try to use a cursor. It is optimized for fast first row Regards Pavel Stehule -- - Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com - -- - Kevin Kempter - Constent State A PostgreSQL Professional Services Company www.consistentstate.com - -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query tuning help
On 06/14/2010 05:41 AM, AI Rumman wrote: Can any one please help me in tuning the query? explain select * from (select * from crmentity where deleted = 0 and createdtime between (now() - interval '6 month') and now() ) as crmentity inner join (select * from activity where activitytype = 'Emails' and date_start between (now() - interval '6 month') and now()) as activity on crmentity.crmid=activity.activityid inner join emaildetails on emaildetails.emailid = crmentity.crmid inner join vantage_email_track on vantage_email_track.mailid=emaildetails.emailid left join seactivityrel on seactivityrel.activityid = emaildetails.emailid Can you send us 'explain analyze' too? - Seq Scan on emaildetails (cost=0.00..1686.95 rows=44595 width=186) - Seq Scan on vantage_email_track (cost=0.00..1324.52 rows=88852 width=12) do you have indexes on emaildetails(emailid) and vantage_email_track(mailid)? -Andy -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query tuning
2009/8/19 Kevin Kempter kev...@consistentstate.com We do have an index on url_hits.time not sure why timestamps were not used, I was not here for the design phase. What's type of time column? I don't like it casts it to double in explain. If it is integer, may be you need to change and time = extract ('epoch' from timestamp '2009-08-12') and time extract ('epoch' from timestamp '2009-08-13' ) to and time = extract ('epoch' from timestamp '2009-08-12')::int4 and time extract ('epoch' from timestamp '2009-08-13' )::int4 for the index to be used?
Re: [PERFORM] Query tuning
that seems to be the killer: and time = extract ('epoch' from timestamp '2009-08-12') and time extract ('epoch' from timestamp '2009-08-13' ) You probably need an index on time/epoch: CREATE INDEX foo ON table(extract ('epoch' from timestamp time ); or something like that, vacuum analyze and retry. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query tuning
2009/8/19 Grzegorz Jaśkiewicz gryz...@gmail.com that seems to be the killer: and time = extract ('epoch' from timestamp '2009-08-12') and time extract ('epoch' from timestamp '2009-08-13' ) You probably need an index on time/epoch: CREATE INDEX foo ON table(extract ('epoch' from timestamp time ); It looks like those extracts just make constant integer times. You probably just create an index on the time column. Also, why not store times as timestamps? or something like that, vacuum analyze and retry. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query tuning
On Wednesday 19 August 2009 11:17:26 Scott Carey wrote: On 8/19/09 9:28 AM, Kevin Kempter kev...@consistentstate.com wrote: Hi all; we've been fighting this query for a few days now. we bumped up the statistict target for the a.id , c.url_hits_id and the b.id columns below to 250 and ran an analyze on the relevant tables. we killed it after 8hrs. Note the url_hits table has 1.4billion rows Any suggestions? Have you tried setting work_mem higher for just this query? Yes, we upped it to 500Meg The big estimated cost is the sequential scan on url_hits. But in reality, if the estimates are off the sort and index scan at the end might be your bottleneck. Larger work_mem might make it choose another plan there. But if the true cost is the sequential scan on url_hits, then only an index there will help. $ psql -ef expl.sql pwreport explain select a.id, ident_id, time, customer_name, extract('day' from timezone(e.name, to_timestamp(a.time))) as day, category_id from pwreport.url_hits a left outer join pwreport.url_hits_category_jt c on (a.id = c.url_hits_id), pwreport.ident b, pwreport.timezone e where a.ident_id = b.id and b.timezone_id = e.id and time = extract ('epoch' from timestamp '2009-08-12') and time extract ('epoch' from timestamp '2009-08-13' ) and direction = 'REQUEST' ; QUERY PLAN - - - - Merge Right Join (cost=47528508.61..180424544.59 rows=10409251 width=53) Merge Cond: (c.url_hits_id = a.id) - Index Scan using mt_url_hits_category_jt_url_hits_id_index on url_hits_category_jt c (cost=0.00..122162596.63 rows=4189283233 width=8) - Sort (cost=47528508.61..47536931.63 rows=3369210 width=49) Sort Key: a.id - Hash Join (cost=2565.00..47163219.21 rows=3369210 width=49) Hash Cond: (b.timezone_id = e.id) - Hash Join (cost=2553.49..47116881.07 rows=3369210 width=37) Hash Cond: (a.ident_id = b.id) - Seq Scan on url_hits a (cost=0.00..47051154.89 rows=3369210 width=12) Filter: ((direction = 'REQUEST'::proxy_direction_enum) AND ((time)::double precision = 1250035200::double precision) AND ((time)::double precision 1250121600::double precision)) - Hash (cost=2020.44..2020.44 rows=42644 width=29) - Seq Scan on ident b (cost=0.00..2020.44 rows=42644 width=29) - Hash (cost=6.78..6.78 rows=378 width=20) - Seq Scan on timezone e (cost=0.00..6.78 rows=378 width=20) (15 rows) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query tuning
On Wednesday 19 August 2009 11:31:30 Nikolas Everett wrote: 2009/8/19 Grzegorz Jaśkiewicz gryz...@gmail.com that seems to be the killer: and time = extract ('epoch' from timestamp '2009-08-12') and time extract ('epoch' from timestamp '2009-08-13' ) You probably need an index on time/epoch: CREATE INDEX foo ON table(extract ('epoch' from timestamp time ); It looks like those extracts just make constant integer times. You probably just create an index on the time column. Also, why not store times as timestamps? or something like that, vacuum analyze and retry. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance We do have an index on url_hits.time not sure why timestamps were not used, I was not here for the design phase. Thx -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query tuning
I get the same plan after running vacuum analyze. Nope, I don't have index on objdomainid, objid and userdomainid. Only eventime has it. -Original Message- From: Chris [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 22, 2006 8:06 PM To: Subbiah, Stalin Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query tuning Subbiah, Stalin wrote: Actually these servers will be upgraded to 8.1.4 in couple of months. even so, you could get some bad data in there. http://www.postgresql.org/docs/8.0/static/release.html . Go through the old release notes and you'll find various race conditions, crashes etc. Here you go with explain analyze. # explain analyze SELECT * FROM EVENTLOG WHERE EVENTTIME'07/23/06 16:00:00' AND EVENTTIME'08/22/06 16:00:00' AND (OBJDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA' OR OBJID='tzRh39d0d91luNGT1weIUjLvFIcA' OR USERDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA') ORDER BY EVENTTIME DESC, SEQUENCENUM DESC LIMIT 500 OFFSET 500; QUERY PLAN -- -- -- -- -- -- -- -- - Limit (cost=15583110.14..15583111.39 rows=500 width=327) (actual time=427771.568..427772.904 rows=500 loops=1) - Sort (cost=15583108.89..15618188.88 rows=14031998 width=327) (actual time=427770.504..427771.894 rows=1000 loops=1) Sort Key: eventtime, sequencenum - Seq Scan on eventlog (cost=0.00..2334535.17 rows=14031998 width=327) (actual time=10.370..190038.764 rows=7699388 loops=1) Filter: ((eventtime '2006-07-23 16:00:00'::timestamp without time zone) AND (eventtime '2006-08-22 16:00:00'::timestamp without time zone) AND (((objdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((objid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((userdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text))) Total runtime: 437884.134 ms (6 rows) If you analyze the table then run this again what plan does it come back with? I can't read explain output properly but I suspect (and I'm sure I'll be corrected if need be) that the sort step is way out of whack and so is the seq scan because the stats aren't up to date enough. Do you have an index on objdomainid, objid and userdomainid (one index per field) ? I wonder if that will help much. -- Postgresql php tutorials http://www.designmagick.com/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Query tuning
It seems to me that what would work best is an index scan backward on the eventtime index. I don't see why that wouldn't work for you, maybe the planner is just esitmating the seq scan and sort is faster for some reason. What does EXPLAIN say if you use a small limit and offset like 10? Or what does EXPLAIN say if you first run set enable_seqscan=false; (If you get the same plan, then I wouldn't bother running EXPLAIN ANALYZE, but if you get a different plan I would run EXPLAIN ANALYZE to see if the new plan is any faster.) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Subbiah, Stalin Sent: Wednesday, August 23, 2006 1:03 PM To: Chris Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query tuning I get the same plan after running vacuum analyze. Nope, I don't have index on objdomainid, objid and userdomainid. Only eventime has it. -Original Message- From: Chris [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 22, 2006 8:06 PM To: Subbiah, Stalin Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query tuning Subbiah, Stalin wrote: Actually these servers will be upgraded to 8.1.4 in couple of months. even so, you could get some bad data in there. http://www.postgresql.org/docs/8.0/static/release.html . Go through the old release notes and you'll find various race conditions, crashes etc. Here you go with explain analyze. # explain analyze SELECT * FROM EVENTLOG WHERE EVENTTIME'07/23/06 16:00:00' AND EVENTTIME'08/22/06 16:00:00' AND (OBJDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA' OR OBJID='tzRh39d0d91luNGT1weIUjLvFIcA' OR USERDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA') ORDER BY EVENTTIME DESC, SEQUENCENUM DESC LIMIT 500 OFFSET 500; QUERY PLAN -- -- -- -- -- -- -- -- - Limit (cost=15583110.14..15583111.39 rows=500 width=327) (actual time=427771.568..427772.904 rows=500 loops=1) - Sort (cost=15583108.89..15618188.88 rows=14031998 width=327) (actual time=427770.504..427771.894 rows=1000 loops=1) Sort Key: eventtime, sequencenum - Seq Scan on eventlog (cost=0.00..2334535.17 rows=14031998 width=327) (actual time=10.370..190038.764 rows=7699388 loops=1) Filter: ((eventtime '2006-07-23 16:00:00'::timestamp without time zone) AND (eventtime '2006-08-22 16:00:00'::timestamp without time zone) AND (((objdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((objid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((userdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text))) Total runtime: 437884.134 ms (6 rows) If you analyze the table then run this again what plan does it come back with? I can't read explain output properly but I suspect (and I'm sure I'll be corrected if need be) that the sort step is way out of whack and so is the seq scan because the stats aren't up to date enough. Do you have an index on objdomainid, objid and userdomainid (one index per field) ? I wonder if that will help much. -- Postgresql php tutorials http://www.designmagick.com/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Query tuning
Changing limit or offset to a small number doesn't have any change in plans. Likewise enable_seqscan to false. They still take 8-10 mins to runs. -Original Message- From: Dave Dutcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 23, 2006 4:20 PM To: Subbiah, Stalin Cc: pgsql-performance@postgresql.org Subject: RE: [PERFORM] Query tuning It seems to me that what would work best is an index scan backward on the eventtime index. I don't see why that wouldn't work for you, maybe the planner is just esitmating the seq scan and sort is faster for some reason. What does EXPLAIN say if you use a small limit and offset like 10? Or what does EXPLAIN say if you first run set enable_seqscan=false; (If you get the same plan, then I wouldn't bother running EXPLAIN ANALYZE, but if you get a different plan I would run EXPLAIN ANALYZE to see if the new plan is any faster.) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Subbiah, Stalin Sent: Wednesday, August 23, 2006 1:03 PM To: Chris Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query tuning I get the same plan after running vacuum analyze. Nope, I don't have index on objdomainid, objid and userdomainid. Only eventime has it. -Original Message- From: Chris [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 22, 2006 8:06 PM To: Subbiah, Stalin Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query tuning Subbiah, Stalin wrote: Actually these servers will be upgraded to 8.1.4 in couple of months. even so, you could get some bad data in there. http://www.postgresql.org/docs/8.0/static/release.html . Go through the old release notes and you'll find various race conditions, crashes etc. Here you go with explain analyze. # explain analyze SELECT * FROM EVENTLOG WHERE EVENTTIME'07/23/06 16:00:00' AND EVENTTIME'08/22/06 16:00:00' AND (OBJDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA' OR OBJID='tzRh39d0d91luNGT1weIUjLvFIcA' OR USERDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA') ORDER BY EVENTTIME DESC, SEQUENCENUM DESC LIMIT 500 OFFSET 500; QUERY PLAN -- -- -- -- -- -- -- -- - Limit (cost=15583110.14..15583111.39 rows=500 width=327) (actual time=427771.568..427772.904 rows=500 loops=1) - Sort (cost=15583108.89..15618188.88 rows=14031998 width=327) (actual time=427770.504..427771.894 rows=1000 loops=1) Sort Key: eventtime, sequencenum - Seq Scan on eventlog (cost=0.00..2334535.17 rows=14031998 width=327) (actual time=10.370..190038.764 rows=7699388 loops=1) Filter: ((eventtime '2006-07-23 16:00:00'::timestamp without time zone) AND (eventtime '2006-08-22 16:00:00'::timestamp without time zone) AND (((objdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((objid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((userdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text))) Total runtime: 437884.134 ms (6 rows) If you analyze the table then run this again what plan does it come back with? I can't read explain output properly but I suspect (and I'm sure I'll be corrected if need be) that the sort step is way out of whack and so is the seq scan because the stats aren't up to date enough. Do you have an index on objdomainid, objid and userdomainid (one index per field) ? I wonder if that will help much. -- Postgresql php tutorials http://www.designmagick.com/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Query tuning
Subbiah, Stalin wrote: Hello All, This query runs forever and ever. Nature of this table being lots of inserts/deletes/query, I vacuum it every half hour to keep the holes reusable and nightly once vacuum analyze to update the optimizer. We've got index on eventtime only. Running it for current day uses index range scan and it runs within acceptable time. Below is the explain of the query. Is the order by sequencenum desc prevents from applying limit optimization? explain SELECT * FROM EVENTLOG WHERE EVENTTIME'07/23/06 16:00:00' AND EVENTTIME'08/22/06 16:00:00' AND (OBJDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA' OR OBJID='tzRh39d0d91luNGT1weIUjLvFIcA' OR USERDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA') ORDER BY EVENTTIME DESC, SEQUENCENUM DESC LIMIT 500 OFFSET 0; QUERY PLAN - Limit (cost=15546930.29..15546931.54 rows=500 width=327) - Sort (cost=15546930.29..15581924.84 rows=13997819 width=327) Sort Key: eventtime, sequencenum - Seq Scan on eventlog (cost=0.00..2332700.25 rows=13997819 width=327) Filter: ((eventtime '2006-07-23 16:00:00'::timestamp without time zone) AND (eventtime '2006-08-22 16:00:00'::timestamp without time zone) AND (((objdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((objid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((userdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text))) (5 rows) Thanks, Stalin Pg version 8.0.1, suse 64bit. Firstly you should update to 8.0.8 - because it's in the same stream you won't need to do a dump/initdb/reload like a major version change, it should be a simple upgrade. Can you send explain analyze instead of just explain? It sounds like you're not analyz'ing enough - if you're doing lots of updates/deletes/inserts then the statistics postgresql uses to choose whether to do an index scan or something else will quickly be outdated and so it'll have to go back to a full table scan every time.. Can you set up autovacuum to handle that for you more regularly? -- Postgresql php tutorials http://www.designmagick.com/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Query tuning
Actually these servers will be upgraded to 8.1.4 in couple of months. Here you go with explain analyze. # explain analyze SELECT * FROM EVENTLOG WHERE EVENTTIME'07/23/06 16:00:00' AND EVENTTIME'08/22/06 16:00:00' AND (OBJDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA' OR OBJID='tzRh39d0d91luNGT1weIUjLvFIcA' OR USERDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA') ORDER BY EVENTTIME DESC, SEQUENCENUM DESC LIMIT 500 OFFSET 500; QUERY PLAN - Limit (cost=15583110.14..15583111.39 rows=500 width=327) (actual time=427771.568..427772.904 rows=500 loops=1) - Sort (cost=15583108.89..15618188.88 rows=14031998 width=327) (actual time=427770.504..427771.894 rows=1000 loops=1) Sort Key: eventtime, sequencenum - Seq Scan on eventlog (cost=0.00..2334535.17 rows=14031998 width=327) (actual time=10.370..190038.764 rows=7699388 loops=1) Filter: ((eventtime '2006-07-23 16:00:00'::timestamp without time zone) AND (eventtime '2006-08-22 16:00:00'::timestamp without time zone) AND (((objdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((objid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((userdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text))) Total runtime: 437884.134 ms (6 rows) -Original Message- From: Chris [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 22, 2006 6:37 PM To: Subbiah, Stalin Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query tuning Subbiah, Stalin wrote: Hello All, This query runs forever and ever. Nature of this table being lots of inserts/deletes/query, I vacuum it every half hour to keep the holes reusable and nightly once vacuum analyze to update the optimizer. We've got index on eventtime only. Running it for current day uses index range scan and it runs within acceptable time. Below is the explain of the query. Is the order by sequencenum desc prevents from applying limit optimization? explain SELECT * FROM EVENTLOG WHERE EVENTTIME'07/23/06 16:00:00' AND EVENTTIME'08/22/06 16:00:00' AND (OBJDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA' OR OBJID='tzRh39d0d91luNGT1weIUjLvFIcA' OR USERDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA') ORDER BY EVENTTIME DESC, SEQUENCENUM DESC LIMIT 500 OFFSET 0; QUERY PLAN -- -- -- -- -- -- -- -- - Limit (cost=15546930.29..15546931.54 rows=500 width=327) - Sort (cost=15546930.29..15581924.84 rows=13997819 width=327) Sort Key: eventtime, sequencenum - Seq Scan on eventlog (cost=0.00..2332700.25 rows=13997819 width=327) Filter: ((eventtime '2006-07-23 16:00:00'::timestamp without time zone) AND (eventtime '2006-08-22 16:00:00'::timestamp without time zone) AND (((objdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((objid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((userdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text))) (5 rows) Thanks, Stalin Pg version 8.0.1, suse 64bit. Firstly you should update to 8.0.8 - because it's in the same stream you won't need to do a dump/initdb/reload like a major version change, it should be a simple upgrade. Can you send explain analyze instead of just explain? It sounds like you're not analyz'ing enough - if you're doing lots of updates/deletes/inserts then the statistics postgresql uses to choose whether to do an index scan or something else will quickly be outdated and so it'll have to go back to a full table scan every time.. Can you set up autovacuum to handle that for you more regularly? -- Postgresql php tutorials http://www.designmagick.com/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Query tuning
Subbiah, Stalin wrote: Actually these servers will be upgraded to 8.1.4 in couple of months. even so, you could get some bad data in there. http://www.postgresql.org/docs/8.0/static/release.html . Go through the old release notes and you'll find various race conditions, crashes etc. Here you go with explain analyze. # explain analyze SELECT * FROM EVENTLOG WHERE EVENTTIME'07/23/06 16:00:00' AND EVENTTIME'08/22/06 16:00:00' AND (OBJDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA' OR OBJID='tzRh39d0d91luNGT1weIUjLvFIcA' OR USERDOMAINID='tzRh39d0d91luNGT1weIUjLvFIcA') ORDER BY EVENTTIME DESC, SEQUENCENUM DESC LIMIT 500 OFFSET 500; QUERY PLAN - Limit (cost=15583110.14..15583111.39 rows=500 width=327) (actual time=427771.568..427772.904 rows=500 loops=1) - Sort (cost=15583108.89..15618188.88 rows=14031998 width=327) (actual time=427770.504..427771.894 rows=1000 loops=1) Sort Key: eventtime, sequencenum - Seq Scan on eventlog (cost=0.00..2334535.17 rows=14031998 width=327) (actual time=10.370..190038.764 rows=7699388 loops=1) Filter: ((eventtime '2006-07-23 16:00:00'::timestamp without time zone) AND (eventtime '2006-08-22 16:00:00'::timestamp without time zone) AND (((objdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((objid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text) OR ((userdomainid)::text = 'tzRh39d0d91luNGT1weIUjLvFIcA'::text))) Total runtime: 437884.134 ms (6 rows) If you analyze the table then run this again what plan does it come back with? I can't read explain output properly but I suspect (and I'm sure I'll be corrected if need be) that the sort step is way out of whack and so is the seq scan because the stats aren't up to date enough. Do you have an index on objdomainid, objid and userdomainid (one index per field) ? I wonder if that will help much. -- Postgresql php tutorials http://www.designmagick.com/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Query tuning help
Hi Dan, I tried to understand your query, but I couldn't get my understanding of the query and your description in sync. Why do you use sub selects? Wouldn't a simple recordtext like '%RED%' do the trick too? You combine all your where conditions with and. To me this looks like you get only rows with RED and CORVETTE. From your description I would rewrite the query as explain analyze select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat from ea join em using(incidentid) join eg using(incidentid) where em.entrydate = '2005-1-1 00:00'::date and em.entrydate = '2005-5-9 00:00'::date and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) order by em.entrydate That should give you all rows containing one of the words. Does it work? Is is faster? Is it fast enough? Ulrich ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Query tuning help
In article [EMAIL PROTECTED], Dan Harris [EMAIL PROTECTED] writes: On May 8, 2005, at 8:06 PM, Josh Berkus wrote: If I were to use tsearch2 for full-text indexing, would I need to create another table that merges all of my recordtext rows into a single 'text' field type? No. Read the OpenFTS docs, they are fairly clear on how to set up a simple FTS index. (TSearch2 ~~ OpenFTS) If so, this is where I run into problems, as my logic also needs to match multiple words in their original order. I have been reading the Tsearch2 docs and either I don't understand something or I'm not communicating my situation clearly enough. It seems that Tsearch2 has a concept of document. And, in everything I am reading, they expect your document to be all contained in a single row. Since my words can be spread across multiple rows, I don't see that Tsearch2 will combine all 'recordtext' row values with the same incidentid into a single vector. Am I overlooking something in the docs? AFAICS no, but you could create a separate table containing just the distinct incidentids and the tsearch2 vectors of all recordtexts matching that incidentid. This table would get updated solely by triggers on the original table and would provide a fast way to get all incidentids for RED and CORVETTE. The question is: would this reduce the number of rows to check more than filtering on date? ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Query tuning help
Quoting Russell Smith [EMAIL PROTECTED]: On Mon, 9 May 2005 11:49 am, Dan Harris wrote: On May 8, 2005, at 6:51 PM, Russell Smith wrote: [snip] select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON em.incidentid = ea.incidentid --- slight paraphrase /Mischa. AND em.entrydate between '2005-1-1' and '2005-5-9' AND ea.recordtext like '%RED%' AND ea.recordtext like '%CORVETTE%' Here's the situation: Due to the format of the systems with which I integrate ( I have no control over these formats ), we will get these 'recordtext' values one line at a time, accumulating over time. The only way I can find to make this work is to insert a new record for each line. The problem is, that when someone wants to search multiple keywords, they expect these words to be matched across multiple records with a given incident number. For a very simple example: IncidentID DateRecordtext -- - 1 2005-05-01 14:21 blah blah blah RED blah blah 2005-05-01 14:23 not what we are looking for 1 2005-05-02 02:05 blah CORVETTE blah blah select em.incidentid, ea.recordtest as retdata fromem join ( -- equivalent to where incidentid in (...), sometimes faster. select incidentid from em join ea using (incidentid) where em.entrydate between '2005-1-1' and '2005-5-9' group by incidentid having 1 = min(case when recordtest like '%RED%' then 1 end) and 1 = min(case when recordtest like '%CORVETTE%' then 1 end) ) as X using (incidentid); ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Query tuning help
On Mon, 9 May 2005 09:20 am, Dan Harris wrote: Sorry to bother everyone with yet another my query isn't using an index problem but I am over my head on this one.. I am open to ways of restructuring this query to perform better. I have a table, 'ea', with 22 million rows in it. VACUUM ANALYZE has been just run on the table. This is the result of: explain analyze select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat from ea, em, eg where em.incidentid = ea.incidentid and em.incidentid = eg.incidentid and em.entrydate = '2005-1-1 00:00' and em.entrydate = '2005-5-9 00:00' and ea.incidentid in ( select incidentid from ea where recordtext like '%RED%' ) and ea.incidentid in ( select incidentid from ea where recordtext like '%CORVETTE%' ) and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) order by em.entrydate You cannot use an index for %CORVETTE%, or %RED%. There is no way for the index to know if a row had that in the middle without scanning the whole index. So it's much cheaper to do a sequence scan. One possible way to make the query faster is to limit based on date, as you will only get about 700 rows. And then don't use subselects, as they are doing full sequence scans. I think this query does what you do above, and I think it will be faster, but I don't know. select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND ea.recordtext like '%RED%' AND ea.recordtext like '%CORVETTE%') JOIN eg ON em.incidentid = eg.incidentid WHERE (recordtext like '%RED%' or recordtext like '%CORVETTE%' ); - ANALYZE RESULTS - Unique (cost=774693.72..774693.76 rows=1 width=159) (actual time=446787.056..446787.342 rows=72 loops=1) - Sort (cost=774693.72..774693.72 rows=1 width=159) (actual time=446787.053..446787.075 rows=72 loops=1) Sort Key: em.incidentid, public.ea.recordtext, eg.long, eg.lat - Nested Loop (cost=771835.10..774693.71 rows=1 width=159) (actual time=444378.655..446786.746 rows=72 loops=1) - Nested Loop (cost=771835.10..774688.81 rows=1 width=148) (actual time=444378.532..446768.381 rows=72 loops=1) - Nested Loop IN Join (cost=771835.10..774678.88 rows=2 width=81) (actual time=444367.080..446191.864 rows=701 loops=1) - Nested Loop (cost=771835.10..774572.05 rows=42 width=64) (actual time=444366.859..445463.232 rows=1011 loops=1) - HashAggregate (cost=771835.10..771835.10 rows=1 width=17) (actual time=444366.702..444368.583 rows=473 loops=1) - Seq Scan on ea (cost=0.00..771834.26 rows=335 width=17) (actual time=259.746..444358.837 rows=592 loops=1) Filter: ((recordtext)::text ~~ '%CORVETTE%'::text) - Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) (actual time=2.085..2.309 rows=2 loops=473) Index Cond: ((ea.incidentid)::text = (outer.incidentid)::text) Filter: (((recordtext)::text ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) - Index Scan using ea1 on ea (cost=0.00..2733.81 rows=42 width=17) (actual time=0.703..0.703 rows=1 loops=1011) Index Cond: ((outer.incidentid)::text = (ea.incidentid)::text) Filter: ((recordtext)::text ~~ '%RED%'::text) - Index Scan using em_incidentid_idx on em (cost=0.00..4.95 rows=1 width=67) (actual time=0.820..0.821 rows=0 loops=701) Index Cond: ((outer.incidentid)::text = (em.incidentid)::text) Filter: ((entrydate = '2005-01-01 00:00:00'::timestamp without time zone) AND (entrydate = '2005-05-09 00:00:00'::timestamp without time zone)) - Index Scan using eg_incidentid_idx on eg (cost=0.00..4.89 rows=1 width=79) (actual time=0.245..0.246 rows=1 loops=72) Index Cond: ((outer.incidentid)::text = (eg.incidentid)::text) Total runtime: 446871.880 ms (22 rows) - EXPLANATION - The reason for the redundant LIKE clause is that first, I only want those incidentids that contain the words 'RED' and 'CORVETTE'. BUT, those two words may exist across multiple records with the same incidentid. Then, I only want to actually work with the rows that contain one of the words. This query will repeat the same logic for however many keywords are entered
Re: [PERFORM] Query tuning help
Dan, and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) It is simply not possible to use B-tree indexes on these kind of text queries. B-trees require you to start at the left side of the field, because B-trees locate records via tests. Anywhere in the field text search requires a Full Text Index. The reason for the redundant LIKE clause is that first, I only want those incidentids that contain the words 'RED' and 'CORVETTE'. BUT, those two words may exist across multiple records with the same incidentid. Then, I only want to actually work with the rows that contain one of the words. This query will repeat the same logic for however many keywords are entered by the user. I have investigated text searching options and have not found them to be congruous with my application. Sounds like you either need to restructure your application, restructure your database (so that you're not doing anywhere in field searches), or buy 32GB of ram so that you can cache the whole table. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Query tuning help
Russell Smith [EMAIL PROTECTED] writes: On Mon, 9 May 2005 09:20 am, Dan Harris wrote: and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) You cannot use an index for %CORVETTE%, or %RED%. Not a btree index anyway. Dan might have some success here with a full-text-indexing package (eg, contrib/tsearch2) regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Query tuning help
On May 8, 2005, at 6:51 PM, Russell Smith wrote: On Mon, 9 May 2005 09:20 am, Dan Harris wrote: You cannot use an index for %CORVETTE%, or %RED%. There is no way for the index to know if a row had that in the middle without scanning the whole index. So it's much cheaper to do a sequence scan. While I believe you, I'm confused by this line in my original EXPLAIN ANALYZE: - Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) (actual time=2.085..2.309 rows=2 loops=473) Index Cond: ((ea.incidentid)::text = (outer.incidentid)::text) Filter: (((recordtext)::text ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) Doesn't that mean it was using an index to filter? Along those lines, before I created index 'ea1', the query was much much slower. So, it seemed like creating this index made a difference. One possible way to make the query faster is to limit based on date, as you will only get about 700 rows. And then don't use subselects, as they are doing full sequence scans. I think this query does what you do above, and I think it will be faster, but I don't know. I REALLY like this idea! If I could just filter by date first and then sequential scan through those, it should be very manageable. Hopefully I can keep this goal while still accommodating the requirement listed in my next paragraph. select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND ea.recordtext like '%RED%' AND ea.recordtext like '%CORVETTE%') JOIN eg ON em.incidentid = eg.incidentid WHERE (recordtext like '%RED%' or recordtext like '%CORVETTE%' ); I have run this, and while it is very fast, I'm concerned it's not doing what I need. Here's the situation: Due to the format of the systems with which I integrate ( I have no control over these formats ), we will get these 'recordtext' values one line at a time, accumulating over time. The only way I can find to make this work is to insert a new record for each line. The problem is, that when someone wants to search multiple keywords, they expect these words to be matched across multiple records with a given incident number. For a very simple example: IncidentID DateRecordtext -- - --- 1 2005-05-01 14:21 blah blah blah RED blah blah 2005-05-01 14:23 not what we are looking for 1 2005-05-02 02:05 blah CORVETTE blah blah So, doing a search with an 'and' condition, e.g. WHERE RECORDTEXT LIKE '%RED%' AND RECORDTEXT LIKE '%CORVETTE%' , will not match because the condition will only be applied to a single row of recordtext at a time, not a whole group with the same incident number. If I were to use tsearch2 for full-text indexing, would I need to create another table that merges all of my recordtext rows into a single 'text' field type? If so, this is where I run into problems, as my logic also needs to match multiple words in their original order. I may also receive additional updates to the previous data. In that case, I need to replace the original record with the latest version of it. If I have already concatenated these rows into a single field, the logic to in-line replace only the old text that has changed is very very difficult at best. So, that's the reason I had to do two subqueries in my example. Please tell me if I misunderstood your logic and it really will match given my condition above, but it didn't seem like it would. Thanks again for the quick responses! This list has been a great resource for me. -Dan ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Query tuning help
Dan, While I believe you, I'm confused by this line in my original EXPLAIN ANALYZE: - Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) (actual time=2.085..2.309 rows=2 loops=473) Index Cond: ((ea.incidentid)::text = (outer.incidentid)::text) Filter: (((recordtext)::text ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) The index named is matching based on incidentid -- the join condition. The filter is applied against the table rows, i.e. a scan. If I were to use tsearch2 for full-text indexing, would I need to create another table that merges all of my recordtext rows into a single 'text' field type? No. Read the OpenFTS docs, they are fairly clear on how to set up a simple FTS index. (TSearch2 ~~ OpenFTS) If so, this is where I run into problems, as my logic also needs to match multiple words in their original order. You do that by doubling up ... that is, use the FTS index to pick all rows that contain RED and CORVETTE, and then check the order. I'll also note that your current query is not checking word order. Example: WHERE recordtext_fti @@ to_tsquery ('default', 'RED CORVETTE') AND recordtext LIKE '%RED%CORVETTE%' I'm doing something fairly similar on one of my projects and it works very well. The limitations on TSearch2 indexes are: 1) they are expensive to update, so your data loads would be noticably slower. 2) they are only fast when cached in RAM (and when cached, are *very* fast). So if you have a variety of other processes that tend to fill up RAM between searches, you may find them less useful. 3) You have to create a materialized index column next to recordtext, which will increase the size of the table. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Query tuning help
Dan Harris [EMAIL PROTECTED] writes: - Index Scan using ea1 on ea (cost=0.00..2736.43 rows=42 width=47) (actual time=2.085..2.309 rows=2 loops=473) Index Cond: ((ea.incidentid)::text = (outer.incidentid)::text) Filter: (((recordtext)::text ~~ '%RED%'::text) OR ((recordtext)::text ~~ '%CORVETTE%'::text)) Doesn't that mean it was using an index to filter? No. The Index Cond shows it is using the index only for the join condition. A Filter is an additional filter condition that happens to get applied at this plan node --- but it'll be applied to every row the index finds for the index condition. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Query tuning help
On May 8, 2005, at 8:06 PM, Josh Berkus wrote: If I were to use tsearch2 for full-text indexing, would I need to create another table that merges all of my recordtext rows into a single 'text' field type? No. Read the OpenFTS docs, they are fairly clear on how to set up a simple FTS index. (TSearch2 ~~ OpenFTS) If so, this is where I run into problems, as my logic also needs to match multiple words in their original order. I have been reading the Tsearch2 docs and either I don't understand something or I'm not communicating my situation clearly enough. It seems that Tsearch2 has a concept of document. And, in everything I am reading, they expect your document to be all contained in a single row. Since my words can be spread across multiple rows, I don't see that Tsearch2 will combine all 'recordtext' row values with the same incidentid into a single vector. Am I overlooking something in the docs? I'm doing something fairly similar on one of my projects and it works very well. I'd be curious what similarities they have? Is it the searching across multiple rows or the order of words? The limitations on TSearch2 indexes are: 1) they are expensive to update, so your data loads would be noticably slower. 2) they are only fast when cached in RAM (and when cached, are *very* fast). So if you have a variety of other processes that tend to fill up RAM between searches, you may find them less useful. 3) You have to create a materialized index column next to recordtext, which will increase the size of the table. Duly noted. If this method can search across rows, I'm willing to accept this overhead for the speed it would add. In the meantime, is there any way I can reach my goal without Tsearch2 by just restructuring my query to narrow down the results by date first, then seq scan for the 'likes'? -Dan ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Query tuning help
On Mon, 9 May 2005 11:49 am, Dan Harris wrote: On May 8, 2005, at 6:51 PM, Russell Smith wrote: [snip] select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND ea.recordtext like '%RED%' AND ea.recordtext like '%CORVETTE%') JOIN eg ON em.incidentid = eg.incidentid WHERE (recordtext like '%RED%' or recordtext like '%CORVETTE%' ); I have run this, and while it is very fast, I'm concerned it's not doing what I need. How fast is very fast? Here's the situation: Due to the format of the systems with which I integrate ( I have no control over these formats ), we will get these 'recordtext' values one line at a time, accumulating over time. The only way I can find to make this work is to insert a new record for each line. The problem is, that when someone wants to search multiple keywords, they expect these words to be matched across multiple records with a given incident number. For a very simple example: IncidentID DateRecordtext -- - --- 1 2005-05-01 14:21 blah blah blah RED blah blah 2005-05-01 14:23 not what we are looking for 1 2005-05-02 02:05 blah CORVETTE blah blah So, doing a search with an 'and' condition, e.g. WHERE RECORDTEXT LIKE '%RED%' AND RECORDTEXT LIKE '%CORVETTE%' , will not match because the condition will only be applied to a single row of recordtext at a time, not a whole group with the same incident number. If I were to use tsearch2 for full-text indexing, would I need to create another table that merges all of my recordtext rows into a single 'text' field type? If so, this is where I run into problems, as my logic also needs to match multiple words in their original order. I may also receive additional updates to the previous data. In that case, I need to replace the original record with the latest version of it. If I have already concatenated these rows into a single field, the logic to in-line replace only the old text that has changed is very very difficult at best. So, that's the reason I had to do two subqueries in my example. Please tell me if I misunderstood your logic and it really will match given my condition above, but it didn't seem like it would. Thanks again for the quick responses! This list has been a great resource for me. select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND (ea.recordtext like '%RED%' OR ea.recordtext like '%CORVETTE%')) JOIN eg ON em.incidentid = eg.incidentid WHERE em.incidentid IN (select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND ea.recordtext like '%CORVETTE%')) JOIN eg ON em.incidentid = eg.incidentid) AND em.incidentid IN (select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND ea.recordtext like '%RED%')) JOIN eg ON em.incidentid = eg.incidentid) This may be more accurate. However I would cool it VERY NASTY. Josh's solutions may be better. However much of the data should be in memory once the subplans are done, so it may be quite fast. you may -Dan ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Query tuning help
On May 8, 2005, at 8:32 PM, Russell Smith wrote: I have run this, and while it is very fast, I'm concerned it's not doing what I need. How fast is very fast? It took 35 seconds to complete versus ~450 my old way. select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND (ea.recordtext like '%RED%' OR ea.recordtext like '%CORVETTE%')) JOIN eg ON em.incidentid = eg.incidentid WHERE em.incidentid IN (select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND ea.recordtext like '%CORVETTE%')) JOIN eg ON em.incidentid = eg.incidentid) AND em.incidentid IN (select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat FROM em JOIN ea ON (em.incidentid = ea.incidentid AND em.entrydate = '2005-1-1 00:00' AND em.entrydate = '2005-5-9 00:00' AND ea.recordtext like '%RED%')) JOIN eg ON em.incidentid = eg.incidentid) Yes, it is nasty, but so was my previous query :) So long as this is faster, I'm ok with that. I'll see if i can make this work. Thank you very much. -Dan ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Query tuning help
On Sun, 8 May 2005 20:31:38 -0600, Dan Harris [EMAIL PROTECTED] wrote: Duly noted. If this method can search across rows, I'm willing to accept this overhead for the speed it would add. You could use intersect to search across rows. Using tsearch2 will look up the RED and CORVETTE using the index and intersect will pull out the commmon rows. In the meantime, is there any way I can reach my goal without Tsearch2 by just restructuring my query to narrow down the results by date first, then seq scan for the 'likes'? select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat from ea, em, eg, ( select ea.incidentid from ea, em where em.incidentid = ea.incidentid and em.entrydate = '2005-1-1 00:00' and em.entrydate = '2005-5-9 00:00' and recordtext like '%RED%' intersect select ea.incidentid from ea, em where em.incidentid = ea.incidentid and em.entrydate = '2005-1-1 00:00' and em.entrydate = '2005-5-9 00:00' and recordtext like '%CORVETTE%' ) as iid where em.incidentid = ea.incidentid and em.incidentid = eg.incidentid and em.entrydate = '2005-1-1 00:00' and em.entrydate = '2005-5-9 00:00' and ea.incidentid = iid.incidentid and ( recordtext like '%RED%' or recordtext like '%CORVETTE%' ) order by em.entrydate klint. +---+-+ : Klint Gore: Non rhyming: : EMail : [EMAIL PROTECTED] : slang - the: : Snail : A.B.R.I.: possibilities : : Mail University of New England : are useless : : Armidale NSW 2351 Australia : L.J.J. : : Fax : +61 2 6772 5376 : : +---+-+ ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Query Tuning
Tom Lane wrote: Pallav Kalva [EMAIL PROTECTED] writes: I have a problem with the below query, when i do explain on the below query on my live database it doesnt use any index specified on the tables and does seq scan on the table which is 400k records. But if i copy the same table onto a different database on a different machine it uses all the indexes specified and query runs much quicker. It looks to me like you've never vacuumed/analyzed the copy, and so you get a different plan there. The fact that that plan is better than the one made with statistics is unhappy making :-( ... but when you only show us EXPLAIN output rather than EXPLAIN ANALYZE, it's impossible to speculate about why. Also, what PG version is this? regards, tom lane Thanks! for the quick reply. I cant run the EXPLAIN ANALYZE on the live database because, it takes lot of time and hols up lot of other queries on the table. The postgres version I am using is 7.4 . when you say i never vacuum/analyxed the copy you mean the Live database ? or the copy of the live database ? . I run vacuum database daily on my live database as a part of daily maintanence. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]