Re: [PERFORM] Index Problem?
Ron St-Pierre wrote: I am using postgres 7.4.1 and have a problem with a plpgsql function. When I run the function on the production server it takes approx 33 minutes to run. I dumped the DB and copied it to a similarly configured box and ran the function and it ran in about 10 minutes. Can anyone offer advice on tuning the function or my database? Here are the lengthy, gory details. F u n c t i o n It updates seven columns of a table 1 to 4 times daily. Current data = 42,000 rows, new data = 30,000 rows. CREATE TYPE employeeType AS (empID INTEGER, updateDate DATE, bDate INTEGER, val1 NUMERIC, val2 NUMERIC, val3 NUMERIC, val4 NUMERIC, favNum NUMERIC); CREATE OR REPLACE FUNCTION updateEmployeeData() RETURNS SETOF employeeType AS ' DECLARE rec RECORD; BEGIN FOR rec IN SELECT empID, updateDate, bDate, val1, val2, val3, val4, favNum FROM newData LOOP RETURN NEXT rec; UPDATE currentData SET val1=rec.val1, val2=rec.val2, val3=rec.val2, val4=rec.val4, favNum=rec.favNum, updateDate=rec.updateDate WHERE empID=rec.empID; END LOOP; RETURN; END; ' LANGUAGE 'plpgsql'; Can't you handle this with a simple update query? UPDATE currentData SET val1 = newData.val1, val2 = newData.val2, val3 = newData.val3, val4 = newData.val4, favNum = newData.favNum, updateDate = newData.updateDate FROM newData WHERE newDate.empID = currentData.empID Jochem -- I don't get it immigrants don't work and steal our jobs - Loesje ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Index Problem?
I am using postgres 7.4.1 and have a problem with a plpgsql function. When I run the function on the production server it takes approx 33 minutes to run. I dumped the DB and copied it to a similarly configured box and ran the function and it ran in about 10 minutes. Can anyone offer advice on tuning the function or my database? Here are the lengthy, gory details. F u n c t i o n It updates seven columns of a table 1 to 4 times daily. Current data = 42,000 rows, new data = 30,000 rows. CREATE TYPE employeeType AS (empID INTEGER, updateDate DATE, bDate INTEGER, val1 NUMERIC, val2 NUMERIC, val3 NUMERIC, val4 NUMERIC, favNum NUMERIC); CREATE OR REPLACE FUNCTION updateEmployeeData() RETURNS SETOF employeeType AS ' DECLARE rec RECORD; BEGIN FOR rec IN SELECT empID, updateDate, bDate, val1, val2 , val3, val4, favNum FROM newData LOOP RETURN NEXT rec; UPDATE currentData SET val1=rec.val1, val2=rec.val2, val3=rec.val2, val4=rec.val4, favNum=rec.favNum, updateDate=rec.updateDate WHERE empID=rec.empID; END LOOP; RETURN; END; ' LANGUAGE 'plpgsql'; The emp table has 60 columns, all indexed, about two-thirds are numeric, but they are not affected by this update. The other 50+ columns are updated in the middle of the night and the amount of time that update takes isn't a concern. Late last night I dumped the table, dropped it and re-created it from the dump (on the production server - when no one was looking). When I re-ran the function it took almost 11 minutes, which was pretty much in line with my results from the dev server. D e t a i l s v 7.4.1 Debian stable 1 GB ram shared_buffers = 2048 sort_mem = 1024 SHMMAX 36000 (360,000,000) VACUUM FULL ANALYZE is run every night, and I ran it yesterday between running the function and it made no difference in running time. top shows the postmaster using minimal cpu (0-40%) and miniscule memory. vmstat shows a fair amount of IO (bo=1000-4000). Yesterday on the dev server we upgraded to the 2.6 kernel and unfortunately only noticed a small increase in update time (about one minute). So does anyone have any suggestions for me on speeding this up? Is it the index? The function is run daily during the mid afternoon to early evening and really drags the performance of the server down (it also hosts a web site). Thanks Ron ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Index Problem?
Ron, The emp table has 60 columns, all indexed, about two-thirds are numeric, but they are not affected by this update. The other 50+ columns are updated in the middle of the night and the amount of time that update takes isn't a concern. Well, I'd say that you have an application design problem, but that's not what you asked for help with ;-) Late last night I dumped the table, dropped it and re-created it from the dump (on the production server - when no one was looking). When I re-ran the function it took almost 11 minutes, which was pretty much in line with my results from the dev server. Sounds like you need to run a REINDEX on the table -- and after that, dramatically increase your max_fsm_pages, and run lazy VACUUM immediately after the batch update to clean up. -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Index Problem?
Josh Berkus [EMAIL PROTECTED] writes: A better way to set this would be to run VACUUM VERBOSE ANALYZE right after doing one of your update batches, and see how many dead pages are being reclaimed, and then set max_fsm_pages to that # + 50% (or more). Actually, since he's running 7.4, there's an even better way. Do a VACUUM VERBOSE (full-database vacuum --- doesn't matter whether you ANALYZE or not). At the end of the very voluminous output, you'll see something like INFO: free space map: 240 relations, 490 pages stored; 4080 total pages needed DETAIL: Allocated FSM size: 1000 relations + 2 pages = 178 kB shared memory. Here, I would need max_fsm_relations = 240 and max_fsm_pages = 4080 to exactly cover the present freespace needs of my system. I concur with the suggestion to bump that up a good deal, of course, but that gives you a real number to start from. The DETAIL part of the message shows my current settings (which are the defaults) and what the FSM is costing me in shared memory space. If you have multiple active databases, the best approach to getting these numbers is to VACUUM in each one, adding VERBOSE when you do the last one. The FSM report is cluster-wide but you want to be sure the underlying info is up to date for all databases. Also the docs on Varlena show the max_fsm_pages default to be 10,000 but my default was 20,000, looks like that needs updating. I don't think the default has been changed. Anyone? Yes, I kicked it up for 7.4 because FSM covers indexes too now. Both the 7.3 and 7.4 defaults are pretty arbitrary of course... regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Index Problem?
Tom Lane wrote: Josh Berkus [EMAIL PROTECTED] writes: A better way to set this would be to run VACUUM VERBOSE ANALYZE right after doing one of your update batches, and see how many dead pages are being reclaimed, and then set max_fsm_pages to that # + 50% (or more). Actually, since he's running 7.4, there's an even better way. Do a VACUUM VERBOSE (full-database vacuum --- doesn't matter whether you ANALYZE or not). At the end of the very voluminous output, you'll see something like INFO: free space map: 240 relations, 490 pages stored; 4080 total pages needed DETAIL: Allocated FSM size: 1000 relations + 2 pages = 178 kB shared memory. Here, I would need max_fsm_relations = 240 and max_fsm_pages = 4080 to exactly cover the present freespace needs of my system. I concur with the suggestion to bump that up a good deal, of course, but that gives you a real number to start from. The DETAIL part of the message shows my current settings (which are the defaults) and what the FSM is costing me in shared memory space. Okay, after running the function VACUUM VERBOSE is telling me: INFO: free space map: 136 relations, 25014 pages stored; 22608 total pages needed DETAIL: Allocated FSM size: 1000 relations + 2 pages = 178 kB shared memory. My max_fsm_pages was set to 20,000 and I reset it to 40,000 on the dev server and the function ran about 20-30% faster, so I'll try the same on the production server. Thanks for the analysis of the VACUUM info. Ron ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Index problem or function problem?
LIANHE SHAO [EMAIL PROTECTED] writes: PGA= explain select ei.expid, er.geneid, er.sampleid, ei.annotation, si.samplename, ei.title as exp_name, aaa.chip, aaa.sequence_derived_from as accession_number, aaa.gene_symbol, aaa.title as gene_function, er.exprs, er.mas5exprs from expressiondata er, experimentinfo ei, sampleinfo si, affy_array_annotation aaa where exists (select distinct ei.expid from experimentinfo) and ei.annotation = aaa.chip and (lower (aaa.title) like '%mif%' or lower(aaa.sequence_description) like '%mif%') and exists (select distinct ei.annotation from experimentinfo) and ei.expid = er.expid and er.expid = si.expid and er.sampleid = si.sampleid and er.geneid = aaa.probeset_id order by si.sampleid limit 20; What is the purpose of the EXISTS() clauses? They are almost surely not doing what you intended, because AFAICS they are just an extremely expensive means of producing a constant-TRUE result. In exists (select distinct ei.expid from experimentinfo) ei.expid is an outer reference, which will necessarily be the same value over all rows of the sub-select. After computing this same value for every row of experimentinfo, the system performs a DISTINCT operation (sort + unique, not cheap) ... and then all it checks for is whether at least one row was produced, which means the DISTINCT operation was completely unnecessary. The only way the EXISTS could return false is if experimentinfo were empty, but if it were so then the outer FROM would've produced no rows and we'd not have got to WHERE anyway. I'm not sure why you get a worse plan for the simpler variant of the query; it would help to see EXPLAIN ANALYZE rather than EXPLAIN output. But it's not worth trying to improve the performance until you are calculating correct answers, and I suspect the above is not doing what you are after at all. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Index problem
What causes this behaviour? is there any workaround? Suggestions? How many rows are there in the table, and can you post the 'explain analyze' for both queries after doing a 'vacuum verbose analyze [tablename]'? There are about 2500 rows in that table. 1st query explain analyze: Seq Scan on PRIORITY_STATISTICS (cost=0.00..491.44 rows=127 width=12) (actual time=98.58..98.58 rows=0 loops=1) Total runtime: 98.74 msec 2nd query explain analyze: NOTICE: QUERY PLAN: Index Scan using PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id on PRIORITY_STATISTICS (cost=0.00..394.06 rows=102 width=12) (actual time=20.93..20.93 rows=0 loops=1) Total runtime: 21.59 msec Any help? Rigmor Cheers Matt --- Incoming mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.515 / Virus Database: 313 - Release Date: 01.09.2003 --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.515 / Virus Database: 313 - Release Date: 01.09.2003 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Index problem
There are about 2500 rows in that table. 1st query explain analyze: Seq Scan on PRIORITY_STATISTICS (cost=0.00..491.44 rows=127 width=12) (actual time=98.58..98.58 rows=0 loops=1) Total runtime: 98.74 msec 2nd query explain analyze: NOTICE: QUERY PLAN: Index Scan using PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, [snip] PRIORITY_STATISTICS_reo_id on PRIORITY_STATISTICS (cost=0.00..394.06 rows=102 width=12) (actual time=20.93..20.93 rows=0 loops=1) Total runtime: 21.59 msec With only 2500 rows the planner could be deciding that it's going to have to read every disk block to do an index scan anyway, so it might as well do a sequential scan. If the pages are in fact in the kernel cache then the compute time will dominate, not the IO time, so it ends up looking like a bad plan, but it's probably not really such a bad plan... Is your effective_cache_size set to something sensibly large? You could also try decreasing cpu_index_tuple_cost and cpu_tuple_cost. These will affect all your queries though, so what you gain on one might be lost on another. Matt ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] Index problem
Hi, I have a table containing columns: END_DATE timestamptz NOT NULL REO_ID int4 NOT NULL and i am indexed REO_ID coulumn. I have a query: select REO_ID, END_DATE from PRIORITY_STATISTICS where REO_ID IN ('112851' ,'112859' ,'112871' ,'112883' ,'112891' ,'112904' ,'112915' ,'112924' ,'112939' ,'112947' ,'112960' ,'112984' ,'112999' ,'113013' ,'113032' ,'113059' ,'113067' ,'113084' ,'113096' ,'113103' ,'113110' ,'113117' ,'113125' ,'113132' ,'113139' ,'113146' ,'113153' ,'113160' ,'113167' ,'113174' ,'113181' ,'113188' ,'113195' ,'113204' ,'113268' ,'113279' ,'113294' ,'113302' ,'113317' ,'113340' ,'113358' ,'113385' ,'113404' ,'113412' ,'113419' ,'113429' ,'113436' ,'113443' ,'113571' ,'113636' ,'113649' ,'113689' ,'113705' ,'113744' ,'113755' ,'113724' ,'113737' ,'113812' ,'113828' ,'113762' ,'113842' ,'113869' ,'113925' ,'113976' ,'114035' ,'114044' ,'114057' ,'114070' ,'114084' ,'114094' ,'114119' ) and it is _not_ using that index But following query (notice there are less id-s in WHERE clause, but rest is same) select REO_ID, END_DATE from PRIORITY_STATISTICS where REO_ID IN ('112851' ,'112859' ,'112871' ,'112883' ,'112891' ,'112904' ,'112915' ,'112924' ,'112939' ,'112947' ,'112960' ,'112984' ,'112999' ,'113013' ,'113032' ,'113059' ,'113067' ,'113084' ,'113096' ,'113103' ,'113110' ,'113117' ,'113125' ,'113132' ,'113139' ,'113146' ,'113153' ,'113160' ,'113167' ,'113174' ,'113181' ,'113188' ,'113195' ,'113204' ,'113268' ,'113279' ,'113294' ,'113302' ,'113317' ,'113340' ,'113358' ,'113385' ,'113404' ,'113412' ,'113419' ,'113429' ,'113436' ,'113443' ,'113571' ,'113636' ,'113649' ,'113689' ,'113705' ,'113744' ,'113755' ,'113724' ,'113737' ) will _is_ using index: Index Scan using PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id, PRIORITY_STATISTICS_reo_id on PRIORITY_STATISTICS (cost=0.00..394.06 rows=102 width=12) What causes this behaviour? is there any workaround? Suggestions? best, Rigmor Ukuhe --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.515 / Virus Database: 313 - Release Date: 01.09.2003 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Index problem
Hi, I have a table containing columns: END_DATE timestamptz NOT NULL REO_ID int4 NOT NULL and i am indexed REO_ID coulumn. I have a query: select REO_ID, END_DATE from PRIORITY_STATISTICS where REO_ID IN ('112851' ,'112859' ,'112871' ,'112883' ,'112891' ,'112904' ,'112915' ,'112924' ,'112939' ,'112947' ,'112960' ,'112984' ,'112999' ,'113013' ,'113032' ,'113059' ,'113067' ,'113084' ,'113096' ,'113103' ,'113110' ,'113117' ,'113125' ,'113132' ,'113139' ,'113146' ,'113153' ,'113160' ,'113167' ,'113174' ,'113181' ,'113188' ,'113195' ,'113204' ,'113268' ,'113279' ,'113294' ,'113302' ,'113317' ,'113340' ,'113358' ,'113385' ,'113404' ,'113412' ,'113419' ,'113429' ,'113436' ,'113443' ,'113571' ,'113636' ,'113649' ,'113689' ,'113705' ,'113744' ,'113755' ,'113724' ,'113737' ,'113812' ,'113828' ,'113762' ,'113842' ,'113869' ,'113925' ,'113976' ,'114035' ,'114044' ,'114057' ,'114070' ,'114084' ,'114094' ,'114119' ) and it is _not_ using that index But following query (notice there are less id-s in WHERE clause, but rest is same) select REO_ID, END_DATE from PRIORITY_STATISTICS where REO_ID IN ('112851' ,'112859' ,'112871' ,'112883' ,'112891' ,'112904' ,'112915' ,'112924' ,'112939' ,'112947' ,'112960' ,'112984' ,'112999' ,'113013' ,'113032' ,'113059' ,'113067' ,'113084' ,'113096' ,'113103' ,'113110' ,'113117' ,'113125' ,'113132' ,'113139' ,'113146' ,'113153' ,'113160' ,'113167' ,'113174' ,'113181' ,'113188' ,'113195' ,'113204' ,'113268' ,'113279' ,'113294' ,'113302' ,'113317' ,'113340' ,'113358' ,'113385' ,'113404' ,'113412' ,'113419' ,'113429' ,'113436' ,'113443' ,'113571' ,'113636' ,'113649' ,'113689' ,'113705' ,'113744' ,'113755' ,'113724' ,'113737' ) will _is_ using index: Why not. It's just because the second query is more selective. Probably you don't have too many rows in your table and Postgres thinks it's better (faster) to use sequential scan than index one. Regards, Tomasz Myrta ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]