Re: [PERFORM] Long Running Update - My Solution
Hello Kevin If you use EXPLAIN with both statements... Yes, the plans are indeed very different. Here is the statement, set to update up to 100,000 records, which took about 5 seconds to complete: UPDATE table_A SET field_1 = table_B.field_1 , field_2 = table_B.field_2 FROM table_B WHERE table_B.tb_id = 0 AND table_B.tb_id = 10 AND table_B.tb_id = table_A.ta_id ; The query plan for the above is: Nested Loop (cost=0.00..2127044.47 rows=73620 width=63) - Index Scan using table_B_pkey on table_B (cost=0.00..151830.75 rows=73620 width=20) Index Cond: ((tb_id = 0) AND (tb_id = 10)) - Index Scan using table_A_pkey on table_A (cost=0.00..26.82 rows=1 width=47) Index Cond: (table_A.ta_id = table_B.tb_id) Now, if I change the first AND clause to update 1M records, as follows: table_B.id = 100 I get the following - quite different - query plan: Hash Join (cost=537057.49..8041177.88 rows=852150 width=63) Hash Cond: (table_A.ta_id = table_B.tb_id) - Seq Scan on table_A (cost=0.00..3294347.71 rows=145561171 width=47) - Hash (cost=521411.62..521411.62 rows=852150 width=20) - Bitmap Heap Scan on table_B (cost=22454.78..521411.62 rows=852150 width=20) Recheck Cond: ((tb_id = 0) AND (tb_id = 100)) - Bitmap Index Scan on table_B_pkey (cost=0.00..22241.74 rows=852150 width=0) Index Cond: ((tb_id = 0) AND (tb_id = 100)) Note: When I tried updating 1M records, the command was still running after 25 minutes before I killed it. The sequential scan in the later plan looks expensive, and (I think) supports what others have since mentioned, namely that when the optimizer moves to using sequential scans (working off the disk) things get a lot slower. For me, the penny has finally dropped on why I should use EXPLAIN for bulk operations. Thanks too, to Greg Smith, Robert Klemme and Thomas for all the feedback. Kind regards Harry Mantheakis London, UK -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Long Running Update - My Solution
I am glad to report that the 'salami-slice' approach worked nicely - all done in about 2.5 hours. Instead of using an all-in-one-go statement, we executed 800 statements, each updating 100,000 records. On average it tool about 10-seconds for each statement to return. This is thinking out of the box solution, which others might not be able to emulate. The mystery remains, for me: why updating 100,000 records could complete in as quickly as 5 seconds, whereas an attempt to update a million records was still running after 25 minutes before we killed it? One thing remains crystal clear: I love Postgresql :-) Kind regards Harry Mantheakis London, UK On 23/06/2011 16:05, Harry Mantheakis wrote: Hello I am attempting to run an update statement that copies two fields from one table to another: UPDATE table_A SET ( field_1 , field_2 ) = ( table_B.field_1 , table_B.field_2 ) FROM table_B WHERE table_B.id = table_A.id ; Table table_B contains almost 75 million records, with IDs that match those in table_A. Both field_1 and field_2 are DOUBLE PRECISION. The ID fields are SERIAL primary-key integers in both tables. I tested (the logic of) this statement with a very small sample, and it worked correctly. The database runs on a dedicated Debian server in our office. I called both VACUUM and ANALYZE on the databased before invoking this statement. The statement has been running for 18+ hours so far. TOP, FREE and VMSTAT utilities indicate that only about half of the 6GB of memory is being used, so I have no reason to believe that the server is struggling. My question is: can I reasonably expect a statement like this to complete with such a large data-set, even if it takes several days? We do not mind waiting, but obviously we do not want to wait unnecessarily. Many thanks. Harry Mantheakis London, UK -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Long Running Update - My Solution
Harry Mantheakis harry.manthea...@riskcontrollimited.com wrote: I am glad to report that the 'salami-slice' approach worked nicely - all done in about 2.5 hours. Glad to hear it! The mystery remains, for me: why updating 100,000 records could complete in as quickly as 5 seconds, whereas an attempt to update a million records was still running after 25 minutes before we killed it? If you use EXPLAIN with both statements (without ANALYZE, since you probably don't want to trigger an actual *run* of the statement), is there a completely different plan for the range covering each? If so, a severe jump like that might mean that your costing parameters could use some adjustment, so that it switches from one plan to the other closer to the actual break-even point. One thing remains crystal clear: I love Postgresql :-) :-) -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Long Running Update - My Solution
The mystery remains, for me: why updating 100,000 records could complete in as quickly as 5 seconds, whereas an attempt to update a million records was still running after 25 minutes before we killed it? Hi, there's a lot of possible causes. Usually this is caused by a plan change - imagine for example that you need to sort a table and the amount of data just fits into work_mem, so that it can be sorted in memory. If you need to perform the same query with 10x the data, you'll have to sort the data on disk. Which is way slower, of course. And there are other such problems ... One thing remains crystal clear: I love Postgresql :-) regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Long Running Update - My Solution
On Mon, Jun 27, 2011 at 5:37 PM, t...@fuzzy.cz wrote: The mystery remains, for me: why updating 100,000 records could complete in as quickly as 5 seconds, whereas an attempt to update a million records was still running after 25 minutes before we killed it? Hi, there's a lot of possible causes. Usually this is caused by a plan change - imagine for example that you need to sort a table and the amount of data just fits into work_mem, so that it can be sorted in memory. If you need to perform the same query with 10x the data, you'll have to sort the data on disk. Which is way slower, of course. And there are other such problems ... I would rather assume it is one of the other problems, typically related to handling the TX (e.g. checkpoints, WAL, creating copies of modified records and adjusting indexes...). Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Long Running Update - My Solution
Harry Mantheakis wrote: The mystery remains, for me: why updating 100,000 records could complete in as quickly as 5 seconds, whereas an attempt to update a million records was still running after 25 minutes before we killed it? The way you were doing this originally, it was joining every record in table A against every record in table B, finding the matches (note the sequential scans on each in the query plan you showed). Having A * B possible matches there was using up a bunch of resources to line those two up for an efficient join, and it's possible that parts of that required spilling working data over to disk and other expensive operations. And you were guaranteeing that every row in each table was being processed in some way. Now, when you only took a small slice of A instead, and a small slice of B to match, this was likely using an index and working with a lot less rows in total--only ones in B that mattered were considered, not every one in B. And each processing slice was working on less rows, making it more likely to fit in memory, and thus avoiding both slow spill to disk operation and work that was less likely to fit into the system cache. I don't know exactly how much of each of these two components went into your large run-time difference, but I suspect both were involved. The way the optimizer switches to using a sequential scan when doing bulk operations is often the right move. But if it happens in a way that causes the set of data to be processed to become much larger than RAM, it can be a bad decision. The performance drop when things stop fitting in memory is not a slow one, it's like a giant cliff you fall off. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance