Re: [PERFORM] Long Running Update - My Solution

2011-06-28 Thread Harry Mantheakis

Hello Kevin

 If you use EXPLAIN with both statements...

Yes, the plans are indeed very different.

Here is the statement, set to update up to 100,000 records, which took 
about 5 seconds to complete:



UPDATE
  table_A
SET
  field_1 = table_B.field_1
, field_2 = table_B.field_2
FROM
  table_B
WHERE
  table_B.tb_id = 0
AND
  table_B.tb_id = 10
AND
  table_B.tb_id = table_A.ta_id
;


The query plan for the above is:


Nested Loop  (cost=0.00..2127044.47 rows=73620 width=63)
  -  Index Scan using table_B_pkey on table_B  (cost=0.00..151830.75 
rows=73620 width=20)

Index Cond: ((tb_id = 0) AND (tb_id = 10))
  -  Index Scan using table_A_pkey on table_A  (cost=0.00..26.82 
rows=1 width=47)

Index Cond: (table_A.ta_id = table_B.tb_id)


Now, if I change the first AND clause to update 1M records, as follows:


table_B.id = 100


I get the following - quite different - query plan:


Hash Join  (cost=537057.49..8041177.88 rows=852150 width=63)
  Hash Cond: (table_A.ta_id = table_B.tb_id)
  -  Seq Scan on table_A  (cost=0.00..3294347.71 rows=145561171 width=47)
  -  Hash  (cost=521411.62..521411.62 rows=852150 width=20)
-  Bitmap Heap Scan on table_B  (cost=22454.78..521411.62 
rows=852150 width=20)

  Recheck Cond: ((tb_id = 0) AND (tb_id = 100))
  -  Bitmap Index Scan on table_B_pkey  
(cost=0.00..22241.74 rows=852150 width=0)

Index Cond: ((tb_id = 0) AND (tb_id = 100))


Note: When I tried updating 1M records, the command was still running 
after 25 minutes before I killed it.


The sequential scan in the later plan looks expensive, and (I think) 
supports what others have since mentioned, namely that when the 
optimizer moves to using sequential scans (working off the disk) things 
get a lot slower.


For me, the penny has finally dropped on why I should use EXPLAIN for 
bulk operations.


Thanks too, to Greg Smith, Robert Klemme and Thomas for all the feedback.

Kind regards

Harry Mantheakis
London, UK




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Long Running Update - My Solution

2011-06-27 Thread Harry Mantheakis
I am glad to report that the 'salami-slice' approach worked nicely - all 
done in about 2.5 hours.


Instead of using an all-in-one-go statement, we executed 800 statements, 
each updating 100,000 records. On average it tool about 10-seconds for 
each statement to return.


This is thinking out of the box solution, which others might not be 
able to emulate.


The mystery remains, for me: why updating 100,000 records could complete 
in as quickly as 5 seconds, whereas an attempt to update a million 
records was still running after 25 minutes before we killed it?


One thing remains crystal clear: I love Postgresql :-)

Kind regards

Harry Mantheakis
London, UK


On 23/06/2011 16:05, Harry Mantheakis wrote:

Hello

I am attempting to run an update statement that copies two fields from 
one table to another:



UPDATE
  table_A
SET
(
  field_1
, field_2
) = (
table_B.field_1
, table_B.field_2
)
FROM
table_B
WHERE
table_B.id = table_A.id
;


Table table_B contains almost 75 million records, with IDs that 
match those in table_A.


Both field_1 and field_2 are DOUBLE PRECISION. The ID fields are 
SERIAL primary-key integers in both tables.


I tested (the logic of) this statement with a very small sample, and 
it worked correctly.


The database runs on a dedicated Debian server in our office.

I called both VACUUM and ANALYZE on the databased before invoking this 
statement.


The statement has been running for 18+ hours so far.

TOP, FREE and VMSTAT utilities indicate that only about half of the 
6GB of memory is being used, so I have no reason to believe that the 
server is struggling.


My question is: can I reasonably expect a statement like this to 
complete with such a large data-set, even if it takes several days?


We do not mind waiting, but obviously we do not want to wait 
unnecessarily.


Many thanks.

Harry Mantheakis
London, UK





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Long Running Update - My Solution

2011-06-27 Thread Kevin Grittner
Harry Mantheakis harry.manthea...@riskcontrollimited.com wrote:
 
 I am glad to report that the 'salami-slice' approach worked nicely
 - all done in about 2.5 hours.
 
Glad to hear it!
 
 The mystery remains, for me: why updating 100,000 records could
 complete in as quickly as 5 seconds, whereas an attempt to update
 a million records was still running after 25 minutes before we
 killed it?
 
If you use EXPLAIN with both statements (without ANALYZE, since you
probably don't want to trigger an actual *run* of the statement), is
there a completely different plan for the range covering each?  If
so, a severe jump like that might mean that your costing parameters
could use some adjustment, so that it switches from one plan to the
other closer to the actual break-even point.
 
 One thing remains crystal clear: I love Postgresql :-)
 
:-)
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Long Running Update - My Solution

2011-06-27 Thread tv
 The mystery remains, for me: why updating 100,000 records could complete
 in as quickly as 5 seconds, whereas an attempt to update a million
 records was still running after 25 minutes before we killed it?

Hi, there's a lot of possible causes. Usually this is caused by a plan
change - imagine for example that you need to sort a table and the amount
of data just fits into work_mem, so that it can be sorted in memory. If
you need to perform the same query with 10x the data, you'll have to sort
the data on disk. Which is way slower, of course.

And there are other such problems ...

 One thing remains crystal clear: I love Postgresql :-)

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Long Running Update - My Solution

2011-06-27 Thread Robert Klemme
On Mon, Jun 27, 2011 at 5:37 PM,  t...@fuzzy.cz wrote:
 The mystery remains, for me: why updating 100,000 records could complete
 in as quickly as 5 seconds, whereas an attempt to update a million
 records was still running after 25 minutes before we killed it?

 Hi, there's a lot of possible causes. Usually this is caused by a plan
 change - imagine for example that you need to sort a table and the amount
 of data just fits into work_mem, so that it can be sorted in memory. If
 you need to perform the same query with 10x the data, you'll have to sort
 the data on disk. Which is way slower, of course.

 And there are other such problems ...

I would rather assume it is one of the other problems, typically
related to handling the TX (e.g. checkpoints, WAL, creating copies of
modified records and adjusting indexes...).

Kind regards

robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Long Running Update - My Solution

2011-06-27 Thread Greg Smith

Harry Mantheakis wrote:
The mystery remains, for me: why updating 100,000 records could 
complete in as quickly as 5 seconds, whereas an attempt to update a 
million records was still running after 25 minutes before we killed it?


The way you were doing this originally, it was joining every record in 
table A against every record in table B, finding the matches (note the 
sequential scans on each in the query plan you showed).  Having A * B 
possible matches there was using up a bunch of resources to line those 
two up for an efficient join, and it's possible that parts of that 
required spilling working data over to disk and other expensive 
operations.  And you were guaranteeing that every row in each table was 
being processed in some way.


Now, when you only took a small slice of A instead, and a small slice of 
B to match, this was likely using an index and working with a lot less 
rows in total--only ones in B that mattered were considered, not every 
one in B.  And each processing slice was working on less rows, making it 
more likely to fit in memory, and thus avoiding both slow spill to disk 
operation and work that was less likely to fit into the system cache.


I don't know exactly how much of each of these two components went into 
your large run-time difference, but I suspect both were involved.  The 
way the optimizer switches to using a sequential scan when doing bulk 
operations is often the right move.  But if it happens in a way that 
causes the set of data to be processed to become much larger than RAM, 
it can be a bad decision.  The performance drop when things stop fitting 
in memory is not a slow one, it's like a giant cliff you fall off.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance