Re: [PERFORM] Bad performance of SELECT ... where id IN (...)
I have a big performance problem in my SQL select query: select * from event where user_id in (500,499,498,497,496,495,494,493,492,491,490,489,488,487,486,485,484,483,482,481,480,479,478,477,476,475,474,473,472,471,470,469,468,467,466,465,464,463,462,461,460,459,458,457,456,455,454,453,452,451,450,449,448,447,446,445,444,443,442,441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,423,422,421,420,419,418,417,416,415,414,413,412,411,410,409,408,407,406,405,404,403,402,401, 400,399,398,397,396,395,394,393,392,391,390,389,388,387,386,385,384,383,382,381,380,379,378,377,376,375,374,373,372,371,370,369,368,367,366,365,364,363,362,361,360,359,358,357,356,355,354,353,352,351,350,349,348,347,346,345,344,343,342,341,340,339,338,337,336,335,334,333,332,331,330,329,328,327,326,325,324,323,322,321,320,319,318,317,316,315,314,313,312,311,310,309,308,307,306,305,304,303,302,301, 300,299,298,297,296,295,294,293,292,291,290,289,288,287,286,285,284,283,282,281,280,279,278,277,276,275,274,273,272,271,270,269,268,267,266,265,264,263,262,261,260,259,258,257,256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225,224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209,208,207,206,205,204,203,202,201, 200,199,198,197,196,195,194,193,192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177,176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161,160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145,144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129,128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113,112,111,110,109,108,107,106,105,104,103,102,101, 100,99,98,97,96,95,94,93,92,91,90,89,88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,47,46,45,44,43,42,41,40,39,38,37,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); What happens if you change the query to select * from event where user_id = 0 and user_id = 500; ? :-) -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Bad performance of SELECT ... where id IN (...)
Hi, I have a big performance problem in my SQL select query: select * from event where user_id in (500,499,498,497,496,495,494,493,492,491,490,489,488,487,486,485,484,483,482,481,480,479,478,477,476,475,474,473,472,471,470,469,468,467,466,465,464,463,462,461,460,459,458,457,456,455,454,453,452,451,450,449,448,447,446,445,444,443,442,441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,423,422,421,420,419,418,417,416,415,414,413,412,411,410,409,408,407,406,405,404,403,402,401, 400,399,398,397,396,395,394,393,392,391,390,389,388,387,386,385,384,383,382,381,380,379,378,377,376,375,374,373,372,371,370,369,368,367,366,365,364,363,362,361,360,359,358,357,356,355,354,353,352,351,350,349,348,347,346,345,344,343,342,341,340,339,338,337,336,335,334,333,332,331,330,329,328,327,326,325,324,323,322,321,320,319,318,317,316,315,314,313,312,311,310,309,308,307,306,305,304,303,302,301, 300,299,298,297,296,295,294,293,292,291,290,289,288,287,286,285,284,283,282,281,280,279,278,277,276,275,274,273,272,271,270,269,268,267,266,265,264,263,262,261,260,259,258,257,256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225,224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209,208,207,206,205,204,203,202,201, 200,199,198,197,196,195,194,193,192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177,176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161,160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145,144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129,128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113,112,111,110,109,108,107,106,105,104,103,102,101, 100,99,98,97,96,95,94,93,92,91,90,89,88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,47,46,45,44,43,42,41,40,39,38,37,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); The above SELECT always spends 1200ms. The EXPLAIN ANLYSE result of it is : QUERY PLAN Bitmap Heap Scan on event (cost=73685.08..5983063.49 rows=662018 width=36) (actual time=24.857..242.826 rows=134289 loops=1) Recheck Cond: (user_id = ANY ('{499,498,497,496,495,494,493,492,491,490,489,488,487,486,485,484,483,482,481,480,479,478,477,476,475,474,473,472,471,470,469,468,467,466,465,464,463,462,461,460,459,458,457,456,455,454,453,452,451 ,450,449,448,447,446,445,444,443,442,441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,423,422,421,420,419,418,417,416,415,414,413,412,411,410,409,408,407,406,405,404,403,402,401,400,399,398,397,396,395,394,3 93,392,391,390,389,388,387,386,385,384,383,382,381,380,379,378,377,376,375,374,373,372,371,370,369,368,367,366,365,364,363,362,361,360,359,358,357,356,355,354,353,352,351,350,349,348,347,346,345,344,343,342,341,340,339,338,337,336 ,335,334,333,332,331,330,329,328,327,326,325,324,323,322,321,320,319,318,317,316,315,314,313,312,311,310,309,308,307,306,305,304,303,302,301,300,299,298,297,296,295,294,293,292,291,290,289,288,287,286,285,284,283,282,281,280,279,2 78,277,276,275,274,273,272,271,270,269,268,267,266,265,264,263,262,261,260,259,258,257,256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225,224,223,222,221 ,220,219,218,217,216,215,214,213,212,211,210,209,208,207,206,205,204,203,202,201,200,199,198,197,196,195,194,193,192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177,176,175,174,173,172,171,170,169,168,167,166,165,164,1 63,162,161,160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145,144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129,128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113,112,111,110,109,108,107,106 ,105,104,103,102,101,100,99,98,97,96,95,94,93,92,91,90,89,88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,47,46,45,44,43,42,41,40,39,38,37,36,35,34,33,32,3 1,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0}'::integer[])) - Bitmap Index Scan on event_user_id_idx (cost=0.00..71699.03 rows=662018 width=0) (actual time=24.610..24.610 rows=134289 loops=1) Index Cond: (user_id = ANY ('{499,498,497,496,495,494,493,492,491,490,489,488,487,486,485,484,483,482,481,480,479,478,477,476,475,474,473,472,471,470,469,468,467,466,465,464,463,462,461,460,459,458,457,456,455,454,453,452 ,451,450,449,448,447,446,445,444,443,442,441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,423,422,421,420,419,418,417,416,415,414,413,412,411,410,409,408,407,406,405,404,403,402,401,400,399,398,397,396,395,3
Re: [PERFORM] PG 8.3 and large shared buffer settings
On Thu, 24 Sep 2009, Dan Sugalski wrote: Is there any practical limit to the number of shared buffers PG 8.3.7 can handle before more becomes counter-productive? There are actually two distinct questions here you should consider, because the popular wisdom here and what makes sense for your case might be different. The biggest shared_buffers tests I've seen come from Sun, where Jignesh there saw around 10GB was the largest amount of RAM you could give to the database before it stopped improving performance. As you guessed, there is a certain amount of overhead to managing the buffers involved, and as the size grows the chance you'll run into locking issues or similar resource contention grows too. Another problem spot are checkpoints. If you dirty a very large buffer cache, that whole thing will have to get dumped to disk eventually, and on some workloads people have found they have to reduce shared_buffers specifically to keep this from being too painful. That's not answering your question though; what it answers is how large can shared_buffers get before it's counterproductive compared with giving the memory to OS to manage? The basic design of PostgreSQL presumes that the OS buffer cache exists as a second-chance source for cached buffers. The OS cache tends to be optimized to handle large numbers of buffers well, but without very much memory about what's been used recently to optimize allocations and evictions. The symmetry there is one reason behind why shared_buffers shouldn't be most of the RAM on your system; splitting things up so that PG has a cut and the OS has at least as large of its own space lets the two cache management schemes complement each other. The box runs other things as well as the database, so the OS buffer cache tends to get effectively flushed -- permanently pinning more of the database in memory would be an overall win for DB performance, assuming bad things don't happen because of buffer management. This means that the question you want an answer to is if the OS cache isn't really available, where does giving memory to shared_buffers becomes less efficient than not caching things at all? My guess is that this number is much larger than 10GB, but I don't think anyone has done any tests to try to quantify exactly where it is. Typically when people are talking about systems as large as yours, they're dedicated database servers at that point, so the OS cache gets considered at the same time. If it's effectively out of the picture, the spot where caching still helps even when it's somewhat inefficient due to buffer contention isn't well explored. It would depend on the app too. If you're heavily balanced toward reads that don't need locks, you can certainly support a larger shared_buffers than someone who is writing a lot (just the checkpoint impact alone makes this true, and there's other sources for locking contention). -- * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Bad performance of SELECT ... where id IN (...)
Xia Qingran qingran@gmail.com writes: I have a big performance problem in my SQL select query: select * from event where user_id in (500,499,498, ... ,1,0); The above SELECT always spends 1200ms. Your EXPLAIN ANALYZE shows that the actual runtime is only about 240ms. So either the planning time is about 1000ms, or transmitting and displaying the 134K rows produced by the query takes that long, or some combination of the two. I wouldn't be too surprised if it's the data display that's slow; but if it's the planning time that you're unhappy about, updating to a more recent PG release might possibly help. What version is this anyway? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG 8.3 and large shared buffer settings
Is there any practical limit to the number of shared buffers PG 8.3.7 can handle before more becomes counter-productive? It is more efficient to have the page in shared buffers, rather than doing a context switch to the OS, copying the entire page from the OS's cache into shared buffers, and coming back to postgres. Shared buffers use less CPU. However, this is totally negligible versus the disk wait time of an uncached IO. The same page may be cached once in shared_buffers, and once in the OS cache, so if your shared buffers is half your RAM, and the other half is disk cache, perhaps it won't be optimal: is stuff is cached twice, you can cache half as much stuff. If your entire database can fit in shared buffers, good for you though. But then a checkpoint comes, and postgres will write all dirty buffers to disk in the order it finds them in Shared Buffers, which might be totally different from the on-disk order. If you have enough OS cache left to absorb these writes, the OS will reorder them. If not, lots of random writes are going to occur. On a RAID5 this can be a lot of fun. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Bad performance of SELECT ... where id IN (...)
if you reuse that set a lot, how about storing it in a table , and doing the join on db side ? if it is large, it sometimes makes sense to create temp table just for single query (I use that sort of stuff for comparing with few M records). But temp tables in that case have to be short lived, as they can't reuse space (no FSM in temporary table world I'm afraid, I hope it will be fixed at some stage tho).
Re: [PERFORM] PG 8.3 and large shared buffer settings
On Fri, 25 Sep 2009, Jeff Janes wrote: Does it do this even if the block was already in shared_buffers? Usually not. The buffer ring algorithm is used to manage pages that are read in specifically to satisfy a sequential scan (there's a slightly different ring method used for VACUUM too). If the buffer you need is already available and not pinned (locked by someone else), it's not read from disk again. Instead, its usage count is incremently only if it's at zero (this doesn't count as a use unless it's about to be evicted as unused), and it's returned without being added to the ring. There's a section about this (Buffer Ring Replacement Strategy) in the source code: http://git.postgresql.org/gitweb?p=postgresql.git;a=blob_plain;f=src/backend/storage/buffer/README;hb=HEAD The commit that added the feature is at http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0 The basic flow of this code is that backends ask for buffers using BufferAlloc, which then calls StrategyGetBuffer (where the ring list is managed) only if it doesn't first find the page in the buffer cache. You get what you'd hope for here: a sequential scan will use blocks when they're already available in the cache, while reading in less popular blocks that weren't cached into the temporary ring area. There's always the OS cache backing the PostrgreSQL one to handle cases where the working set you're using is just a bit larger than shared_buffers. The ring read requests may very well be satisfied by that too if there was a recent sequential scan the OS is still caching. You can read a high-level summary of the algorithm used for ring management (with an intro to buffer management in general) in my Inside the PostgreSQL Buffer Cache presentation at http://www.westnet.com/~gsmith/content/postgresql/ on P10 Optimizations for problem areas. That doesn't specifically cover the what if it's in the cache already? case though. -- * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Bad performance of SELECT ... where id IN (...)
Xia Qingran wrote: Hi, I have a big performance problem in my SQL select query: select * from event where user_id in (500,499,498,497,496,495,494,493,492,491,490,489,488,487,486,485,484,483,482,481,480,479,478,477,476,475,474,473,472,471,470,469,468,467,466,465,464,463,462,461,460,459,458,457,456,455,454,453,452,451,450,449,448,447,446,445,444,443,442,441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,423,422,421,420,419,418,417,416,415,414,413,412,411,410,409,408,407,406,405,404,403,402,401, 400,399,398,397,396,395,394,393,392,391,390,389,388,387,386,385,384,383,382,381,380,379,378,377,376,375,374,373,372,371,370,369,368,367,366,365,364,363,362,361,360,359,358,357,356,355,354,353,352,351,350,349,348,347,346,345,344,343,342,341,340,339,338,337,336,335,334,333,332,331,330,329,328,327,326,325,324,323,322,321,320,319,318,317,316,315,314,313,312,311,310,309,308,307,306,305,304,303,302,301, 300,299,298,297,296,295,294,293,292,291,290,289,288,287,286,285,284,283,282,281,280,279,278,277,276,275,274,273,272,271,270,269,268,267,266,265,264,263,262,261,260,259,258,257,256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225,224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209,208,207,206,205,204,203,202,201, 200,199,198,197,196,195,194,193,192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177,176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161,160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145,144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129,128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113,112,111,110,109,108,107,106,105,104,103,102,101, 100,99,98,97,96,95,94,93,92,91,90,89,88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,47,46,45,44,43,42,41,40,39,38,37,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); The above SELECT always spends 1200ms. If your user_id is always in a narrow range like this, or even in any range that is a small fraction of the total, then add a range condition, like this: select * from event where user_id = 500 and user_id = 0 and user_id in (...) I did this exact same thing in my application and it worked well. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Bad performance of SELECT ... where id IN (...)
On 26-Sep-2009, at 10:16 PM, Claus Guttesen wrote: I have a big performance problem in my SQL select query: select * from event where user_id in (500,499,498,497,496,495,494,493,492,491,490,489,488,487,486,485,484,483,482,481,480,479,478,477,476,475,474,473,472,471,470,469,468,467,466,465,464,463,462,461,460,459,458,457,456,455,454,453,452,451,450,449,448,447,446,445,444,443,442,441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,423,422,421,420,419,418,417,416,415,414,413,412,411,410,409,408,407,406,405,404,403,402,401 , 400,399,398,397,396,395,394,393,392,391,390,389,388,387,386,385,384,383,382,381,380,379,378,377,376,375,374,373,372,371,370,369,368,367,366,365,364,363,362,361,360,359,358,357,356,355,354,353,352,351,350,349,348,347,346,345,344,343,342,341,340,339,338,337,336,335,334,333,332,331,330,329,328,327,326,325,324,323,322,321,320,319,318,317,316,315,314,313,312,311,310,309,308,307,306,305,304,303,302,301 , 300,299,298,297,296,295,294,293,292,291,290,289,288,287,286,285,284,283,282,281,280,279,278,277,276,275,274,273,272,271,270,269,268,267,266,265,264,263,262,261,260,259,258,257,256,255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,239,238,237,236,235,234,233,232,231,230,229,228,227,226,225,224,223,222,221,220,219,218,217,216,215,214,213,212,211,210,209,208,207,206,205,204,203,202,201 , 200,199,198,197,196,195,194,193,192,191,190,189,188,187,186,185,184,183,182,181,180,179,178,177,176,175,174,173,172,171,170,169,168,167,166,165,164,163,162,161,160,159,158,157,156,155,154,153,152,151,150,149,148,147,146,145,144,143,142,141,140,139,138,137,136,135,134,133,132,131,130,129,128,127,126,125,124,123,122,121,120,119,118,117,116,115,114,113,112,111,110,109,108,107,106,105,104,103,102,101 , 100,99,98,97,96,95,94,93,92,91,90,89,88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,47,46,45,44,43,42,41,40,39,38,37,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 ); What happens if you change the query to select * from event where user_id = 0 and user_id = 500; or select * from event where user_id = 500; :) Besides, your index seem quite huge 2G, and it usually takes some time to process the result, even though it's already indexed with btree. ? :-) -- regards Claus When lenity and cruelty play for a kingdom, the gentler gamester is the soonest winner. Shakespeare -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org ) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG 8.3 and large shared buffer settings
On Sat, Sep 26, 2009 at 9:57 AM, Gerhard Wiesinger li...@wiesinger.com wrote: On Sat, 26 Sep 2009, Greg Smith wrote: On Fri, 25 Sep 2009, Jeff Janes wrote: Does it do this even if the block was already in shared_buffers? Usually not. The buffer ring algorithm is used to manage pages that are read in specifically to satisfy a sequential scan (there's a slightly different ring method used for VACUUM too). If the buffer you need is already available and not pinned (locked by someone else), it's not read from disk again. Instead, its usage count is incremently only if it's at zero (this doesn't count as a use unless it's about to be evicted as unused), and it's returned without being added to the ring. Hello Greg, What happens when a postmaster dies (e.g. core dump, kill -9, etc.). How is reference counting cleaned up and the lock removed? If a backend dies in disgrace, the master detects this and the whole cluster is taken down and brought back up. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. (The DETAIL is technically accurate, but somewhat misleading. If the crash to another backend happens while your backend is waiting on the commit record WAL fsync to return, then while the postmaster may have commanded your session to rollback, it is too late to actually do so and when the server comes back up and finishes recovery, you will probably find that your transaction has indeed committed, assuming you have some way to accurately deduce this) Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG 8.3 and large shared buffer settings
On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith gsm...@gregsmith.com wrote: Another problem spot are checkpoints. If you dirty a very large buffer cache, that whole thing will have to get dumped to disk eventually, and on some workloads people have found they have to reduce shared_buffers specifically to keep this from being too painful. Hi Greg, Is this the case even if checkpoint_completion_target is set close to 1.0? If you dirty buffers fast enough to dirty most of a huge shared_buffers area between checkpoints, then it seems like lowering the shared_buffers wouldn't reduce the amount of I/O needed, it would just shift the I/O from checkpoints to the backends themselves. It looks like checkpoint_completion_target was introduced in 8.3.0 Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG 8.3 and large shared buffer settings
On Sat, 26 Sep 2009, Jeff Janes wrote: On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith gsm...@gregsmith.com wrote: Another problem spot are checkpoints. If you dirty a very large buffer cache, that whole thing will have to get dumped to disk eventually, and on some workloads people have found they have to reduce shared_buffers specifically to keep this from being too painful. Is this the case even if checkpoint_completion_target is set close to 1.0? Sure. checkpoint_completion_target aims to utilize more of the space between each checkpoint by spreading them out over more of that space, but it alone doesn't change the fact that checkpoints are only so long. By default, you're going to get one every five minutes, and on active systems they can come every few seconds if you're not aggressive with increasing checkpoint_segments. Some quick math gives an idea of the scale of the problem. A single cheap disk can write random I/O (which checkpoints writes often are) at 1-2MB/s; let's call it 100MB/minute. That means that in 5 minutes, a single disk system might be hard pressed to write even 500MB of data out. But you can easily dirty 500MB in seconds nowadays. Now imagine shared_buffers is 40GB and you've dirtied a bunch of it; how long will that take to clear even on a fast RAID system? It won't be quick, and the whole system will grind to a halt at the end of the checkpoint as all the buffered writes queued up are forced out. If you dirty buffers fast enough to dirty most of a huge shared_buffers area between checkpoints, then it seems like lowering the shared_buffers wouldn't reduce the amount of I/O needed, it would just shift the I/O from checkpoints to the backends themselves. What's even worse is that backends can be writing data and filling the OS buffer cache in between checkpoints too, but all of that is forced to complete before the checkpoint can finish too. You can easily start the checkpoint process with the whole OS cache filled with backend writes that will slow checkpoint ones if you're not careful. Because disks are slow, you need to get things that are written to disk as soon as feasible, so the OS has more time to work on them, reorder for efficient writing, etc. Ultimately, the sooner you get I/O to the OS cache to write, the better, *unless* you're going to write that same block over again before it must go to disk. Normally you want buffers that aren't accessed often to get written out to disk early rather than linger until checkpoint time, there's nothing wrong with a backend doing a write if that block wasn't going to be used again soon. The ideal setup from a latency perspective is that you size shared_buffers just large enough to hold the things you write to regularly, but not so big that it caches every write. It looks like checkpoint_completion_target was introduced in 8.3.0 Correct. Before then, you had no hope for reducing checkpoint overhead but to use very small settings for shared_buffers, particularly if you cranked the old background writer up so that it wrote lots of redundant information too (that's was the main result of tuning it on versions before 8.3 as well). -- * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance