Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-04-02 Thread Bruce Momjian
test version, but I am putting in the queue so we can track it there. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it.

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-26 Thread Bruce Momjian
Simon, is this patch ready to be added to the patch queue? I assume not. --- Simon Riggs wrote: On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote: On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: With

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-13 Thread Simon Riggs
On Mon, 2007-03-12 at 22:16 -0700, Luke Lonergan wrote: You may know we've built something similar and have seen similar gains. Cool We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-13 Thread Simon Riggs
On Tue, 2007-03-13 at 13:40 +0900, ITAGAKI Takahiro wrote: Simon Riggs [EMAIL PROTECTED] wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? New test version

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-13 Thread Luke Lonergan
Simon, On 3/13/07 2:37 AM, Simon Riggs [EMAIL PROTECTED] wrote: We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are allowing the scan to write through the shared_buffers cache. Write? For

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread ITAGAKI Takahiro
Simon Riggs [EMAIL PROTECTED] wrote: I've implemented buffer recycling, as previously described, patch being posted now to -patches as scan_recycle_buffers. - for VACUUMs of any size, with the objective of reducing WAL thrashing whilst keeping VACUUM's behaviour of not spoiling the buffer

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Simon Riggs
On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: Simon Riggs [EMAIL PROTECTED] wrote: I've implemented buffer recycling, as previously described, patch being posted now to -patches as scan_recycle_buffers. - for VACUUMs of any size, with the objective of reducing WAL

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Simon Riggs
On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote: On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Yes, but its not

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes: I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal VACUUMs with backgrond jobs! Proving that you can see a

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Simon Riggs
On Mon, 2007-03-12 at 10:30 -0400, Tom Lane wrote: ITAGAKI Takahiro [EMAIL PROTECTED] writes: I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread ITAGAKI Takahiro
Simon Riggs [EMAIL PROTECTED] wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-12 Thread Luke Lonergan
Simon, You may know we've built something similar and have seen similar gains. We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are allowing the scan to write through the shared_buffers cache.

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-09 Thread Luke Lonergan
; PGSQL Hackers; Doug Rady Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote: Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-08 Thread Sherry Moore
Hi Simon, and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-07 Thread Marko Kreen
On 3/7/07, Hannu Krosing [EMAIL PROTECTED] wrote: Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix systems ? Supposedly you could mmap() a file and then do mincore() on the area to see which pages are cached. But you were talking about postgres cache before, there it

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Simon Riggs
On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote: Simon Riggs wrote: But it would break the idea of letting a second seqscan follow in the first's hot cache trail, no? No, but it would make it somewhat harder to achieve without direct synchronization between scans. It could still

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Sherry Moore
Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Heikki Linnakangas
Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Heikki Linnakangas
Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point,

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Simon Riggs
On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 18:47 +, Heikki Linnakangas wrote: Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jim Nasby
On Mar 6, 2007, at 12:17 AM, Tom Lane wrote: Jim Nasby [EMAIL PROTECTED] writes: An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, The bgwriter already tries to keep pages just in front of the clock

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jim Nasby
On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Yes. The idea being if you see there's 10k pages in cache, you can likely start 9k

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote: On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Yes. The idea being if you see

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote: Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Luke Lonergan
; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Hi Simon, and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Hannu Krosing
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis: On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote: Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Grzegorz Jaskiewicz [EMAIL PROTECTED] writes: On Mar 5, 2007, at 2:36 AM, Tom Lane wrote: I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? Nope. DMA is

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
So either way, it isn't in processor cache after the read. So how can there be any performance benefit? It's the copy from kernel IO cache to the buffer cache that is L2 sensitive. When the shared buffer cache is polluted, it thrashes the L2 cache. When the number of pages being written to

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom, Now this may only prove that the disk subsystem on this machine is too cheap to let the system show any CPU-related issues. Try it with a warm IO cache. As I posted before, we see double the performance of a VACUUM from a table in IO cache when the shared buffer cache isn't being

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Grzegorz Jaskiewicz
On Mar 5, 2007, at 2:36 AM, Tom Lane wrote: n into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? Nope. DMA is writing directly into main memory.

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes: So either way, it isn't in processor cache after the read. So how can there be any performance benefit? It's the copy from kernel IO cache to the buffer cache that is L2 sensitive. When the shared buffer cache is polluted, it thrashes the L2 cache.

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom, Even granting that your conclusions are accurate, we are not in the business of optimizing Postgres for a single CPU architecture. I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Heikki Linnakangas
Luke Lonergan wrote: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Hannu Krosing
Ühel kenal päeval, E, 2007-03-05 kell 03:51, kirjutas Luke Lonergan: Hi Tom, Even granting that your conclusions are accurate, we are not in the business of optimizing Postgres for a single CPU architecture. I think you're missing my/our point: The Postgres shared buffer cache

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes: I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. No, this is not a bug; it is operating

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Florian Weimer
* Tom Lane: That makes absolutely zero sense. The data coming from the disk was certainly not in processor cache to start with, and I hope you're not suggesting that it matters whether the *target* page of a memcpy was already in processor cache. If the latter, it is not our bug to fix.

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Hannu Krosing
Ühel kenal päeval, E, 2007-03-05 kell 04:15, kirjutas Tom Lane: Luke Lonergan [EMAIL PROTECTED] writes: I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. No, this is not a bug; it is operating as designed. The point of the current bufmgr algorithm is to

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood
Gavin Sherry wrote: On Mon, 5 Mar 2007, Mark Kirkwood wrote: To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: (snipped) When

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Mark, lineitem has 1535724 pages (11997 MB) Shared Buffers Elapsed IO rate (from vmstat) -- --- - 400MB 101 s122 MB/s 2MB 100 s 1MB 97 s 768KB93 s 512KB86 s 256KB77 s

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Gregory Stark
Luke Lonergan [EMAIL PROTECTED] writes: The evidence seems to clearly indicate reduced memory writing due to an L2 related effect. You might try using valgrind's cachegrind tool which I understand can actually emulate various processors' cache to show how efficiently code uses it. I haven't

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes: Shared Buffers Elapsed IO rate (from vmstat) -- --- - 400MB 101 s122 MB/s 2MB 100 s 1MB 97 s 768KB93 s 512KB86 s 256KB77 s 128KB

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Pavan Deolasee
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Shared Buffers Elapsed IO rate (from vmstat) -- --- - 400MB 101 s122 MB/s 2MB 100 s 1MB 97 s 768KB93 s 512KB86 s 256KB77

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Hi Tom, On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote: Hm, that seems to blow the it's an L2 cache effect theory out of the water. If it were a cache effect then there should be a performance cliff at the point where the cache size is exceeded. I see no such cliff, in fact the middle

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes: Isn't the size of the shared buffer pool itself acting as a performance penalty in this case ? May be StrategyGetBuffer() needs to make multiple passes over the buffers before the usage_count of any buffer is reduced to zero and the buffer is chosen as

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Josh Berkus
Tom, Yes, autovacuum is off, and bgwriter shouldn't have anything useful to do either, so I'm a bit at a loss what's going on --- but in any case, it doesn't look like we are cycling through the entire buffer space for each fetch. I'd be happy to DTrace it, but I'm a little lost as to where

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Tom, On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote: Hm, that seems to blow the it's an L2 cache effect theory out of the water. If it were a cache effect then there should be a performance cliff at the point where the cache size is exceeded. I see no such cliff, in fact the middle

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
I wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Isn't the size of the shared buffer pool itself acting as a performance penalty in this case ? May be StrategyGetBuffer() needs to make multiple passes over the buffers before the usage_count of any buffer is reduced to zero and the buffer is

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Here's four more points on the curve - I'd use a dirac delta function for your curve fit ;-) Shared_buffers Select CountVacuum (KB)(s) (s) === 248 5.522.46 368 4.772.40 552

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Josh Berkus
Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea.  But at least with large

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Pavan Deolasee
Tom Lane wrote: Nope, Pavan's nailed it: the problem is that after using a buffer, the seqscan leaves it with usage_count = 1, which means it has to be passed over once by the clock sweep before it can be re-used. I was misled in the 32-buffer case because catalog accesses during startup had

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes: I am wondering whether seqscan would set the usage_count to 1 or to a higher value. usage_count is incremented while unpinning the buffer. Even if we use page-at-a-time mode, won't the buffer itself would get pinned/unpinned every time seqscan

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Simon Riggs
On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote: Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes: Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
; PGSQL Hackers; Doug Rady; Sherry Moore Cc: pgsql-hackers@postgresql.org Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote: Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Simon Riggs
On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 03:51 -0500, Luke Lonergan wrote: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. My proposal for a fix: ensure that when relations larger

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 11:10 +0200, Hannu Krosing wrote: My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are scanned, they are mapped to a single page in the shared buffer cache. How will this approach play together with synchronized scan patches ?

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes: Best way is to prove it though. Seems like not too much work to have a private ring data structure when the hint is enabled. The extra bookeeping is easily going to be outweighed by the reduction in mem-L2 cache fetches. I'll do it tomorrow, if no other

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 09:09 +, Heikki Linnakangas wrote: In fact, the pages that are left in the cache after the seqscan finishes would be useful for the next seqscan of the same table if we were smart enough to read those pages first. That'd make a big difference for seqscanning a

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes: Absolutely. I've got a parameter in my patch sync_scan_offset that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if there's still a scan going). Strikes me that expressing that parameter as

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: Absolutely. I've got a parameter in my patch sync_scan_offset that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if there's still a scan

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes: On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: Strikes me that expressing that parameter as a percentage of shared_buffers might make it less in need of manual tuning ... The original patch was a percentage of effective_cache_size, because in theory

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Heikki Linnakangas
Jeff Davis wrote: On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: Absolutely. I've got a parameter in my patch sync_scan_offset that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood
Tom Lane wrote: So the problem is not so much the clock sweep overhead as that it's paid in a very nonuniform fashion: with N buffers you pay O(N) once every N reads and O(1) the rest of the time. This is no doubt slowing things down enough to delay that one read, instead of leaving it nicely

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: Mark, can you detect hiccups in the read rate using your setup? I think so, here's the vmstat output for 400MB of shared_buffers during the scan: Hm, not really a smoking gun there. But just for grins, would you try this patch and see

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood
Tom Lane wrote: Hm, not really a smoking gun there. But just for grins, would you try this patch and see if the numbers change? Applied to 8.2.3 (don't have lineitem loaded in HEAD yet) - no change that I can see: procs ---memory-- ---swap-- -io --system--

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes: Elapsed time is exactly the same (101 s). Is is expected that HEAD would behave differently? Offhand I don't think so. But what I wanted to see was the curve of elapsed time vs shared_buffers? regards, tom lane

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jeff Davis
On Mon, 2007-03-05 at 21:03 +, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Florian G. Pflug
Simon Riggs wrote: On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Mark Kirkwood
Tom Lane wrote: But what I wanted to see was the curve of elapsed time vs shared_buffers? Of course! (lets just write that off to me being pre coffee...). With the patch applied: Shared Buffers Elapsed vmstat IO rate -- --- -- 400MB 101 s122

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: But what I wanted to see was the curve of elapsed time vs shared_buffers? ... Looks *very* similar. Yup, thanks for checking. I've been poking into this myself. I find that I can reproduce the behavior to some extent even with a slow

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: I don't see any good reason why overwriting a whole cache line oughtn't be the same speed either way. I can think of a couple theories, but I don't know if they're reasonable. The one the comes to mind is the inter-processor cache coherency protocol. When

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Deolasee; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: But what I wanted to see was the curve of elapsed time vs shared_buffers? ... Looks *very

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes: What happens if VACUUM comes across buffers that *are* already in the buffer cache. Does it throw those on the freelist too? Not unless they have usage_count 0, in which case they'd be subject to recycling by the next clock sweep anyway.

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes: Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). Cool. Maybe Sherry can comment on the question whether it's possible for a large-scale-memcpy to not take a hit on filling a cache line that wasn't previously in cache? I

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Luke Lonergan
Tom, On 3/5/07 7:58 PM, Tom Lane [EMAIL PROTECTED] wrote: I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. Here's the C wrapper routine in Solaris:

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes: Here's the x86 assembler routine for Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32 /ml/copy.s The actual uiomove routine is a simple wrapper that calls the assembler kcopy or xcopyout routines. There are two

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jim Nasby
On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote: Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Jim Nasby
On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-05 Thread Tom Lane
Jim Nasby [EMAIL PROTECTED] writes: An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, The bgwriter already tries to keep pages just in front of the clock sweep pointer clean.

[HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
I'm putting this out there before we publish a fix so that we can discuss how best to fix it. Doug and Sherry recently found the source of an important performance issue with the Postgres shared buffer cache. The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes: The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
on ma treo -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time To: Luke Lonergan Cc: PGSQL Hackers; Doug Rady; Sherry Moore Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Luke Lonergan

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
[mailto:[EMAIL PROTECTED] Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time To: Luke Lonergan Cc: PGSQL Hackers; Doug Rady; Sherry Moore Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Luke Lonergan [EMAIL PROTECTED] writes: The issue is summarized like

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Mark Kirkwood
Tom Lane wrote: Luke Lonergan [EMAIL PROTECTED] writes: The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Gavin Sherry
On Mon, 5 Mar 2007, Mark Kirkwood wrote: To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: e.g: PIII 1.26Ghz 512Kb L2 cache, 2G

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Luke Lonergan
Gavin, Mark, Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. I suggest two experiments that we've already done: 1) increase shared buffers to

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-04 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes: Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. Using PG CVS HEAD on 64-bit Intel Xeon (1MB L2 cache), Fedora