Re: [PATCHES] Seq scans status update
Hi All, On 5/31/07 12:40 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote: BTW, we've been talking about the L2 cache effect but we don't really know for sure if the effect has anything to do with the L2 cache. But whatever it is, it's real. The mailing list archives contain the ample evidence of: - it's definitely an L2 cache effect - on fast I/O hardware tests show large benefits of keeping the ring in L2 I see no reason to re-open the discussion about these, can we accept these as fact and continue? - Luke ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
Luke Lonergan [EMAIL PROTECTED] writes: The mailing list archives contain the ample evidence of: - it's definitely an L2 cache effect - on fast I/O hardware tests show large benefits of keeping the ring in L2 Please provide some specific pointers, because I don't remember that. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Seq scans status update
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I just ran a quick test with 4 concurrent scans on a dual-core system, and it looks like we do leak buffers from the rings because they're pinned at the time they would be recycled. Yeah, I noticed the same in some tests here. I think there's not a lot we can do about that; we don't have enough visibility into why someone else has the buffer pinned. We could stash pinned buffers to some other list etc. and try them again later. But that gets a lot more complex. Using a larger ring would help, by making it less probable that any other sync-scanning backend is so far behind as to still have the oldest element of our ring pinned. But if we do that we have the L2-cache-size effect to worry about. Is there any actual data backing up that it's useful to keep the ring fitting in L2, or is that just guesswork? In the sync-scan case the idea seems pretty bogus anyway, because the actual working set will be N backends' rings not just one. Yes, I tested different ring sizes here: http://archives.postgresql.org/pgsql-hackers/2007-05/msg00469.php The tests above showed the effect when reading a table from OS cache. I haven't seen direct evidence supporting Luke's claim that the ring makes scans of tables bigger than RAM go faster with bigger I/O hardware, because I don't have such hardware at hand. We did repeat the tests on different hardware however, and monitored the CPU usage with vmstat at the same time. The CPU usage was significantly lower with the patch, so I believe that with better I/O hardware the test would become limited by CPU and the patch would therefore make it go faster. BTW, we've been talking about the L2 cache effect but we don't really know for sure if the effect has anything to do with the L2 cache. But whatever it is, it's real. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Seq scans status update
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: A heapscan would pin the buffer only once and hence bump its count at most once, so I don't see a big problem here. Also, I'd argue that buffers that had a positive usage_count shouldn't get sucked into the ring to begin with. True, except that with the synchronized scans patch two synchronized scans will pin the buffer twice. Hmm. But we probably don't want the same buffer in two different backends' rings, either. You *sure* the sync-scan patch has no interaction with this one? A buffer is only put to the ring when it's requested with StrategyGetBuffer. If a page is requested with ReadBuffer, and it's in cache already, it won't be put in the ring. When multiple scanners are scanning synchronously, they will all use their own, distinct rings when reading buffers into the cache, and peek into the buffers in other backend's rings for pages that others read to the buffer cache. I'm going to test the interactions in more detail... One other question: I see the patch sets the threshold for switching from normal to ring-buffer heapscans at table size = NBuffers. Why so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10. If you don't want a seqscan blowing out your buffer cache, you surely don't want it blowing out 90% of the cache either. NBuffers is the maximum value that makes sense; if you're scanning more than NBuffers, the scan is definitely not going to fit in shared_buffers. Anything less than that and we might be causing harm to some use cases, so I chose that for the time being. Simon argued for a GUC variable, and Jeff's patch as it stands introduces one. I'm not sure we want it but if we do, we should use the same variable to control both the sync scan and cache replacement policy. It's essentially how large a scan do you expect to fit in shared_buffers? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote: Hmm. But we probably don't want the same buffer in two different backends' rings, either. You *sure* the sync-scan patch has no interaction with this one? I will run some tests again tonight, I think the interaction needs more testing than I did originally. Also, I'm not sure that the hardware I have is sufficient to test those cases. I ran some sanity tests last night with cvs head, plus my syncscan20- cvshead.patch, plus scan_recycle_buffers.v3.patch. It passed the sanity tests at least. I did see that there was more interference with sync_seqscan_threshold=0 (always on) and scan_recycle_buffers=0 (off) than I had previously seen with 8.2.4, so I will test again against 8.2.4 to see why that might be. The interference that I saw was still quite small, a scan moving concurrently with 9 other scans was about 10% slower than a scan running alone -- which is still very good compared with plain cvs head and no sync scan -- it's just not ideal. However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128 didn't have much effect. At 32 it appeared to be about 1% worse during 10 scans, but that may have been noise. The other values I tried didn't have any difference that I could see. This was really just a quick sanity test, I think more hard data would be useful. Regards, Jeff Davis ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: One other question: I see the patch sets the threshold for switching from normal to ring-buffer heapscans at table size = NBuffers. Why so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10. If you don't want a seqscan blowing out your buffer cache, you surely don't want it blowing out 90% of the cache either. NBuffers is the maximum value that makes sense; if you're scanning more than NBuffers, the scan is definitely not going to fit in shared_buffers. Anything less than that and we might be causing harm to some use cases, so I chose that for the time being. But the flip side of that is you're failing to provide the benefit of the patch in quite a lot of use-cases where it's clearly beneficial. I just don't believe that there are very many cases where people will want a heapscan to eat 90% of their cache. Simon argued for a GUC variable, and Jeff's patch as it stands introduces one. I'm not sure we want it but if we do, we should use the same variable to control both the sync scan and cache replacement policy. It's essentially how large a scan do you expect to fit in shared_buffers? Well, let's do some experiments and see if there's really any point in varying the cutover. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Seq scans status update
Jeff Davis wrote: On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote: Hmm. But we probably don't want the same buffer in two different backends' rings, either. You *sure* the sync-scan patch has no interaction with this one? I will run some tests again tonight, I think the interaction needs more testing than I did originally. Also, I'm not sure that the hardware I have is sufficient to test those cases. I ran some sanity tests last night with cvs head, plus my syncscan20- cvshead.patch, plus scan_recycle_buffers.v3.patch. It passed the sanity tests at least. I did see that there was more interference with sync_seqscan_threshold=0 (always on) and scan_recycle_buffers=0 (off) than I had previously seen with 8.2.4, so I will test again against 8.2.4 to see why that might be. The interference that I saw was still quite small, a scan moving concurrently with 9 other scans was about 10% slower than a scan running alone -- which is still very good compared with plain cvs head and no sync scan -- it's just not ideal. However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128 didn't have much effect. At 32 it appeared to be about 1% worse during 10 scans, but that may have been noise. The other values I tried didn't have any difference that I could see. This was really just a quick sanity test, I think more hard data would be useful. The interesting question is whether the small buffer ring is big enough to let all synchronized scans to process a page before it's being recycled. Keep an eye on pg_buffercache to see if it gets filled with pages from the table you're querying. I just ran a quick test with 4 concurrent scans on a dual-core system, and it looks like we do leak buffers from the rings because they're pinned at the time they would be recycled. A full scan of a 30GB table took just under 7 minutes, and starting after a postmaster restart it took ~4-5 minutes until all of the 320MB of shared_buffers were used. That means we're leaking a buffer from the ring very roughly on every 20-40th ReadBuffer call, but I'll put in some proper instrumentation and test with different configurations to get a better picture. The synchronized scans gives such a big benefit when it's applicable, that I think that some cache-spoiling is acceptable and in fact unavoidable in some scenarios. It's much better than 8.2 behavior anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: I just ran a quick test with 4 concurrent scans on a dual-core system, and it looks like we do leak buffers from the rings because they're pinned at the time they would be recycled. Yeah, I noticed the same in some tests here. I think there's not a lot we can do about that; we don't have enough visibility into why someone else has the buffer pinned. Using a larger ring would help, by making it less probable that any other sync-scanning backend is so far behind as to still have the oldest element of our ring pinned. But if we do that we have the L2-cache-size effect to worry about. Is there any actual data backing up that it's useful to keep the ring fitting in L2, or is that just guesswork? In the sync-scan case the idea seems pretty bogus anyway, because the actual working set will be N backends' rings not just one. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote: In the sync-scan case the idea seems pretty bogus anyway, because the actual working set will be N backends' rings not just one. I don't follow. Ideally, in the sync-scan case, the sets of buffers in the ring of different scans on the same relation will overlap completely, right? We might not be at the ideal, but the sets of buffers in the rings shouldn't be disjoint, should they? Regards, Jeff Davis ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Seq scans status update
Jeff Davis [EMAIL PROTECTED] writes: On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote: In the sync-scan case the idea seems pretty bogus anyway, because the actual working set will be N backends' rings not just one. I don't follow. Ideally, in the sync-scan case, the sets of buffers in the ring of different scans on the same relation will overlap completely, right? We might not be at the ideal, but the sets of buffers in the rings shouldn't be disjoint, should they? According to Heikki's explanation here http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php each backend doing a heapscan would collect its own ring of buffers. You might have a few backends that are always followers, never leaders, and so never actually fetch any pages --- but for each backend that actually did any I/O there would be a separate ring. In practice I'd expect the lead would change hands pretty often and so you'd see all the backends accumulating their own rings. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Seq scans status update
Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: Is there a reason UnpinBuffer has to be the one to increment the usage count anyways? Why can't ReadBuffer handle incrementing the count and just trust that it won't be decremented until the buffer is unpinned anyways? That's a good question. I think the idea was that if we hold a buffer pinned for awhile (long enough that the bgwriter's clock sweep passes over it one or more times), we want the usage count decrementing to start when we release the pin, not when we acquire it. But maybe that could be fixed if the clock sweep doesn't touch the usage_count of a pinned buffer. Which in fact it may not do already --- didn't look. It does -- in BgBufferSync the all scan calls SyncOneBuffer with skip_pinned=false. The lru scan does skip pinned buffers. -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ World domination is proceeding according to plan(Andrew Morton) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Seq scans status update
Alvaro Herrera wrote: Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: Is there a reason UnpinBuffer has to be the one to increment the usage count anyways? Why can't ReadBuffer handle incrementing the count and just trust that it won't be decremented until the buffer is unpinned anyways? That's a good question. I think the idea was that if we hold a buffer pinned for awhile (long enough that the bgwriter's clock sweep passes over it one or more times), we want the usage count decrementing to start when we release the pin, not when we acquire it. But maybe that could be fixed if the clock sweep doesn't touch the usage_count of a pinned buffer. Which in fact it may not do already --- didn't look. It does -- in BgBufferSync the all scan calls SyncOneBuffer with skip_pinned=false. The lru scan does skip pinned buffers. You're looking at the wrong place. StrategyGetBuffer drives the clock sweep, and it always decreases the usage_count, IOW it doesn't skip pinned buffers. SyncOneBuffer and BgBufferSync don't decrease the usage_count in any case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: Here's a new version, all known issues are now fixed. I'm now happy with this patch. I'm looking this over and finding it fairly ugly from a system-structural point of view. In particular, this has pushed the single-global-variable StrategyHintVacuum() concept well past the breaking point. That API was sort of tolerable for vacuum, since vacuum isn't re-entrant and can't coexist with any other behavior within the same backend; though I never liked the fact that it affected vacuum's system-catalog accesses along with the intended table fetches. But now you've got bits of code hacking the access pattern in contexts that are far less predictable than vacuum ... and not guaranteeing to reset the access pattern on failure, either. I think we've got to get rid of the global variable and make the access pattern be a parameter to StrategyGetBuffer, instead. Which in turn suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the next level up. This would serve directly for the heapscan case, and for the main heap accesses in vacuum/analyze, but it might be a bit of a pain to make index vacuuming pay attention to the strategy. The other case that the patch addresses is COPY TO REL, which we could handle if we were willing to pass a strategy parameter down through heap_insert and RelationGetBufferForTuple ... is it worth the trouble? I don't recall anyone having presented measurements to show COPY TO REL being helped by this patch. I don't especially care for the ring data structure being global inside freelist.c, either. What I'm inclined to do is have the access strategy parameter actually be a pointer to some struct type, with NULL representing the default strategy. At the beginning of a bulk operation we'd create an access strategy object, which would be internally the current Ring data structure, but it'd be opaque to code outside freelist.c. This'd mean that a query using two large seqscans would use two rings not one, but that's not necessarily bad; it would in any case make it a lot easier to generalize the strategy concept in future to support more than one kind of non-default policy being active in different parts of a query. A point I have not figured out how to deal with is that in the patch as given, UnpinBuffer needs to know the strategy; and getting it that info would make the changes far more invasive. But the patch's behavior here is pretty risky anyway, since the strategy global at the time of unpin might have nothing to do with how it was set when the buffer was acquired. What I'm tempted to do is remove the special case there and adjust buffer acquisition instead (maybe make it decrement the usage_count when re-using a buffer from the ring). Comments? I'm still playing with the ideas at this point... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: A point I have not figured out how to deal with is that in the patch as given, UnpinBuffer needs to know the strategy; and getting it that info would make the changes far more invasive. But the patch's behavior here is pretty risky anyway, since the strategy global at the time of unpin might have nothing to do with how it was set when the buffer was acquired. It's assumed that the same strategy is used when unpinning, which is true for the current usage (and apparently needs to be documented). I don't believe that for a moment. Even in the trivial heapscan case, the last pin is typically dropped during a tupleslot clear operation sometime after the scan itself has moved on to the next page. In a more general context (such as parallel heapscan and indexscan on a rel) it's certainly too risky to assume it. Normally buffers that are in the ring are recycled quickly, in which case the usage_count will always be increased from 0-1 in UnpinBuffer, regardless of the check. The check is there to avoid inflating the usage_count on buffers that happened to be already in shared_buffers. A heapscan would pin the buffer only once and hence bump its count at most once, so I don't see a big problem here. Also, I'd argue that buffers that had a positive usage_count shouldn't get sucked into the ring to begin with. If we figure out another way to deal with the COPY usage_count, maybe we could remove the check altogether. I've been thinking of changing COPY anyway so that it always kept the last page it inserted to pinned, to avoid the traffic to buffer manager on each tuple. This is getting fairly invasive for a part of the patch you've not justified at all yet... Let me know if you want me to make changes and submit a new version. I can work on this stuff; please do the tests to show whether it's worth handling COPY TO REL, and keep on with Jeff's patch. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Seq scans status update
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: It's assumed that the same strategy is used when unpinning, which is true for the current usage (and apparently needs to be documented). I don't believe that for a moment. Even in the trivial heapscan case, the last pin is typically dropped during a tupleslot clear operation sometime after the scan itself has moved on to the next page. In a more general context (such as parallel heapscan and indexscan on a rel) it's certainly too risky to assume it. Hmm, I guess you're right. :( One idea is to keep track which pins are taken using the bulk strategy. It's a bit tricky when a buffer is pinned multiple times since we don't know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we could get away with just a flag per pinned buffer. Set the flag when a buffer is pinned with bulk strategy and it wasn't pinned by us before, and clear it when it's pinned with another strategy. I'm thinking we steal one bit from PrivateRefCount for this. Normally buffers that are in the ring are recycled quickly, in which case the usage_count will always be increased from 0-1 in UnpinBuffer, regardless of the check. The check is there to avoid inflating the usage_count on buffers that happened to be already in shared_buffers. A heapscan would pin the buffer only once and hence bump its count at most once, so I don't see a big problem here. Also, I'd argue that buffers that had a positive usage_count shouldn't get sucked into the ring to begin with. True, except that with the synchronized scans patch two synchronized scans will pin the buffer twice. If we figure out another way to deal with the COPY usage_count, maybe we could remove the check altogether. I've been thinking of changing COPY anyway so that it always kept the last page it inserted to pinned, to avoid the traffic to buffer manager on each tuple. This is getting fairly invasive for a part of the patch you've not justified at all yet... Let me know if you want me to make changes and submit a new version. I can work on this stuff; please do the tests to show whether it's worth handling COPY TO REL, and keep on with Jeff's patch. Ok. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: One idea is to keep track which pins are taken using the bulk strategy. It's a bit tricky when a buffer is pinned multiple times since we don't know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we could get away with just a flag per pinned buffer. Set the flag when a buffer is pinned with bulk strategy and it wasn't pinned by us before, and clear it when it's pinned with another strategy. I'm thinking we steal one bit from PrivateRefCount for this. Seems like a mess. Why don't we just fix it so there's no need for different behavior at Unpin time? The facts on the ground are that the current patch's change in UnpinBuffer is a no-op anyway, because of the tupletableslot interference. The behavior I'm imagining is just that when we try to take a buffer from the ring, if its usage count exceeds 1 then drop it from the ring and get another buffer. 1 would be the expected case if no one had touched it since we last used it. A heapscan would pin the buffer only once and hence bump its count at most once, so I don't see a big problem here. Also, I'd argue that buffers that had a positive usage_count shouldn't get sucked into the ring to begin with. True, except that with the synchronized scans patch two synchronized scans will pin the buffer twice. Hmm. But we probably don't want the same buffer in two different backends' rings, either. You *sure* the sync-scan patch has no interaction with this one? One other question: I see the patch sets the threshold for switching from normal to ring-buffer heapscans at table size = NBuffers. Why so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10. If you don't want a seqscan blowing out your buffer cache, you surely don't want it blowing out 90% of the cache either. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Seq scans status update
Tom Lane [EMAIL PROTECTED] writes: A point I have not figured out how to deal with is that in the patch as given, UnpinBuffer needs to know the strategy; and getting it that info would make the changes far more invasive. But the patch's behavior here is pretty risky anyway, since the strategy global at the time of unpin might have nothing to do with how it was set when the buffer was acquired. What I'm tempted to do is remove the special case there and adjust buffer acquisition instead (maybe make it decrement the usage_count when re-using a buffer from the ring). Is there a reason UnpinBuffer has to be the one to increment the usage count anyways? Why can't ReadBuffer handle incrementing the count and just trust that it won't be decremented until the buffer is unpinned anyways? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
Gregory Stark [EMAIL PROTECTED] writes: Is there a reason UnpinBuffer has to be the one to increment the usage count anyways? Why can't ReadBuffer handle incrementing the count and just trust that it won't be decremented until the buffer is unpinned anyways? That's a good question. I think the idea was that if we hold a buffer pinned for awhile (long enough that the bgwriter's clock sweep passes over it one or more times), we want the usage count decrementing to start when we release the pin, not when we acquire it. But maybe that could be fixed if the clock sweep doesn't touch the usage_count of a pinned buffer. Which in fact it may not do already --- didn't look. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Seq scans status update
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Here's a new version, all known issues are now fixed. I'm now happy with this patch. Next, I'll start looking at the latest version of Jeff's synchronized scans patch. I'm a bit confused --- weren't you intending to review these in parallel because of the possible interactions? Do you think this should be applied now, or does it need to wait to see what happens with Jeff's patch? I think it should be applied now. I've briefly looked at Jeff's patch and I don't see any problems looming there. Jeff performed tests with Simon's original patch and his patch, and I think the results from those tests are still valid since the basic behavior hasn't been changed. I'll repeat those tests myself, and run some more to see if the added CPU overhead shows up in tests, but I'm pretty confident the patches will work together as advertised. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
Here's a new version, all known issues are now fixed. I'm now happy with this patch. Next, I'll start looking at the latest version of Jeff's synchronized scans patch. Bruce Momjian wrote: Great. Based on this, do you have a patch that is ready to apply. --- Heikki Linnakangas wrote: Heikki Linnakangas wrote: In any case, I'd like to see more test results before we make a decision. I'm running tests with DBT-2 and a seq scan running in the background to see if the cache-spoiling effect shows up. I'm also trying to get hold of some bigger hardware to run on. Running these tests takes some calendar time, but the hard work has already been done. I'm going to start reviewing Jeff's synchronized scans patch now. Here are the results of the DBT-2 tests: http://community.enterprisedb.com/seqscan/imola/ In each of these tests, at the end of rampup a script is started that issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay between end of previous query and start of next one. The patch makes the seq scans go significantly faster. In the 1 hour test period, the patched tests perform roughly 30-100% as many selects as unpatched tests. With 100 and 105 warehouses, it also significantly reduces the impact of the seq scan on other queries; response times are lower with the patch. With 120 warehouses the reduction of impact is not as clear, but when you plot the response times it's still there (the plots on the response times charts-page are useless because they're overwhelmed by the checkpoint spike). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/heap/heapam.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.232 diff -c -r1.232 heapam.c *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 - 1.232 --- src/backend/access/heap/heapam.c 25 May 2007 19:22:33 - *** *** 83,88 --- 83,96 */ scan-rs_nblocks = RelationGetNumberOfBlocks(scan-rs_rd); + /* A scan on a table smaller than shared_buffers is treated like random + * access, but bigger scans use the bulk read page replacement policy. + */ + if (scan-rs_nblocks NBuffers) + scan-rs_accesspattern = AP_BULKREAD; + else + scan-rs_accesspattern = AP_NORMAL; + scan-rs_inited = false; scan-rs_ctup.t_data = NULL; ItemPointerSetInvalid(scan-rs_ctup.t_self); *** *** 123,133 --- 131,146 Assert(page scan-rs_nblocks); + /* Read the page with the right strategy */ + SetAccessPattern(scan-rs_accesspattern); + scan-rs_cbuf = ReleaseAndReadBuffer(scan-rs_cbuf, scan-rs_rd, page); scan-rs_cblock = page; + SetAccessPattern(AP_NORMAL); + if (!scan-rs_pageatatime) return; Index: src/backend/access/transam/xlog.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.268 diff -c -r1.268 xlog.c *** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 - 1.268 --- src/backend/access/transam/xlog.c 15 May 2007 16:23:30 - *** *** 1668,1673 --- 1668,1700 } /* + * Returns true if 'record' hasn't been flushed to disk yet. + */ + bool + XLogNeedsFlush(XLogRecPtr record) + { + /* Quick exit if already known flushed */ + if (XLByteLE(record, LogwrtResult.Flush)) + return false; + + /* read LogwrtResult and update local state */ + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(xlogctl-info_lck); + LogwrtResult = xlogctl-LogwrtResult; + SpinLockRelease(xlogctl-info_lck); + } + + /* check again */ + if (XLByteLE(record, LogwrtResult.Flush)) + return false; + + return true; + } + + /* * Ensure that all XLOG data through the given position is flushed to disk. * * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not Index: src/backend/commands/copy.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.283 diff -c -r1.283 copy.c *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 - 1.283 --- src/backend/commands/copy.c 15 May 2007 17:05:29 - *** *** 1876,1881 --- 1876,1888 nfields = file_has_oids ? (attr_count + 1) : attr_count; field_strings = (char **) palloc(nfields * sizeof(char *)); + /* Use the special COPY buffer replacement
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: Here's a new version, all known issues are now fixed. I'm now happy with this patch. Next, I'll start looking at the latest version of Jeff's synchronized scans patch. I'm a bit confused --- weren't you intending to review these in parallel because of the possible interactions? Do you think this should be applied now, or does it need to wait to see what happens with Jeff's patch? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Seq scans status update
Great. Based on this, do you have a patch that is ready to apply. --- Heikki Linnakangas wrote: Heikki Linnakangas wrote: In any case, I'd like to see more test results before we make a decision. I'm running tests with DBT-2 and a seq scan running in the background to see if the cache-spoiling effect shows up. I'm also trying to get hold of some bigger hardware to run on. Running these tests takes some calendar time, but the hard work has already been done. I'm going to start reviewing Jeff's synchronized scans patch now. Here are the results of the DBT-2 tests: http://community.enterprisedb.com/seqscan/imola/ In each of these tests, at the end of rampup a script is started that issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay between end of previous query and start of next one. The patch makes the seq scans go significantly faster. In the 1 hour test period, the patched tests perform roughly 30-100% as many selects as unpatched tests. With 100 and 105 warehouses, it also significantly reduces the impact of the seq scan on other queries; response times are lower with the patch. With 120 warehouses the reduction of impact is not as clear, but when you plot the response times it's still there (the plots on the response times charts-page are useless because they're overwhelmed by the checkpoint spike). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Seq scans status update
Not yet, there's still one issue that needs fixing. Bruce Momjian wrote: Great. Based on this, do you have a patch that is ready to apply. --- Heikki Linnakangas wrote: Heikki Linnakangas wrote: In any case, I'd like to see more test results before we make a decision. I'm running tests with DBT-2 and a seq scan running in the background to see if the cache-spoiling effect shows up. I'm also trying to get hold of some bigger hardware to run on. Running these tests takes some calendar time, but the hard work has already been done. I'm going to start reviewing Jeff's synchronized scans patch now. Here are the results of the DBT-2 tests: http://community.enterprisedb.com/seqscan/imola/ In each of these tests, at the end of rampup a script is started that issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay between end of previous query and start of next one. The patch makes the seq scans go significantly faster. In the 1 hour test period, the patched tests perform roughly 30-100% as many selects as unpatched tests. With 100 and 105 warehouses, it also significantly reduces the impact of the seq scan on other queries; response times are lower with the patch. With 120 warehouses the reduction of impact is not as clear, but when you plot the response times it's still there (the plots on the response times charts-page are useless because they're overwhelmed by the checkpoint spike). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Seq scans status update
I forgot to attach the program used to generate test data. Here it is. Heikki Linnakangas wrote: Attached is a new version of Simon's scan-resistant buffer manager patch. It's not ready for committing yet because of a small issue I found this morning (* see bottom), but here's a status update. To recap, the basic idea is to use a small ring of buffers for large scans like VACUUM, COPY and seq-scans. Changes to the original patch: - a different sized ring is used for VACUUMs and seq-scans, and COPY. VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 buffers in default configuration. See README changes in the patch for rationale. - for queries with large seqscans, the buffer ring is only used for reads issued by the seq scan, not for any other reads in the query. Typical scenario where this matters is doing a large seq scan with a nested loop join to a smaller table. You don't want to use the buffer ring for index lookups inside the nested loop. - for seqscans, drop buffers from the ring that would need a WAL flush to reuse. That makes bulk updates to behave roughly like they do without the patch, instead of having to do a WAL flush every 32 pages. I've spent a lot of time thinking of solutions to the last point. The obvious solution would be to not use the buffer ring for updating scans. The difficulty with that is that we don't know if a scan is read-only in heapam.c, where the hint to use the buffer ring is set. I've completed a set of performance tests on a test server. The server has 4 GB of RAM, of which 1 GB is used for shared_buffers. Results for a 10 GB table: head-copy-bigtable | 00:10:09.07016 head-copy-bigtable | 00:10:20.507357 head-copy-bigtable | 00:10:21.857677 head-copy_nowal-bigtable | 00:05:18.232956 head-copy_nowal-bigtable | 00:03:24.109047 head-copy_nowal-bigtable | 00:05:31.019643 head-select-bigtable | 00:03:47.102731 head-select-bigtable | 00:01:08.314719 head-select-bigtable | 00:01:08.238509 head-select-bigtable | 00:01:08.208563 head-select-bigtable | 00:01:08.28347 head-select-bigtable | 00:01:08.308671 head-vacuum_clean-bigtable | 00:01:04.227832 head-vacuum_clean-bigtable | 00:01:04.232258 head-vacuum_clean-bigtable | 00:01:04.294621 head-vacuum_clean-bigtable | 00:01:04.280677 head-vacuum_hintbits-bigtable| 00:04:01.123924 head-vacuum_hintbits-bigtable| 00:03:58.253175 head-vacuum_hintbits-bigtable| 00:04:26.318159 head-vacuum_hintbits-bigtable| 00:04:37.512965 patched-copy-bigtable| 00:09:52.776754 patched-copy-bigtable| 00:10:18.185826 patched-copy-bigtable| 00:10:16.975482 patched-copy_nowal-bigtable | 00:03:14.882366 patched-copy_nowal-bigtable | 00:04:01.04648 patched-copy_nowal-bigtable | 00:03:56.062272 patched-select-bigtable | 00:03:47.704154 patched-select-bigtable | 00:01:08.460326 patched-select-bigtable | 00:01:10.441544 patched-select-bigtable | 00:01:11.916221 patched-select-bigtable | 00:01:13.848038 patched-select-bigtable | 00:01:10.956133 patched-vacuum_clean-bigtable| 00:01:10.315439 patched-vacuum_clean-bigtable| 00:01:12.210537 patched-vacuum_clean-bigtable| 00:01:15.202114 patched-vacuum_clean-bigtable| 00:01:10.712235 patched-vacuum_hintbits-bigtable | 00:03:42.279201 patched-vacuum_hintbits-bigtable | 00:04:02.057778 patched-vacuum_hintbits-bigtable | 00:04:26.805822 patched-vacuum_hintbits-bigtable | 00:04:28.911184 In other words, the patch has no significant effect, as expected. The select times did go up by a couple of seconds, which I didn't expect, though. One theory is that unused shared_buffers are swapped out during the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and try again at some point. Results for a 2 GB table: copy-medsize-unpatched| 00:02:18.23246 copy-medsize-unpatched| 00:02:22.347194 copy-medsize-unpatched| 00:02:23.875874 copy_nowal-medsize-unpatched | 00:01:27.606334 copy_nowal-medsize-unpatched | 00:01:17.491243 copy_nowal-medsize-unpatched | 00:01:31.902719 select-medsize-unpatched | 00:00:03.786031 select-medsize-unpatched | 00:00:02.678069 select-medsize-unpatched | 00:00:02.666103 select-medsize-unpatched | 00:00:02.673494 select-medsize-unpatched | 00:00:02.669645 select-medsize-unpatched | 00:00:02.666278 vacuum_clean-medsize-unpatched| 00:00:01.091356 vacuum_clean-medsize-unpatched| 00:00:01.923138 vacuum_clean-medsize-unpatched| 00:00:01.917213 vacuum_clean-medsize-unpatched| 00:00:01.917333 vacuum_hintbits-medsize-unpatched | 00:00:01.683718
Re: [PATCHES] Seq scans status update
Heikki Linnakangas wrote: In any case, I'd like to see more test results before we make a decision. I'm running tests with DBT-2 and a seq scan running in the background to see if the cache-spoiling effect shows up. I'm also trying to get hold of some bigger hardware to run on. Running these tests takes some calendar time, but the hard work has already been done. I'm going to start reviewing Jeff's synchronized scans patch now. Here are the results of the DBT-2 tests: http://community.enterprisedb.com/seqscan/imola/ In each of these tests, at the end of rampup a script is started that issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay between end of previous query and start of next one. The patch makes the seq scans go significantly faster. In the 1 hour test period, the patched tests perform roughly 30-100% as many selects as unpatched tests. With 100 and 105 warehouses, it also significantly reduces the impact of the seq scan on other queries; response times are lower with the patch. With 120 warehouses the reduction of impact is not as clear, but when you plot the response times it's still there (the plots on the response times charts-page are useless because they're overwhelmed by the checkpoint spike). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Seq scans status update
CK Tan wrote: If it is convenient for you, could you run my patch against the same hardware and data to get some numbers on select for comparison? Although we don't address updates, copy, or inserts, we are definitely getting at least 20% improvement in scans here without poisoning the bufpool for large tables. Sure, here you are: copy-cktan-big| 00:10:08.843126 copy-cktan-big| 00:10:17.767606 copy-cktan-big| 00:09:37.797059 copy_nowal-cktan-big | 00:05:12.305025 copy_nowal-cktan-big | 00:05:10.518417 copy_nowal-cktan-big | 00:05:03.472939 select-cktan-big | 00:03:27.655982 select-cktan-big | 00:01:55.496408 select-cktan-big | 00:01:31.693856 select-cktan-big | 00:01:12.705268 select-cktan-big | 00:01:12.478247 select-cktan-big | 00:01:10.866484 vacuum_clean-cktan-big| 00:03:05.340875 vacuum_clean-cktan-big| 00:01:12.428317 vacuum_clean-cktan-big| 00:01:13.179957 vacuum_clean-cktan-big| 00:01:10.43 vacuum_hintbits-cktan-big | 00:03:58.78208 vacuum_hintbits-cktan-big | 00:04:02.515778 vacuum_hintbits-cktan-big | 00:04:19.083402 vacuum_hintbits-cktan-big | 00:04:11.170831 copy-cktan-med| 00:02:19.413484 copy-cktan-med| 00:02:22.270497 copy-cktan-med| 00:02:22.297946 copy_nowal-cktan-med | 00:01:31.192601 copy_nowal-cktan-med | 00:01:17.736356 copy_nowal-cktan-med | 00:01:32.272778 select-cktan-med | 00:00:03.774974 select-cktan-med | 00:00:01.279276 select-cktan-med | 00:00:01.297703 select-cktan-med | 00:00:01.304129 select-cktan-med | 00:00:01.297048 select-cktan-med | 00:00:01.306073 vacuum_clean-cktan-med| 00:00:01.820733 vacuum_clean-cktan-med| 00:00:01.755684 vacuum_clean-cktan-med| 00:00:01.755659 vacuum_clean-cktan-med| 00:00:01.750972 vacuum_hintbits-cktan-med | 00:00:01.58744 vacuum_hintbits-cktan-med | 00:00:06.216038 vacuum_hintbits-cktan-med | 00:00:02.201789 vacuum_hintbits-cktan-med | 00:00:36.494427 Select performance looks the same as with Simon's/my patch. That 20% gain must be hardware-dependent. My interpretation is that the CPU savings from more efficient cache usage only matters when the I/O bandwidth is high enough that CPU becomes the bottleneck. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Seq scans status update
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: In any case, I do want this for VACUUMs to fix the WAL flush for every dirty page problem. Maybe we should indeed drop the other aspects of the patch and move on, I'm getting tired of this as well. Can we devise a small patch that fixes that issue without creating uncertain impacts on other behavior? Yeah certainly, that's my backup plan. The thing that has made me uncomfortable with this set of patches right along is the presumption that it's a good idea to tweak cache behavior to improve a small set of cases. We are using cache behavior (now clock-sweep, formerly LRU) that is well tested and proven across a large body of experience; my gut feeling is that deviating from that behavior is likely to be a net loss when the big picture is considered. I certainly have not seen any test results that try to prove that there's no such generalized disadvantage to these patches. That was my initial reaction as well. However, people claimed that using a small number of buffers you can achieve higher raw throughput. Avoiding the cache spoiling sounds plausible as well, if you're going to do a seq scan of a table larger than shared_buffers, you know those pages are not going to be needed in the near future; you're going to replace them yourself with pages from the same scan. One could argue that the real bug here is that VACUUM deviates from the standard behavior by forcing immediate recycling of pages it releases, and that getting rid of that wart is the correct answer rather than piling more warts atop it --- especially warts that will change the behavior for anything besides VACUUM. Maybe a better approach would be switching to an algorithm that's more resistant to sequential scans by nature. That's definitely not going to happen for 8.3, though. It's also possible that whatever algorithm we choose doesn't make much difference in practice because the in typical configurations the OS cache is bigger and more significant anyway. It's also possible that there's some counter-productive interactions between our and OS cache management. In any case, I'd like to see more test results before we make a decision. I'm running tests with DBT-2 and a seq scan running in the background to see if the cache-spoiling effect shows up. I'm also trying to get hold of some bigger hardware to run on. Running these tests takes some calendar time, but the hard work has already been done. I'm going to start reviewing Jeff's synchronized scans patch now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PATCHES] Seq scans status update
Attached is a new version of Simon's scan-resistant buffer manager patch. It's not ready for committing yet because of a small issue I found this morning (* see bottom), but here's a status update. To recap, the basic idea is to use a small ring of buffers for large scans like VACUUM, COPY and seq-scans. Changes to the original patch: - a different sized ring is used for VACUUMs and seq-scans, and COPY. VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 buffers in default configuration. See README changes in the patch for rationale. - for queries with large seqscans, the buffer ring is only used for reads issued by the seq scan, not for any other reads in the query. Typical scenario where this matters is doing a large seq scan with a nested loop join to a smaller table. You don't want to use the buffer ring for index lookups inside the nested loop. - for seqscans, drop buffers from the ring that would need a WAL flush to reuse. That makes bulk updates to behave roughly like they do without the patch, instead of having to do a WAL flush every 32 pages. I've spent a lot of time thinking of solutions to the last point. The obvious solution would be to not use the buffer ring for updating scans. The difficulty with that is that we don't know if a scan is read-only in heapam.c, where the hint to use the buffer ring is set. I've completed a set of performance tests on a test server. The server has 4 GB of RAM, of which 1 GB is used for shared_buffers. Results for a 10 GB table: head-copy-bigtable | 00:10:09.07016 head-copy-bigtable | 00:10:20.507357 head-copy-bigtable | 00:10:21.857677 head-copy_nowal-bigtable | 00:05:18.232956 head-copy_nowal-bigtable | 00:03:24.109047 head-copy_nowal-bigtable | 00:05:31.019643 head-select-bigtable | 00:03:47.102731 head-select-bigtable | 00:01:08.314719 head-select-bigtable | 00:01:08.238509 head-select-bigtable | 00:01:08.208563 head-select-bigtable | 00:01:08.28347 head-select-bigtable | 00:01:08.308671 head-vacuum_clean-bigtable | 00:01:04.227832 head-vacuum_clean-bigtable | 00:01:04.232258 head-vacuum_clean-bigtable | 00:01:04.294621 head-vacuum_clean-bigtable | 00:01:04.280677 head-vacuum_hintbits-bigtable| 00:04:01.123924 head-vacuum_hintbits-bigtable| 00:03:58.253175 head-vacuum_hintbits-bigtable| 00:04:26.318159 head-vacuum_hintbits-bigtable| 00:04:37.512965 patched-copy-bigtable| 00:09:52.776754 patched-copy-bigtable| 00:10:18.185826 patched-copy-bigtable| 00:10:16.975482 patched-copy_nowal-bigtable | 00:03:14.882366 patched-copy_nowal-bigtable | 00:04:01.04648 patched-copy_nowal-bigtable | 00:03:56.062272 patched-select-bigtable | 00:03:47.704154 patched-select-bigtable | 00:01:08.460326 patched-select-bigtable | 00:01:10.441544 patched-select-bigtable | 00:01:11.916221 patched-select-bigtable | 00:01:13.848038 patched-select-bigtable | 00:01:10.956133 patched-vacuum_clean-bigtable| 00:01:10.315439 patched-vacuum_clean-bigtable| 00:01:12.210537 patched-vacuum_clean-bigtable| 00:01:15.202114 patched-vacuum_clean-bigtable| 00:01:10.712235 patched-vacuum_hintbits-bigtable | 00:03:42.279201 patched-vacuum_hintbits-bigtable | 00:04:02.057778 patched-vacuum_hintbits-bigtable | 00:04:26.805822 patched-vacuum_hintbits-bigtable | 00:04:28.911184 In other words, the patch has no significant effect, as expected. The select times did go up by a couple of seconds, which I didn't expect, though. One theory is that unused shared_buffers are swapped out during the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and try again at some point. Results for a 2 GB table: copy-medsize-unpatched| 00:02:18.23246 copy-medsize-unpatched| 00:02:22.347194 copy-medsize-unpatched| 00:02:23.875874 copy_nowal-medsize-unpatched | 00:01:27.606334 copy_nowal-medsize-unpatched | 00:01:17.491243 copy_nowal-medsize-unpatched | 00:01:31.902719 select-medsize-unpatched | 00:00:03.786031 select-medsize-unpatched | 00:00:02.678069 select-medsize-unpatched | 00:00:02.666103 select-medsize-unpatched | 00:00:02.673494 select-medsize-unpatched | 00:00:02.669645 select-medsize-unpatched | 00:00:02.666278 vacuum_clean-medsize-unpatched| 00:00:01.091356 vacuum_clean-medsize-unpatched| 00:00:01.923138 vacuum_clean-medsize-unpatched| 00:00:01.917213 vacuum_clean-medsize-unpatched| 00:00:01.917333 vacuum_hintbits-medsize-unpatched | 00:00:01.683718 vacuum_hintbits-medsize-unpatched | 00:00:01.864003 vacuum_hintbits-medsize-unpatched | 00:00:03.186596
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: I've completed a set of performance tests on a test server. The server has 4 GB of RAM, of which 1 GB is used for shared_buffers. Perhaps I'm misreading it, but these tests seem to show no improvement worth spending any effort on --- some of the tests improve a bit but many get worse, and that's for tests allegedly designed to highlight the improvement; there's been no attempt to measure the distributed cost of the additional logic in scenarios where it's not helpful. To say nothing of the likelihood that it will be flat-out counterproductive in yet other scenarios. regression=# select id,sum(rt),count(*) from times group by id; id | sum | count +-+--- 10G| 01:15:53.497114 |20 10Gpatched | 01:12:51.749906 |20 2G | 00:11:54.343741 |20 2Gpatched | 00:11:32.482733 |20 (4 rows) Should we not just reject the patch and move on to something more useful? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Seq scans status update
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I've completed a set of performance tests on a test server. The server has 4 GB of RAM, of which 1 GB is used for shared_buffers. Perhaps I'm misreading it, but these tests seem to show no improvement worth spending any effort on --- some of the tests improve a bit but many get worse, and that's for tests allegedly designed to highlight the improvement; there's been no attempt to measure the distributed cost of the additional logic in scenarios where it's not helpful. To say nothing of the likelihood that it will be flat-out counterproductive in yet other scenarios. Yeah, possibly. I've been thinking hard of scenarios where this would be counterproductive, bulk delete is one that the original patch hurt, but I think I have all interesting scenarios covered now. Should we not just reject the patch and move on to something more useful? If Luke is right, the L2 cache effect that's visible in these in-memory tests: select-medsize-patched| 00:00:01.332975 select-medsize-patched| 00:00:01.33014 select-medsize-patched| 00:00:01.332392 select-medsize-patched| 00:00:01.333498 select-medsize-patched| 00:00:01.332692 select-medsize-unpatched | 00:00:02.678069 select-medsize-unpatched | 00:00:02.666103 select-medsize-unpatched | 00:00:02.673494 select-medsize-unpatched | 00:00:02.669645 select-medsize-unpatched | 00:00:02.666278 is also visible on larger scans that don't fit in cache with bigger I/O hardware, and this patch would increase the max. I/O throughput that we can handle on such hardware. I don't have such hardware available, I hope someone else will try that. In addition to that, this should reduce the cache-spoiling effect of big scans and COPY. That's harder to quantify, I've been planning to run a TPC-C test with a large SELECT COUNT(*) running in the background to measure that, but me and the server has been busy with those other tests. In any case, I do want this for VACUUMs to fix the WAL flush for every dirty page problem. Maybe we should indeed drop the other aspects of the patch and move on, I'm getting tired of this as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Seq scans status update
Heikki Linnakangas [EMAIL PROTECTED] writes: In any case, I do want this for VACUUMs to fix the WAL flush for every dirty page problem. Maybe we should indeed drop the other aspects of the patch and move on, I'm getting tired of this as well. Can we devise a small patch that fixes that issue without creating uncertain impacts on other behavior? The thing that has made me uncomfortable with this set of patches right along is the presumption that it's a good idea to tweak cache behavior to improve a small set of cases. We are using cache behavior (now clock-sweep, formerly LRU) that is well tested and proven across a large body of experience; my gut feeling is that deviating from that behavior is likely to be a net loss when the big picture is considered. I certainly have not seen any test results that try to prove that there's no such generalized disadvantage to these patches. One could argue that the real bug here is that VACUUM deviates from the standard behavior by forcing immediate recycling of pages it releases, and that getting rid of that wart is the correct answer rather than piling more warts atop it --- especially warts that will change the behavior for anything besides VACUUM. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
Hi Heikki, On 5/17/07 10:28 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote: is also visible on larger scans that don't fit in cache with bigger I/O hardware, and this patch would increase the max. I/O throughput that we can handle on such hardware. I don't have such hardware available, I hope someone else will try that. Yes, this is absolutely the case, in addition to the benefits of not polluting the bufcache with seq scans (as discussed in detail previously). We've adopted this (see CK's patch) with excellent benefits. We can try your version on a machine with fast I/O and get back to you with a comparison of this and CK's version. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings