Re: [HACKERS] posix_fadvise v22
Tom Lane t...@sss.pgh.pa.us writes: What I intend to do over the next day or so is commit the prefetch infrastructure and the bitmap scan prefetch logic, but I'm bouncing the indexscan part back for rework. I think that it should be implemented in or near index_getnext() and pay attention to effective_io_concurrency. The biggest question I have here is about doing it at the index_* abstraction level. I've looked pretty hard at how to do this and run into some pretty major problems. I don't see any way to do it without changing the access method api. The index_* methods cannot start fiddling with the current scan position without messing up things like CURRENT OF and mark/restore irrecoverably. But if we're going to change the index am api then we lose all of the advantages of putting the logic in indexam.c in the first place. It won't help any other index am without special code in each one The best plan I came up with at this level is to add an am method amgetprefetchbitmap(IndexScanDesc scan, ScanDirection direction, TIDBitmap *bitmap, int nblocks) Which returns up to nblocks worth of bitmap starting from the current scan position in the specified direction based on whatever's convenient to the internal representation. I think it would be best if it stopped at the end of the page at least if the next index page isn't in shared buffers. Then nodeIndexscan.c would keep track of how the value of target_prefetch just like nodeBitmapHeapScan, incrementing it whenever the caller continues the scan and resetting it to zero if the direction changes. However the getprefetchbitmap() call would only remove duplicates from the upcoming blocks. It wouldn't know which blocks have already been prefetched from previous calls. So nodeIndexscan would also have to remove duplicates itself. This splitting the work between three layers of abstraction is pretty messy and creates a lot of duplicated work and doesn't seem to buy us anything. It *still* wouldn't help any non-btree index types until they add the new method -- and if they add the new method they might as well just add the USE_PREFETCH code anyways. I don't see how the new method is useful for anything else. I have another plan which would be a lot simpler but is a much more brute-force approach. If we add a new scan pointer in addition to the current position and the mark and add a slew of new methods like getnext_prefetch() and reset_prefetch() to reset it to the current position. Also, add a hash table internal to PrefetchBuffer and have it return a boolean indicating whether it actually did a prefetch. Then index_prefetch would reset the prefetch pointer and scan forward using it calling PrefetchBuffer on *every* pointer counting how many trues returns it sees from PrefetchBuffer.* *[Hm, not quite, we want to count recent prefetches that probably are still queued separately from old ones that are probably in cache. And we also have to think about how long to treat prefetches as probably still being in cache. But with some additional thought I think this could be made to work.] -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Gregory Stark st...@enterprisedb.com writes: Tom Lane t...@sss.pgh.pa.us writes: 2. I fixed it so that setting effective_io_concurrency to zero disables prefetching altogether; there was no way to do that in the patch as submitted. Hm. the original intent was that effective_io_concurrency 1 meant no prefetching because there was only one drive. Well, no prefetch is an entirely different behavior from prefetch one block ahead. Given the way you've defined the GUC, a setting of one has to mean the latter. My complaint was basically that with the patch applied, the code was physically incapable of providing the former. Which you'd surely want if only for testing/comparison purposes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Robert Haas robertmh...@gmail.com writes: OK, here's an update of Greg's patch with the runtime configure test ripped out, some minor documentation tweaks, and a few unnecessary whitespace diff hunks quashed. I think this is about ready for committer review. I've started to look through this, and the only part I seriously don't like is the nbtsearch.c changes. I've got three complaints about that: * Doing it inside the index AMs is wrong, or at the very least forces us to do it over for each index AM (which the patch fails to do). * As coded, it generates prefetch bursts that are much too large and too widely spaced to be effective, not to mention that they entirely ignore the effective_io_concurrency control knob as well as the order in which the pages will actually be needed. I wonder now whether Robert's inability to see any benefit came because he was testing indexscans and not bitmap scans. * It's only accidental that it's not kicking in during a bitmap indexscan and bollixing up the much-more-carefully-written nodeBitmapHeapscan prefetch logic. What I intend to do over the next day or so is commit the prefetch infrastructure and the bitmap scan prefetch logic, but I'm bouncing the indexscan part back for rework. I think that it should be implemented in or near index_getnext() and pay attention to effective_io_concurrency. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
* As coded, it generates prefetch bursts that are much too large and too widely spaced to be effective, not to mention that they entirely ignore the effective_io_concurrency control knob as well as the order in which the pages will actually be needed. I wonder now whether Robert's inability to see any benefit came because he was testing indexscans and not bitmap scans. Nope, bitmap index scans. For some reason I was having trouble generating a plan that included an index scan, so I took the easy way out and tested what was in front of me. :-) What I intend to do over the next day or so is commit the prefetch infrastructure and the bitmap scan prefetch logic, but I'm bouncing the indexscan part back for rework. I think that it should be implemented in or near index_getnext() and pay attention to effective_io_concurrency. FWIW, following your first set of commits, I extracted, cleaned up, and re-posted the uncommitted portion of the patch last night. Based on this it doesn't sound like there's much point in continuing to do that, but I'm happy to do so if anyone thinks otherwise. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Tom Lane t...@sss.pgh.pa.us writes: Robert Haas robertmh...@gmail.com writes: OK, here's an update of Greg's patch with the runtime configure test ripped out, some minor documentation tweaks, and a few unnecessary whitespace diff hunks quashed. I think this is about ready for committer review. I've started to look through this, and the only part I seriously don't like is the nbtsearch.c changes. I've got three complaints about that: * Doing it inside the index AMs is wrong, or at the very least forces us to do it over for each index AM (which the patch fails to do). ok * As coded, it generates prefetch bursts that are much too large and too widely spaced to be effective, not to mention that they entirely ignore the effective_io_concurrency control knob as well as the order in which the pages will actually be needed. I wonder now whether Robert's inability to see any benefit came because he was testing indexscans and not bitmap scans. Well experiments showed that it was very effective, even more so than for bitmap scans. So much so that it nearly obliterated bitmap scans' advantage over index scans. I originally thought like you that all that logic was integral to the thing but eventually came around to think the opposite. That logic is to overcome a fundamental problem with bitmap scans -- that there's no logical group to prefetch and a potentially unbounded stream of pages. Index scans just don't have that problem so they don't need that extra logic. * It's only accidental that it's not kicking in during a bitmap indexscan and bollixing up the much-more-carefully-written nodeBitmapHeapscan prefetch logic. ok. What I intend to do over the next day or so is commit the prefetch infrastructure and the bitmap scan prefetch logic, but I'm bouncing the indexscan part back for rework. I think that it should be implemented in or near index_getnext() and pay attention to effective_io_concurrency. So to do that I would have a few questions. a) ISTM necessary to keep a dynahash of previously prefetched pointers around to avoid repeatedly prefetching the same pages. That could get quite large though. Or do you think it would be fine to visit the buffer cache, essentially using that as the hash, for every index pointer? b) How would index_getnext keep two read pointers like bitmap heap scans? I think this would require adding a new index AM api similar to your tuplestore api where the caller can maintain multiple read pointers in the scan. And I'm not sure how that would work with mark/restore. c) I would be afraid that pushing the index scan to reach for the next index leaf page prematurely (and not just a async prefetch, but a blocking read) would cause extra random i/o which would slow down the critical path of reading the current index tuples. So I think we would still want to pause when we hit the end of the current leaf page. That would require some form of feedback in the index am api as well. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Robert Haas robertmh...@gmail.com writes: FWIW, following your first set of commits, I extracted, cleaned up, and re-posted the uncommitted portion of the patch last night. Based on this it doesn't sound like there's much point in continuing to do that, but I'm happy to do so if anyone thinks otherwise. Probably not; I'm busy hacking on the code anyway, fixing minor issues like its failure to do anything sane with temp tables. A question about that part of the code: I see that PrefetchBuffer simply does nothing if it finds the target page already in a buffer. I wonder whether that's sufficient, or whether we ought to do something to try to ensure that the buffer doesn't get evicted in the near future (ie, before we actually get to use it). Simply advancing the usage_count seems too strong because when you add in the upcoming ReadBuffer, we'd be advancing usage_count twice for a single use of the page whenever it happened to get prefetched. We could alleviate that by only advancing usage_count if it's 0, but that seems like a band-aid not a real fix. I also thought about setting a buffer flag bit showing that prefetch has occurred, but then you have to worry about when/how to clear it if the actual read never occurs. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Robert Haas robertmh...@gmail.com writes: OK, here's an update of Greg's patch with the runtime configure test ripped out, some minor documentation tweaks, and a few unnecessary whitespace diff hunks quashed. I think this is about ready for committer review. I've applied most of this, with a couple of notable revisions: 1. The runtime check on whether posix_fadvise works is really gone. We'll see whether we need to put anything back, but I think our first try should be under the assumption that it works. (BTW, I was wrong in my earlier claim that the buildfarm wouldn't exercise posix_fadvise by default --- the patch defaults to io_concurrency = 1 if fadvise is available, so it will try to prefetch one page ahead.) 2. I fixed it so that setting effective_io_concurrency to zero disables prefetching altogether; there was no way to do that in the patch as submitted. 3. I did not apply the changes in nbtsearch.c, since as I mentioned earlier I wasn't convinced it's a reasonable approach. We can argue about that some more, but even if 8.4 ships with concurrency only for bitmap scans, it's still useful. The two main loose ends that could stand further work are the plain-index-scan case and the question of whether PrefetchBuffer should try to protect an already-existing buffer from being evicted. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Tom Lane t...@sss.pgh.pa.us writes: Robert Haas robertmh...@gmail.com writes: OK, here's an update of Greg's patch with the runtime configure test ripped out, some minor documentation tweaks, and a few unnecessary whitespace diff hunks quashed. I think this is about ready for committer review. I've applied most of this, with a couple of notable revisions: 1. The runtime check on whether posix_fadvise works is really gone. We'll see whether we need to put anything back, but I think our first try should be under the assumption that it works. (BTW, I was wrong in my earlier claim that the buildfarm wouldn't exercise posix_fadvise by default --- the patch defaults to io_concurrency = 1 if fadvise is available, so it will try to prefetch one page ahead.) 2. I fixed it so that setting effective_io_concurrency to zero disables prefetching altogether; there was no way to do that in the patch as submitted. Hm. the original intent was that effective_io_concurrency 1 meant no prefetching because there was only one drive. I wonder if that worked earlier and got lost along the way or if I always had this wrong. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [SPAM] Re: [HACKERS] posix_fadvise v22
Peter Eisentraut pete...@gmx.net writes: The way I read this is that this was a temporary kernel/libc mismatch in a development version of Debian 3 years ago that was fixed within 2 months of being reported and was never released to the general public. So it would be on the same level as any of a million temporary breakages in Linux distributions under development. This is incorrect, as the problem was in fact present on Red Hat and presumably all other distros as well. Unless there are other reports of this problem, I wouldn't bother testing or working around this at all. If people are running PostgreSQL 8.4+ on Debian unstable June 2005 with kernel 2.4, they cannot be helped. While I would like to agree with that position, I can't help noticing lines 2438-2461 of xlog.c, which represent the still-smoking wreckage of our last attempt to do something with posix_fadvise. It's not that old either --- we gave up on it less than three years ago: http://archives.postgresql.org/pgsql-hackers/2006-06/msg01481.php I think at a minimum there should be a manual configuration switch (ie something in pg_config_manual.h) to allow the builder to disable use of posix_fadvise, even if configure thinks it's there. Depending on buildfarm results we may have to do more than that. BTW, I intend to un-disable the xlog change when committing the fadvise patch. In for a penny, in for a pound ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [SPAM] Re: [HACKERS] posix_fadvise v22
Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: The way I read this is that this was a temporary kernel/libc mismatch in a development version of Debian 3 years ago that was fixed within 2 months of being reported and was never released to the general public. So it would be on the same level as any of a million temporary breakages in Linux distributions under development. This is incorrect, as the problem was in fact present on Red Hat and presumably all other distros as well. Unless there are other reports of this problem, I wouldn't bother testing or working around this at all. If people are running PostgreSQL 8.4+ on Debian unstable June 2005 with kernel 2.4, they cannot be helped. While I would like to agree with that position, I can't help noticing lines 2438-2461 of xlog.c, which represent the still-smoking wreckage of our last attempt to do something with posix_fadvise. It's not that old either --- we gave up on it less than three years ago: http://archives.postgresql.org/pgsql-hackers/2006-06/msg01481.php I think at a minimum there should be a manual configuration switch (ie something in pg_config_manual.h) to allow the builder to disable use of posix_fadvise, even if configure thinks it's there. Depending on buildfarm results we may have to do more than that. BTW, I intend to un-disable the xlog change when committing the fadvise patch. In for a penny, in for a pound ... I assumed if effective_io_concurrency 2 that no posix_fadvise() calls would be made. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [SPAM] Re: [HACKERS] posix_fadvise v22
I think at a minimum there should be a manual configuration switch (ie something in pg_config_manual.h) to allow the builder to disable use of posix_fadvise, even if configure thinks it's there. Depending on buildfarm results we may have to do more than that. This seems pretty pointless, since there is already a GUC to control the behavior, and the default is off. The only value here would be if you could demonstrate that even with effective_io_concurrency set to 1 there is a significant performance dropoff. But that would probably be an argument for rejecting the patch outright, not adding a compile-time switch. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [SPAM] Re: [HACKERS] posix_fadvise v22
Gregory Stark wrote: Peter Eisentraut pete...@gmx.net writes: On Friday 02 January 2009 06:49:57 Greg Stark wrote: The guc run-time check is checking for known-buggy versions of glibc using sysconf to check what version of glibc you have. Could you please mention the bug number in the relevant source code comments? It's Debian bug# 312406 which was fixed in Debian release 2.3.5-3. So it's probably one of these but searching for posix_fadvise doesn't find anything in their bug tracker: The way I read this is that this was a temporary kernel/libc mismatch in a development version of Debian 3 years ago that was fixed within 2 months of being reported and was never released to the general public. So it would be on the same level as any of a million temporary breakages in Linux distributions under development. Unless there are other reports of this problem, I wouldn't bother testing or working around this at all. If people are running PostgreSQL 8.4+ on Debian unstable June 2005 with kernel 2.4, they cannot be helped. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Friday 02 January 2009 06:49:57 Greg Stark wrote: The guc run-time check is checking for known-buggy versions of glibc using sysconf to check what version of glibc you have. Could you please mention the bug number in the relevant source code comments? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [SPAM] Re: [HACKERS] posix_fadvise v22
Peter Eisentraut pete...@gmx.net writes: On Friday 02 January 2009 06:49:57 Greg Stark wrote: The guc run-time check is checking for known-buggy versions of glibc using sysconf to check what version of glibc you have. Could you please mention the bug number in the relevant source code comments? It's Debian bug# 312406 which was fixed in Debian release 2.3.5-3. So it's probably one of these but searching for posix_fadvise doesn't find anything in their bug tracker: Version 2.3.6 * The following bugs are resolved with this release: 38, 253, 549, 622, 653, 721, 758, 851, 877, 915, 934, 955, 961, 1016, 1037, 1076, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1110, , 1112, 1113, 1125, 1137, 1138, 1249, 1250, 1251, 1252, 1253, 1254, 1350, 1358, 1394, 1438, 1498, 1534 Visit http://sources.redhat.com/bugzilla/ for the details of each bug. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Thu, 1 Jan 2009, Robert Haas wrote: The only thing I haven't been able to do is demonstrate that this change actually produces a performance improvement. Either I'm testing the wrong thing, or it just doesn't provide any benefit on a single-spindle system. When I did a round of testing on the earlier prefetch test program Greg Stark put together, one of my single-spindle Linux system didn't show any real benefit. So as long as you didn't see performance degrade, your not seeing any improvement isn't bad news. I've got a stack of hardware I can do performance testing of this patch on, what I haven't been able to find time for is setting up any sort of test harness right now. If you or Greg have any benchmark or test program you could suggest that should show off the improvements here, I'd be glad to run it on a bunch of systems and report back--I've already got a stack of candidate ones I ran the earlier tests on to compare results against. -- * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Greg Smith gsm...@gregsmith.com writes: On Thu, 1 Jan 2009, Robert Haas wrote: The only thing I haven't been able to do is demonstrate that this change actually produces a performance improvement. Either I'm testing the wrong thing, or it just doesn't provide any benefit on a single-spindle system. When I did a round of testing on the earlier prefetch test program Greg Stark put together, one of my single-spindle Linux system didn't show any real benefit. So as long as you didn't see performance degrade, your not seeing any improvement isn't bad news. ISTM that you *should* be able to see an improvement on even single-spindle systems, due to better overlapping of CPU and I/O effort. If the test case is either 100% CPU-bound or 100% I/O-bound then no, but for anything in between there ought to be improvement. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Fri, 2 Jan 2009, Tom Lane wrote: ISTM that you *should* be able to see an improvement on even single-spindle systems, due to better overlapping of CPU and I/O effort. The earlier synthetic tests I did: http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php Showed a substantial speedup even in the single spindle case on a couple of systems, but one didn't really seem to benefit. So we could theorize that Robert's test system is more like that one. If someone can help out with making a more formal test case showing this in action, I'll dig into the details of what's different between that system and the others. -- * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Greg Smith gsm...@gregsmith.com writes: On Fri, 2 Jan 2009, Tom Lane wrote: ISTM that you *should* be able to see an improvement on even single-spindle systems, due to better overlapping of CPU and I/O effort. The earlier synthetic tests I did: http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php Showed a substantial speedup even in the single spindle case on a couple of systems, but one didn't really seem to benefit. So we could theorize that Robert's test system is more like that one. If someone can help out with making a more formal test case showing this in action, I'll dig into the details of what's different between that system and the others. Well, I claim that if you start with a query that's about 50% CPU and 50% I/O effort, you ought to be able to get something approaching 2X speedup if this patch really works. Consider something like create function waste_time(int) returns int as $$ begin for i in 1 .. $1 loop null; end loop; return 1; end $$ language plpgsql; select count(waste_time(42)) from very_large_table; In principle you should be able to adjust the constant so that vmstat shows about 50% CPU busy, and then enabling fadvise should improve matters significantly. Now the above proposed test case is too simple because it will generate a seqscan, and if the kernel is not completely brain-dead it will not need any fadvise hinting to do read-ahead. But you should be able to adapt the idea for whatever indexscan-based test case you are really using. Note: on a multi-CPU system you need to take vmstat or top numbers with a grain of salt, since they might consider one CPU 50% busy as system only 50/N % busy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Tom Lane t...@sss.pgh.pa.us writes: In principle you should be able to adjust the constant so that vmstat shows about 50% CPU busy, and then enabling fadvise should improve matters significantly. I think in practice individual queries don't interleave much cpu with i/o work. A single random page fetch is 5ms which is an awful lot of cpu cycles to be sinking somewhere. In practice I think this is going to be single-digit percentages. Aside from big sorts and such which tend not to be interleaved with i/o the main time I've seen queries have a significant cpu load is when the data is mostly in cache. In which case prefetching would be hard pressed to help at all. We could construct a synthetic case but the main point of this feature is to make use of raid arrays that are currently going idle, not to pick up a few percentage points for single spindle systems. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Greg Smith wrote: On Fri, 2 Jan 2009, Tom Lane wrote: ISTM that you *should* be able to see an improvement on even single-spindle systems, due to better overlapping of CPU and I/O effort. The earlier synthetic tests I did: http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php Showed a substantial speedup even in the single spindle case on a couple of systems, but one didn't really seem to benefit. So we could theorize that Robert's test system is more like that one. If someone can help out with making a more formal test case showing this in action, I'll dig into the details of what's different between that system and the others. I think for an I/O-bound workload on a single drive system you would need a drive that did some kind of tagged queuing (reordering/grouping) of requests to see a benefit from the patch. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
I've got a stack of hardware I can do performance testing of this patch on, what I haven't been able to find time for is setting up any sort of test harness right now. If you or Greg have any benchmark or test program you could suggest that should show off the improvements here, I'd be glad to run it on a bunch of systems and report back--I've already got a stack of candidate ones I ran the earlier tests on to compare results against. Unfortunately I don't have anything useful. I took the skewed TPC-H data that Lawrence Ramon posted and created a table based on lineitem using something along the lines of: SELECT *, g FROM lineitem, generate_series(1,8) AS g; Then I made an index on one of the columns and ran some queries that ended up being planned as bitmap index scans. The disk seemed to be doing its thing, but it didn't do it's thing any faster when I changed effective_io_concurrency to 4 (in fact there might have been a small slowdown but I didn't take the time to measure carefully, so I can't refute the null hypothesis). ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Fri, Jan 2, 2009 at 5:36 PM, Robert Haas robertmh...@gmail.com wrote: I've got a stack of hardware I can do performance testing of this patch on, what I haven't been able to find time for is setting up any sort of test harness right now. If you or Greg have any benchmark or test program you could suggest that should show off the improvements here, I'd be glad to run it on a bunch of systems and report back--I've already got a stack of candidate ones I ran the earlier tests on to compare results against. Then I made an index on one of the columns and ran some queries that ended up being planned as bitmap index scans. Hm, what were those plans? You might want to put the old code back in explain.c to print the prefetching target to see how well it's doing. The queries I ran to test it tended to look like this kind of thing, where r was a column filled with random values. If it's clustered there will be hardly any benefit as even random i/o would be prefetched by the OS. select * from h where r = any (array(select (1+random()*200)::integer from generate_series(1,1000))); -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Gregory Stark st...@enterprisedb.com writes: Tom Lane t...@sss.pgh.pa.us writes: In principle you should be able to adjust the constant so that vmstat shows about 50% CPU busy, and then enabling fadvise should improve matters significantly. I think in practice individual queries don't interleave much cpu with i/o work. A single random page fetch is 5ms which is an awful lot of cpu cycles to be sinking somewhere. In practice I think this is going to be single-digit percentages. The point of the suggestion is to prove that the patch works as advertised. How wide the sweet spot is for this test isn't nearly as interesting as proving that there *is* a sweet spot. If you can't find one it suggests that either the patch or the local posix_fadvise doesn't work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Hm, what were those plans? You might want to put the old code back in explain.c to print the prefetching target to see how well it's doing. Well, bad news. Here's one where prefetching seems to make it WORSE. rhaas=# explain select sum(1) from enormous where l_shipdate in ('1992-01-01', '1993-01-01', '1994-01-01', '1995-01-01', '1996-01-01', '1997-01-01', '1998-01-01', '1999-01-01', '2000-01-01', '2001-01-01'); QUERY PLAN -- Aggregate (cost=455072.75..455072.76 rows=1 width=0) - Bitmap Heap Scan on enormous (cost=3327.59..454634.09 rows=175464 width=0) Recheck Cond: (l_shipdate = ANY ('{1992-01-01,1993-01-01,1994-01-01,1995-01-01,1996-01-01,1997-01-01,1998-01-01,1999-01-01,2000-01-01,2001-01-01}'::d ate[])) - Bitmap Index Scan on enormous_l_shipdate (cost=0.00..3283.72 rows=175464 width=0) Index Cond: (l_shipdate = ANY ('{1992-01-01,1993-01-01,1994-01-01,1995-01-01,1996-01-01,1997-01-01,1998-01-01,1999-01-01,2000-01-01,2001-01-01} '::date[])) (5 rows) With effective_io_concurrency set to 1, this took 32 s. With effective_io_concurrency set to 4, it took 50 s. The table was created like this: create table enormous as select l.*, l_instance from lineitem l, generate_series(1, 8) l_instance; create index enormous_l_shipdate on enormous (l_shipdate); vacuum analyze enormous; ...where lineitem is from the skewed TPC-H data for the histojoin patch. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Fri, Jan 2, 2009 at 8:42 PM, Robert Haas robertmh...@gmail.com wrote: Hm, what were those plans? You might want to put the old code back in explain.c to print the prefetching target to see how well it's doing. Well, bad news. Here's one where prefetching seems to make it WORSE. rhaas=# explain select sum(1) from enormous where l_shipdate in ('1992-01-01', '1993-01-01', '1994-01-01', '1995-01-01', '1996-01-01', '1997-01-01', '1998-01-01', '1999-01-01', '2000-01-01', '2001-01-01'); QUERY PLAN -- Aggregate (cost=455072.75..455072.76 rows=1 width=0) - Bitmap Heap Scan on enormous (cost=3327.59..454634.09 rows=175464 width=0) Any chance you could put back the code in explain.c which showed whether posix_fadvise is actually getting used? Another thing I did when testing was attaching with strace to see if posix_fadvise (the syscall on linux was actually fadvise64 iirc) is actually getting called. And is this on a system with multiple spindles? How many? And how much of the data is in shared buffers or in filesystem cache? Is this consistent for repeated queries? Is it only when you're repeating a query for dates that you've already selected? And how fast is it with effective_io_concurrency set to 1,2,3,5,6,7,8,...? Do you see the same effect if you use a self-contained test case instead of the TPC-H data so I can try it? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Any chance you could put back the code in explain.c which showed whether posix_fadvise is actually getting used? Another thing I did when testing was attaching with strace to see if posix_fadvise (the syscall on linux was actually fadvise64 iirc) is actually getting called. I tried changing this: returnCode = posix_fadvise(VfdCache[file].fd, offset, amount, POSIX_FADV_WILLNEED); to this: returnCode = 0; When I did that, it when back from 50 s to 33 s, which I think means that posix_fadvise is getting called and that that is what is making it slower. And is this on a system with multiple spindles? How many? Latitude D830 laptop. Single disk. Fedora 9. kernel-2.6.27.9-73.fc9.x86_64. And how much of the data is in shared buffers or in filesystem cache? Is this consistent for repeated queries? Is it only when you're repeating a query for dates that you've already selected? I stopped the cluster, dropped the page cache, and restarted the cluster just before testing. Repeated tests are fast due to caching effects. shared_buffers is 240MB. System has 2GB RAM, steady state is about 1GB of page cache. And how fast is it with effective_io_concurrency set to 1,2,3,5,6,7,8,...? I do not currently have this information. :-) I will try to run some more tests over the weekend, but I'm too tired now and am starting to make mistakes. Do you see the same effect if you use a self-contained test case instead of the TPC-H data so I can try it? Not sure exactly what you have in mind here. If you send a script or something to reproduce I will try it. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Tom Lane t...@sss.pgh.pa.us writes: The point of the suggestion is to prove that the patch works as advertised. How wide the sweet spot is for this test isn't nearly as interesting as proving that there *is* a sweet spot. If you can't find one it suggests that either the patch or the local posix_fadvise doesn't work. I posted tons of reproducible test cases with graphs of results for various raid stripe widths a while back. There was a very slight benefit on a single spindle at some prefetch depths but it wasn't very consistent and it varied heavily depending on the prefetch depth. I don't know what to make of this test. I don't know how to reproduce the same data distribution, I have no idea what raid configuration it's been run on, what version of what OS it's on, etc. It's quite possible posix_fadvise isn't working on it, I don't know. It's also possible the overhead of the extra buffer lookups and syscalls outweight any benefit of overlapping i/o and cpu on a single spindle. Trying to contrive a situation where a single spindle sees a significant benefit is going to be very tricky. Avoiding caching effects and the confounding effect of any overhead will make it hard to see a consistent benefit without a raid array. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Fri, Jan 2, 2009 at 11:13 PM, Robert Haas robertmh...@gmail.com wrote: When I did that, it when back from 50 s to 33 s, which I think means that posix_fadvise is getting called and that that is what is making it slower. And is this on a system with multiple spindles? How many? Latitude D830 laptop. Single disk. Fedora 9. kernel-2.6.27.9-73.fc9.x86_64. And how much of the data is in shared buffers or in filesystem cache? Is this consistent for repeated queries? Is it only when you're repeating a query for dates that you've already selected? I stopped the cluster, dropped the page cache, and restarted the cluster just before testing. Repeated tests are fast due to caching effects. shared_buffers is 240MB. System has 2GB RAM, steady state is about 1GB of page cache. Ahhh. So this is a test of how much impact the extra syscalls and buffer lookups have on a system where they're not really helping. I'm still surprised, a 50% performance penalty is a lot worse than I would have expected, especially when the buffers aren't in cache. I did one quick test and saw about 10% performance penalty in that test. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
I tried this on my laptop running FC9, and because I forgot to run autoconf, I got this error message when I tried to turn on posix_fadvise. rhaas=# set effective_io_concurrency to 3; ERROR: could not determine if this system has a working posix_fadvise DETAIL: Check configure.log produced by configure for more information Am I correct in thinking that the only thing we're really checking for here is whether a trivial posix_fadvise() call returns success? If so, is this test really worth doing? It seems to me that since users can always switch off the feature by leaving effective_io_concurrency set to the default value of 1, there is not much value in having a configure test that forcibly disables it. If the user has a broken posix_fadvise() and later fixes it, they shouldn't have to recompile PostgreSQL to use this feature, especially in this day when the build system and the run system are often different. A user who somehow ends up with RPMs that generate this error message will be utterly at a loss as to what to do about it. One minor nit: If we're going to keep this test, we should change the detail string to say config.log rather than configure.log, as that is the actual file name. ...Robert On Thu, Dec 11, 2008 at 4:35 PM, Gregory Stark st...@enterprisedb.com wrote: Here's the update I also skimmed through and cleaned a couple other things. There's *still* a function prototype which I don't see what header file to put it in, that's the one in port/posix_fadvise.c which contains one function with one caller, guc.c. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Robert Haas robertmh...@gmail.com writes: Am I correct in thinking that the only thing we're really checking for here is whether a trivial posix_fadvise() call returns success? If so, is this test really worth doing? Runtime tests performed during configure are generally a bad idea to start with --- it's impossible to do any such thing in a cross-compilation scenario, for example. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Thu, Jan 1, 2009 at 3:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Am I correct in thinking that the only thing we're really checking for here is whether a trivial posix_fadvise() call returns success? If so, is this test really worth doing? Runtime tests performed during configure are generally a bad idea to start with --- it's impossible to do any such thing in a cross-compilation scenario, for example. OK, here's an update of Greg's patch with the runtime configure test ripped out, some minor documentation tweaks, and a few unnecessary whitespace diff hunks quashed. I think this is about ready for committer review. The only thing I haven't been able to do is demonstrate that this change actually produces a performance improvement. Either I'm testing the wrong thing, or it just doesn't provide any benefit on a single-spindle system. However, I believe that Greg has previously posted some fairly impressive performance results, so I'm not sure that my shortcomings in this area should be a bar to having a committer pick this one up. If more testing is needed, it would at least be helpful to have a committer specify what areas they are concerned about. ...Robert posix_fadvise_v23_rh1.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
In theory there should be no benefit on a single spindle system. There could be a slight benefit due to reordering of I/o but only on a raid array would you see a significant speedup -- which should be about equal to the number of spindles. What would be interesting is whether you see a noticable speed *decrease* from having prefetching enabled when it isn't helping. Either due to having everything fit in shared buffers or everything fit in the filesystem cache (the latter should be more of a hit) Even if there is it doesn't really worry me. By default the feature is disabled and you should only really turn it on if ulu do have a raid array and want an individual query to make use if it. Now that there's an actual run-time sysconf check for the buggy glibc called by the guc function we arguably don't need the autoconf check_run check anymore anyways. -- Greg On 1 Jan 2009, at 21:43, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 1, 2009 at 3:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Am I correct in thinking that the only thing we're really checking for here is whether a trivial posix_fadvise() call returns success? If so, is this test really worth doing? Runtime tests performed during configure are generally a bad idea to start with --- it's impossible to do any such thing in a cross-compilation scenario, for example. OK, here's an update of Greg's patch with the runtime configure test ripped out, some minor documentation tweaks, and a few unnecessary whitespace diff hunks quashed. I think this is about ready for committer review. The only thing I haven't been able to do is demonstrate that this change actually produces a performance improvement. Either I'm testing the wrong thing, or it just doesn't provide any benefit on a single-spindle system. However, I believe that Greg has previously posted some fairly impressive performance results, so I'm not sure that my shortcomings in this area should be a bar to having a committer pick this one up. If more testing is needed, it would at least be helpful to have a committer specify what areas they are concerned about. ...Robert posix_fadvise_v23_rh1.diff.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Now that there's an actual run-time sysconf check for the buggy glibc called by the guc function we arguably don't need the autoconf check_run check anymore anyways. Isn't that the check I just removed for you, or are you talking about some other check that can also be removed? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Sorry for top-posting -- phone mail client sucks. I thought the autoconf ac_run_check was the test that people were questioning. That calls posix_fadvise to see if it crashes at configure time. The guc run-time check is checking for known-buggy versions of glibc using sysconf to check what version of glibc you have. -- Greg On 1 Jan 2009, at 23:11, Robert Haas robertmh...@gmail.com wrote: Now that there's an actual run-time sysconf check for the buggy glibc called by the guc function we arguably don't need the autoconf check_run check anymore anyways. Isn't that the check I just removed for you, or are you talking about some other check that can also be removed? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Thu, Jan 1, 2009 at 11:49 PM, Greg Stark greg.st...@enterprisedb.com wrote: Sorry for top-posting -- phone mail client sucks. I thought the autoconf ac_run_check was the test that people were questioning. That calls posix_fadvise to see if it crashes at configure time. Yes, that's what I removed. The guc run-time check is checking for known-buggy versions of glibc using sysconf to check what version of glibc you have. Right - that check is still in my updated patch. I think the confusion may stem from the fact that Tom and I used the word runtime to refer to the ac_run_check thing, because it is checking something about the runtime environment (namely, whether posix_fadvise works or not) at configure-time. In any event, it seems as though we are all on the same page. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
I'll send another path with at least 1 and 3 fixed and hunt around again for a header file to put this guc into. On 10 Dec 2008, at 04:22, ITAGAKI Takahiro [EMAIL PROTECTED] wrote: Hello, Gregory Stark [EMAIL PROTECTED] wrote: Here's an update to eliminate two small bitrot conflicts. I read your patch with interest, but found some trivial bad manners. * LET_OS_MANAGE_FILESIZE is already obsoleted. You don't have to cope with the option. Huh I didn't realize that. I guess the idea is that users just configure a very large segment size to get the old behaviour. * Type mismatch in prefetch_pages A variable prefetch_pages is defined as unsigned or int in some places. Why don't you define it only once in a header and include the header in source files? Just... Which header? * Assignment to prefetch_pages What do +0.99 means here? [assign_io_concurrency()] +prefetch_pages = new_prefetch_pages+0.99; You want to do as follows, right? +prefetch_pages = (int) ceil(new_prefetch_pages); Sure -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Greg Stark [EMAIL PROTECTED] writes: A variable prefetch_pages is defined as unsigned or int in some places. Why don't you define it only once in a header and include the header in source files? Just... Which header? MHO: the header that goes with the source file that is most concerned with implementing the variable's behavior (which is also the file that should have the actual variable definition). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
On Thu, Dec 11, 2008 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark greg.st...@enterprisedb.com writes: A variable prefetch_pages is defined as unsigned or int in some places. Why don't you define it only once in a header and include the header in source files? Just... Which header? MHO: the header that goes with the source file that is most concerned with implementing the variable's behavior (which is also the file that should have the actual variable definition). Well the trick here is that the variable actually affects how many PrefetchBuffer() calls *callers* should make. The callers are various places which are doing lots of ReadBuffer calls and know what buffer's they'll need in the future. The main places are in nodeBitmapHeapScan.c and nbtsearch.c. Neither of those are remotely relevant. I think i'm settling in that it should be in the same place as the PrefetchBuffer() prototype since anyone who needs prefetch_buffers will need that as well (except for guc.c). So I'll put it in bufmgr.h for now. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Here's the update I also skimmed through and cleaned a couple other things. There's *still* a function prototype which I don't see what header file to put it in, that's the one in port/posix_fadvise.c which contains one function with one caller, guc.c. posix_fadvise_v23.diff.gz Description: Binary data -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise v22
Hello, Gregory Stark [EMAIL PROTECTED] wrote: Here's an update to eliminate two small bitrot conflicts. I read your patch with interest, but found some trivial bad manners. * LET_OS_MANAGE_FILESIZE is already obsoleted. You don't have to cope with the option. * Type mismatch in prefetch_pages A variable prefetch_pages is defined as unsigned or int in some places. Why don't you define it only once in a header and include the header in source files? * Assignment to prefetch_pages What do +0.99 means here? [assign_io_concurrency()] + prefetch_pages = new_prefetch_pages+0.99; You want to do as follows, right? + prefetch_pages = (int) ceil(new_prefetch_pages); Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] posix_fadvise v22
Here's an update to eliminate two small bitrot conflicts. posix_fadvise_v22.diff.gz Description: Binary data -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers