Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Curt Sampson wrote: > On Mon, 24 Jun 2002, Tom Lane wrote: > > > There are a lot of other things we desperately need to spend time > > on that would not amount to re-engineering large quantities of OS-level > > code. Given that most Unixen have perfectly respectable disk management > > subsystems, we prefer to tune our code to make use of that stuff, rather > > than follow the "conventional wisdom" that databases need to bypass it. > > ... > > Oracle can afford to do that sort of thing because they have umpteen > > thousand developers available. Postgres does not. > > Well, Oracle also started out, a long long time ago, on systems without > unified buffer cache and so on, and so they *had* to write this stuff > because otherwise data would not be cached. So Oracle can also afford to > maintain it now because the code already exists. Well, actually, it isn't unified buffer cache that is the issue, but rather the older SysV file system had pretty poor performance so bypassing it was a bigger win that it is today. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Mon, 24 Jun 2002, Tom Lane wrote: > There are a lot of other things we desperately need to spend time > on that would not amount to re-engineering large quantities of OS-level > code. Given that most Unixen have perfectly respectable disk management > subsystems, we prefer to tune our code to make use of that stuff, rather > than follow the "conventional wisdom" that databases need to bypass it. > ... > Oracle can afford to do that sort of thing because they have umpteen > thousand developers available. Postgres does not. Well, Oracle also started out, a long long time ago, on systems without unified buffer cache and so on, and so they *had* to write this stuff because otherwise data would not be cached. So Oracle can also afford to maintain it now because the code already exists. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On 24 Jun 2002, J. R. Nield wrote: > All I'm saying is that the entire postgresql block write must be > converted into exactly one SCSI write command in all cases, and I don't > know a portable way to ensure this. No, there's no portable way. All you can do is give the admin who is able to set things up safely the ability to turn of the now-unneeded (and expensive) safety-related stuff that postgres does. > I agree with this. My point was only that you need to know what > guarantees your operating system/hardware combination provides on a > case-by-case basis, and there is no standard way for a program to > discover this. Most system administrators are not going to know this > either, unless databases are their main responsibility. Certainly this is true of pretty much every database system out there. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Mon, 2002-06-24 at 17:16, Tom Lane wrote: > I think you have been missing the point... Yes, this appears to be the case. Thanks especially to Curt for clearing things up for me. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Tom Lane wrote: > Bruce Momjian <[EMAIL PROTECTED]> writes: > >> Does anyone know what the major barriers to infinite log replay are in > >> PostgreSQL? I'm trying to look for everything that might need to be > >> changed outside xlog.c, but surely this has come up before. Searching > >> the archives hasn't revealed much. > > > This has been brought up. Could we just save WAL files and get replay? > > I believe some things have to be added to WAL to allow this, but it > > seems possible. > > The Red Hat group has been looking at this somewhat; so far there seem > to be some minor tweaks that would be needed, but no showstoppers. Good. > > Somehow you would need a tar-type > > backup of the database, and with a running db, it is hard to get a valid > > snapshot of that. > > But you don't *need* a "valid snapshot", only a correct copy of > every block older than the first checkpoint in your WAL log series. > Any inconsistencies in your tar dump will look like repairable damage; > replaying the WAL log will fix 'em. Yes, my point was that you need physical file backups, not pg_dump, and you have to be tricky about the files changing during the backup. You _can_ work around changes to the files during backup. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 /usr/local/bin/mime: cannot create /dev/ttyp3: permission denied ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 23:40, Curt Sampson wrote: > On 23 Jun 2002, J. R. Nield wrote: > > > If is impossible to do what you want. You can not protect against > > partial writes without writing pages twice and calling fdatasync > > between them while going through a generic filesystem. > > I agree with this. > > > The best disk array will not protect you if the operating system does > > not align block writes to the structure of the underlying device. > > This I don't quite understand. Assuming you're using a SCSI drive > (and this mostly applies to ATAPI/IDE, too), you can do naught but > align block writes to the structure of the underlying device. When you > initiate a SCSI WRITE command, you start by telling the device at which > block to start writing and how many blocks you intend to write. Then you > start passing the data. > All I'm saying is that the entire postgresql block write must be converted into exactly one SCSI write command in all cases, and I don't know a portable way to ensure this. > > Even with raw devices, you need special support or knowledge of the > > operating system and/or the disk device to ensure that each write > > request will be atomic to the underlying hardware. > > Well, so here I guess you're talking about two things: > > 1. When you request, say, an 8K block write, will the OS really > write it to disk in a single 8K or multiple of 8K SCSI write > command? > > 2. Does the SCSI device you're writing to consider these writes to > be transactional. That is, if the write is interrupted before being > completed, does the SCSI device guarantee that the partially-sent > data is not written, and the old data is maintained? And of course, > does it guarantee that, when it acknowledges a write, that write is > now in stable storage and will never go away? > > Both of these are not hard to guarantee, actually. For a BSD-based OS, > for example, just make sure that your filesystem block size is the > same as or a multiple of the database block size. BSD will never write > anything other than a block or a sequence of blocks to a disk in a > single SCSI transaction (unless you've got a really odd SCSI driver). > And for your disk, buy a Baydel or Clarion disk array, or something > similar. > > Given that it's not hard to set up a system that meets these criteria, > and this is in fact commonly done for database servers, it would seem a > good idea for postgres to have the option to take advantage of the time > and money spent and adjust its performance upward appropriately. I agree with this. My point was only that you need to know what guarantees your operating system/hardware combination provides on a case-by-case basis, and there is no standard way for a program to discover this. Most system administrators are not going to know this either, unless databases are their main responsibility. > > > All other systems rely on the fact that you can recover a damaged file > > using the log archive. > > Not exactly. For MS SQL Server, at any rate, if it detects a page tear > you cannot restore based on the log file alone. You need a full or > partial backup that includes that entire torn block. > I should have been more specific: you need a backup of the file from some time ago, plus all the archived logs from then until the current log sequence number. > > This means downtime in the rare case, but no data loss. Until > > PostgreSQL can do this, then it will not be acceptable for real > > critical production use. > > It seems to me that it is doing this right now. In fact, it's more > reliable than some commerial systems (such as SQL Server) because it can > recover from a torn block with just the logfile. Again, what I meant to say is that the commercial systems can recover with an old file backup + logs. How old the backup can be depends only on how much time you are willing to spend playing the logs forward. So if you do a full backup once a week, and multiplex and backup the logs, then even if a backup tape gets destroyed you can still survive. It just takes longer. Also, postgreSQL can't recover from any other type of block corruption, while the commercial systems can. That's what I meant by the "critical production use" comment, which was sort-of unfair. So I would say they are equally reliable for torn pages (but not bad blocks), and the commercial systems let you trade potential recovery time for not having to write the blocks twice. You do need to back-up the log archives though. > > > But at the end of the day, unless you have complete understanding of > > the I/O system from write(2) through to the disk system, the only sure > > ways to protect against partial writes are by "careful writes" (in > > the WAL log or elsewhere, writing pages twice), or by requiring (and > > allowing) users to do log-replay recovery when a file is corrupted by > > a partial write. > > I don't understand how, without a copy of the old data
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
J. R. Nield wrote: > > This I don't quite understand. Assuming you're using a SCSI drive > > (and this mostly applies to ATAPI/IDE, too), you can do naught but > > align block writes to the structure of the underlying device. When you > > initiate a SCSI WRITE command, you start by telling the device at which > > block to start writing and how many blocks you intend to write. Then you > > start passing the data. > > > > All I'm saying is that the entire postgresql block write must be > converted into exactly one SCSI write command in all cases, and I don't > know a portable way to ensure this. ... > I agree with this. My point was only that you need to know what > guarantees your operating system/hardware combination provides on a > case-by-case basis, and there is no standard way for a program to > discover this. Most system administrators are not going to know this > either, unless databases are their main responsibility. Yes, agreed. >1% are going to know the answer to this question so we have to assume worst case. > > It seems to me that it is doing this right now. In fact, it's more > > reliable than some commerial systems (such as SQL Server) because it can > > recover from a torn block with just the logfile. > > Again, what I meant to say is that the commercial systems can recover > with an old file backup + logs. How old the backup can be depends only > on how much time you are willing to spend playing the logs forward. So > if you do a full backup once a week, and multiplex and backup the logs, > then even if a backup tape gets destroyed you can still survive. It just > takes longer. > > Also, postgreSQL can't recover from any other type of block corruption, > while the commercial systems can. That's what I meant by the "critical > production use" comment, which was sort-of unfair. > > So I would say they are equally reliable for torn pages (but not bad > blocks), and the commercial systems let you trade potential recovery > time for not having to write the blocks twice. You do need to back-up > the log archives though. Yes, good tradeoff analysis. We recover from partial writes quicker, and don't require saving of log files, _but_ we don't recover from bad disk blocks. Good summary. > I'll back off on that. I don't know if we want to use the OS buffer > manager, but shouldn't we try to have our buffer manager group writes > together by files, and pro-actively get them out to disk? Right now, it > looks like all our write requests are delayed as long as possible and > the order in which they are written is pretty-much random, as is the > backend that writes the block, so there is no locality of reference even > when the blocks are adjacent on disk, and the write calls are spread-out > over all the backends. > > Would it not be the case that things like read-ahead, grouping writes, > and caching written data are probably best done by PostgreSQL, because > only our buffer manager can understand when they will be useful or when > they will thrash the cache? The OS should handle all of this. We are doing main table writes but no sync until checkpoint, so the OS can keep those blocks around and write them at its convenience. It knows the size of the buffer cache and when stuff is forced to disk. We can't second-guess that. > I may likely be wrong on this, and I haven't done any performance > testing. I shouldn't have brought this up alongside the logging issues, > but there seemed to be some question about whether the OS was actually > doing all these things behind the scene. It had better. Looking at the kernel source is the way to know. > Does anyone know what the major barriers to infinite log replay are in > PostgreSQL? I'm trying to look for everything that might need to be > changed outside xlog.c, but surely this has come up before. Searching > the archives hasn't revealed much. This has been brought up. Could we just save WAL files and get replay? I believe some things have to be added to WAL to allow this, but it seems possible. However, the pg_dump is just a data dump and does not have the file offsets and things. Somehow you would need a tar-type backup of the database, and with a running db, it is hard to get a valid snapshot of that. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian <[EMAIL PROTECTED]> writes: >> Does anyone know what the major barriers to infinite log replay are in >> PostgreSQL? I'm trying to look for everything that might need to be >> changed outside xlog.c, but surely this has come up before. Searching >> the archives hasn't revealed much. > This has been brought up. Could we just save WAL files and get replay? > I believe some things have to be added to WAL to allow this, but it > seems possible. The Red Hat group has been looking at this somewhat; so far there seem to be some minor tweaks that would be needed, but no showstoppers. > Somehow you would need a tar-type > backup of the database, and with a running db, it is hard to get a valid > snapshot of that. But you don't *need* a "valid snapshot", only a correct copy of every block older than the first checkpoint in your WAL log series. Any inconsistencies in your tar dump will look like repairable damage; replaying the WAL log will fix 'em. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
"J. R. Nield" <[EMAIL PROTECTED]> writes: > Also, postgreSQL can't recover from any other type of block corruption, > while the commercial systems can. Say again? > Would it not be the case that things like read-ahead, grouping writes, > and caching written data are probably best done by PostgreSQL, because > only our buffer manager can understand when they will be useful or when > they will thrash the cache? I think you have been missing the point. No one denies that there will be some incremental gain if we do all that. However, the conclusion of everyone who has thought much about it (and I see Curt has joined that group) is that the effort would be far out of proportion to the probable gain. There are a lot of other things we desperately need to spend time on that would not amount to re-engineering large quantities of OS-level code. Given that most Unixen have perfectly respectable disk management subsystems, we prefer to tune our code to make use of that stuff, rather than follow the "conventional wisdom" that databases need to bypass it. Oracle can afford to do that sort of thing because they have umpteen thousand developers available. Postgres does not. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Tom Lane wrote: > > On Sun, 23 Jun 2002, Bruce Momjian wrote: > >> Yes, I don't see writing to two files vs. one to be any win, especially > >> when we need to fsync both of them. What I would really like is to > >> avoid the double I/O of writing to WAL and to the data file; improving > >> that would be a huge win. > > I don't believe it's possible to eliminate the double I/O. Keep in mind > though that in the ideal case (plenty of shared buffers) you are only > paying two writes per modified block per checkpoint interval --- one to > the WAL during the first write of the interval, and then a write to the > real datafile issued by the checkpoint process. Anything that requires > transaction commits to write data blocks will likely result in more I/O > not less, at least for blocks that are modified by several successive > transactions. > > The only thing I've been able to think of that seems like it might > improve matters is to make the WAL writing logic aware of the layout > of buffer pages --- specifically, to know that our pages generally > contain an uninteresting "hole" in the middle, and not write the hole. > Optimistically this might reduce the WAL data volume by something > approaching 50%; though pessimistically (if most pages are near full) > it wouldn't help much. Good idea. How about putting the page through or TOAST compression routine before writing it to WAL? Should be pretty easy and fast and doesn't require any knowledge of the page format. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
> On Sun, 23 Jun 2002, Bruce Momjian wrote: >> Yes, I don't see writing to two files vs. one to be any win, especially >> when we need to fsync both of them. What I would really like is to >> avoid the double I/O of writing to WAL and to the data file; improving >> that would be a huge win. I don't believe it's possible to eliminate the double I/O. Keep in mind though that in the ideal case (plenty of shared buffers) you are only paying two writes per modified block per checkpoint interval --- one to the WAL during the first write of the interval, and then a write to the real datafile issued by the checkpoint process. Anything that requires transaction commits to write data blocks will likely result in more I/O not less, at least for blocks that are modified by several successive transactions. The only thing I've been able to think of that seems like it might improve matters is to make the WAL writing logic aware of the layout of buffer pages --- specifically, to know that our pages generally contain an uninteresting "hole" in the middle, and not write the hole. Optimistically this might reduce the WAL data volume by something approaching 50%; though pessimistically (if most pages are near full) it wouldn't help much. This was not very feasible when the WAL code was designed because the buffer manager needed to cope with both normal pages and pg_log pages, but as of 7.2 I think it'd be safe to assume that all pages have the standard layout. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian <[EMAIL PROTECTED]> writes: >> The only thing I've been able to think of that seems like it might >> improve matters is to make the WAL writing logic aware of the layout >> of buffer pages --- specifically, to know that our pages generally >> contain an uninteresting "hole" in the middle, and not write the hole. >> Optimistically this might reduce the WAL data volume by something >> approaching 50%; though pessimistically (if most pages are near full) >> it wouldn't help much. > Good idea. How about putting the page through or TOAST compression > routine before writing it to WAL? Should be pretty easy and fast and > doesn't require any knowledge of the page format. Easy, maybe, but fast definitely NOT. The compressor is not speedy. Given that we have to be holding various locks while we build WAL records, I do not think it's a good idea to add CPU time there. Also, compressing already-compressed data is not a win ... regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
J. R. Nield wrote: > So since we have all this buffering designed especially to meet our > needs, and since the OS buffering is in the way, can someone explain to > me why postgresql would ever open a file without the O_DSYNC flag if the > platform supports it? We sync only WAL, not the other pages, except for the sync() call we do during checkpoint when we discard old WAL files. > > I concur with Bruce: the reason we keep page images in WAL is to > > minimize the number of places we have to fsync, and thus the amount of > > head movement required for a commit. Putting the page images elsewhere > > cannot be a win AFAICS. > > > Why not put all the page images in a single pre-allocated file and treat > it as a ring? How could this be any worse than flushing them in the WAL > log? > > Maybe fsync would be slower with two files, but I don't see how > fdatasync would be, and most platforms support that. We have fdatasync option for WAL in postgresql.conf. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On 23 Jun 2002, J. R. Nield wrote: > If is impossible to do what you want. You can not protect against > partial writes without writing pages twice and calling fdatasync > between them while going through a generic filesystem. I agree with this. > The best disk array will not protect you if the operating system does > not align block writes to the structure of the underlying device. This I don't quite understand. Assuming you're using a SCSI drive (and this mostly applies to ATAPI/IDE, too), you can do naught but align block writes to the structure of the underlying device. When you initiate a SCSI WRITE command, you start by telling the device at which block to start writing and how many blocks you intend to write. Then you start passing the data. (See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter details for the SCSI WRITE(10) command. You may find the SCSI 2 specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful reference here.) > Even with raw devices, you need special support or knowledge of the > operating system and/or the disk device to ensure that each write > request will be atomic to the underlying hardware. Well, so here I guess you're talking about two things: 1. When you request, say, an 8K block write, will the OS really write it to disk in a single 8K or multiple of 8K SCSI write command? 2. Does the SCSI device you're writing to consider these writes to be transactional. That is, if the write is interrupted before being completed, does the SCSI device guarantee that the partially-sent data is not written, and the old data is maintained? And of course, does it guarantee that, when it acknowledges a write, that write is now in stable storage and will never go away? Both of these are not hard to guarantee, actually. For a BSD-based OS, for example, just make sure that your filesystem block size is the same as or a multiple of the database block size. BSD will never write anything other than a block or a sequence of blocks to a disk in a single SCSI transaction (unless you've got a really odd SCSI driver). And for your disk, buy a Baydel or Clarion disk array, or something similar. Given that it's not hard to set up a system that meets these criteria, and this is in fact commonly done for database servers, it would seem a good idea for postgres to have the option to take advantage of the time and money spent and adjust its performance upward appropriately. > All other systems rely on the fact that you can recover a damaged file > using the log archive. Not exactly. For MS SQL Server, at any rate, if it detects a page tear you cannot restore based on the log file alone. You need a full or partial backup that includes that entire torn block. > This means downtime in the rare case, but no data loss. Until > PostgreSQL can do this, then it will not be acceptable for real > critical production use. It seems to me that it is doing this right now. In fact, it's more reliable than some commerial systems (such as SQL Server) because it can recover from a torn block with just the logfile. > But at the end of the day, unless you have complete understanding of > the I/O system from write(2) through to the disk system, the only sure > ways to protect against partial writes are by "careful writes" (in > the WAL log or elsewhere, writing pages twice), or by requiring (and > allowing) users to do log-replay recovery when a file is corrupted by > a partial write. I don't understand how, without a copy of the old data that was in the torn block, you can restore that block from just log file entries. Can you explain this to me? Take, as an example, a block with ten tuples, only one of which has been changed "recently." (I.e., only that change is in the log files.) > If we log pages to WAL, they are useless when archived (after a > checkpoint). So either we have a separate "log" for them (the > ping-pong file), or we should at least remove them when archived, > which makes log archiving more complex but is perfectly doable. Right. That seems to me a better option, since we've now got only one write point on the disk rather than two. > Finally, I would love to hear why we are using the operating system > buffer manager at all. The OS is acting as a secondary buffer manager > for us. Why is that? What flaw in our I/O system does this reveal? It's acting as a "second-level" buffer manager, yes, but to say it's "secondary" may be a bit misleading. On most of the systems I've set up, the OS buffer cache is doing the vast majority of the work, and the postgres buffering is fairly minimal. There are some good (and some perhaps not-so-good) reasons to do it this way. I'll list them more or less in the order of best to worst: 1. The OS knows where the blocks physically reside on disk, and postgres does not. Therefore it's in the interest of postgresql to dispatch write responsibility back to the OS as quickly as possibl
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Curt Sampson wrote: > On 23 Jun 2002, J. R. Nield wrote: > > > So since we have all this buffering designed especially to meet our > > needs, and since the OS buffering is in the way, can someone explain to > > me why postgresql would ever open a file without the O_DSYNC flag if the > > platform supports it? > > It's more code, if there are platforms out there that don't support > O_DYSNC. (We still have to keep the old fsync code.) On the other hand, > O_DSYNC could save us a disk arm movement over fsync() because it > appears to me that fsync is also going to force a metadata update, which > means that the inode blocks have to be written as well. Again, see postgresql.conf: #wal_sync_method = fsync# the default varies across platforms: # # fsync, fdatasync, open_sync, or open_datasync > > > Maybe fsync would be slower with two files, but I don't see how > > fdatasync would be, and most platforms support that. > > Because, if both files are on the same disk, you still have to move > the disk arm from the cylinder at the current log file write point > to the cylinder at the current ping-pong file write point. And then back > again to the log file write point cylinder. > > In the end, having a ping-pong file as well seems to me unnecessary > complexity, especially when anyone interested in really good > performance is going to buy a disk subsystem that guarantees no > torn pages and thus will want to turn off the ping-pong file writes > entirely, anyway. Yes, I don't see writing to two files vs. one to be any win, especially when we need to fsync both of them. What I would really like is to avoid the double I/O of writing to WAL and to the data file; improving that would be a huge win. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
J. R. Nield wrote: > On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: > > Yes, I don't see writing to two files vs. one to be any win, especially > > when we need to fsync both of them. What I would really like is to > > avoid the double I/O of writing to WAL and to the data file; improving > > that would be a huge win. > > > > If is impossible to do what you want. You can not protect against > partial writes without writing pages twice and calling fdatasync between > them while going through a generic filesystem. The best disk array will > not protect you if the operating system does not align block writes to > the structure of the underlying device. Even with raw devices, you need > special support or knowledge of the operating system and/or the disk > device to ensure that each write request will be atomic to the > underlying hardware. Yes, I suspected it was impossible, but that doesn't mean I want it any less. ;-) > All other systems rely on the fact that you can recover a damaged file > using the log archive. This means downtime in the rare case, but no data > loss. Until PostgreSQL can do this, then it will not be acceptable for > real critical production use. This is not to knock PostgreSQL, because > it is a very good database system, and clearly the best open-source one. > It even has feature advantages over the commercial systems. But at the > end of the day, unless you have complete understanding of the I/O system > from write(2) through to the disk system, the only sure ways to protect > against partial writes are by "careful writes" (in the WAL log or > elsewhere, writing pages twice), or by requiring (and allowing) users to > do log-replay recovery when a file is corrupted by a partial write. As > long as there is a UPS, and the operating system doesn't crash, then > there still should be no partial writes. You are talking point-in-time recovery, a major missing feature right next to replication, and I agree it makes PostgreSQL unacceptable for some applications. Point taken. And the interesting thing you are saying is that with point-in-time recovery, we don't need to write pre-write images of pages because if we detect a partial page write, we then abort the database and tell the user to do a point-in-time recovery, basically meaning we are using the previous full backup as our pre-write page image and roll forward using the logical logs. This is clearly a nice thing to be able to do because it let's you take a pre-write image of the page once during full backup, keep it offline, and bring it back in the rare case of a full page write failure. I now can see how the MSSQL tearoff-bits would be used, not for recovery, but to detect a partial write and force a point-in-time recovery from the administrator. > If we log pages to WAL, they are useless when archived (after a > checkpoint). So either we have a separate "log" for them (the ping-pong > file), or we should at least remove them when archived, which makes log > archiving more complex but is perfectly doable. Yes, that is how we will do point-in-time recovery; remove the pre-write page images and archive the rest. It is more complex, but having the fsync all in one file is too big a win. > Finally, I would love to hear why we are using the operating system > buffer manager at all. The OS is acting as a secondary buffer manager > for us. Why is that? What flaw in our I/O system does this reveal? I > know that: > > >We sync only WAL, not the other pages, except for the sync() call we do > > during checkpoint when we discard old WAL files. > > But this is probably not a good thing. We should only be writing blocks > when they need to be on disk. We should not be expecting the OS to write > them "sometime later" and avoid blocking (as long) for the write. If we > need that, then our buffer management is wrong and we need to fix it. > The reason we are doing this is because we expect the OS buffer manager > to do asynchronous I/O for us, but then we don't control the order. That > is the reason why we have to call fdatasync(), to create "sequence > points". Yes. I think I understand. It is true we have to fsync WAL because we can't control the individual writes by the OS. > The reason we have performance problems with either D_OSYNC or fdatasync > on the normal relations is because we have no dbflush process. This > causes an unacceptable amount of I/O blocking by other transactions. Uh, that would force writes all over the disk. Why do we really care how the OS writes them? If we are going to fsync, let's just do the one file and be done with it. What would a separate flusher process really buy us if it has to use fsync too. The main backend doesn't have to wait for the fsync, but then again, we can't say the transaction is committed until it hits the disk, so how does a flusher help? > The ORACLE people were not kidding when they said that they could not > certify Linux for production use until it supported
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: > Yes, I don't see writing to two files vs. one to be any win, especially > when we need to fsync both of them. What I would really like is to > avoid the double I/O of writing to WAL and to the data file; improving > that would be a huge win. > If is impossible to do what you want. You can not protect against partial writes without writing pages twice and calling fdatasync between them while going through a generic filesystem. The best disk array will not protect you if the operating system does not align block writes to the structure of the underlying device. Even with raw devices, you need special support or knowledge of the operating system and/or the disk device to ensure that each write request will be atomic to the underlying hardware. All other systems rely on the fact that you can recover a damaged file using the log archive. This means downtime in the rare case, but no data loss. Until PostgreSQL can do this, then it will not be acceptable for real critical production use. This is not to knock PostgreSQL, because it is a very good database system, and clearly the best open-source one. It even has feature advantages over the commercial systems. But at the end of the day, unless you have complete understanding of the I/O system from write(2) through to the disk system, the only sure ways to protect against partial writes are by "careful writes" (in the WAL log or elsewhere, writing pages twice), or by requiring (and allowing) users to do log-replay recovery when a file is corrupted by a partial write. As long as there is a UPS, and the operating system doesn't crash, then there still should be no partial writes. If we log pages to WAL, they are useless when archived (after a checkpoint). So either we have a separate "log" for them (the ping-pong file), or we should at least remove them when archived, which makes log archiving more complex but is perfectly doable. Finally, I would love to hear why we are using the operating system buffer manager at all. The OS is acting as a secondary buffer manager for us. Why is that? What flaw in our I/O system does this reveal? I know that: >We sync only WAL, not the other pages, except for the sync() call we do > during checkpoint when we discard old WAL files. But this is probably not a good thing. We should only be writing blocks when they need to be on disk. We should not be expecting the OS to write them "sometime later" and avoid blocking (as long) for the write. If we need that, then our buffer management is wrong and we need to fix it. The reason we are doing this is because we expect the OS buffer manager to do asynchronous I/O for us, but then we don't control the order. That is the reason why we have to call fdatasync(), to create "sequence points". The reason we have performance problems with either D_OSYNC or fdatasync on the normal relations is because we have no dbflush process. This causes an unacceptable amount of I/O blocking by other transactions. The ORACLE people were not kidding when they said that they could not certify Linux for production use until it supported O_DSYNC. Can you explain why that was the case? Finally, let me apologize if the above comes across as somewhat belligerent. I know very well that I can't compete with you guys for knowledge of the PosgreSQL system. I am still at a loss when I look at the optimizer and executor modules, and it will take some time before I can follow discussion of that area. Even then, I doubt my ability to compare with people like Mr. Lane and Mr. Momjian in experience and general intelligence, or in the field of database programming and software development in particular. However, this discussion and a search of the pgsql-hackers archives reveals this problem to be the KEY area of PostgreSQL's failing, and general misunderstanding, when compared to its commercial competitors. Sincerely, J. R. Nield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 21:29, J. R. Nield wrote: > If is impossible to do what you want. You can not protect against... Wow. The number of typo's in that last one was just amazing. I even started with one. Have an nice weekend everybody :-) ;jrnield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 23 Jun 2002, Bruce Momjian wrote: > Yes, I don't see writing to two files vs. one to be any win, especially > when we need to fsync both of them. What I would really like is to > avoid the double I/O of writing to WAL and to the data file; improving > that would be a huge win. You mean, the double I/O of writing the block to the WAL and data file? (We'd still have to write the changed columns or whatever to the WAL, right?) I'd just add an option to turn it off. If you need it, you need it; there's no way around that except to buy hardware that is really going to guarantee your writes (which then means you don't need it). cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On 23 Jun 2002, J. R. Nield wrote: > So since we have all this buffering designed especially to meet our > needs, and since the OS buffering is in the way, can someone explain to > me why postgresql would ever open a file without the O_DSYNC flag if the > platform supports it? It's more code, if there are platforms out there that don't support O_DYSNC. (We still have to keep the old fsync code.) On the other hand, O_DSYNC could save us a disk arm movement over fsync() because it appears to me that fsync is also going to force a metadata update, which means that the inode blocks have to be written as well. > Maybe fsync would be slower with two files, but I don't see how > fdatasync would be, and most platforms support that. Because, if both files are on the same disk, you still have to move the disk arm from the cylinder at the current log file write point to the cylinder at the current ping-pong file write point. And then back again to the log file write point cylinder. In the end, having a ping-pong file as well seems to me unnecessary complexity, especially when anyone interested in really good performance is going to buy a disk subsystem that guarantees no torn pages and thus will want to turn off the ping-pong file writes entirely, anyway. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 12:10, Curt Sampson wrote: > > So what we would really need to implement, if we wanted to be more > efficient with trusted disk subsystems, would be the option of writing > to the log only the changed row or changed part of the row, or writing > the entire changed page. I don't know how hard this would be > We already log that stuff. The page images are in addition to the "Logical Changes", so we could just stop logging the page images. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 11:19, Tom Lane wrote: > Curt Sampson <[EMAIL PROTECTED]> writes: > > This should also allow us to disable completely the ping-pong writes > > if we have a disk subsystem that we trust. > > If we have a disk subsystem we trust, we just disable fsync on the > WAL and the performance issue largely goes away. It wouldn't work because the OS buffering interferes, and we need those WAL records on disk up to the greatest LSN of the Buffer we will be writing. We already buffer WAL ourselves. We also already buffer regular pages. Whenever we write a Buffer out of the buffer cache, it is because we really want that page on disk and wanted to start an IO. If thats not the case, then we should have more block buffers! So since we have all this buffering designed especially to meet our needs, and since the OS buffering is in the way, can someone explain to me why postgresql would ever open a file without the O_DSYNC flag if the platform supports it? > > I concur with Bruce: the reason we keep page images in WAL is to > minimize the number of places we have to fsync, and thus the amount of > head movement required for a commit. Putting the page images elsewhere > cannot be a win AFAICS. Why not put all the page images in a single pre-allocated file and treat it as a ring? How could this be any worse than flushing them in the WAL log? Maybe fsync would be slower with two files, but I don't see how fdatasync would be, and most platforms support that. What would improve performance would be to have a dbflush process that would work in the background flushing buffers in groups and trying to stay ahead of ReadBuffer requests. That would let you do the temporary side of the ping-pong as a huge O_DSYNC writev(2) request (or fdatasync() once) and then write out the other buffers. It would also tend to prevent the other backends from blocking on write requests. A dbflush could also support aio_read/aio_write on platforms like Solaris and WindowsNT that support it. Am I correct that right now, buffers only get written when they get removed from the free list for reuse? So a released dirty buffer will sit in the buffer free list until it becomes the Least Recently Used buffer, and will then cause a backend to block for IO in a call to BufferAlloc? This would explain why we like using the OS buffer cache, and why our performance is troublesome when we have to do synchronous IO writes, and why fsync() takes so long to complete. All of the backends block for each call to BufferAlloc() after a large table update by a single backend, and then the OS buffers are always full of our "written" data. Am I reading the bufmgr code correctly? I already found an imaginary race condition there once :-) ;jnield > > > Well, whether or not there's a cheap way depends on whether you consider > > fsync to be cheap. :-) > > It's never cheap :-( > -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 23 Jun 2002, Tom Lane wrote: > Curt Sampson <[EMAIL PROTECTED]> writes: > > This should also allow us to disable completely the ping-pong writes > > if we have a disk subsystem that we trust. > > If we have a disk subsystem we trust, we just disable fsync on the > WAL and the performance issue largely goes away. No, you can't do this. If you don't fsync(), there's no guarantee that the write ever got out of the computer's buffer cache and to the disk subsystem in the first place. > I concur with Bruce: the reason we keep page images in WAL is to > minimize the number of places we have to fsync, and thus the amount of > head movement required for a commit. An fsync() does not necessarially cause head movement, or any real disk writes at all. If you're writing to many external disk arrays, for example, the fsync() ensures that the data are in the disk array's non-volatile or UPS-backed RAM, no more. The array might hold the data for quite some time before it actually writes it to disk. But you're right that it's faster, if you're going to write out changed pages and have have the ping-pong file and the transaction log on the same disk, just to write out the entire page to the transaction log. So what we would really need to implement, if we wanted to be more efficient with trusted disk subsystems, would be the option of writing to the log only the changed row or changed part of the row, or writing the entire changed page. I don't know how hard this would be > > Well, whether or not there's a cheap way depends on whether you consider > > fsync to be cheap. :-) > > It's never cheap :-( Actually, with a good external RAID system with non-volatile RAM, it's a good two to four orders of magnitude cheaper than writing to a directly connected disk that doesn't claim the write is complete until it's physically on disk. I'd say that it qualifies as at least "not expensive." Not that you want to do it more often than you have to anyway cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Curt Sampson <[EMAIL PROTECTED]> writes: > This should also allow us to disable completely the ping-pong writes > if we have a disk subsystem that we trust. If we have a disk subsystem we trust, we just disable fsync on the WAL and the performance issue largely goes away. I concur with Bruce: the reason we keep page images in WAL is to minimize the number of places we have to fsync, and thus the amount of head movement required for a commit. Putting the page images elsewhere cannot be a win AFAICS. > Well, whether or not there's a cheap way depends on whether you consider > fsync to be cheap. :-) It's never cheap :-( regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On 23 Jun 2002, J. R. Nield wrote: > On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: > > J. R. Nield wrote: > > > One other point: > > > > > > Page pre-image logging is fundamentally the same as what Jim Grey's > > > book[1] would call "careful writes". I don't believe they should be in > > > the XLOG, because we never need to keep the pre-images after we're sure > > > the buffer has made it to the disk. Instead, we should have the buffer > > > IO routines implement ping-pong writes of some kind if we want > > > protection from partial writes. > > > > Ping-pong writes to where? We have to fsync, and rather than fsync that > > area and WAL, we just do WAL. Not sure about a win there. Presumably the win is that, "we never need to keep the pre-images after we're sure the buffer has made it to the disk." So the pre-image log can be completely ditched when we shut down the server, so a full system sync, or whatever. This keeps the log file size down, which means faster recovery, less to back up (when we start getting transaction logs that can be backed up), etc. This should also allow us to disable completely the ping-pong writes if we have a disk subsystem that we trust. (E.g., a disk array with battery backed memory.) That would, in theory, produce a nice little performance increase when lots of inserts and/or updates are being committed, as we have much, much less to write to the log file. Are there stats that track, e.g., the bandwidth of writes to the log file? I'd be interested in knowing just what kind of savings one might see by doing this. > The key question is: do we have some method to ensure that the OS > doesn't do the writes in parallel?... > It would seem to me that if the same process does both ping-pong writes, > then there should be a cheap way to enforce a serial order. I could be > wrong though. Well, whether or not there's a cheap way depends on whether you consider fsync to be cheap. :-) cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: > J. R. Nield wrote: > > One other point: > > > > Page pre-image logging is fundamentally the same as what Jim Grey's > > book[1] would call "careful writes". I don't believe they should be in > > the XLOG, because we never need to keep the pre-images after we're sure > > the buffer has made it to the disk. Instead, we should have the buffer > > IO routines implement ping-pong writes of some kind if we want > > protection from partial writes. > > Ping-pong writes to where? We have to fsync, and rather than fsync that > area and WAL, we just do WAL. Not sure about a win there. > The key question is: do we have some method to ensure that the OS doesn't do the writes in parallel? If the OS will ensure that one of the two block writes of a ping-pong completes before the other starts, then we don't need to fsync() at all. The only thing we are protecting against is the possibility of both writes being partial. If neither is done, that's fine because WAL will protect us. If the first write is partial, we will detect that and use the old data from the other, then recover from WAL. If the first is complete but the second is partial, then we detect that and use the newer block from the first write. If the second is complete but the first is partial, we detect that and use the newer block from the second write. So does anyone know a way to prevent parallel writes in one of the common unix standards? Do they say anything about this? It would seem to me that if the same process does both ping-pong writes, then there should be a cheap way to enforce a serial order. I could be wrong though. As to where the first block of the ping-pong should go, maybe we could reserve a file with nBlocks space for them, and write the information about which block was being written to the XLOG for use in recovery. There are many other ways to do it. ;jrnield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
J. R. Nield wrote: > One other point: > > Page pre-image logging is fundamentally the same as what Jim Grey's > book[1] would call "careful writes". I don't believe they should be in > the XLOG, because we never need to keep the pre-images after we're sure > the buffer has made it to the disk. Instead, we should have the buffer > IO routines implement ping-pong writes of some kind if we want > protection from partial writes. Ping-pong writes to where? We have to fsync, and rather than fsync that area and WAL, we just do WAL. Not sure about a win there. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote: > I was wondering, how does knowing the block is corrupt help MS SQL? > Right now, we write changed pages to WAL, then later write them to disk. > I have always been looking for a way to prevent these WAL writes. The > 512-byte bit seems interesting, but how does it help? > > And how does the bit help them with partial block writes? Is the bit at > the end of the block? Is that reliable? > My understanding of this is as follows: 1) On most commercial systems, if you get a corrupted block (from partial write or whatever) you need to restore the file(s) from the most recent backup, and replay the log from the log archive (usually only the damaged files will be written to during replay). 2) If you can't deal with the downtime to recover the file, then EMC, Sun, or IBM will sell you an expensive disk array with an NVRAM cache that will do atomic writes. Some plain-vanilla SCSI disks are also capable of atomic writes, though usually they don't use NVRAM to do it. The database must then make sure that each page-write gets translated into exactly one SCSI-level write. This is one reason why ORACLE and Sybase recommend that you use raw disk partitions for high availability. Some operating systems support this through the filesystem, but it is OS dependent. I think Solaris 7 & 8 has support for this, but I'm not sure. PostgreSQL has trouble because it can neither archive logs for replay, nor use raw disk partitions. One other point: Page pre-image logging is fundamentally the same as what Jim Grey's book[1] would call "careful writes". I don't believe they should be in the XLOG, because we never need to keep the pre-images after we're sure the buffer has made it to the disk. Instead, we should have the buffer IO routines implement ping-pong writes of some kind if we want protection from partial writes. Does any of this make sense? ;jrnield [1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts and Techniques". Morgan Kaufmann. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Fri, 21 Jun 2002, Tom Lane wrote: > Curt Sampson <[EMAIL PROTECTED]> writes: > > And now you know which parts of your page got written, and which > > parts didn't. > > Yes ... and what do you *do* about it? Ok. Here's the extract from _Inside Microsoft SQL Server 7.0_, page 207: torn page detection When TRUE, this option causes a bit to be flipped for each 512-byte sector in a database page (8 KB) whenever the page is written to disk. This option allows SQL Server to detect incomplete I/O operations caused by power failures or other system outages. If a bit is in the wrong state when the page is later read by SQL Server, this means the page was written incorrectly; a torn page has been detected. Although SQL Server database pages are 8 KB, disks perform I/O operations using 512-byte sectors. Therefore, 16 sectors are written per database page. A torn page can occur if the system crashes (for example, because of power failure) between the time the operating system writes the first 512-byte sector to disk and the completion of the 8-KB I/O operation. If the first sector of a database page is successfully written before the crash, it will appear that the database page on disk was updated, although it might not have succeeded. Using battery-backed disk caches can ensure that data is [sic] successfully written to disk or not written at all. In this case, don't set torn page detection to TRUE, as it isn't needed. If a torn page is detected, the database will need to be restored from backup because it will be physically inconsistent. As I understand it, this is not a problem for postgres becuase the entire page is written to the log. So postgres is safe, but quite inefficient. (It would be much more efficient to write just the changed tuple, or even just the changed values within the tuple, to the log.) Adding these torn bits would allow posgres at least to write to the log just the 512-byte sectors that have changed, rather than the entire 8 KB page. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Curt Sampson <[EMAIL PROTECTED]> writes: > And now you know which parts of your page got written, and which > parts didn't. Yes ... and what do you *do* about it? regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 20 Jun 2002, Bruce Momjian wrote: > > MS SQL Server has an interesting way of dealing with this. They have a > > "torn" bit in each 512-byte chunk of a page, and this bit is set the > > same for each chunk. When they are about to write out a page, they first > > flip all of the torn bits and then do the write. If the write does not > > complete due to a system crash or whatever, this can be detected later > > because the torn bits won't match across the entire page. > > I was wondering, how does knowing the block is corrupt help MS SQL? I'm trying to recall, but I can't off hand. I'll have to look it up in my Inside SQL Server book, which is at home right now, unfortunately. I'll bring the book into work and let you know the details later. > Right now, we write changed pages to WAL, then later write them to disk. Ah. You write the entire page? MS writes only the changed tuple. And DB2, in fact, goes one better and writes only the part of the tuple up to the change, IIRC. Thus, if you put smaller and/or more frequently changed columns first, you'll have smaller logs. > I have always been looking for a way to prevent these WAL writes. The > 512-byte bit seems interesting, but how does it help? Well, this would at least let you reduce the write to the 512-byte chunk that changed, rather than writing the entire 8K page. > And how does the bit help them with partial block writes? Is the bit at > the end of the block? Is that reliable? The bit is somewhere within every 512 byte "disk page" within the 8192 byte "filesystem/database page." So an 8KB page is divided up like this: | <--- 8 Kb --> | | 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b | Thus, the tear bits start out like this: | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | After a successful write of the entire page, you have this: | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | If the write is unsuccessful, you end up with something like this: | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | And now you know which parts of your page got written, and which parts didn't. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Curt Sampson wrote: > On Thu, 25 Apr 2002, Bruce Momjian wrote: > > > Actually, this brings up a different point. We use 8k blocks now > > because at the time PostgreSQL was developed, it used BSD file systems, > > and those prefer 8k blocks, and there was some concept that an 8k write > > was atomic, though with 512 byte disk blocks, that was incorrect. (We > > knew that at the time too, but we didn't have any options, so we just > > hoped.) > > MS SQL Server has an interesting way of dealing with this. They have a > "torn" bit in each 512-byte chunk of a page, and this bit is set the > same for each chunk. When they are about to write out a page, they first > flip all of the torn bits and then do the write. If the write does not > complete due to a system crash or whatever, this can be detected later > because the torn bits won't match across the entire page. I was wondering, how does knowing the block is corrupt help MS SQL? Right now, we write changed pages to WAL, then later write them to disk. I have always been looking for a way to prevent these WAL writes. The 512-byte bit seems interesting, but how does it help? And how does the bit help them with partial block writes? Is the bit at the end of the block? Is that reliable? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 25 Apr 2002, Bruce Momjian wrote: > Actually, this brings up a different point. We use 8k blocks now > because at the time PostgreSQL was developed, it used BSD file systems, > and those prefer 8k blocks, and there was some concept that an 8k write > was atomic, though with 512 byte disk blocks, that was incorrect. (We > knew that at the time too, but we didn't have any options, so we just > hoped.) MS SQL Server has an interesting way of dealing with this. They have a "torn" bit in each 512-byte chunk of a page, and this bit is set the same for each chunk. When they are about to write out a page, they first flip all of the torn bits and then do the write. If the write does not complete due to a system crash or whatever, this can be detected later because the torn bits won't match across the entire page. > Now, with larger RAM and disk sizes, it may be time to consider larger > page sizes, like 32k pages. That reduces the granularity of the cache, > but it may have other performance advantages that would be worth it. It really depends on the block size your underlying layer is using. Reading less than that is never useful as you pay for that entire block anyway. (E.g., on an FFS filesystem with 8K blocks, the OS always reads 8K even if you ask for only 4K.) On the other hand, reading more does have a tangible cost, as you saw from the benchmark I posted; reading 16K on my system cost 20% more than reading 8K, and used twice the buffer space. If I'm doing lots of really random reads, this would result in a performance loss (due to doing more I/O, and having less chance that the next item I want is in the buffer cache). For some reason I thought we had the ability to change the block size that postgres uses on a table-by-table basis, but I can't find anything in the docs about that. Maybe it's just because I saw some support in the code for it. But this feature would be a nice addition for those cases where a larger block size would help. But I think that 8K is a pretty good default, and I think that 32K blocks would result in a quite noticable performance reduction for apps that did a lot of random I/O. > What people are actually suggesting with the read-ahead for sequential > scans is basically a larger block size for sequential scans than for > index scans. While this makes sense, it may be better to just increase > the block size overall. I don't think so, because the smaller block size is definitely better for random I/O. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 25 Apr 2002, Bruce Momjian wrote: > Also keep in mind most disks have 512 byte blocks, so even if the file > system is 8k, the disk block sizes are different. A given 8k or 1k file > system block may not even be all in the same cylinder. Right. Though much of the time they will be in the same cylinder, sometimes they will be in adjacent cylinders if the drive manufacturer has made cylinder sizes that are not multiples of 8K. I don't think this is terribly frequent, but there's no way to substantiate that assumption without knowing the real geometries of the drive, which generally are not given out. (What is reported to the OS has not been the real geometry for years now, because drive manufacturers long ago started putting more blocks on the outer cylinders than the inner ones.) However, even that they will be in adjacent cylinders doesn't always hold: depending on how the disk subsystems are partitioned, you might be crossing a boundary where two partitions are joined together, necessitating a seek. But this case would be quite rare. You can always find conditions, in modern drive subsystems, where the "read close together" idea doesn't hold, but in the vast, vast majority of circumstances it does. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian wrote: > Now, with larger RAM and disk sizes, it may be time to consider larger > page sizes, like 32k pages. That reduces the granularity of the cache, > but it may have other performance advantages that would be worth it. > > What people are actually suggesting with the read-ahead for sequential > scans is basically a larger block size for sequential scans than for > index scans. While this makes sense, it may be better to just increase > the block size overall. I have seen performance improvements by using 16K blocks over 8K blocks in sequential scans of large tables. I am investigating the performance difference between 16K and 8K block sizes on one of my systems. I'll let you know what I see. I am using pgbench for generic performance levels. If you would like to see any extra data, just let me know. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Actually, this brings up a different point. We use 8k blocks now because at the time PostgreSQL was developed, it used BSD file systems, and those prefer 8k blocks, and there was some concept that an 8k write was atomic, though with 512 byte disk blocks, that was incorrect. (We knew that at the time too, but we didn't have any options, so we just hoped.) In fact, we now write pre-modified pages to WAL specifically because we can't be sure an 8k page write to disk will be atomic. Part of the page may make it to disk, and part may not. Now, with larger RAM and disk sizes, it may be time to consider larger page sizes, like 32k pages. That reduces the granularity of the cache, but it may have other performance advantages that would be worth it. What people are actually suggesting with the read-ahead for sequential scans is basically a larger block size for sequential scans than for index scans. While this makes sense, it may be better to just increase the block size overall. --- Curt Sampson wrote: > On Wed, 24 Apr 2002, Michael Loftis wrote: > > > A Block-sized read will not be rboken up. But if you're reading ina > > size bigger than the underlying systems block sizes then it can get > > broken up. > > In which operating systems, and under what circumstances? > > I'll agree that some OSs may not coalesce adjacent reads into a > single read, but even so, submitting a bunch of single reads for > consecutive blocks is going to be much, much faster than if other, > random I/O occured between those reads. > > > If the underlying > > block size is 8KB and you dump 4MB down on it, the OS may (and in many > > cases does) decide to write part of it, do a read ona nearby sector, > > then write the rest. This happens when doing long writes that end up > > spanning block groups because the inodes must be allocated. > > Um...we're talking about 64K vs 8K reads here, not 4 MB reads. I am > certainly not suggesting Posgres ever submit 4 MB read requests to the OS. > > I agree that any single-chunk reads or writes that cause non-adjacent > disk blocks to be accessed may be broken up. But in my sense, > they're "broken up" anyway, in that you have no choice but to take > a performance hit. > > > Further large read requests can of course be re-ordered by hardware. > > ...The OS also tags ICP, which can be re-ordered on block-sized chunks. > > Right. All the more reason to read in larger chunks when we know what we > need in advance, because that will give the OS, controllers, etc. more > advance information, and let them do the reads more efficiently. > > cjs > -- > Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org > Don't you know, in this new Dark Age, we're all light. --XTC > > > ---(end of broadcast)--- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Michael Loftis wrote: > A Block-sized read will not be broken up. But if you're reading ina > size bigger than the underlying systems block sizes then it can get > broken up. > > So yes a sequential read will get broken up. A single read request for > a block may or may not get broken up. If you're freading with set block > sizes you'll see the set sizes of blocks come through, but what the > underlying OS does is undefined, same for writing. If the underlying > block size is 8KB and you dump 4MB down on it, the OS may (and in many > cases does) decide to write part of it, do a read ona nearby sector, > then write the rest. This happens when doing long writes that end up > spanning block groups because the inodes must be allocated. Also keep in mind most disks have 512 byte blocks, so even if the file system is 8k, the disk block sizes are different. A given 8k or 1k file system block may not even be all in the same cylinder. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Wed, 24 Apr 2002, Michael Loftis wrote: > A Block-sized read will not be rboken up. But if you're reading ina > size bigger than the underlying systems block sizes then it can get > broken up. In which operating systems, and under what circumstances? I'll agree that some OSs may not coalesce adjacent reads into a single read, but even so, submitting a bunch of single reads for consecutive blocks is going to be much, much faster than if other, random I/O occured between those reads. > If the underlying > block size is 8KB and you dump 4MB down on it, the OS may (and in many > cases does) decide to write part of it, do a read ona nearby sector, > then write the rest. This happens when doing long writes that end up > spanning block groups because the inodes must be allocated. Um...we're talking about 64K vs 8K reads here, not 4 MB reads. I am certainly not suggesting Posgres ever submit 4 MB read requests to the OS. I agree that any single-chunk reads or writes that cause non-adjacent disk blocks to be accessed may be broken up. But in my sense, they're "broken up" anyway, in that you have no choice but to take a performance hit. > Further large read requests can of course be re-ordered by hardware. > ...The OS also tags ICP, which can be re-ordered on block-sized chunks. Right. All the more reason to read in larger chunks when we know what we need in advance, because that will give the OS, controllers, etc. more advance information, and let them do the reads more efficiently. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
A Block-sized read will not be rboken up. But if you're reading ina size bigger than the underlying systems block sizes then it can get broken up. So yes a sequential read will get broken up. A single read request for a block may or may not get broken up. If you're freading with set block sizes you'll see the set sizes of blocks come through, but what the underlyign OS does is undefined, same for writing. If the underlying block size is 8KB and you dump 4MB down on it, the OS may (and in many cases does) decide to write part of it, do a read ona nearby sector, then write the rest. This happens when doing long writes that end up spanning block groups because the inodes must be allocated. So the write WILL get broken up. Reads are under the same gun. IT all depends on how big. To the application you may or may not see this (probably not, unless you're set non-blocking, because the kernel will just sleep you until your data is ready). Further large read requests can of course be re-ordered by hardware. Tagged Command Queueing on SCSI drives and RAIDs. The ICP Vortex cards I use ina number of systems have 64MB on-board cache. They quite happily, and often re-order reads and writes when queueing them to keep things moving as fast as possible (Intel didn't buy them for their cards, they use the i960 as it is, Intel swiped them for their IP rights). The OS also tags commands it fires to the ICP, which can be re-ordered on block-sized chunks. Curt Sampson wrote: >On Wed, 24 Apr 2002, mlw wrote: > >>I am not arguing about whether or not they do it, I am saying it is >>not always possible. I/O requests do not remain in queue waiting for >>reordering indefinitely. >> > >It doesn't matter. When they go out to the disk they go out in >order. On every Unix-based OS I know of, and Novell Netware, if >you submit a single read request for consecutive blocks, those >blocks *will* be read sequentially, no matter what the system load. > >So to get back to the original arugment: > The supposed advantage of a sequential read over an random read, in an active multitasking system, is a myth. If you are executing one query and the system is doing only that query, you may be right. > >No, it's very real, because your sequential read will not be broken up. > >If you think it will, let me know which operating systems this >happens on, and how exactly it happens. > >cjs > ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Wed, 24 Apr 2002, mlw wrote: > I am not arguing about whether or not they do it, I am saying it is > not always possible. I/O requests do not remain in queue waiting for > reordering indefinitely. It doesn't matter. When they go out to the disk they go out in order. On every Unix-based OS I know of, and Novell Netware, if you submit a single read request for consecutive blocks, those blocks *will* be read sequentially, no matter what the system load. So to get back to the original arugment: > > >The supposed advantage of a sequential read over an random read, in > > >an active multitasking system, is a myth. If you are executing one > > >query and the system is doing only that query, you may be right. No, it's very real, because your sequential read will not be broken up. If you think it will, let me know which operating systems this happens on, and how exactly it happens. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
I was thinking in something independent from the executor, simply a variable that recommends or not the use of a particular index, it could be obtained from user, and so it could be improved(a factor lower than 1) on planner. How about something like this? - Original Message - From: "Bruce Momjian" <[EMAIL PROTECTED]> To: "Luis Alberto Amigo Navarro" <[EMAIL PROTECTED]> Cc: "Lincoln Yeoh" <[EMAIL PROTECTED]>; "Tom Lane" <[EMAIL PROTECTED]>; "mlw" <[EMAIL PROTECTED]>; "Andrew Sullivan" <[EMAIL PROTECTED]>; "PostgreSQL-development" <[EMAIL PROTECTED]> Sent: Tuesday, April 23, 2002 6:42 PM Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE > Luis Alberto Amigo Navarro wrote: > > Hi All. > > I've been reading all the thread and I want to add a few points: > > > > You can set enable_seqscan=off in small or easy queries, but in large > > queries index can speed parts of the query and slow other, so I think it is > > neccesary if you want Postgres to become a Wide-used DBMS that the planner > > could be able to decide accuratelly, in the thread there is a point that > > might be useful, it will be very interesting that the planner could learn > > with previous executions, even there could be a warm-up policy to let > > planner learn about how the DB is working, this info could be stored with DB > > data, and could statistically show how use of index or seqscan works on > > every column of the DB. > > Yes, I have always felt it would be good to feed back information from > the executor to the optimizer to help with later estimates. Of course, > I never figured out how to do it. :-) > > -- > Bruce Momjian| http://candle.pha.pa.us > [EMAIL PROTECTED] | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 > ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Curt Sampson wrote: > > On Tue, 23 Apr 2002, mlw wrote: > > > > On a system that has neither read-ahead nor sorting of I/O requests, > > > yes. Which systems are you using that provide neither of these > > > facilities? > > > > This only happens if the OS can organize the I/O requests in such a manner. It > > is a non-trivial function. > > Well, if you call less than 200 lines of code (including lots of > comments), "non-trivial," yes. Have a look at NetBSD's > src/sys/kern/subr_disk.c for one example implementation. > > But trivial or not, if all operating systems on which Postgres runs > are doing this, your point is, well, pointless. So, once again, which > systems are you using that do *not* do this? I am not arguing about whether or not they do it, I am saying it is not always possible. I/O requests do not remain in queue waiting for reordering indefinitely. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
At 12:41 PM 4/23/02 -0400, Bruce Momjian wrote: >This is an interesting point, that an index scan may fit in the cache >while a sequential scan may not. I can see cases where even a index >scan of a large percentage of the table may win over an sequential scan. >Interesting. Yes and if it fits in the cache the random access costs drop by orders of magnitude as shown by a recent benchmark someone posted where a Solaris box cached gigs of data[1]. That's why it might be useful to know what the crossover points for index scan vs sequential scans for various random page cost values. e.g. set random page cost to 1 means optimizer will use sequential scan if it thinks an index scan will return 50% or more rows. set to 0.5 for 75% or more and so on. That's probably very simplistic, but basically some idea of what the optimizer will do given a random page cost could be helpful. Thanks, Link. [1] Mark Pritchard's benchmark where you can see 3rd try onwards random is actually faster than sequential after caching (TWICE as fast too!). Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk) Sequential Bytes Read TimeBytes / Sec 2097152000 65.19 32167675.28 2097152000 65.22 32154114.65 2097152000 65.16 32182561.99 2097152000 65.12 32206105.12 2097152000 64.67 32429463.26 32227984.06 (avg) Random Bytes Read TimeBytes / Sec 4194304000 1522.22 2755394.79 4194304000 278.18 15077622.05 4194304000 91.43 45874730.07 4194304000 61.43 68273795.19 4194304000 54.55 76890231.51 41774354.72 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Tue, 23 Apr 2002, mlw wrote: > > On a system that has neither read-ahead nor sorting of I/O requests, > > yes. Which systems are you using that provide neither of these > > facilities? > > This only happens if the OS can organize the I/O requests in such a manner. It > is a non-trivial function. Well, if you call less than 200 lines of code (including lots of comments), "non-trivial," yes. Have a look at NetBSD's src/sys/kern/subr_disk.c for one example implementation. But trivial or not, if all operating systems on which Postgres runs are doing this, your point is, well, pointless. So, once again, which systems are you using that do *not* do this? > > Invariably a process or thread will lose its quantum when it submits > > an I/O request. (There's nothing left for it to do, since it's waiting > > for its data to be read, so there's nothing for it to execute.) > > This statement is verifiably false. What a program does after it > submits an I/O requests is VERY OS and state specific. If an I/O > request is made for a disk block, which is in read-ahead cache, a > number of operating systems my return right away. Sorry, we were working at different levels. You are thinking of generating an I/O request on the logical level, via a system call. I was refering to generating a physical I/O request, which a logical I/O reqeust may or may not do. So if you would please go back and tackle my argument again, based on my clarifications above cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Curt Sampson wrote: > > On Thu, 18 Apr 2002, Michael Loftis wrote: > > > mlw wrote: > > > > >The supposed advantage of a sequential read over an random read, in > > >an active multitasking system, is a myth. If you are executing one > > >query and the system is doing only that query, you may be right. > > > > > >Execute a number of queries at the same time, the expected benefit > > >of a sequential scan goes out the window. The OS will be fetching > > >blocks, more or less, at random. > > On a system that has neither read-ahead nor sorting of I/O requests, > yes. Which systems are you using that provide neither of these > facilities? This only happens if the OS can organize the I/O requests in such a manner. It is a non-trivial function. > > > In a multi-tasking system it's always cheaper to fetch less blocks, no > > matter where they are. Because, as you said, it will end up more or > > less random onf a system experiencing a larger number of queries. > > Invariably a process or thread will lose its quantum when it submits > an I/O request. (There's nothing left for it to do, since it's waiting > for its data to be read, so there's nothing for it to execute.) This statement is verifiably false. What a program does after it submits an I/O requests is VERY OS and state specific. If an I/O request is made for a disk block, which is in read-ahead cache, a number of operating systems my return right away. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Lincoln Yeoh wrote: > > At 10:48 AM 4/18/02 -0400, mlw wrote: > >Bruce Momjian wrote: > > > > > > Have you tried reducing 'random_page_cost' in postgresql.conf. That > > > should solve most of your problems if you would like more index scans. > > > >My random page cost is 1 :-) > > What happens when you set random page cost to 1? Between an index scan of > 50% of a table and a full table scan which would the optimizer pick? With > it at 1, what percentage would be the switchover point? I am no longer working on the project. Alas, the company is no more. Anyone want to buy it? :-) > I'm just wondering why not just use enable_seqscan=false for those > problematic queries as a "hint"? Unless your query does need some seq scans > as well? I am the architect, thus only one of the developers. It was easier, and safer, to make sure sequential scans did not get executed on a global basis. It would be disastrous if the development version of the database did not do a sequential scan, but the live version did. (This did happen to us once. Another point of PostgreSQL vs Index frustration.) The risk was minimal if a live query erroneously used an index, but the consequenses, at least in our application, would be a 1~2 minute PostgreSQL query. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Luis Alberto Amigo Navarro wrote: > Hi All. > I've been reading all the thread and I want to add a few points: > > You can set enable_seqscan=off in small or easy queries, but in large > queries index can speed parts of the query and slow other, so I think it is > neccesary if you want Postgres to become a Wide-used DBMS that the planner > could be able to decide accuratelly, in the thread there is a point that > might be useful, it will be very interesting that the planner could learn > with previous executions, even there could be a warm-up policy to let > planner learn about how the DB is working, this info could be stored with DB > data, and could statistically show how use of index or seqscan works on > every column of the DB. Yes, I have always felt it would be good to feed back information from the executor to the optimizer to help with later estimates. Of course, I never figured out how to do it. :-) -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Lincoln Yeoh wrote: > At 10:48 AM 4/18/02 -0400, mlw wrote: > >Bruce Momjian wrote: > > > > > > Have you tried reducing 'random_page_cost' in postgresql.conf. That > > > should solve most of your problems if you would like more index scans. > > > >My random page cost is 1 :-) > > What happens when you set random page cost to 1? Between an index scan of > 50% of a table and a full table scan which would the optimizer pick? With > it at 1, what percentage would be the switchover point? > > Because I'm thinking that for _repeated_ queries when there is caching the > random page cost for "small" selections may be very low after the first > very costly select (may not be that costly for smart SCSI drives). So > selecting 10% of a table randomly may not be that costly after the first > select. Whereas for sequential scans 100% of the table must fit in the > cache. If the cache is big enough then whichever results in selecting less > should be faster ( noting that typically sequential RAM reads are faster > than random RAM reads ). If the cache is not big enough then selecting less > may be better up till the point where the total amount repeatedly selected > cannot be cached, in which case sequential scans should be better. This is > of course for queries in serial, not queries in parallel. How would one > take these issues into account in an optimizer? This is an interesting point, that an index scan may fit in the cache while a sequential scan may not. I can see cases where even a index scan of a large percentage of the table may win over an sequential scan. Interesting. Determining that, especially in a multi-user environment, is quite difficult. We do have 'effective_cache_size', which does try to determine how much of the I/O will have to go to disk and how much may fit in the cache, but it is quite a fuzzy number. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Hi All. I've been reading all the thread and I want to add a few points: You can set enable_seqscan=off in small or easy queries, but in large queries index can speed parts of the query and slow other, so I think it is neccesary if you want Postgres to become a Wide-used DBMS that the planner could be able to decide accuratelly, in the thread there is a point that might be useful, it will be very interesting that the planner could learn with previous executions, even there could be a warm-up policy to let planner learn about how the DB is working, this info could be stored with DB data, and could statistically show how use of index or seqscan works on every column of the DB. I think it will be useful hearing all users and not guiding only with our own experience, the main objective is to make a versatil DBMS, It's very easy to get down the need of improving indexes with single selects, but a lot of us are not doing single select, so I think that point needs to be heard. Regards ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Lincoln Yeoh <[EMAIL PROTECTED]> writes: > ...By the way, are updates treated the same as selects by the optimizer? Yeah. The writes must occur in any case, so I see no reason why the optimizer should worry about them. All it needs to consider are the cycles used by the various alternatives for fetching the data. So the problem is isomorphic to a SELECT. This assumption is really wired quite fundamentally into the optimizer, but I'm not sure if it's made clear anywhere in the documentation. Can anyone suggest the right place to describe it? regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
At 10:48 AM 4/18/02 -0400, mlw wrote: >Bruce Momjian wrote: > > > > Have you tried reducing 'random_page_cost' in postgresql.conf. That > > should solve most of your problems if you would like more index scans. > >My random page cost is 1 :-) What happens when you set random page cost to 1? Between an index scan of 50% of a table and a full table scan which would the optimizer pick? With it at 1, what percentage would be the switchover point? Because I'm thinking that for _repeated_ queries when there is caching the random page cost for "small" selections may be very low after the first very costly select (may not be that costly for smart SCSI drives). So selecting 10% of a table randomly may not be that costly after the first select. Whereas for sequential scans 100% of the table must fit in the cache. If the cache is big enough then whichever results in selecting less should be faster ( noting that typically sequential RAM reads are faster than random RAM reads ). If the cache is not big enough then selecting less may be better up till the point where the total amount repeatedly selected cannot be cached, in which case sequential scans should be better. This is of course for queries in serial, not queries in parallel. How would one take these issues into account in an optimizer? Mark's problems with the optimizer seem to be something else tho: statistics off. >I had a database where I had to have "enable_seqscan=false" in the config >file. >The nature of the data always makes the statistics bogus, and it always >refused >to use the index. >My one most important experience (I've had more than one) with this whole >topic >is DMN's music database, when PostgreSQL uses the index, the query executes in >a fraction of a second. When "enable_seqscan=true" PostgreSQL refuses to use >the index, and the query takes a about a minute. No matter how much I analyze, >I have to disable sequential scan for the system to work correctly. I'm just wondering why not just use enable_seqscan=false for those problematic queries as a "hint"? Unless your query does need some seq scans as well? By the way, are updates treated the same as selects by the optimizer? Regards, Link. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On 04/18/2002 12:41:15 PM tycho wrote: > > Don't know if the optimizer takes this into consideration, but a query that > uses a primary and/or unique key in the where-clause, should always choose to > use > > the related indices (assuming the table size is above a certain threshold). > Since a primary key/unique index always restricts the resultset to a single > row. > > I don't think so. > > eg. table with primary key "pk", taking values from 1 to 100 (so > 100 records) > > select * from table where pk > 5 > > should probably not use the index ... Oops, you're right of course. Rephrase the above as 'a query that uses a primary key to uniquely qualify a single row' (which pretty much restricts it to the = operator with a constant). Still, this is probably a fairly common case. Maarten Maarten Boekhold, [EMAIL PROTECTED] Reuters Consulting Dubai Media City Building 1, 5th Floor PO Box 1426 Dubai, United Arab Emirates tel:+971(0)4 3918300 ext 249 fax:+971(0)4 3918333 mob:+971(0)505526539 -- -- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 18 Apr 2002, Michael Loftis wrote: > mlw wrote: > > >The supposed advantage of a sequential read over an random read, in > >an active multitasking system, is a myth. If you are executing one > >query and the system is doing only that query, you may be right. > > > >Execute a number of queries at the same time, the expected benefit > >of a sequential scan goes out the window. The OS will be fetching > >blocks, more or less, at random. On a system that has neither read-ahead nor sorting of I/O requests, yes. Which systems are you using that provide neither of these facilities? > In a multi-tasking system it's always cheaper to fetch less blocks, no > matter where they are. Because, as you said, it will end up more or > less random onf a system experiencing a larger number of queries. Invariably a process or thread will lose its quantum when it submits an I/O request. (There's nothing left for it to do, since it's waiting for its data to be read, so there's nothing for it to execute.) It receives its next quantum when the data are available, and then it may begin processing the data. There are two possibilities at this point: a) The process will complete its processing of the current blocks of data and submit an I/O request. In this case, you would certainly have seen better performance (assuming you're not CPU-bound--see below) had you read more, because you would have processed more in that quantum instead of stopping and waiting for more I/O. b) In that quantum you cannot complete processing the blocks read because you don't have any more CPU time left. In this case there are two possibilities: i) You're CPU bound, in which case better disk performance makes no difference anyway, or ii) You are likely to find the blocks still in memory when you get your next quantum. (Unless you don't have enough memory in the system, in which case, you should fix that before you spend any more time or money on tuning disk performance.) So basically, it's only cheaper to fetch fewer blocks all the time if you're doing large amounts of I/O and have relatively little memory. The latter case is getting more and more rare as time goes on. I'd say at this point that anybody interested in performance is likely to have at least 256 MB of memory, which means you're going to need a fairly large database and a lot of workload before that becomes the problem. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 18 Apr 2002, mlw wrote: > The days when "head movement" is relevant are long over. Not a single drive > sold today, or in the last 5 years, is a simple spindle/head system. > The assumption that sequentially reading a file from a modern disk drive means > that the head will move less often is largely bogus. Well, oddly enough, even with the head moving just as often, sequential I/O has always been much faster than random I/O on every drive I've owned in the past five years. So I guess I/O speed doesn't have a lot to do with head movement or something. Some of my drives have started to "chatter" quite noisily during random I/O, too. I thought that this was due to the head movement, but I guess not, since they're quite silent during sequential I/O. BTW, what sort of benchmarking did you do to determine that the head movement is similar during random and sequential I/O on drives in the last five years or so? cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Got some numbers now... You'll notice the Random reads are *really* slow. The reason for htis is the particular read sizes that are ebing used are the absolute worst-case for my particular configuration. (wiht a 32kb or 64kb block size I generally achieve much higher performance even on random I/O) Sequential I/O is most likely being limited atleast in part by the CPU power available... Sequential tests: 2147483648 bytes read in 39.716158 seconds 54070780.16 bytes/sec 2147483648 bytes read in 37.836187 seconds 56757401.27 bytes/sec 2147483648 bytes read in 38.081452 seconds 56391853.13 bytes/sec 2147483648 bytes read in 38.122105 seconds 56331717.46 bytes/sec 2147483648 bytes read in 38.303999 seconds 56064215.33 bytes/sec Total: 192.059901 seconds 279615967.4 (mumble) Ave: 38.4119802 seconds 55923193.47 bytes/sec Random tests: 2147483648 bytes read in 1744.002332 seconds 1231353.656 bytes/sec 2147483648 bytes read in 1744.797705 seconds 1230792.339 bytes/sec 2147483648 bytes read in 1741.577362 seconds 1233068.191 bytes/sec 2147483648 bytes read in 1741.497690 seconds 1233124.603 bytes/sec 2147483648 bytes read in 1739.773354 seconds 1234346.786 bytes/sec Total: 8711.648443 seconds 6162685.575 Ave: 1742.329689 seconds 1232537.115 bytes/sec So on this machine at that block I/O level (8kb block I believe it was) I have a ~55MB/sec Sequential Read rate and ~12MB/sec Random Read rate. Like I said though I'm fairly certain the random read rates were worst case because of the particular block size in the configuration this system uses. But I feel that the results are respectable and valid nonetheless. Note how the random reads kept getting better... The ICP and drive caching firmware were starting to 'catch on' that this 2gb file was a hot spot so were preferring to cache things a little longer and pre-fetch in a different order than normal. I estimate that it would have dropped as low as 1700 if allowed to keep going. RAW output from my script... mloftis@free:/mnt/rz01/ml01/rndtst$ sh PAGECOST2GB.sh CREATING FILE Thu Apr 18 09:11:55 PDT 2002 Creating test file 2gb.test of 2048 mb 176.23 real22.75 user34.72 sys BEGINNING SEQUENTIAL TESTS Thu Apr 18 09:14:51 PDT 2002 Sequential read test of 2gb.test 2147483648 bytes read in 39.716158 seconds 39.73 real 1.52 user23.87 sys Sequential read test of 2gb.test 2147483648 bytes read in 37.836187 seconds 37.83 real 1.44 user23.68 sys Sequential read test of 2gb.test 2147483648 bytes read in 38.081452 seconds 38.08 real 1.62 user23.51 sys Sequential read test of 2gb.test 2147483648 bytes read in 38.122105 seconds 38.12 real 1.63 user23.50 sys Sequential read test of 2gb.test 2147483648 bytes read in 38.303999 seconds 38.30 real 1.32 user23.83 sys Thu Apr 18 09:18:03 PDT 2002 BEGINNING RANDOM READ TESTS Random read test of 2gb.test for 2048 mb 2147483648 bytes read in 1744.002332 seconds 1744.01 real 4.33 user36.47 sys Random read test of 2gb.test for 2048 mb 2147483648 bytes read in 1744.797705 seconds 1744.81 real 4.38 user36.56 sys Random read test of 2gb.test for 2048 mb 2147483648 bytes read in 1741.577362 seconds 1741.58 real 4.58 user36.18 sys Random read test of 2gb.test for 2048 mb 2147483648 bytes read in 1741.497690 seconds 1741.50 real 4.17 user36.57 sys Random read test of 2gb.test for 2048 mb 2147483648 bytes read in 1739.773354 seconds 1739.78 real 4.41 user36.36 sys TESTS COMPLETED Thu Apr 18 11:43:15 PDT 2002 Michael Loftis wrote: > Numbers being run on a BSD box now... > > FreeBSD 4.3-p27 512MB RAM 2xPiii600 Xeon ona 4 disk RAID 5 ARRAY on a > dedicated ICP Vortex card. Sorry no single drives on this box, I have > an outboard Silicon Gear Mercury on a motherboard based Adaptec > controller I can test as well. I'll post when the tests on the Vortex > are done. I'm using 2Gb files ATM, I'll look at the code and see if > it can be made to work with large files. Atleast for FreeBSD the > change will be mostly taking doing s/fseek/fseeko/g s/size_t/off_t/g > or something similar. FreeBSD seems ot prefer teh Open Unix standard > in this regard... > > This will make it usable for much larger test files. > > Tom Lane wrote: > >> Michael Loftis <[EMAIL PROTECTED]> writes: >> >>> Somethings wrong with the random numbers from the sun... re-run >>> them, that first sample is insane Caching looks like it's >>> affecctign your results alot... >>> >> >> Yeah; it looks like the test case is not large enough to swamp out >> caching effects on the Sun box. It is on the Linux box, evidently, >> since the 10:1 ratio appears very repeatable. >> >> regards, tom lane >> > > > > ---(end of broadcast)--- >
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Indeed - I had a delayed post (sent from the wrong email address) which mentioned that the cache is obviously at play here. I still find it amazing that the file system would cache 2gb :) The numbers are definitely correct though...they are actually the second set. I'm running a test with a larger file size to remove the cache effects (having realise that ulimit is the biz). Will post again shortly. Tom - have you had a change to look at the test prg I wrote? Is it working as desired? Cheers, Mark On Fri, 2002-04-19 at 00:56, Tom Lane wrote: > Michael Loftis <[EMAIL PROTECTED]> writes: > > Somethings wrong with the random numbers from the sun... re-run them, > > that first sample is insane Caching looks like it's affecctign your > > results alot... > > Yeah; it looks like the test case is not large enough to swamp out > caching effects on the Sun box. It is on the Linux box, evidently, > since the 10:1 ratio appears very repeatable. > > regards, tom lane > > ---(end of broadcast)--- > TIP 4: Don't 'kill -9' the postmaster > ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Adrian 'Dagurashibanipal' von Bidder wrote: > > On Wed, 2002-04-17 at 19:43, Tom Lane wrote: > > Hannu Krosing <[EMAIL PROTECTED]> writes: > > > OTOH, it is also important where the file is on disk. As seen from disk > > > speed test graphs on http://www.tomshardware.com , the speed difference > > > of sequential reads is 1.5 to 2.5 between inner and outer tracks. > > > > True. But if we use the same test file for both the sequential and > > random-access timings, hopefully the absolute speed of access will > > cancel out. (Again, it's the sort of thing that could use some > > real-world testing...) > > Not so sure about that. Random access basically measures latency, > sequential access measures transfer speed. I'd argue that latency is > more or less constant across the disk as it depends on head movement and > the spindle turning. The days when "head movement" is relevant are long over. Not a single drive sold today, or in the last 5 years, is a simple spindle/head system. Many are RLE encoded, some have RAID features across the various platters inside the drive. Many have dynamic remapping of sectors, all of them have internal caching, some of them have predictive read ahead, some even use compression. The assumption that sequentially reading a file from a modern disk drive means that the head will move less often is largely bogus. Now, factor in a full RAID system where you have 8 of these disks. Random access of a drive may be slower than sequential access, but this has less to do with the drive, and more to do with OS level caching and I/O channel hardware. Factor in a busy multitasking system, you have no way to really predict the state of a drive from one read to the next. (Rotational speed of the drive is still important in that it affects internal rotational alignment and data transfer.) ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Numbers being run on a BSD box now... FreeBSD 4.3-p27 512MB RAM 2xPiii600 Xeon ona 4 disk RAID 5 ARRAY on a dedicated ICP Vortex card. Sorry no single drives on this box, I have an outboard Silicon Gear Mercury on a motherboard based Adaptec controller I can test as well. I'll post when the tests on the Vortex are done. I'm using 2Gb files ATM, I'll look at the code and see if it can be made to work with large files. Atleast for FreeBSD the change will be mostly taking doing s/fseek/fseeko/g s/size_t/off_t/g or something similar. FreeBSD seems ot prefer teh Open Unix standard in this regard... This will make it usable for much larger test files. Tom Lane wrote: >Michael Loftis <[EMAIL PROTECTED]> writes: > >>Somethings wrong with the random numbers from the sun... re-run them, >>that first sample is insane Caching looks like it's affecctign your >>results alot... >> > >Yeah; it looks like the test case is not large enough to swamp out >caching effects on the Sun box. It is on the Linux box, evidently, >since the 10:1 ratio appears very repeatable. > > regards, tom lane > ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw <[EMAIL PROTECTED]> writes: > My one most important experience (I've had more than one) with this > whole topic is DMN's music database, when PostgreSQL uses the index, > the query executes in a fraction of a second. When > "enable_seqscan=true" PostgreSQL refuses to use the index, and the > query takes a about a minute. No matter how much I analyze, I have to > disable sequential scan for the system to work correctly. It would be useful to see "explain analyze" not just "explain" for these cases. Also, what stats does pg_stats show for the variables used? regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw wrote: > Bruce Momjian wrote: > > > > mlw wrote: > > > I don't think we will agree, we have seen different behaviors, and our > > > experiences seem to conflict. This however does not mean that either of us is > > > in error, it just may mean that we use data with very different > > > characteristics. > > > > > > This thread is kind of frustrating for me because over the last couple years I > > > have seen this problem many times and the answer is always the same, "The > > > statistics need to be improved." Tom, you and I have gone back and forth about > > > this more than once. > > > > > > > Have you tried reducing 'random_page_cost' in postgresql.conf. That > > should solve most of your problems if you would like more index scans. > > My random page cost is 1 :-) Have you tried < 1. Seems that may work well for your case. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
... > My one most important experience (I've had more than one) with this whole topic > is DMN's music database, when PostgreSQL uses the index, the query executes in > a fraction of a second. When "enable_seqscan=true" PostgreSQL refuses to use > the index, and the query takes a about a minute. No matter how much I analyze, > I have to disable sequential scan for the system to work correctly. How about contributing the data and a query? We've all got things that we would like to change or adjust in the PostgreSQL feature set. If you can't contribute code, how about organizing some choice datasets for testing purposes? If the accumulated set is too big for postgresql.org (probably not, but...) I can host them on my machine. Most folks seem to not have to manipulate the optimizer to get good results nowadays. So to make more progress we need to have test cases... - Thomas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Michael Loftis <[EMAIL PROTECTED]> writes: > Somethings wrong with the random numbers from the sun... re-run them, > that first sample is insane Caching looks like it's affecctign your > results alot... Yeah; it looks like the test case is not large enough to swamp out caching effects on the Sun box. It is on the Linux box, evidently, since the 10:1 ratio appears very repeatable. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian wrote: > > mlw wrote: > > I don't think we will agree, we have seen different behaviors, and our > > experiences seem to conflict. This however does not mean that either of us is > > in error, it just may mean that we use data with very different > > characteristics. > > > > This thread is kind of frustrating for me because over the last couple years I > > have seen this problem many times and the answer is always the same, "The > > statistics need to be improved." Tom, you and I have gone back and forth about > > this more than once. > > > > Have you tried reducing 'random_page_cost' in postgresql.conf. That > should solve most of your problems if you would like more index scans. My random page cost is 1 :-) I had a database where I had to have "enable_seqscan=false" in the config file. The nature of the data always makes the statistics bogus, and it always refused to use the index. It is frustrating because sometimes it *is* a problem for some unknown number of users (including myself), as evidenced by the perenial "why isn't postgres using my index" posts, and for the last two years you guys keep saying it isn't a problem, or that the statistics just need improvement. Sorry for my tone, but I have pulled out my hair numerous times on this very problem. This whole process has lead me to change my mind. I don't think adding weight to an index scan is the answer, I think having the ability to submit hints to the planner is the only way to really address this or any future issues. Just so you understand my perspective, I am not thinking of the average web monkey. I am thinking of the expert DBA or archetect who want to deploy a system, and needs to have real control over performance in critical areas. My one most important experience (I've had more than one) with this whole topic is DMN's music database, when PostgreSQL uses the index, the query executes in a fraction of a second. When "enable_seqscan=true" PostgreSQL refuses to use the index, and the query takes a about a minute. No matter how much I analyze, I have to disable sequential scan for the system to work correctly. cdinfo=# set enable_seqscan=false ; SET VARIABLE cdinfo=# explain select * from ztitles, zsong where ztitles.muzenbr = zsong.muzenbr and ztitles.artistid = 100 ; NOTICE: QUERY PLAN: Merge Join (cost=3134.95..242643.42 rows=32426 width=356) -> Sort (cost=3134.95..3134.95 rows=3532 width=304) -> Index Scan using ztitles_artistid on ztitles (cost=0.00..3126.62 rows=3532 width=304) -> Index Scan using zsong_muzenbr on zsong (cost=0.00..237787.51 rows=4298882 width=52) EXPLAIN cdinfo=# set enable_seqscan=true ; SET VARIABLE cdinfo=# explain select * from ztitles, zsong where ztitles.muzenbr = zsong.muzenbr and ztitles.artistid = 100 ; NOTICE: QUERY PLAN: Hash Join (cost=3126.97..61889.37 rows=32426 width=356) -> Seq Scan on zsong (cost=0.00..52312.66 rows=4298882 width=52) -> Hash (cost=3126.62..3126.62 rows=3532 width=304) -> Index Scan using ztitles_artistid on ztitles (cost=0.00..3126.62 rows=3532 width=304) EXPLAIN cdinfo=# select count(*) from zsong ; count - 4298882 (1 row) ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Somethings wrong with the random numbers from the sun... re-run them, that first sample is insane Caching looks like it's affecctign your results alot... Mark Pritchard wrote: >I threw together the attached program (compiles fine with gcc 2.95.2 on >Solaris 2.6 and egcs-2.91.66 on RedHat Linux 6.2) and ran it a few >times. Data is below. Usual disclaimers about hastily written code etc >:) > >Machine = ghoul (generic intel, 384mb ram, dual p3-800, ide disk running >dma) > >Sequential >Bytes Read TimeBytes / Sec >536870912 27.14 19783933.74 >536870912 27.14 19783990.60 >536870912 27.11 19801872.14 >536870912 26.92 19942928.41 >536870912 27.31 19657408.43 >19794026.66 (avg) > >Random >Bytes Read TimeBytes / Sec >1073741824 519.57 2066589.21 >1073741824 517.78 2073751.44 >1073741824 516.92 2077193.23 >1073741824 513.18 2092333.29 >1073741824 510.68 2102579.88 >2082489.41 (avg) > >Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk) > >Sequential >Bytes Read TimeBytes / Sec >2097152000 65.19 32167675.28 >2097152000 65.22 32154114.65 >2097152000 65.16 32182561.99 >2097152000 65.12 32206105.12 >2097152000 64.67 32429463.26 >32227984.06 (avg) > >Random >Bytes Read TimeBytes / Sec >4194304000 1522.22 2755394.79 >4194304000 278.18 15077622.05 >4194304000 91.43 45874730.07 >4194304000 61.43 68273795.19 >4194304000 54.55 76890231.51 >41774354.72 > >If I interpret Tom's "divide" instruction correctly, is that a factor of >10 on the linux box? > ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Finally someone writes down whats been itching at my brain for a while. In a multi-tasking system it's always cheaper to fetch less blocks, no matter where they are. Because, as you said, it will end up more or less random onf a system experiencing a larger number of queries. mlw wrote: >Bruce Momjian wrote: > >>My second point, that index scan is more risky than sequential scan, is >>outlined above. A sequential scan reads each page once, and uses the >>file system read-ahead code to prefetch the disk buffers. Index scans >>are random, and could easily re-read disk pages to plow through a >>significant portion of the table, and because the reads are random, >>the file system will not prefetch the rows so the index scan will have >>to wait for each non-cache-resident row to come in from disk. >> > >It took a bike ride to think about this one. The supposed advantage of a >sequential read over an random read, in an active multitasking system, is a >myth. If you are executing one query and the system is doing only that query, >you may be right. > >Execute a number of queries at the same time, the expected benefit of a >sequential scan goes out the window. The OS will be fetching blocks, more or >less, at random. > >---(end of broadcast)--- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html > ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Tom Lane wrote: > 2. The tag approach presumes that the query programmer is smarter > than the planner. This might be true under ideal circumstances, > but I have a hard time crediting that the planner looking at today's > stats is dumber than the junior programmer who left two years ago, > and no one's updated his query since then. The planner may not be > very bright, but it doesn't get bored, tired, or sick, nor move on > to the next opportunity. It will pick the best plan it can on the > basis of current statistics and the specific values appearing in > the given query. Every time. A tag-forced query plan doesn't > have that adaptability. Add to this that hand tuning would happem mostly queries where the two cost estimates are fairly close, and add the variability of a multi-user environment, a hard-coded plan may turn out to be faster only some of the time, and could change very quickly into something longer if the table changes. My point is that very close cases are the ones most likely to change over time. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw wrote: > I don't think we will agree, we have seen different behaviors, and our > experiences seem to conflict. This however does not mean that either of us is > in error, it just may mean that we use data with very different > characteristics. > > This thread is kind of frustrating for me because over the last couple years I > have seen this problem many times and the answer is always the same, "The > statistics need to be improved." Tom, you and I have gone back and forth about > this more than once. > Have you tried reducing 'random_page_cost' in postgresql.conf. That should solve most of your problems if you would like more index scans. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Wed, 2002-04-17 at 19:43, Tom Lane wrote: > Hannu Krosing <[EMAIL PROTECTED]> writes: > > OTOH, it is also important where the file is on disk. As seen from disk > > speed test graphs on http://www.tomshardware.com , the speed difference > > of sequential reads is 1.5 to 2.5 between inner and outer tracks. > > True. But if we use the same test file for both the sequential and > random-access timings, hopefully the absolute speed of access will > cancel out. (Again, it's the sort of thing that could use some > real-world testing...) Not so sure about that. Random access basically measures latency, sequential access measures transfer speed. I'd argue that latency is more or less constant across the disk as it depends on head movement and the spindle turning. cheers -- vbi signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Tom Lane wrote: > By and large this argument reminds me of the "compiler versus hand- > programmed assembler" argument. Which was pretty much a dead issue > when I was an undergrad, more years ago than I care to admit in a > public forum. Yes, a competent programmer who's willing to work > hard can out-code a compiler over small stretches of code. But no > one tries to write large systems in assembler anymore. Hand-tuned > SQL is up against that same John-Henry-vs-the-steam-hammer logic. > Maybe the current PG optimizer isn't quite in the steam hammer > league yet, but it will get there someday. I'm more interested > in revving up the optimizer than in betting on John Henry. I am not suggesting that anyone is going to write each and every query with hints, but a few select queries, yes, people will want to hand tune them. You are right no one uses assembler to create big systems, but big systems often have spot optimizations in assembler. Even PostgreSQL has assembler in it. No generic solution can be perfect for every specific application. There will always be times when hand tuning a query will produce better results, and sometimes that will make the difference between using PostgreSQL or use something else. For the two years I have been subscribed to this list, this is a fairly constant problem, and the answer is always the same, in effect, "we're working on it." If PostgreSQL had the ability to accept hints, one could say, "We are always working to improve it, but in your case you may want to give the optimizer a hint as to what you expect it to do." It may not be the "best" solution in your mind, but speaking as a long time user of PostgreSQL, it would be a huge help to me, and I'm sure I am not alone. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Tom Lane wrote: > > mlw <[EMAIL PROTECTED]> writes: > > For instance: say we have two similarly performing plans, close to one another, > > say within 20%, one plan uses an index, and one does not. It is unlikely that > > the index plan will perform substantially worse than the non-index plan, right? > > This seems to be the crux of the argument ... but I've really seen no > evidence to suggest that it's true. The downside of improperly picking > an indexscan plan is *not* any less than the downside of improperly > picking a seqscan plan, in my experience. Our experiences differ. I have fought with PostgreSQL on a number of occasions when it would not use an index. Inevitably, I would have to set "enable_seqscan = false." I don't like doing that because it forces the use of an index when it doesn't make sense. I don't think we will agree, we have seen different behaviors, and our experiences seem to conflict. This however does not mean that either of us is in error, it just may mean that we use data with very different characteristics. This thread is kind of frustrating for me because over the last couple years I have seen this problem many times and the answer is always the same, "The statistics need to be improved." Tom, you and I have gone back and forth about this more than once. I submit to you that the statistics will probably *never* be right. They will always need improvement here and there. Perhaps instead of fighting over an algorithmic solution, and forcing the users to work around problems with choosing an index, should we not just allow the developer to place hints in the SQL, as: select /*+ INDEX(a_id, b_id) */ * from a, b where a.id = b.id; That way if there is a performance issue with using or not using an index, the developer can have better control over the evaluation of the query. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw <[EMAIL PROTECTED]> writes: > should we not just allow the developer to place hints in the > SQL, as: > select /*+ INDEX(a_id, b_id) */ * from a, b where a.id = b.id; <> People have suggested that sort of thing from time to time, but I have a couple of problems with it: 1. It's unobvious how to tag the source in a way that is helpful for any but the most trivial queries. Moreover, reasonable sorts of tags would provide only partial specification of the exact query plan, which is a recipe for trouble --- an upgraded optimizer might make different choices, leading to a pessimized plan if some points are pinned down when others aren't. 2. The tag approach presumes that the query programmer is smarter than the planner. This might be true under ideal circumstances, but I have a hard time crediting that the planner looking at today's stats is dumber than the junior programmer who left two years ago, and no one's updated his query since then. The planner may not be very bright, but it doesn't get bored, tired, or sick, nor move on to the next opportunity. It will pick the best plan it can on the basis of current statistics and the specific values appearing in the given query. Every time. A tag-forced query plan doesn't have that adaptability. By and large this argument reminds me of the "compiler versus hand- programmed assembler" argument. Which was pretty much a dead issue when I was an undergrad, more years ago than I care to admit in a public forum. Yes, a competent programmer who's willing to work hard can out-code a compiler over small stretches of code. But no one tries to write large systems in assembler anymore. Hand-tuned SQL is up against that same John-Henry-vs-the-steam-hammer logic. Maybe the current PG optimizer isn't quite in the steam hammer league yet, but it will get there someday. I'm more interested in revving up the optimizer than in betting on John Henry. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw <[EMAIL PROTECTED]> writes: > For instance: say we have two similarly performing plans, close to one another, > say within 20%, one plan uses an index, and one does not. It is unlikely that > the index plan will perform substantially worse than the non-index plan, right? This seems to be the crux of the argument ... but I've really seen no evidence to suggest that it's true. The downside of improperly picking an indexscan plan is *not* any less than the downside of improperly picking a seqscan plan, in my experience. It does seem (per Thomas' earlier observation) that we get more complaints about failure to use an index scan than the other case. Prior to 7.2 it was usually pretty obviously traceable to overestimates of the number of rows to be retrieved (due to inadequate data statistics). In 7.2 that doesn't seem to be the bottleneck anymore. I think now that there may be some shortcoming in the planner's cost model or in the adjustable parameters for same. But my reaction to that is to try to figure out how to fix the cost model. I certainly do not feel that we've reached a dead end in which the only answer is to give up and stop trusting the cost-based optimization approach. > Now, given the choice of the two strategies on a table, both pretty close to > one another, the risk of poor performance for using the index scan is minimal > based on the statistics, but the risk of poor performance for using the > sequential scan is quite high on a large table. You keep asserting that, and you keep providing no evidence. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Apologies for the naff double post, but I meant to add that obviously the figures for the solaris box are bogus after the first run...imagine a file system cache of an entire 2gb file. I tried creating a file of 4gb on this box, but it bombed with a "file too large error". Unfortunately, I can't rip memory out of this box as I don't have exclusive access. On Thu, 2002-04-18 at 11:49, Mark Pritchard wrote: > I threw together the attached program (compiles fine with gcc 2.95.2 on > Solaris 2.6 and egcs-2.91.66 on RedHat Linux 6.2) and ran it a few > times. Data is below. Usual disclaimers about hastily written code etc > :) > > Machine = ghoul (generic intel, 384mb ram, dual p3-800, ide disk running > dma) > > Sequential > Bytes ReadTimeBytes / Sec > 536870912 27.14 19783933.74 > 536870912 27.14 19783990.60 > 536870912 27.11 19801872.14 > 536870912 26.92 19942928.41 > 536870912 27.31 19657408.43 > 19794026.66 (avg) > > Random > Bytes ReadTimeBytes / Sec > 1073741824 519.57 2066589.21 > 1073741824 517.78 2073751.44 > 1073741824 516.92 2077193.23 > 1073741824 513.18 2092333.29 > 1073741824 510.68 2102579.88 > 2082489.41 (avg) > > Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk) > > Sequential > Bytes ReadTimeBytes / Sec > 2097152000 65.19 32167675.28 > 2097152000 65.22 32154114.65 > 2097152000 65.16 32182561.99 > 2097152000 65.12 32206105.12 > 2097152000 64.67 32429463.26 > 32227984.06 (avg) > > Random > Bytes ReadTimeBytes / Sec > 4194304000 1522.22 2755394.79 > 4194304000 278.18 15077622.05 > 4194304000 91.43 45874730.07 > 4194304000 61.43 68273795.19 > 4194304000 54.55 76890231.51 > 41774354.72 > > If I interpret Tom's "divide" instruction correctly, is that a factor of > 10 on the linux box? > > On Thu, 2002-04-18 at 01:16, Tom Lane wrote: > > "Luis Alberto Amigo Navarro" <[EMAIL PROTECTED]> writes: > > > On my own few experience I think this could be solved decreasing > > > random_page_cost, if you would prefer to use indexes than seq scans, then > > > you can lower random_page_cost to a point in which postgres works as you > > > want. So the planner would prefer indexes when in standard conditions it > > > would prefer seq scans. > > > > It's entirely possible that the default value of random_page_cost is too > > high, at least for many modern machines. The experiments I did to get > > the 4.0 figure were done a couple years ago, on hardware that wasn't > > exactly new at the time. I have not heard of anyone else trying to > > measure it though. > > > > I don't think I have the source code I used anymore, but the principle > > is simple enough: > > > > 1. Make a large file (several times the size of your machine's RAM, to > > ensure you swamp out kernel disk buffering effects). Fill with random > > data. (NB: do not fill with zeroes, some filesystems optimize this away.) > > > > 2. Time reading the file sequentially, 8K per read request. > > Repeat enough to get a statistically trustworthy number. > > > > 3. Time reading randomly-chosen 8K pages from the file. Repeat > > enough to get a trustworthy number (the total volume of pages read > > should be several times the size of your RAM). > > > > 4. Divide. > > > > The only tricky thing about this is making sure you are measuring disk > > access times and not being fooled by re-accessing pages the kernel still > > has cached from a previous access. (The PG planner does try to account > > for caching effects, but that's a separate estimate; the value of > > random_page_cost isn't supposed to include caching effects.) AFAIK the > > only good way to do that is to use a large test, which means it takes > > awhile to run; and you need enough spare disk space for a big test file. > > > > It'd be interesting to get some numbers for this across a range of > > hardware, filesystems, etc ... > > > > regards, tom lane > > > > ---(end of broadcast)--- > > TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] > > > > > > > #include > #include > #include > #include > #include > #include > > /** > * Constants > */ > > #define BLOCK_SIZE(8192) > > /** > * Prototypes > */ > > // Creates the test file filled with random data > void createTestFile(char *testFileName, long long fileSize); > > // Handles runtime errors by displaying the function, activity and error number > void handleError(char *functionName, char *activity); > > // Standard entry point > int main(int argc, char *args[]); > > // Prints correct usage and quits > void printUsageAndQuit(); > >
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 18 Apr 2002 [EMAIL PROTECTED] wrote: > > On 04/17/2002 01:44:46 PM Michael Loftis wrote: > > In many of the cases where it is a primary key it is also there to > > ensure fast lookups when referenced as a foreign key. Or for joins. > > Don't know if the optimizer takes this into consideration, but a query that uses a >primary and/or unique key in the where-clause, should always choose to use > the related indices (assuming the table size is above a certain threshold). Since a >primary key/unique index always restricts the resultset to a single row. I don't think so. eg. table with primary key "pk", taking values from 1 to 100 (so 100 records) select * from table where pk > 5 should probably not use the index ... Cheers Tycho -- Tycho Fruru [EMAIL PROTECTED] "Prediction is extremely difficult. Especially about the future." - Niels Bohr ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On 04/17/2002 01:44:46 PM Michael Loftis wrote: > In many of the cases where it is a primary key it is also there to > ensure fast lookups when referenced as a foreign key. Or for joins. Don't know if the optimizer takes this into consideration, but a query that uses a primary and/or unique key in the where-clause, should always choose to use the related indices (assuming the table size is above a certain threshold). Since a primary key/unique index always restricts the resultset to a single row. Somebody else mentioned that after creating an index, he still had to run analyze in order to get the optimizer to choose to use the index. I thought that 'create index' also updated pg_stats? Maarten Maarten Boekhold, [EMAIL PROTECTED] Reuters Consulting Dubai Media City Building 1, 5th Floor PO Box 1426 Dubai, United Arab Emirates tel:+971(0)4 3918300 ext 249 fax:+971(0)4 3918333 mob:+971(0)505526539 - --- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
> > Look at the pgbench utility. I can't run that program without a +- 10% > variation from run to run, no mater how many times I run vacuum and checkpoint. > It's pgbench's fault, TPC-B was replaced with TPC-C because it is not accurate enough, we run a pseudo TPC-H and it has almost no variations from one run to another. Regards ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Thomas Lockhart wrote: > > ... > > I'm of a belief that *eventually* we really can take enough of the > > variables into consideration for planning the best query every time. I > > didn't say it was gunna be soon, nor easy though. > > I agree. But I'd like to eliminate the optimizer variability which > depends solely on the syntactical differences between traditional and > "join syntax" inner join queries. If the reason for these differences > are to allow explicit control over join order, let's get another > mechanism for doing that. Ok. I see what you mean now. That makes more sense. :) + Justin >- Thomas -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
... > I'm of a belief that *eventually* we really can take enough of the > variables into consideration for planning the best query every time. I > didn't say it was gunna be soon, nor easy though. I agree. But I'd like to eliminate the optimizer variability which depends solely on the syntactical differences between traditional and "join syntax" inner join queries. If the reason for these differences are to allow explicit control over join order, let's get another mechanism for doing that. - Thomas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Thomas Lockhart wrote: > > If that were exposed, then folks could have additional control over the > optimizer no matter what syntax they prefer to use. And in fact could > alter the behavior without having to completely rewrite their query. > > One could also think about a threshold mechanism as you mention above, > but istm that allowing explicit control over reordering (fundamentally > different than, say, control over whether particular kinds of scans are > used) is the best first step. Not solely continuing to hide that control > behind heuristics involving query style and numbers of tables. A la Oracle... here we come :-/ If we go down this track, although it would be beneficial in the short term, is it the best long term approach? I'm of a belief that *eventually* we really can take enough of the variables into consideration for planning the best query every time. I didn't say it was gunna be soon, nor easy though. + Justin > - Thomas > > ---(end of broadcast)--- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
> Would it make sense to flatten out INNER JOINs only when the total > number of tables involved is less than some parameter N? N > around six or eight would probably keep the complex-query crowd > happy, while not causing unintuitive behavior for simple queries. > Anybody who really likes the current behavior could set N=1 to force > the system to obey his join order. I'd like to see the "reorder, or not to reorder" to happen as a settable parameter, *not* as a side effect of choosing a particular should-be-equivalent syntax for a query. If that were exposed, then folks could have additional control over the optimizer no matter what syntax they prefer to use. And in fact could alter the behavior without having to completely rewrite their query. One could also think about a threshold mechanism as you mention above, but istm that allowing explicit control over reordering (fundamentally different than, say, control over whether particular kinds of scans are used) is the best first step. Not solely continuing to hide that control behind heuristics involving query style and numbers of tables. - Thomas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
I threw together the attached program (compiles fine with gcc 2.95.2 on Solaris 2.6 and egcs-2.91.66 on RedHat Linux 6.2) and ran it a few times. Data is below. Usual disclaimers about hastily written code etc :) Machine = ghoul (generic intel, 384mb ram, dual p3-800, ide disk running dma) Sequential Bytes Read TimeBytes / Sec 536870912 27.14 19783933.74 536870912 27.14 19783990.60 536870912 27.11 19801872.14 536870912 26.92 19942928.41 536870912 27.31 19657408.43 19794026.66 (avg) Random Bytes Read TimeBytes / Sec 1073741824 519.57 2066589.21 1073741824 517.78 2073751.44 1073741824 516.92 2077193.23 1073741824 513.18 2092333.29 1073741824 510.68 2102579.88 2082489.41 (avg) Machine = jedi (Sun E420, 3gb ram, dual 400s, test on single scsi disk) Sequential Bytes Read TimeBytes / Sec 2097152000 65.19 32167675.28 2097152000 65.22 32154114.65 2097152000 65.16 32182561.99 2097152000 65.12 32206105.12 2097152000 64.67 32429463.26 32227984.06 (avg) Random Bytes Read TimeBytes / Sec 4194304000 1522.22 2755394.79 4194304000 278.18 15077622.05 4194304000 91.43 45874730.07 4194304000 61.43 68273795.19 4194304000 54.55 76890231.51 41774354.72 If I interpret Tom's "divide" instruction correctly, is that a factor of 10 on the linux box? On Thu, 2002-04-18 at 01:16, Tom Lane wrote: > "Luis Alberto Amigo Navarro" <[EMAIL PROTECTED]> writes: > > On my own few experience I think this could be solved decreasing > > random_page_cost, if you would prefer to use indexes than seq scans, then > > you can lower random_page_cost to a point in which postgres works as you > > want. So the planner would prefer indexes when in standard conditions it > > would prefer seq scans. > > It's entirely possible that the default value of random_page_cost is too > high, at least for many modern machines. The experiments I did to get > the 4.0 figure were done a couple years ago, on hardware that wasn't > exactly new at the time. I have not heard of anyone else trying to > measure it though. > > I don't think I have the source code I used anymore, but the principle > is simple enough: > > 1. Make a large file (several times the size of your machine's RAM, to > ensure you swamp out kernel disk buffering effects). Fill with random > data. (NB: do not fill with zeroes, some filesystems optimize this away.) > > 2. Time reading the file sequentially, 8K per read request. > Repeat enough to get a statistically trustworthy number. > > 3. Time reading randomly-chosen 8K pages from the file. Repeat > enough to get a trustworthy number (the total volume of pages read > should be several times the size of your RAM). > > 4. Divide. > > The only tricky thing about this is making sure you are measuring disk > access times and not being fooled by re-accessing pages the kernel still > has cached from a previous access. (The PG planner does try to account > for caching effects, but that's a separate estimate; the value of > random_page_cost isn't supposed to include caching effects.) AFAIK the > only good way to do that is to use a large test, which means it takes > awhile to run; and you need enough spare disk space for a big test file. > > It'd be interesting to get some numbers for this across a range of > hardware, filesystems, etc ... > > regards, tom lane > > ---(end of broadcast)--- > TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] > #include #include #include #include #include #include /** * Constants */ #define BLOCK_SIZE (8192) /** * Prototypes */ // Creates the test file filled with random data void createTestFile(char *testFileName, long long fileSize); // Handles runtime errors by displaying the function, activity and error number void handleError(char *functionName, char *activity); // Standard entry point int main(int argc, char *args[]); // Prints correct usage and quits void printUsageAndQuit(); // Tests performance of random reads of the given file void testRandom(char *testFileName, long long amountToRead); // Tests performance of sequential reads of the given file void testSeq(char *testFileName); /** * Definitions */ /** * createTestFile() */ void createTestFile(char *testFileName, long long fileSize) { FILE *testFile; long long reps, i, j, bufferReps; time_t timetmp; long long *buffer; size_t written; // Indicate op printf("Creating test file %s of %lld mb\n",testFileName,fileSize); // Adjust file size to bytes fileSize *= (1024*1024); // Allocate a buffer for writing out random long longs if (!(buffer = malloc(BLOCK_SIZE))) handleError("createTestFile()","malloc"); // Open
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
> > ... I have seen many instances of when > > PostgreSQL refuses to use an index because the data distribution is uneven. > > This is fixed, or at least attacked, in 7.2. Again, I do not see > this as an argument for making the planner stupider instead of > smarter. Could someone fork out some decent criteria for these "stats" that way someone could generate a small app that would recommend these values on a per site basis. Having them hardwired and stuffed into a system catalog does no good to the newbie DBA. Iterating over a set of SQL statements, measuring the output, and then sending the user the results in the form of recommended values would be huge. Where could I look for an explanation of all of these values? -sc -- Sean Chittenden ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Tom Lane wrote: >mlw <[EMAIL PROTECTED]> writes: > >>That is the difference, in another post Tom said he could not get >>excited about 10.9 second execution time over a 7.96 execution >>time. Damn!!! I would. That is wrong. >> > >Sure. Show us how to make the planner's estimates 2x more accurate >(on average) than they are now, and I'll get excited too. > >But forcing indexscan to be chosen over seqscan does not count as >making it more accurate. (If you think it does, then you don't >need to be in this thread at all; set enable_seqscan = 0 and >stop bugging us ;-)) > > regards, tom lane > >---(end of broadcast)--- >TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] > Do we have a tool that can analyze a table and indexes to allow the DBA to choose when to add an index or when not too? DB2 has an index analyizer like this. Given a specific query and the current table stats it can tell you which indexes would be most beneficial. Do we have something like this already? At least we could point those DBA's to a utility like this and then they would not be too suprised when the optimizer didn't use the index. - Bill ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Wed, 2002-04-17 at 22:43, Tom Lane wrote: > Hannu Krosing <[EMAIL PROTECTED]> writes: > > OTOH, it is also important where the file is on disk. As seen from disk > > speed test graphs on http://www.tomshardware.com , the speed difference > > of sequential reads is 1.5 to 2.5 between inner and outer tracks. > > True. But if we use the same test file for both the sequential and > random-access timings, hopefully the absolute speed of access will > cancel out. (Again, it's the sort of thing that could use some > real-world testing...) What I was trying to say was thet if you test on one end you will get wrong data for the other end of the same disk. > > (The PG planner does try to account > > for caching effects, but that's a separate estimate; > > > Will it make the random and seq read cost equal when cache size > > > database size and enough queries are performed to assume that all data > > is in cache. > > There isn't any attempt to account for the effects of data having been > read into cache by previous queries. I doubt that it would improve the > model to try to keep track of what the recent queries were Perhaps some simple thing, like number of pages read * cache size / database size Or perhaps use some additional bookkeeping in cache logic, perhaps even on per-table basis. If this can be made to use the same locks ás cache loading/invalidation it may be quite cheap. It may even exist in some weird way already inside the LRU mechanism. >--- for one > thing, do you really want your plans changing on the basis of activity > of other backends? If I want the best plans then yes. The other backends do affect performance so the best plan would be to account for their activities. If other backend is swapping like crazy the best plan may even be to wait for it to finish before proceeding :) Hannu ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
D'Arcy J.M. Cain wrote: > On April 17, 2002 05:44 pm, mlw wrote: > > It took a bike ride to think about this one. The supposed advantage of a > > sequential read over an random read, in an active multitasking system, is a > > myth. If you are executing one query and the system is doing only that > > query, you may be right. > > > > Execute a number of queries at the same time, the expected benefit of a > > sequential scan goes out the window. The OS will be fetching blocks, more > > or less, at random. > > If it does you should look for another OS. A good OS will work with your > access requests to keep them as linear as possible. Of course it has a > slight effect the other way as well but generally lots of sequential reads > will be faster than lots of random ones. If you don't believe that then just > run the test that Tom suggested to calculate random_tuple_cost on your own > system. I bet your number is higher than 1. The two backends would have to be hitting the same table at different spots to turn off read-ahead, but it is possible. If the backends are hitting different tables, then they don't turn off read-ahead. Of course, for both backends to be hitting the disk, they both would have not found their data in the postgres or kernel cache. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw wrote: > Bruce Momjian wrote: > > > > > OK, yes, sequential scan _can_ be as slow as index scan, but sometimes > > it is faster. Can you provide reasoning why index scan should be > > preferred, other than the admin created it, which I already addressed? > > If you have a choice between two or more sub-plans, similar in cost, say within > 20% of one another. Choosing a plan which uses an index has a chance of > improved performance if the estimates are wrong where as choosing the > sequential scan will always have the full cost. And the chance of reduced performance if the estimate was too low. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian wrote: > > OK, yes, sequential scan _can_ be as slow as index scan, but sometimes > it is faster. Can you provide reasoning why index scan should be > preferred, other than the admin created it, which I already addressed? If you have a choice between two or more sub-plans, similar in cost, say within 20% of one another. Choosing a plan which uses an index has a chance of improved performance if the estimates are wrong where as choosing the sequential scan will always have the full cost. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
-Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 17, 2002 2:39 PM To: mlw Cc: Andrew Sullivan; PostgreSQL-development; Tom Lane Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE mlw wrote: > Bruce Momjian wrote: > > > > mlw wrote: > > > Now, given the choice of the two strategies on a table, both pretty close to > > > one another, the risk of poor performance for using the index scan is minimal > > > based on the statistics, but the risk of poor performance for using the > > > sequential scan is quite high on a large table. > > > My second point, that index scan is more risky than sequential scan, is > > outlined above. A sequential scan reads each page once, and uses the > > file system read-ahead code to prefetch the disk buffers. Index scans > > are random, and could easily re-read disk pages to plow through a > > significant portion of the table, and because the reads are random, > > the file system will not prefetch the rows so the index scan will have > > to wait for each non-cache-resident row to come in from disk. > > That is a very interesting point, but shouldn't that be factored into the cost > (random_tuple_cost?) In which case my point still stands. Yes, I see your point. I think on the high end that index scans can get very expensive if you start to do lots of cache misses and have to wait for i/o. I know the random cost is 4, but I think that number is not linear. It can be much higher for lots of cache misses and waiting for I/O, and think that is why it feels more risky to do an index scan on a sample size that is not perfectly known. Actually, you pretty much can know sequential scan size because you know the number of blocks in the table. It is index scan that is more unknown because you don't know how many index lookups you will need, and how well they will stay in the cache. Does that help? Wow, this _is_ confusing. I am still looking for that holy grail that will allow this all to be codified so others can learn from it and we don't have to rehash this repeatedly, but frankly, this whole discussion is covering new ground that we haven't covered yet. (Maybe TODO.detail this discussion and point to it from the FAQ.) >> General rules of thumb (don't know if they apply to postgres or not): Index scans are increasingly costly when the data is only a few types. For instance, an index on a single bit makes for a very expensive scan. After 5% of the data, it would be cheaper to scan the whole table without using the index. Now, if you have a clustered index (where the data is *physically* ordered by the index order) you can use index scans much more cheaply than when the data is not ordered in this way. A golden rule of thumb when accessing data in a relational database is to use the unique clustered index whenever possible (even if it is not the primary key). These decisions are always heuristic in nature. If a table is small it may be cheapest to load the whole table into memory and sort it. If the vacuum command could categorize statistically (some RDBMS systems do this) then you can look at the statistical data and make much smarter choices for how to use the index relations. The statistical information saved could be as simple as the min, max, mean, median, mode, and standard deviation, or it might also have quartiles or deciles, or some other measure to show even more data about the actual distribution. You could save what is essentially a binned histogram of the data that is present in the table. A bit of imagination will quickly show how useful this could be (some commercial database systems actually do this). Another notion is to super optimize some queries. By this, I mean that if someone says that a particular query is very important, it might be worthwhile to actually try a dozen or so different plans (of the potentially billions that are possible) against the query and store the best one. They could also run the super optimize feature again later or even automatically if the vacuum operation detects that the data distributions or cardinality have changed in some significant manner. Better yet, let them store the plan and hand edit it, if need be. Rdb is an example of a commercial database that allows this. A maintenance nightmare when dimwits do it, of course, but such is life. << ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian wrote: > > mlw wrote: > > Bruce Momjian wrote: > > > > > > mlw wrote: > > > > Now, given the choice of the two strategies on a table, both pretty close to > > > > one another, the risk of poor performance for using the index scan is minimal > > > > based on the statistics, but the risk of poor performance for using the > > > > sequential scan is quite high on a large table. > > > > > My second point, that index scan is more risky than sequential scan, is > > > outlined above. A sequential scan reads each page once, and uses the > > > file system read-ahead code to prefetch the disk buffers. Index scans > > > are random, and could easily re-read disk pages to plow through a > > > significant portion of the table, and because the reads are random, > > > the file system will not prefetch the rows so the index scan will have > > > to wait for each non-cache-resident row to come in from disk. > > > > That is a very interesting point, but shouldn't that be factored into the cost > > (random_tuple_cost?) In which case my point still stands. > > Yes, I see your point. I think on the high end that index scans can get > very expensive if you start to do lots of cache misses and have to wait > for i/o. I know the random cost is 4, but I think that number is not > linear. It can be much higher for lots of cache misses and waiting for > I/O, and think that is why it feels more risky to do an index scan on a > sample size that is not perfectly known. In an active system, sequential scans are still OS random access to a file. Two or more queries running at the same time will blow out most of the expected gain. > > Actually, you pretty much can know sequential scan size because you know > the number of blocks in the table. It is index scan that is more > unknown because you don't know how many index lookups you will need, and > how well they will stay in the cache. Again, shouldn't that be factored into the cost? > > Does that help? Wow, this _is_ confusing. I am still looking for that > holy grail that will allow this all to be codified so others can learn > from it and we don't have to rehash this repeatedly, but frankly, this > whole discussion is covering new ground that we haven't covered yet. Path planning by probabilities derived from statistical analysis is always big science, regardless of application. The cost based optimizer will *never* be finished because it can never be perfect. When all is said and done, it could very well be as good as it ever needs to be, and that a method for giving hints to the optimizer, ala Oracle, is the answer. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On April 17, 2002 05:44 pm, mlw wrote: > It took a bike ride to think about this one. The supposed advantage of a > sequential read over an random read, in an active multitasking system, is a > myth. If you are executing one query and the system is doing only that > query, you may be right. > > Execute a number of queries at the same time, the expected benefit of a > sequential scan goes out the window. The OS will be fetching blocks, more > or less, at random. If it does you should look for another OS. A good OS will work with your access requests to keep them as linear as possible. Of course it has a slight effect the other way as well but generally lots of sequential reads will be faster than lots of random ones. If you don't believe that then just run the test that Tom suggested to calculate random_tuple_cost on your own system. I bet your number is higher than 1. And when you are done, just plug the number into your configuration and get the plans that you are looking for. -- D'Arcy J.M. Cain| Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 425 1212 (DoD#0082)(eNTP) | what's for dinner. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw <[EMAIL PROTECTED]> writes: > It took a bike ride to think about this one. The supposed advantage of a > sequential read over an random read, in an active multitasking system, is a > myth. Disagree. > Execute a number of queries at the same time, the expected benefit of a > sequential scan goes out the window. The OS will be fetching blocks, more or > less, at random. If readahead is active (and it should be for sequential reads) there is still a pretty good chance that the next few disk blocks will be in cache next time you get scheduled. If your disk is thrashing that badly, you need more RAM and/or more spindles; using an index will just put even more load on the i/o system. -Doug -- Doug McNaught Wireboard Industries http://www.wireboard.com/ Custom software development, systems and network consulting. Java PostgreSQL Enhydra Python Zope Perl Apache Linux BSD... ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw wrote: > Bruce Momjian wrote: > > My second point, that index scan is more risky than sequential scan, is > > outlined above. A sequential scan reads each page once, and uses the > > file system read-ahead code to prefetch the disk buffers. Index scans > > are random, and could easily re-read disk pages to plow through a > > significant portion of the table, and because the reads are random, > > the file system will not prefetch the rows so the index scan will have > > to wait for each non-cache-resident row to come in from disk. > > It took a bike ride to think about this one. The supposed advantage of a > sequential read over an random read, in an active multitasking system, is a > myth. If you are executing one query and the system is doing only that query, > you may be right. > > Execute a number of queries at the same time, the expected benefit of a > sequential scan goes out the window. The OS will be fetching blocks, more or > less, at random. OK, yes, sequential scan _can_ be as slow as index scan, but sometimes it is faster. Can you provide reasoning why index scan should be preferred, other than the admin created it, which I already addressed? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian wrote: > My second point, that index scan is more risky than sequential scan, is > outlined above. A sequential scan reads each page once, and uses the > file system read-ahead code to prefetch the disk buffers. Index scans > are random, and could easily re-read disk pages to plow through a > significant portion of the table, and because the reads are random, > the file system will not prefetch the rows so the index scan will have > to wait for each non-cache-resident row to come in from disk. It took a bike ride to think about this one. The supposed advantage of a sequential read over an random read, in an active multitasking system, is a myth. If you are executing one query and the system is doing only that query, you may be right. Execute a number of queries at the same time, the expected benefit of a sequential scan goes out the window. The OS will be fetching blocks, more or less, at random. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Michael Loftis wrote: > As far as the 'planner benchmark suite' so we cans tart gathering more > statistical data about what costs should be, or are better at, that's an > excellent idea. People with different hardware have different random page costs, clearly. Even different workloads affect it. Added to TODO: * Add utility to compute accurate random_page_cost value -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw wrote: > Bruce Momjian wrote: > > > > mlw wrote: > > > Now, given the choice of the two strategies on a table, both pretty close to > > > one another, the risk of poor performance for using the index scan is minimal > > > based on the statistics, but the risk of poor performance for using the > > > sequential scan is quite high on a large table. > > > My second point, that index scan is more risky than sequential scan, is > > outlined above. A sequential scan reads each page once, and uses the > > file system read-ahead code to prefetch the disk buffers. Index scans > > are random, and could easily re-read disk pages to plow through a > > significant portion of the table, and because the reads are random, > > the file system will not prefetch the rows so the index scan will have > > to wait for each non-cache-resident row to come in from disk. > > That is a very interesting point, but shouldn't that be factored into the cost > (random_tuple_cost?) In which case my point still stands. Yes, I see your point. I think on the high end that index scans can get very expensive if you start to do lots of cache misses and have to wait for i/o. I know the random cost is 4, but I think that number is not linear. It can be much higher for lots of cache misses and waiting for I/O, and think that is why it feels more risky to do an index scan on a sample size that is not perfectly known. Actually, you pretty much can know sequential scan size because you know the number of blocks in the table. It is index scan that is more unknown because you don't know how many index lookups you will need, and how well they will stay in the cache. Does that help? Wow, this _is_ confusing. I am still looking for that holy grail that will allow this all to be codified so others can learn from it and we don't have to rehash this repeatedly, but frankly, this whole discussion is covering new ground that we haven't covered yet. (Maybe TODO.detail this discussion and point to it from the FAQ.) -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Bruce Momjian wrote: > > mlw wrote: > > Now, given the choice of the two strategies on a table, both pretty close to > > one another, the risk of poor performance for using the index scan is minimal > > based on the statistics, but the risk of poor performance for using the > > sequential scan is quite high on a large table. > My second point, that index scan is more risky than sequential scan, is > outlined above. A sequential scan reads each page once, and uses the > file system read-ahead code to prefetch the disk buffers. Index scans > are random, and could easily re-read disk pages to plow through a > significant portion of the table, and because the reads are random, > the file system will not prefetch the rows so the index scan will have > to wait for each non-cache-resident row to come in from disk. That is a very interesting point, but shouldn't that be factored into the cost (random_tuple_cost?) In which case my point still stands. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw wrote: > Andrew Sullivan wrote: > > > > Now, given the choice of the two strategies on a table, both pretty > > > close to one another, the risk of poor performance for using the > > > index scan is minimal based on the statistics, but the risk of poor > > > performance for using the sequential scan is quite high on a large > > > table. > > > > I thought that's what the various cost estimates were there to cover. > > If this is all you're saying, then the feature is already there. > > The point is that if the index plan is < 20% more costly than the sequential > scan, it is probably less risky. I just posted on this topic. Index scan is more risky, no question about it. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
Andrew Sullivan wrote: > > Now, given the choice of the two strategies on a table, both pretty > > close to one another, the risk of poor performance for using the > > index scan is minimal based on the statistics, but the risk of poor > > performance for using the sequential scan is quite high on a large > > table. > > I thought that's what the various cost estimates were there to cover. > If this is all you're saying, then the feature is already there. The point is that if the index plan is < 20% more costly than the sequential scan, it is probably less risky. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
mlw wrote: > Now, given the choice of the two strategies on a table, both pretty close to > one another, the risk of poor performance for using the index scan is minimal > based on the statistics, but the risk of poor performance for using the > sequential scan is quite high on a large table. Wow, what did I start here? OK, let me see if I can explain why the idea of an index being present is not significant, and also explain why doing a sequential scan is _less_ risky than a index scan. First, if an admin creates an index, it does mean he thinks it will help, but is he right? You could say that if they create the index, use it, and if the admin finds it makes the query slower, he can then remove it, and this does give him some control over the optimizer. However, this assumes two things. First, it assumes the admin will actually check to see if the index helps, and if it doesn't remove it, but more importantly, it assumes there is only one type of query for that table. That is the biggest fallacy. If I do: SELECT * FROM tab WHERE col = 0; I may be selecting 70% of the table, and an index scan will take forever if 70% of the table (plus index pages) is significancy larger than the cache size; every row lookup will have to hit the disk! However, if I do: SELECT * FROM tab WHERE col = 89823; and 89823 is a rare value, perhaps only one row in the table, then an index would be good to use, so yes, indexes can be added by admins to improve performance, but the admin is creating the index probably for the second query, and certainly doesn't want the index used for the first query. Also, these are simple queries. Add multiple tables and join methods, and the idea that an admin creating an index could in any way control these cases is implausible. My second point, that index scan is more risky than sequential scan, is outlined above. A sequential scan reads each page once, and uses the file system read-ahead code to prefetch the disk buffers. Index scans are random, and could easily re-read disk pages to plow through a significant portion of the table, and because the reads are random, the file system will not prefetch the rows so the index scan will have to wait for each non-cache-resident row to come in from disk. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Wed, Apr 17, 2002 at 04:28:03PM -0400, mlw wrote: > Oracle has a cost based optimizer, and they allow you to override > it, offer hints as to what it should do, or use the rules based > optimizer. They know that a cost based optimizer can not generate > the best query all the time. Oracle's the wrong competition to cite here. IBM's optimiser and planner in DB2 is rather difficult to override; IBM actively discourages doing so. That's because it's the best there is. It's _far_ better than Oracle's, and has ever been so. It just about _always_ gets it right. Without presuming to speak for him, I'd suggest that Tom probably wants to get the planner to that level, rather than adding band-aids. > I say it is obvious it can never know enough, since statistics are Enough for what? The idea is that the statistics will get you the best-bet plan. You're trying to redefine what the best bet is; and Tom and others have suggested that a simple rule of thumb, "All else being more or less equal, prefer an index," is not a good one. > Now, given the choice of the two strategies on a table, both pretty > close to one another, the risk of poor performance for using the > index scan is minimal based on the statistics, but the risk of poor > performance for using the sequential scan is quite high on a large > table. I thought that's what the various cost estimates were there to cover. If this is all you're saying, then the feature is already there. -- Andrew Sullivan 87 Mowat Avenue Liberty RMS Toronto, Ontario Canada <[EMAIL PROTECTED]> M6K 3E3 +1 416 646 3304 x110 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster