Re: [HACKERS] Cause of moving-target FSM space-needed reports
Tom Lane wrote: In fact, maybe vacuum should just throw a WARNING when it finds a single rel with more than MaxFSMPages pages with useful free space? I fully agree with this. This (in particular, how many useful free pages a rel has) is information a good DBA can make very good use of, and is needed in the case where it exceeds MaxFSMPages. I would also suggest having vacuum verbose print an INFO message with the rel's number of free pages for rels that don't exceed MaxFSMPages (that number should be printed in the WARNING when the number exceeds MaxFSMPages). Are there other ways of getting this information from the system? If not, then I'd consider this proposed change important. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] advisory locks and permissions
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Doesn't creating many temp tables in a transaction do the same thing? True, but it's a tad harder/less likely that you'd accidentally cause a problem that way. Then why not use a GUC (that only an administrator can set) to control the maximum number of advisory locks a given backend can take at any one time? Then it becomes the DBA's problem (and solution) if someone manages to run the database out of shared memory through this mechanism. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [HACKERS] Fix linking of OpenLDAP libraries
Tom Lane wrote: Albe Laurenz [EMAIL PROTECTED] writes: Let me expand a little on some of the peculiarities of shared libraries on AIX: - A normal AIX shared library is called libXX.a It is an 'ar' archive that contains the shared object(s). Ah, so the problem really boils down to funny naming conventions. If they use .a for both shared and static libraries, how does anyone tell the difference? It sounds to me like there is no difference. Notice how his example ldd output shows dependencies on specific .o entries within the various .a files that reside on the system, rather than on the .a files as a whole. If those entries had been statically linked then they wouldn't have shown up in the ldd output at all. So the difference is no longer between static libraries and shared libraries -- they're now just libraries. The only difference is how you link to them. What IBM has done here is very sensible, and is really what the other Unixes should have done from the beginning: a library is just a library, and what differs is how it's used. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Foreign keys
Joshua D. Drake wrote: In any case the same logic that leads to it being desirable to report all the errors to the user in a UI and not just report them one by one also applies to the database. I'm not sure it's the most important issue in the world, but it does seem like a it would be nice feature if it reported all the errors in the statement, not just the first one it finds. Seems kind of extraneous to me. I am guessing it would cause yet further overhead with our foreign key checks. But in this case, it would be (or should be) overhead only in the case of failure. In the case of success, all the constraints are checked anyway -- they just succeed. I would expect that the number of applications for which a constraint violation is the norm and not the exception is very small. But Tom's concern is a valid one. I expect a reasonable compromise would be to record and show the errors for only the non-deferred constraints in the currently executing statement, because after that point the transaction is in an error state anyway and thus can't continue without a rollback to a savepoint. It probably wouldn't make sense to evaluate the deferred constraints within the erroring statement anyway -- they're deferred, which by definition means they don't get evaluated until commit, so evaluating them at failure time could easily show errors that are only there because subsequent statements never got executed. As for the deferred constraints, it might be reasonable to show errors only up to some limit (controlled by a GUC, perhaps), with the default limit being 1, which is what we have now. Otherwise you run the risk of throwing millions of errors, which is surely not desirable. The downside to this is until you've hit the limit, you have to evaluate *all* the deferred constraints, which could take a while, whereas the current setup will return immediately upon encountering the first constraint error. My testing shows that the use of foreign keys on high velocity single transaction loads, can cause easily a 50% reduction in performance. Why add to that? What we need to be doing is finding a way to decrease the impact of foreign key checks. I definitely agree here, but this should be independent of how foreign key failures are handled once they're detected. In other words, what you're experiencing is the perfomance hit that comes from evaluating the constraints, not from reporting the errors afterwards. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Fixed length data types issue
Tom Lane wrote: (does anyone know the cost of ntohl() on modern Intel CPUs?) I wrote a simple test program to determine this: #include arpa/inet.h int main (int argc, char *argv[]) { unsigned long i; uint32_t a; a = 0; for (i = 0 ; i 40L ; ++i) { #ifdef CALL_NTOHL a = ntohl(i); #endif } return a; } I have a system with an Athlon 64 3200+ (2.0 GHz) running in 64-bit mode, another one with the same processor running in 32-bit mode, a a third running a Pentium 4 1.5 GHz processor, and a fourth running a pair of 2.8 GHz Xeons in hyperthreading mode. I compiled the test program on the 32-bit systems with the -std=c9x option so that the constant would be treated as unsigned. Other than that, the compilation method I used was identical: no optimization, since it would skip the loop entirely in the version without the ntohl() call. I compiled it both with and without defining CALL_NTOHL, and measured the difference in billed CPU seconds. Based on the above, on both Athlon 64 systems, each ntohl() invocation and assignment takes 1.04 nanoseconds to complete (I presume the assignment is to a register, but I'd have to examine the assembly to know for sure). On the 1.5 GHz P4 system, each iteration takes 8.49 nanoseconds. And on the 2.8 GHz Xeon system, each iteration takes 5.01 nanoseconds. That seems reasonably fast to me... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Getting a move on for 8.2 beta
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: Well, it's taken us the full month to get this far through the queue, so I'd sure like to have more people on board reviewing next time. Alvaro helped but I miss Neil Conway, and some other people have helped in the past. However --- the present situation has nothing to do with lack of reviewers, it's lack of time to finish the patches. I did try to get us additional help in reviewing. Neil was unavailable, and Alvaro could only give part of his time. It strikes me that setting feature freeze in midsummer might not be the best strategy for having manpower available to review --- people tend to be on vacation in August. Maybe the answer is just to move the dates a bit one way or the other. Hmm...but if you're going to do that, why not do that now: push the beta date back by, say, a month (or however long you had in mind) for this cycle. That way, the two major patches that are likely to be dropped for this cycle stand a chance to make it into this release, and you accomplish your goal of moving the dates a bit all at the same time. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Coding style for emacs
Andrew Dunstan wrote: David Fetter wrote: If we put them in a place that's visible before you get the source, we can help people use the settings globally :) The likely level of our influence on someone who hasn't used the settings and isn't editing our source code is approximately equal to the influence of Pluto's gravitational field on your mood when you got out of bed this morning. I always wondered what affected his mood when he got out of bed this morning. Now I know! It's amazing the things you can learn by lurking here... :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] LWLock statistics collector
Tom Lane wrote: If I thought that average users would have a need for LWLock statistics, I'd be more sympathetic to expending effort on a nice frontend for viewing the statistics, but this is and always will be just a concern for hardcore hackers ... That may be true of the output, but that's not a very strong argument against making it much easier to gather and display the LWLock statistics. I can easily imagine the patch be a useful performance troubleshooting tool in a high load environment. Depends on how easy/intrusive it is to enable/use the stderr method on a production system, though, as well as how much of a performance impact the measurements have on overall operation... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Corrupt data pages...
Tom Lane wrote: It's fairly hard to see how that could happen inside Postgres. One can readily imagine bugs that might replace one whole page with another, but there aren't any operations that manipulate half-a-page. On the other hand, if your kernel uses 4K blocksize, this could be explained as substitution of one buffer for another at the kernel level. So my money is on a kernel bug. As I mentioned, we've once or twice before seen reports that looked like similar substitutions of single pages by the kernel. Yeah, I agree that's what's likely to have happened here. The kernel's page size is 4k, as is the default block size used by XFS. Tracking this one down any further is going to be nigh unto impossible, I think. There have been a pile of fixes to XFS in 2.6.16, so I'll start using that, at least... Thanks for looking at this. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Corrupt data pages...
Tom Lane wrote: You should at least show the page you think is corrupt. I attempted to send this additional info to the list but I think the message got dropped on the floor by the mailing list software or by the spam filter. I'll put the files on a web server and post links to them here. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Corrupt data pages...
I wrote: I attempted to send this additional info to the list but I think the message got dropped on the floor by the mailing list software or by the spam filter. I'll put the files on a web server and post links to them here. You can find them here: https://gazebo.sysexperts.com/~kevin/postgresql The files are bad-page-info.txt and bad-page.txt. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Corrupt data pages...
possible that this is causing problems, but to be honest this is the first time I've ever seen this happen. If you guys have any suggestions on what I might do to help track this one down, please let me know. I tarred up the entire 'stagezilla' database after shutting down PostgreSQL so that I can do additional examination of the datafiles if that proves useful. Additionally, if you'd like to see the output of pg_filedump, I'll be happy to post it here (or email it separately). I hesitate to attach it here because it's not clear the mailing list's spam filtering would let it pass. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Unsplitting btree index leaf pages
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: We already do something similar for page deletions. Empty pages are not deleted right away, but they are marked with BTP_DEAD, and then deleted on a subsequent vacuum. Or something like that, I don't remember the exact details. Right, and the reason for that is exactly that there might be a concurrent indexscan already in flight to the newly-dead page. We must wait to recycle the page until we are certain no such scans remain. It doesn't matter whether a concurrent indexscan visits the dead page or not, *because it's empty* and so there's nothing to miss. So there's no race condition. But if you try to move valid data across pages then there is a race condition. Hmm... Well, REINDEX is apparently a very expensive operation right now. But how expensive would it be to go through the entire index and perform the index page merge operation being discussed here, and nothing else? If it's fast enough, might it be worthwhile to implement just this alone as a separate maintenance command (e.g., VACUUM INDEX) that acquires the appropriate lock (AccessExclusive, I'd expect) on the index to prevent exactly the issues you're concerned about? If it's fast enough even on large tables, it would be a nice alternative to REINDEX, I'd think. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Reducing relation locking overhead
Tom Lane wrote: The concern about deadlock applies to the various proposals that involve upgrading to a write-prevention lock at some late point in the process. That clearly has the potential to deadlock against relatively weak lock requests. After looking at the various lock types, I don't see how this is the case at all (which may mean that I'm more confused than ever. But please read on). It seems to me that only ops that promote to AccessExclusiveLock can deadlock against at least some of the proposed REINDEX implementations. REINDEX would have to initially grab AccessShareLock, of course, but AccessExclusiveLock is the only lock type that blocks against it, so in the case of lock promotion the only operations that would cause REINDEX to really deadlock (as opposed to simply blocking) are operations on the entire table (ALTER TABLE, DROP TABLE, etc.). None of the common operations block against an AccessShareLock, and the order of acquisition against objects (table vs index) is already enforced, so where's the deadlock potential? REINDEX would, I expect, promote its lock to ShareLock when it's time for it to block writers. That would block against quite a number of operations, of course, but that's not a problem in and of itself, because it need only wait until the operations in question are finished. The lock won't be granted until those other operations are finished, and nothing aside from table-level ops will block against the REINDEX until that lock is granted. A true deadlock won't happen against common operations unless REINDEX promotes its lock again to something stronger than ShareLock, and that's easy to avoid: just have REINDEX promote directly from AccessShareLock to the strongest lock it will ever need. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Reducing relation locking overhead
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: Tom Lane wrote: Even ignoring that, you *still* have a lock upgrade problem in this sketch. Hmm, well, I can see a deadlock potential for those operations that have to acquire multiple locks simultaneously, and I suppose that the table + FSM lock would qualify in the sequence here, but the rest of it involves just a single read lock against the table. What am I missing? At some point you have to lock out writers, else you can never be certain you have all the tuples. I was taking your read lock to actually mean lock out writers; otherwise the sketch doesn't work at all. Right, but the idea is to lock out writers for as brief a time as possible. That not only minimizes the possibility of lock contention but guarantees that REINDEX will get a complete view of the database. That said, it hinges on some sort of efficient way of identifying the new tuples created by other transactions that are/were running during the bulk of the time RINDEX was running. If there's no good way to do that, then there's no way to avoid blocking writers for an extended period of time. The real situation is that you must hold at least AccessShareLock on the table throughout the entire operation, else you have no defense against (say) someone dropping the index or the entire table out from under you. And when you add onto this lock in order to lock out writers temporarily, you are engaging in a lock upgrade, which can deadlock against any sort of exclusive lock request. But won't that depend on the order in which the lock requests appear? If locks A, B, and C are taken in that same order by every transaction, then there's no possibility of deadlock, right? The fact that you've been holding the AccessShareLock for quite a long time means that the window for deadlock problems is very wide. But with respect to deadlocks, that's true only if deadlocks are possible, which is true only if the order of lock acquisition differs between transactions. I guess the real question here is: is it possible to, in code, guarantee the order of lock acquisition by any given transaction? For REINDEX, the problem is simplified somewhat because it's operating against a single index and a single table, so the locks it has to acquire are somewhat limited in scope compared with a generic transaction. An endeavor to acquire all locks in the same order throughout the code would not only take care of this REINDEX deadlock problem but would essentially eliminate all possible deadlocks arising from code-ordered lock acquisition in the system, which I expect would be considered a very good thing indeed. But I expect it would be a lot of effort and wouldn't be worth it just to make REINDEX behave differently than it does now. So what am I missing/overlooking here? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Reducing relation locking overhead
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: I guess the real question here is: is it possible to, in code, guarantee the order of lock acquisition by any given transaction? Yes, but not in our code :-(. This is largely determined by what the application does. Yeah, that's what I figured. But what of the more limited problem of lock acquisition relative to the locks that REINDEX might need to acquire? Since those locks are limited to a single table and a single index, I'd think the problem wouldn't be insurmountable. No? Suppose the following rules were enforced in the code: - when requesting a type of lock, one must first acquire all lesser lock types against the object in order of strength. Hence, one must acquire AccessShareLock before acquiring AccessExclusiveLock. - one must always acquire a given lock type against the table before acquiring it against the index. The latter rule might be sufficient, if the former rule is already implied by the lock types and how they're acquired. Thus, acquisition of AccessExclusiveLock against the index automatically implies acquisition of AccessShareLock(table), then AccessShareLock(index), then AccessExclusiveLock(table), then AccessExclusiveLock(index). I could see that possibly causing performance problems (and would be interested in knowing what performance problems it would likely cause), but given the limited scope of the locks that REINDEX needs, wouldn't the above be sufficient to keep REINDEX deadlock-free against other transactions? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Reducing relation locking overhead
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: - when requesting a type of lock, one must first acquire all lesser lock types against the object in order of strength. Hence, one must acquire AccessShareLock before acquiring AccessExclusiveLock. This is exactly wrong ... And now I see why, since it will introduce deadlocks (sigh). But what of the other rule (always acquiring locks against the table before the index)? You may have stopped reading at the above... One thing I don't quite understand about the discussion is why there's particular attention being paid to deadlocks with respect to REINDEX when it clearly can happen in the general case when lock promotion is involved. Why is REINDEX special here? If the problem is that REINDEX has to hold an AccessShareLock to prevent the table or index from being dropped, but does not need to prevent writers in general (because the presumption is that there is some way of efficiently discovering the addtional modifications made during the bulk of REINDEX processing), then doesn't that mean that an AccessShareLock is the wrong kind of lock for REINDEX to be holding, and that the appropriate type of lock should be created if it doesn't already exist? Additionally, I was under the impression that normal INSERTs, UPDATEs, and DELETEs didn't generally need to acquire AccessExclusiveLock, because of MVCC. If that's the case, then aren't the operations that could deadlock REINDEX relatively rare? And if those operations *do* need to acquire that lock type, then what exactly does MVCC buy you? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Reducing relation locking overhead
Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: What's worse, once you have excluded writes you have to rescan the entire table to be sure you haven't missed anything. So in the scenarios where this whole thing is actually interesting, ie enormous tables, you're still talking about a fairly long interval with writes locked out. Maybe not as long as a complete REINDEX, but long. I was thinking you would set a flag to disable use of the FSM for inserts/updates while the reindex was running. So you would know where to find the new tuples, at the end of the table after the last tuple you read. If REINDEX works by seqscanning the table then the inclusion of new tuples would happen for free if you turn off the FSM before beginning the REINDEX operation -- you're guaranteed to see them last. But that only works if REINDEX behaves this way. Then it's a question of what to do with in-flight updates at the time the REINDEX hits the end of the table. Even if REINDEX hits the table in non-sequential order, turning off the FSM should still work. REINDEX wouldn't need to acquire any additional locks until after it has scanned the appended area. So the way I (perhaps naively) envision it working is: 1. Acquire read lock on the table 2. Turn off FSM 3. Note the location of the end of the table 4. Release read lock on the table 5. Perform REINDEX operation 6. Read and index the bit of the table starting with the location noted previously. 7. Note new end of table 8. Acquire read lock on the table 9. Scan any entries that have been appended past new end of table. 10. Release read lock on table 11. Turn on FSM In the above for large relations, the bulk of the REINDEX should happen without any locks being held by the REINDEX operation. For small tables (where the amount of new insert activity can be a large percentage of the total table size), it would almost certainly be more efficient to just take a read lock for the whole operation. So it might be wise to set up some sort of threshold, and to take a read lock during the whole operation if the table size is smaller than the threshold. The reason the sequence I enumerate above involves taking any locks at all is to avoid the issues that Tom brought up about having to rescan the entire table to make sure nothing gets missed, to avoid possible race conditions between steps 2 and 3, and to allow step 9 to definitively complete, since otherwise in-flight updates would be missed. In the context of the original discussion (reduction of lock acquisition), REINDEX isn't a common operation even if it is a critical one, so acquisition of more than the usual number of locks here shouldn't be a big deal. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [pgsql-www] [HACKERS] Upcoming PG re-releases
David Fetter wrote: On Wed, Nov 30, 2005 at 11:56:33PM -0400, Marc G. Fournier wrote: So, if Sun, SRA, Pervasive, Command Prompt, etc were to submit a patch for v7.2, we'd refuse it? That depends on what you mean by refuse. Such a patch wouldn't resurrect the original Postgres with POSTQUEL and cause us to support it, and it won't cause us to start supporting PostgreSQL 7.2 again either. Okay, but suppose the patch in question breaks the version in question in some subtle but horrible way? If the community isn't supporting the release in question then it implies that it won't go to the effort of testing the patch, subjecting it to a beta period, etc. But since the patch would be applied by the community, the implication would be that the community *endorses* the patch in question, since the official source would be changed to reflect it. If the patch breaks the release horribly, just blindly accepting it wouldn't do good things to the community's reputation. And that means that the only really good way to guard against such an occurrance is to subject the patch to the same process that's used for officially supported releases. At that point, there's no real distinction between officially supported and not officially supported. I doubt the community wants to go down that road. The acceptance of a patch by the community probably implies a lot more than one would think at first glance, so this is certainly an issue that should be thought all the way through. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Reducing relation locking overhead
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: In the above for large relations, the bulk of the REINDEX should happen without any locks being held by the REINDEX operation. As I just pointed out to Greg, the arm-waving notion that you can turn off the FSM requires a great deal of low-level locking that is not there now. Yeah, I thought that the check for use of the FSM was already being done by the lower level operators, and that contention would only occur on the modification of the FSM enabled flag. Obviously this doesn't work at all if the FSM is just assumed to be in use at all times, or if the FSM values are read only at transaction start or something... Even ignoring that, you *still* have a lock upgrade problem in this sketch. Hmm, well, I can see a deadlock potential for those operations that have to acquire multiple locks simultaneously, and I suppose that the table + FSM lock would qualify in the sequence here, but the rest of it involves just a single read lock against the table. What am I missing? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] SIGSEGV taken on 8.1 during dump/reload
Martijn van Oosterhout wrote: None of this applies to PostgreSQL because we open the modules directly, and don't rely on the linker loader. Ah, right. I forgot the context was the server, not one of the utilities... Sorry for the waste of bandwidth... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] SIGSEGV taken on 8.1 during dump/reload
Tom Lane wrote: On the other hand, it'd be relatively easy for clueless lusers to defeat; I can readily imagine someone copying foo.so.8.2 to foo.so.8.3 when the backend complained that it couldn't find the latter. So maybe it's not what we want. Hmm...but isn't the version number also something that can be stored in the shared library itself during link time (e.g., via the -soname option to the linker)? The manpage for ld under Linux implies that this will cause the executable that's linked against the shared object to look explicitly for a library with the soname specified by the shared object. I don't know if that just causes the dynamic linker to look for a file with the specified soname or if it will actually examine the shared object under consideration to make sure it has the DT_SONAME field in question, however. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Question about Ctrl-C and less
Sean Utt wrote: If you send a recent version of vim a CONTROL-C, and you're just sitting there at a prompt, it gives you a hint: Type :quitEnter to exit Vim Any reason not to just trap the CONTROL-C in psql when paging and offer a hint? Especially since we don't really know that the user really wanted to type CONTROL-C instead of q for quit. I know that I have always meant to type q and was just distracted whenever I've typed CONTROL-C in the pager, and so passing the CONTROL-C on to less is not actually heeding my wishes, it is instead giving me enough rope to shoot myself in the foot. It won't work properly that way. SIGINT gets sent to all the members of the process group, not just the child. Psql isn't responsible for sending ctrl-c through to the child. When you're in an editor such as vi that makes use of the terminal, the editor itself is likely the only program that is doing anything. Its parent is doing a wait() on the editor. The parent in that instance can ignore SIGINT because it's not involved at all at that point. That's not the case here. Psql and the pager are really two cooperating parts of the same task. They just happen to be running in two different process spaces. Because they're both cooperatively active at the same time, they both need to handle SIGINT, because when the user invokes SIGINT, he intends for the overall task to return some kind of control to him. For psql, which is gathering data and sending it to the pager, that means that it needs to stop doing so and wait for the pager to finish. For the pager, it means at a minimum that it needs to display what it has so far and give interactive control to the user (it may or may not attempt to continue to read what's being sent to it). Some pagers (like more) will just exit. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Question about Ctrl-C and less
[EMAIL PROTECTED] wrote: On Thu, Oct 20, 2005 at 03:42:10PM -0700, Kevin Brown wrote: Martijn van Oosterhout wrote: You can't do a pclose in a signal handler, it's not one of the reeentrant safe functions and could lead to deadlocks. The signal manpage documents the ones you can use. Just set a flag. Setting the descriptor to NULL is worse because then we have check before every output function. fprintf(NULL, ...) will segfault on most architechtures I wager. Yeah, I was thinking that you'd do the check for the flag and invoke a cleanup handler after the write() to the output file descriptor. It's not clear that you'd need to do the check anyplace else. It's been a while since I've messed with this stuff, but if I recall correctly, the write() will return immediately after receipt of a signal, and will indicate how much was actually written. So receipt of a SIGINT should wind up being handled in a reasonably timely fashion. Additionally the normal SIGINT signal handler (the one that gets invoked when the pager is turned off) can be called from the cleanup handler in order to maintain the proper semantics. I disagree that psql should make *any* assumptions about what SIGINT means to the child process. Consider less again, and Control-C used to abort a search. You are suggesting that Control-C should not only abort the search, but should also cut off the input from less. Less won't die. Less will just see a terminated input stream. What has been gained from this? Is this intuitive behaviour? It's behaviour that's consistent with every other pipeline application set I've ever seen. The only difference between those and this situation is that for a standard pipeline set, SIGINT will kill all the processes in the pipeline. Less explicitly ignores (or traps) SIGINT, so the effect of generating a SIGINT where less is at the end of the pipeline is to kill everything that precedes it, and less will then show what results it received. And obviously, that implies that the pipeline gets closed. If psql does not close the pipeline right then and there, then its behaviour will obviously be different from what the user likely expects, based on other pipelined uses of less. After all, if they wanted to see the entire result set then they wouldn't have sent SIGINT, would they? If the pager does die in response to SIGINT, the write() will fail with SIGPIPE. Completely clean, without any need for psql to pay attention to SIGINT. We're not talking about the semantics of the pager, we're talking about the semantics of psql. You said it yourself: psql can't make any assumptions about what SIGINT means to the child process. So it has to consider what to do if the child does *not* die in response to SIGINT. What are the proper semantics for psql + the child in that situation? Well, I think it's clear that having psql ignore SIGINT in that situation is *not* correct, because it implies that whether or not SIGINT causes processing to stop (as the user would expect) depends entirely on the child. More specifically, it would depend on the child either dying or explicitly closing its stdin upon receipt of SIGINT. The bottom line is that, at least in my opinion, the semantics of SIGINT as regards psql should be the same whether or not there's a pager involved, with one crucial difference: if there's a pager involved, psql should wait for it to terminate before showing the prompt. But otherwise, the semantics should be identical, because they are what the user expects. I think the only reasonable behaviour is to ignore SIGINT within the parent, until the child exits. I don't see why other behaviours are even being considered. To me, it points at a misunderstanding of the problem. Not at all. When you send SIGINT to a process, you want that process to stop doing whatever it's doing and return control to you. That's what it means, and that's what it's for. If we ignore SIGINT then obviously we will *not* be heeding the wishes of the user who sends SIGINT, and that is not likely what the user expects. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Question about Ctrl-C and less
Andrew - Supernews wrote: On 2005-10-19, Kevin Brown [EMAIL PROTECTED] wrote: Making assumptions about what the pager will do upon receipt of SIGINT is folly as well. Setting up SIGINT to be ignored may be the right answer (I don't believe it is -- see below), but if so then it needs to be done properly. If it gets ignored prior to the popen(), then the child will also end up ignoring it by default, because signal disposition is inherited by child processes. If we ignore SIGINT, it should be after the popen(), not before. I do not believe it is possible to do the signal disposition correctly and still use popen() to run the pager. (You would need to reimplement popen using raw syscalls.) I'm not sure I see why this is so. popen() just creates the pipeline, fork()s, closes the proper file descriptor (depending on whether it's in the parent or the child and whether the pipe was open for read or write) and then exec()s in the child. In the parent, it returns control after the fork() and file descriptor cleanup. So the parent can set up its own internal signal disposition immediately after popen() returns. This sequence of events is exactly what we'd end up doing if we did everything ourselves using raw syscalls, save for the use of stdio instead of direct syscalls for the file operations. So I think the right answer here is for psql to handle SIGINT internally by doing a pclose() first The chances that psql can do this safely approach zero. pclose() is not a signal-safe function, so it can only be called from a signal handler if you _know_ that the signal did not interrupt any non-signal-safe function. (Nor can the signal handler longjmp out in such a case, unless the code is never again going to call any unsafe function.) I agree. I guess I need to be a little more explicit about what I envision here. There would be two possible signal handlers. The first is the one we have now, which cleans up various things upon receipt of SIGINT. The second simply sets a flag that says that SIGINT has been received. The signal handler that gets assigned to SIGINT depends on whether or not a pager is going to be used. If it's not, then we point it to the first signal handler. If it is, then we point it to the second, and clear the flag. When a pager is being used, we check for the flag immediately after doing a write()/fwrite() to the pipe. If it's set, we pclose(), clear the flag, and then manually invoke the non-pager signal handler. SIGINT should cause the write() to return immediately, possibly with EINTR. Make sense? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Question about Ctrl-C and less
Martijn van Oosterhout wrote: You can't do a pclose in a signal handler, it's not one of the reeentrant safe functions and could lead to deadlocks. The signal manpage documents the ones you can use. Just set a flag. Setting the descriptor to NULL is worse because then we have check before every output function. fprintf(NULL, ...) will segfault on most architechtures I wager. Yeah, I was thinking that you'd do the check for the flag and invoke a cleanup handler after the write() to the output file descriptor. It's not clear that you'd need to do the check anyplace else. It's been a while since I've messed with this stuff, but if I recall correctly, the write() will return immediately after receipt of a signal, and will indicate how much was actually written. So receipt of a SIGINT should wind up being handled in a reasonably timely fashion. Additionally the normal SIGINT signal handler (the one that gets invoked when the pager is turned off) can be called from the cleanup handler in order to maintain the proper semantics. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Question about Ctrl-C and less
Martijn van Oosterhout wrote: Very well, patch attached. It's quite simple actually. However, there seems to be some push back from people suggesting that jumping back to the main loop before the pager has quit is not buggy behaviour. Assuming that a ^C will kill the pager is just folly. Making assumptions about what the pager will do upon receipt of SIGINT is folly as well. Setting up SIGINT to be ignored may be the right answer (I don't believe it is -- see below), but if so then it needs to be done properly. If it gets ignored prior to the popen(), then the child will also end up ignoring it by default, because signal disposition is inherited by child processes. If we ignore SIGINT, it should be after the popen(), not before. When the user sends SIGINT, he means to interrupt whatever processing is currently occurring. He expects to regain control of the terminal. If psql is in the process of sending data to the pager, then a SIGINT should cause psql to stop doing so. So I think the right answer here is for psql to handle SIGINT internally by doing a pclose() first (and at this point, it probably should ignore SIGINT altogether), then returning to the main loop (and, of course, cleaning up anything that needs it along the way). If the child hasn't exited then pclose() will block until it has. The end result should be the semantics you want: if psql is in the middle of sending a bunch of rows of output to the pager, this will interrupt that process. If the pager remains running then it will hopefully give the user the ability to scroll through whatever results were sent to it. Tom asked if we should be blocking SIGQUIT and SIGHUP too. Standard procedure for spawning external interactive processes includes blocking SIGQUIT too (see system() manpage). SIGQUIT has a different standard meaning in Unix than SIGINT: it causes the process to drop core. We should not be blocking it -- we should be leaving it alone. The reason is that it's quite possible that the user wants to have psql generate a core file while it's writing output to the pager. Logically speaking, when the user sends an interrupt from the keyboard they expect to interrupt the currently active *interaxtive* process. They expect to interrupt the currently active processing. Not quite the same thing. Hence, once psql has spawned the pager, it should ignore such interrupts until control is returned (after pclose). So yes, I would suggest blocking SIGQUIT also, if only to prevent terminal corruption problems. Interactive programs like less trap SIGQUIT to restore the terminal settings on exit, but the exit anyway. They should be dropping core upon receipt of SIGQUIT. It might be nice if they cleaned up the terminal first, but receipt of a SIGQUIT generally means that the user wants to run the resulting core file through a debugger, and trapping the signal could alter the stack such that the resulting core would be less useful. I'd rather have to clean up the terminal manually than have an unusable core file. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] GRANT/roles problem: grant is shown as from login role
Tom Lane wrote: So I think we don't have much choice but to implement theory #2; which is essentially the same thing I said earlier, ie, ACLs have to record the grantor of a privilege as being the role actually holding the grant option, not the role-member issuing the GRANT. There are really two different considerations here. The first is the meaning of the role relationships involved. With respect to this, I'm in agreement that the recorded grantor of the privilege should be the role actually holding the option. But the second is auditing. It's useful to know which user/role actually performed the grant in question, independent of the grant relationships themselves. These two are at odds with each other only if the system can record only one of the two things. The auditing consideration really argues for the implementation of an audit trail table/structure, if one doesn't already exist (and if it already exists, then clearly the ACLs should be storing the id of the role holding the grant, since the audit structure will separately record the user/role issuing the grant). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC
Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: I don't think we should care too much about indexes. We can rebuild them...but losing heap sectors means *data loss*. If you're so concerned about *data loss* then none of this will be acceptable to you at all. We are talking about going from a system that can actually survive torn-page cases to one that can only tell you whether you've lost data to such a case. Arguing about the probability with which we can detect the loss seems beside the point. I realize I'm coming into this discussion a bit late, and perhaps my thinking on this is simplistically naive. That said, I think I have an idea of how to solve the torn page problem. If the hardware lies to you about the data being written to the disk, then no amount of work on our part can guarantee data integrity. So the below assumes that the hardware doesn't ever lie about this. If you want to prevent a torn page, you have to make the last synchronized write to the disk as part of the checkpoint process a write that *cannot* result in a torn page. So it has to be a write of a buffer that is no larger than the sector size of the disk. I'd make it 256 bytes, to be sure of accomodating pretty much any disk hardware out there. In any case, the modified sequence would go something like: 1. write the WAL entry, and encode in it a unique magic number 2. sync() 3. append the unique magic number to the WAL again (or to a separate file if you like, it doesn't matter as long as you know where to look for it during recovery), using a 256 byte (at most) write buffer. 4. sync() After the first sync(), the OS guarantees that the data you've written so far is committed to the platters, with the possible exception of a torn page during the write operation, which will only happen during a crash during step 2. But if a crash happens here, then the second occurrance of the unique magic number will not appear in the WAL (or separate file, if that's the mechanism chosen), and you'll *know* that you can't trust that the WAL entry was completely committed to the platter. If a crash happens during step 4, then either the appended magic number won't appear during recovery, in which case the recovery process can assume that the WAL entry is incomplete, or it will appear, in which case it's *guaranteed by the hardware* that the WAL entry is complete, because you'll know for sure that the previous sync() completed successfully. The amount of time between steps 2 and 4 should be small enough that there should be no significant performance penalty involved, relative to the time it takes for the first sync() to complete. Thoughts? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Checkpoint cost, looks like it is WAL/CRC
Bruce Momjian wrote: I don't think our problem is partial writes of WAL, which we already check, but heap/index page writes, which we currently do not check for partial writes. Hmm...I've read through the thread again and thought about the problem further, and now think I understand what you're dealing with. Ultimately, the problem is that you're storing diffs in the WAL, so you have to be able to guarantee that every data/index page has been completely written, right? There's no way to detect a torn page without some sort of marker in each disk-indivisible segment of the page, unless you're willing to checksum the entire page. With that in mind, the method Microsoft uses for SQL Server is probably about as simple as it gets. In our case, I suppose we may as well allocate one byte per 256-bytes segment for the torn page marker. Just increment the marker value each time you write the page (you'll have to read it from the page prior to incrementing it, of course). Other than that, torn page detection is really just a special case of page corruption detection. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] fool-toleranced optimizer
Greg Stark wrote: Kevin Brown [EMAIL PROTECTED] writes: Hence, it makes sense to go ahead and run the query, but issue a warning at the very beginning, e.g. WARNING: query JOINs tables list of tables without otherwise referencing or making use of those tables. This may cause excessively poor performance of the query. Well the problem with a warning is what if it *is* intentional? It's not ok to fill my logs up with warnings for every time the query is executed. That just forces me to turn off warnings. WARNING is probably the wrong level (I wasn't thinking in terms of PG logging, though I probably should have been). What about NOTICE? Basically, you want something that will alert the interactive user that what they're doing is likely to be stupid, but at the same time won't be a burden on the system or the DBA... It would be ok to have an option to block cartesian joins entirely. I might even choose to run with that enabled normally. I can always disable it for queries I know need cartesion joins. Which wouldn't work all that well for people who are trying to write their software in a reasonably portable fashion, unfortunately. However, the number of people who care would now be much smaller. For that matter, I wonder whether it's time to consider an option to disable implicit (ie, pre-ansi join syntax) joins entirely. It seems like lots of shops are likely imposing coding standards that require ansi join syntax anyways. In environments like that you would expect a CROSS JOIN b not just select * from a,b anyways. Shops like that might appreciate the ability to enforce a blanket coding standard on that point and get protection from accidental cartesian joins as a side benefit. That could be handy, but of course it should default to off, which with respect to cross joins would unfortunately wind up benefitting only those people who already are potentially aware of the issue and care about it (or, at least, those people who have DBAs that care about it). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] fool-toleranced optimizer
Neil Conway wrote: Simon Riggs wrote: Oleg is saying that the optimizer doesn't protect against foolish SQL requests. His query is an example of a foolishly written query. IMHO calling this a foolishly written query is completely arbitrary. I can imagine plenty of applications for which a cartesian join makes sense. In this case the user didn't write the query they meant to write -- but it is surely hopeless to prevent that in the general case :) Sure, but this case, at least, is (hopefully) easily detectable (as such things go), has a high cost when it occurs, and is *usually* not what the user intended. Hence, it makes sense to go ahead and run the query, but issue a warning at the very beginning, e.g. WARNING: query JOINs tables list of tables without otherwise referencing or making use of those tables. This may cause excessively poor performance of the query. That said, the real question is whether or not it's worth putting in the effort to detect this condition and issue the warning. I'd say probably not, but if this is a big enough itch for someone then why should we discourage them from coding up a fix? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] We are not following the spec for HAVING without GROUP BY
Tom Lane wrote: What that means is that neither the HAVING clause nor the targetlist can use any ungrouped columns except within aggregate calls; that is, select col from tab having 21 is in fact illegal per SQL spec, because col isn't a grouping column (there are no grouping columns in this query). [...] Comments? Can anyone confirm whether DB2 or other databases allow ungrouped column references with HAVING? Oracle does not allow such references. It issues ORA-00979: not a GROUP BY expression when you try to hand it such a reference. MS SQL Server does not allow such references either, yielding columnname is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.. Can't comment about DB2. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Help me recovering data
Tom Lane wrote: Gaetano Mendola [EMAIL PROTECTED] writes: BTW, why not do an automatic vacuum instead of shutdown ? At least the DB do not stop working untill someone study what the problem is and how solve it. No, the entire point of this discussion is to whup the DBA upside the head with a big enough cluestick to get him to install autovacuum. Once autovacuum is default, it won't matter anymore. I have a concern about this that I hope is just based on some misunderstanding on my part. My concern is: suppose that a database is modified extremely infrequently? So infrequently, in fact, that over a billion read transactions occur before the next write transaction. Once that write transaction occurs, you're hosed, right? Autovacuum won't catch this because it takes action based on the write activity that occurs in the tables. So: will autovacuum be coded to explicitly look for transaction wraparound, or to automatically vacuum every N number of transactions (e.g., 500 million)? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] IBM patent
Marc G. Fournier wrote: On Wed, 26 Jan 2005, Christopher Browne wrote: Actually, the latter isn't so. If Mammoth or Pervasive or such release their own release of PostgreSQL, nothing has historically mandated that they make that release available under the BSD license. Presumably acceptance of the patent would change that. You and I might not have individual objections to this situation, but one or another of the companies putting together PostgreSQL releases very well might. But, there is nothing stop'ng them from replacing the ARC code with their own variant though ... Not only that, I'd go further and say that they have a duty to either do that or pay someone to do it. They are, after all, the entities that probably care about the situation the most. This type of situation seems to me to be one that has to be examined from a greatest good point of view. If IBM were to allow all open source projects to make free use of a patent (thus exposing only those entities which sell commercial versions under a non-open-source license to risk), then the PG group might be faced with the tradeoff of using a superior but patented (though free for open source use) algorithm, or using a possibly inferior but unencumbered one. I'd wager that the vast majority of PostgreSQL users received their copy via the open source license. Unless the encumbered algorithm is not significantly superior to the unencumbered one, the greater good is likely to be to make use of the patented algorithm and force the non-open-source vendors to deal with removing the algorithm themselves. None of that really applies to the specific situation we're discussing, however: the current ARC implementation is apparently not showing itself to be a clearly superior approach, so some other approach is probably warranted. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Allow GRANT/REVOKE permissions to be applied to all schema objects with one command
Alvaro Herrera wrote: On Fri, Jan 28, 2005 at 09:17:46PM +0100, Matthias Schmidt wrote: a) accept some sort of wildcard for the grant on table syntax: GRANT ... ON TABLE schema.* What about a list, GRANT ... ON TABLE table1, table2, ... TO user1, user2, ...; It would be good if it was a list of wildcards. Not sure if that is workable. Actually, what I'd *love* to see is for statements such as GRANT to allow select result sets to be used in place of arguments, e.g.: GRANT ... ON TABLE (SELECT table_schema || '.' || table_name FROM information_schema.tables WHERE table_schema IN ('public', 'postgres')) TO (SELECT usename from PG_USER WHERE usecatupd = true); Actually, it would be very nice if all DDL statements could work that way. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ARC patent
Nicolai Tufar wrote: Second, a pending patent is not a granted patent, one is not infringing anything by distributing technology based in a pending patent. Given the patents the USPTO has been granting in recent times, if a patent is pending, it's almost certainly going to be granted. Especially if it comes from an entity such as IBM (the USPTO wouldn't want to upset its biggest paying customers, would it?), and especially if it's on something that isn't completely trivial. For that reason, I think it's quite reasonable to treat any pending patent from IBM as if it were a granted patent. The only way I could see the patent not being granted is if some large corporate entity like Microsoft filed an objection. That's possible, I suppose, but not something I would want to count on. But objections raised by small entities such as individuals will almost certainly be dropped on the floor, because such entities don't matter to the USPTO (or the rest of the government, for that matter), unless they are flush with cash. IBM can NEVER sue customers for using infringing code before first informing them of infringement and giving reasonable time to upgrade to uninfringing version. This is the United States. People (and especially large corporations) can sue anybody for anything anytime they wish. And they do. Reason doesn't enter into it. Only money. See the SCO debacle for proof, and note that they're not suing in any other countries. If I sound bitter and cynical, well, there's lots of good reason for it. You need only look around, at least if you're in the U.S. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [pgsql-hackers-win32] [BUGS] More SSL questions..
Bruce Momjian wrote: FWIW, I've seen several apps that use .txt for config files, but I can't think of an example right now. Most don't though - .cfg or .conf is probably most common. Except for the majority of windows programs that don't use config files - they use the registry. But I see no reason *at all* for us to want to do that :-) It also more or less requires you to write a GUI to change the config stuff and in that case the file extension becomes irrelevant. Where are we on this? I think Andrew and I both think *.txt is confusing. We need to decide on Monday if we should change the current *.txt names. We can either leave it unchanged, remove *.txt, or change it to *.config. APPDATA/postgresql/pgpass.txt APPDATA/postgresql/psqlrc.txt Another idea is to use *.conf. For what it's worth, I always thought that text configuration files on Windows platforms generally used the '.ini' extension. I believe on most Windows systems that extension is by default associated with Notepad. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] psql \e broken again
Tom Lane wrote: Magnus Hagander [EMAIL PROTECTED] writes: Agreed, no quotes on Unix. I'm just wondering what to do on Windows. I don't think I've seen a single windows program that uses the EDITOR variable. There are some ported unix tools, but that's it. Native windows program will have a per-program setting for this. The system default is picked based on file extension. So I doubt it will break a lot of things. So are you in favor of not quoting at all --- ie, reverting to the Unix behavior? I'm pretty sure that psql got changed because someone complained, so it seems like we'd be going in circles if we just do that and don't have any special behavior at all on Windows. You probably already know this but others on the list may not, so... The behavior of EDITOR under Unix comes about as a result of how it's invoked -- usually via the system() library call, which invokes a shell to parse the command. The fact that spaces in EDITOR's value are treated as argument delimeters instead of part of the path of the editor itself is a direct result of how the shell interprets the command string. It almost certainly doesn't make sense to retain those semantics under Windows, because to achieve real equivalence we would have to expand other shell metacharacters ourselves. Since Windows generally doesn't even make use of EDITOR as such, it probably makes the most sense for \e on that platform to save the edit buffer to a .txt file and execute it -- Windows will then invoke whichever editor is associated with text files (Notepad by default). It would have to be invoked in such a way that psql could wait for it to complete, of course. I suppose it might be nice to be able to override that, and do something else if EDITOR is defined. In that event it's probably safer to avoid parsing EDITOR and instead have it just refer to the full path to the program to be used. Someone who is savvy enough to define EDITOR is also savvy enough to throw together a simple batch file which invokes his real editor with whatever options he wants. It seems to me that being able to properly specify the path of the program to use with a minimum of fuss (hence no strange quoting conventions) takes precedence. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] code question: storing INTO relation
Bruce Momjian wrote: Added to TODO: * Eliminate WAL logging for CREATE INDEX/REINDEX/CREATE TABLE AS when not doing WAL archiving But that means that these operations can't be performed in a transaction unless WAL archiving is on, right? (If I misunderstand and thus am wrong about this then you can ignore the rest of what follows, of course) Suppose I do the following within a transaction with WAL archiving turned off: CREATE TABLE blah (x integer primary key, y varchar(32)); INSERT INTO blah SELECT x, y FROM foo; and then roll it back because something didn't work out the way I wanted it to. If CREATE INDEX isn't WAL logged, then either the rollback of the above will result in an inconsistent database, or the fact that CREATE INDEX isn't WAL logged will cause the CREATE TABLE to fail because the index creation (and thus the table creation) can't be done within a transaction. Being able to do DDL within a transaction is one of the coolest and (from the POV of a DBA) most useful features of the database. If we're going to eliminate WAL logging of certain operations, it should be done at the explicit request of the DBA, preferably through a GUC, and preferably with a relevant GUC for each operation. Since WAL archiving requires that the operations in question be WAL logged, the GUCs that control WAL logging of those statements would clearly be ineffective if WAL archiving is turned on. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Unixware 714 pthreads
Tom Lane wrote: [EMAIL PROTECTED] writes: On Thu, 28 Oct 2004, Tom Lane wrote: No. Why should the signal handler need re-arming? My impression was that once caught, signal handler for a particular signal is reset to SIG-DFL. No. If your signal support is POSIX-compatible, it should not do that because we don't set SA_RESETHAND when calling sigaction(2). If you don't have POSIX signals, you had better have BSD-style signal(2), which doesn't reset either. If this is not happening as expected, you will have much worse problems than whether statement_timeout works :-( SysV-style signal(2) handling does indeed require that the signal handler be re-enabled. The attached program demonstrates this on Solaris, and probably on Unixware as well (I don't have access to the latter). Just run it and interrupt it with ctrl-c. It should print something the first time around, and actually be interrupted the second time. So if Unixware doesn't have sigaction() or it's not being picked up by autoconf then yeah, he'll have big problems... -- Kevin Brown [EMAIL PROTECTED] #include signal.h #include stdio.h #include unistd.h void sighandler(int sig) { printf (Received signal %d\n, sig); } int main (int argc, char *argv[]) { signal(SIGINT, sighandler); while(1) { sleep(1); } } ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PATCHES] [HACKERS] ARC Memory Usage analysis
Tom Lane wrote: Greg Stark [EMAIL PROTECTED] writes: So I would suggest using something like 100us as the threshold for determining whether a buffer fetch came from cache. I see no reason to hardwire such a number. On any hardware, the distribution is going to be double-humped, and it will be pretty easy to determine a cutoff after minimal accumulation of data. The real question is whether we can afford a pair of gettimeofday() calls per read(). This isn't a big issue if the read actually results in I/O, but if it doesn't, the percentage overhead could be significant. If we assume that the effective_cache_size value isn't changing very fast, maybe it would be good enough to instrument only every N'th read (I'm imagining N on the order of 100) for this purpose. Or maybe we need only instrument reads that are of blocks that are close to where the ARC algorithm thinks the cache edge is. If it's decided to instrument reads, then perhaps an even better use of it would be to tune random_page_cost. If the storage manager knows the difference between a sequential scan and a random scan, then it should easily be able to measure the actual performance it gets for each and calculate random_page_cost based on the results. While the ARC lists can't be tuned on the fly, random_page_cost can. One small problem is that the time measurement gives you only a lower bound on the time the read() actually took. In a heavily loaded system you might not get the CPU back for long enough to fool you about whether the block came from cache or not. True, but that's information that you'd want to factor into the performance measurements anyway. The database needs to know how much wall clock time it takes for it to fetch a page under various circumstances from disk via the OS. For determining whether or not the read() hit the disk instead of just OS cache, what would matter is the average difference between the two. That's admittedly a problem if the difference is less than the noise, though, but at the same time that would imply that given the circumstances it really doesn't matter whether or not the page was fetched from disk: the difference is small enough that you could consider them equivalent. You don't need 100% accuracy for this stuff, just statistically significant accuracy. Another issue is what we do with the effective_cache_size value once we have a number we trust. We can't readily change the size of the ARC lists on the fly. Compare it with the current value, and notify the DBA if the values are significantly different? Perhaps write the computed value to a file so the DBA can look at it later? Same with other values that are computed on the fly. In fact, it might make sense to store them in a table that gets periodically updated, and load their values from that table, and then the values in postgresql.conf or the command line would be the default that's used if there's nothing in the table (and if you really want fine-grained control of this process, you could stick a boolean column in the table to indicate whether or not to load the value from the table at startup time). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Should libpq set close-on-exec flag on its socket?
Tom Lane wrote: Dennis Bjorklund [EMAIL PROTECTED] writes: On Thu, 21 Oct 2004, Tom Lane wrote: It was suggested to me off-list that libpq should do fcntl(fd, F_SETFD, FD_CLOEXEC) on the socket connecting to the server. This would prevent any child program from accidentally or maliciously interfering with the connection. Either way that the lib sets it, the client can alter the setting itself by issuing a new SETFD command. That's a fair point, and certainly passing it down to the child intentionally wouldn't be a common case. I'll put the change in. Since program authors who would care about this one way or another probably won't be expecting this behavior, it should also be documented reasonably well -- something which I'm rather sure you were going to do anyway. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [BUGS] BUG #1290: Default value and ALTER...TYPE
Tom Lane wrote: PostgreSQL Bugs List [EMAIL PROTECTED] writes: troels=# create table lookat_feature( troels(# feature_id char(4), troels(# status varchar(2) default 'TODO' troels(# ); CREATE TABLE troels=# alter table lookat_feature troels-# alter column status type varchar(4); ALTER TABLE troels=# insert into lookat_feature (feature_id) values('B034'); ERROR: value too long for type character varying(2) Hmm. What's going on here is that the stored default expression is actually of the form ('TODO'::varchar)::varchar(2) where you don't see the coercion to varchar(2) in \d becayuse ruleutils.c doesn't show implicit casts. After the ALTER COLUMN it's of the form (('TODO'::varchar)::varchar(2))::varchar(4) which of course will give an error when used. Possibly we should make ALTER COLUMN strip any implicit coercions that appear at the top level of the default expression before it adds on the implicit coercion to the new column datatype. I am not sure that this is a good idea, however; it seems like it might alter the semantics in unexpected ways. (The default expression could potentially come through differently than an actually stored value of the column would do.) The alternative would seem to be decreeing that this is not a bug. Comments anyone? I think the conversion from 'TODO' to varchar(2) and then to varchar(4) is much more surprising than converting 'TODO' directly to varchar(4) after the ALTER TABLE. In short, as a DBA I would expect the database to do any conversion into the column target type based on the original specified default value, and not some intermediate form that exists only because of the history of the column's datatype. So, my vote is for the form to be ('TODO'::varchar)::varchar(4) after the ALTER TABLE in the example. FWIW... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Tablespace issues (comment on ,moving indexes)
Christopher Kings-Lynne wrote: Errr, unlike all the other uses for alter table and friends? ie: OWNER TO Which changes the attributes of the table... And indexes. Sure. But not *just* indexes. RENAME TO Same. And indexes. It does? I thought the indexes pointed to relations directly, not to tables by name, and so changing the name of the table wouldn't have any effect on the indexes, right? SET TABLESPACE Which again changes the attributes of the table.. And indexes. But it does change more than just the indexes. But the context here is changing the tablespace of indexes independently of the tablespace for the table. For that, how exactly does it affect the table metadata? Not at all, I'd wager. If you're going to go use ALTER TABLE to make changes to the attributes of indexes, might I suggest that you also use ALTER TABLE to create and destroy them as well? Otherwise you end up with an inconsistent language, which is fine if the spec calls for it or if you somehow are attempting to maintain compatibility with something. But what we're talking about here is brand new functionality for which the language hasn't been defined yet. It would be a bit unfortunate to introduce inconsistencies where they're not needed, wouldn't you say? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Tablespace issues (comment on ,moving indexes)
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: ... But what we're talking about here is brand new functionality for which the language hasn't been defined yet. You're missing the point, which is that there *is* a precedent of long standing. ALTER TABLE has worked on indexes (and sequences, and views) for those cases in which the operation sensibly applied for a long time. In particular, the original 7.1 implementation of ALTER TABLE OWNER would work on tables, indexes, sequences, and views. Should we really have insisted on inventing four syntaxes for the identical operation? Maybe, but we didn't, and now there is a precedent to follow. And yet we have ALTER SEQUENCE. In 7.4, we seem to have: ALTER AGGREGATE ALTER CONVERSION ALTER DATABASE ALTER DOMAIN ALTER FUNCTION ALTER GROUP ALTER LANGUAGE ALTER OPERATOR CLASS ALTER SCHEMA ALTER SEQUENCE ALTER TABLE ALTER TRIGGER ALTER USER Within ALTER TABLE, you can change: 1. columns 2. the table name 3. constraints 4. table ownership 5. index clustering and within those, only (2) and (4) apply to sequences and views, and (5) is the only ALTER TABLE operation that applies to indexes (corrections to this welcome). Furthermore, the rename operation for triggers, languages, groups, functions, databases, conversions, and aggregates are all implemented in their own ALTER statement (indeed, the rename operation is the only ALTER operation for some of those). The decision to roll some of the functionality affecting sequences and views into ALTER TABLE is at least somewhat sensible: those things look like tables in at least one key way, namely that they can be SELECTed from. That's not true of indexes, and so that reasoning does not apply to using ALTER TABLE to change an index's tablespace. It appears to me that the precedent for creating a new ALTER statement is actually much bigger than the precedent for rolling functionality into ALTER TABLE, based on the above. But that's just my bird's eye view on things. I'm sure lots of people disagree with me on this. :-) I'm certainly not arguing for a wholesale rework of the syntax in order to achieve maximum consistency (nice as that might be), but it seems to me that it would be a mistake to introduce more inconsistency than is already there when it's not necessary to do so. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Tablespace issues (comment on ,moving indexes)
I wrote: I'm certainly not arguing for a wholesale rework of the syntax in order to achieve maximum consistency (nice as that might be), but it seems to me that it would be a mistake to introduce more inconsistency than is already there when it's not necessary to do so. What I mean here is that I think it would be in our best interests to define the syntax for any new operation to be as easily guessed as possible. I believe that ALTER INDEX would be more easily guessed by more people as the means by which one would alter an index's tablespace than ALTER TABLE, even if those people have a decent amount of PG experience. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Tablespace issues (comment on ,moving indexes)
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: 1. there is no COMMENT ON TABLESPACE support That's right. 2. how is one supposed to move indexes(not tables) to another tablespace? Use ALTER TABLE on the index. Hmm...not ALTER INDEX? Now that there's an operation that actually modifies an index instead of the table itself, should there be an ALTER INDEX? It would be cleaner and more consistent, IMO... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Tablespace issues (comment on ,moving indexes)
Christopher Kings-Lynne wrote: Hmm...not ALTER INDEX? Now that there's an operation that actually modifies an index instead of the table itself, should there be an ALTER INDEX? It would be cleaner and more consistent, IMO... Errr, unlike all the other uses for alter table and friends? ie: OWNER TO Which changes the attributes of the table... RENAME TO Same. SET TABLESPACE Which again changes the attributes of the table.. But using ALTER TABLE to change the tablespace that an index belongs to doesn't change an attribute of a table, it changes the attribute of an index. etc. Lots of things against tables work against indexes and views. Some stuff for commenting on columns say works on views, composite types and indexes! No doubt. Of course, that something's been done a certain way in the past doesn't imply that it's the right way to do something new, nor does it imply that the new thing must be done that way. I mean, it's not a terribly big deal or anything, but since we're talking about stuff that isn't in the SQL spec it seems reasonable to define the commands in such a way that they don't violate the principle of least surprise. Using ALTER TABLE to alter the characteristics of an index violates that principle, at least in my opinion. It's not the first command I would have thought of when asking myself how do I change the tablespace of an index? -- ALTER INDEX is. And the reason is simple: we use CREATE INDEX to create an index and DROP INDEX to drop one -- we don't use ALTER TABLE subcommands to create or drop indexes. Why, then, should modification of an index's properties be treated any differently than the rest of the index manipulation commands? I just happen to like consistency. :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] DROP TABLESPACE causes panic during recovery
Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Maybe we could avoid removing it until the next checkpoint? Or is that not enough. Maybe it could stay there forever :/ Part of the problem here is that this code has to serve several purposes. We have different scenarios to worry about: * crash recovery from the most recent checkpoint * PITR replay over a long interval (many checkpoints) * recovery in the face of a partially corrupt filesystem It's the last one that is mostly bothering me at the moment. I don't want us to throw away data simply because the filesystem forgot an inode. Yeah, we might not have enough data in the WAL log to completely reconstruct a table, but we should push out what we do have, *not* toss it into the bit bucket. I like the idea tossed out by one of the others the most: create a recovery system tablespace, and use it to resolve issues like this. The question is: what do you do with the tables in that tablespace once recovery is complete? Leave them there? That's certainly a possibility (in fact, it seems the best option, especially now that we're doing PITR), but it means that the DBA would have to periodically clean up that tablespace so that it doesn't run out of space during a later recovery. Actually, it seems to me to be the only option that isn't the equivalent of throwing away the data... In the first case (straight crash recovery) I think it is true that any reference to a missing file is a reference to a file that will get deleted before recovery finishes. But I don't think that holds for PITR (we might be asked to stop short of where the table gets deleted) nor for the case where there's been filesystem damage. But doesn't PITR assume that a full filesystem-level restore of the database as it was prior to the events in the first event log being replayed has been done? In that event, wouldn't the PITR process Just Work? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] DROP TABLESPACE causes panic during recovery
Tom Lane wrote: In CVS tip, try running the regression tests against an installed postmaster (ie, make installcheck); then as soon as the tests are done, kill -9 the bgwriter process to force a database restart. Most of the time you'll get a PANIC during recovery: [...] This is impossible to fix nicely because the information to reconstruct the tablespace is simply not available. We could make an ordinary directory (not a symlink) under pg_tblspc and then limp along in the expectation that it would get removed before we finish replay. Or we could just skip logged operations on files within the tablespace, but that feels pretty uncomfortable to me --- it amounts to deliberately discarding data ... Any thoughts? How is a dropped table handled by the recovery code? Doesn't it present the same sort of issues (though on a smaller scale)? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Too-many-files errors on OS X
Larry Rosenman wrote: I had to hack on the code some more for FreeBSD: (the realloc call needed the multiplication). I ran this same code on UnixWare. I feel like a moron, having missed that. Probably explains the bad file number error I was getting on AIX, too... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Too-many-files errors on OS X
Tom Lane wrote: However, it seems that the real problem here is that we are so far off base about how many files we can open. I wonder whether we should stop relying on sysconf() and instead try to make some direct probe of the number of files we can open. I'm imagining repeatedly open() until failure at some point during postmaster startup, and then save that result as the number-of-openable-files limit. I strongly favor this method. In particular, the probe should probably be done after all shared libraries have been loaded and initialized. I originally thought that each shared library that was loaded would eat a file descriptor (since I thought it would be implemented via mmap()) but that doesn't seem to be the case, at least under Linux (for those who are curious, you can close the underlying file after you perform the mmap() and the mapped region still works). If it's true under any OS then it would certainly be prudent to measure the available file descriptors after the shared libs have been loaded (another reason is that the init function of a library might itself open a file and keep it open, but this isn't likely to happen very often). I also notice that OS X 10.3 seems to have working SysV semaphore support. I am tempted to change template/darwin to use SysV where available, instead of Posix semaphores. I wonder whether inheriting 100-or-so open file descriptors every time we launch a backend isn't in itself a nasty performance hit, quite aside from its effect on how many normal files we can open. I imagine this could easily be tested. I rather doubt that the performance hit would be terribly large, but we certainly shouldn't rule it out without testing it first. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PostgreSQL configuration
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: The goal here is simply to make it obvious to a system administrator where the PG data directory that a given postmaster is using resides. Why would it not be sufficient to add a read-only GUC variable that tells that? Connect to the postmaster and do show datadir and you're done. (Without this, it's not clear you've made any particular gain anyway, since a given postmaster would typically mean the one I can connect to at this port, no?) That would probably be sufficient for most cases. It wouldn't take care of the case where there's a strict separation of powers between the system administrator and the DBA, but only if the system were managed badly (i.e., the SA and the DBA don't talk to each other very well). That's probably something we shouldn't concern ourselves with. In any case I don't see how removing PGDATA would make this more obvious. You yourself just pointed out that the command-line arguments of a postmaster aren't necessarily visible through ps; if they're not, what have you gained in transparency by forbidding PGDATA? I think you misunderstood what I was saying (which means I didn't say it right). There are ways within a program to change what 'ps' shows as the command line. We use those methods to make it possible to see what a given backend is doing by looking at the 'ps' output. It would be possible to have the postmaster use those ways in order to show which data directory it is using even if it wasn't specified on the command line. But in my experience, those ways don't work reliably on all systems. On the systems that those methods don't work, what 'ps' shows is the original command line that was used. So clearly, the only way 'ps' will show the data directory in that instance is if it was actually specified on the command line. In any case, I'm not at all opposed to having the backend stuff know about PGDATA during development, but for production you should have to explicitly specify the data directory on the command line. If you wish to do things that way, you can; but that doesn't mean that everyone else should have to do it that way too. If there were a security or reliability hazard involved, I might agree with taking the fascist approach, but I see no such hazard here ... Fair enough. The PGDATA issue isn't a big enough one that I'm terribly concerned about it, especially if a read-only GUC variable is available to give that information (something that, I think, should be there anyway). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] PostgreSQL configuration
Simon Riggs wrote: Very much agreed. PGDATA is important, lets keep it, please. To me it's not so much whether or not PGDATA is kept around for the system as a whole so much as how it's used. In the general case, scripts are used to start the postmaster. So using PGDATA even if the postmaster doesn't directly make use of it is a simple matter of adding '-D $PGDATA' to the command that invokes the postmaster. The goal here is simply to make it obvious to a system administrator where the PG data directory that a given postmaster is using resides. We can't rely on the mechanism used to change the command string that ps shows for the process: in my experience it's something that often does not work. And in any case, the system administrator will also want to know exactly what options were passed to the postmaster when it was invoked. If there's any group that can figure out how to effortlessly get PGDATA onto the command line of the backend utilities, it's the developer group. :-) In any case, I'm not at all opposed to having the backend stuff know about PGDATA during development, but for production you should have to explicitly specify the data directory on the command line. That seems easy enough to do: #ifdef is your friend. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] PostgreSQL configuration
Tom Lane wrote: Well, the patch says that the command line switch wins, which is consistent with what we do for other command line switches (they all override the equivalent postgresql.conf entries). This does seem a bit at variance with the stated goal of making the configuration more clearly documented, though :-(. Hmm...well, think of it as a tool. It makes it *possible* to make the configuration more clearly documented, and in fact makes it easy to do so, but doesn't guarantee safety in all cases. If you actually use the capability then your config file will be lying to you about where things are. Of course. Just like your config file is lying about any configuration option that is overridden on the command line. I don't see why this is a problem, unless we intend to change the way the entire GUC system works. It's worth pointing out in this connection that for the most part I think people are moving *away* from using command line switches; it's better to set the value in postgresql.conf, both for documentation reasons and because that way you have some chance of changing the value via config file update and SIGHUP. The only way to change a value on the command line is to restart the postmaster. Plus, if you're using a distribution-supplied init script to start the postmaster, it's hard to get any switches in without hacking the script anyway. Now this raises a very interesting problem. Namely, what happens if you use the -C option to the postmaster as is being advocated, then change the datadir entry in the config file, and send SIGHUP to the postmaster? Ooops. Score one for Tom. :-) Most of these objections also apply to values obtained from environment variables (the exception is that postgresql.conf can override environment variables). To be honest, I think the use of the PG_DATA environment variable is the biggest impediment to self documentation - the postmaster should not use it. The reason is that if PG_DATA is used to specify the location of the data directory, you won't be able to find out where a running postmaster's data directory is located without doing some heavy-duty investigation. Not all operating systems make it possible to determine the values of a particular process' environment variables. By requiring that the data directory be specified on the postmaster command line, it becomes possible to always determine where a postmaster's data directory resides just by looking at the ps output. Now, I know you guys who do heavy duty development make use of PG_DATA. I see no problem with having the code in postmaster that looks at PG_DATA be surrounded by a #ifdef that is active whenever you're doing development work. But it should *not* be active on a production system. Oh, as to the safety issue of a config file not properly corresponding to a given data directory, that seems easy enough to solve: if a file (call it magic for the purposes of discussion, though perhaps a better name would be do_not_remove :-) ) exists in the data directory, then the value of a configuration variable (call it magic, too) must match the contents of that file. If the values don't match then the postmaster will issue an error and refuse to start. If the file doesn't exist then no magic configuration option need exist in the config file, and the postmaster will start as usual. So any administrator who wants to make sure that a configuration file has to explicitly be targetted at the data directory can do so. End result: if you use the -D option on the command line with an inappropriate -C option, the postmaster will refuse to run. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] PostgreSQL configuration
Tom Lane wrote: Honza Pazdziora [EMAIL PROTECTED] writes: On Thu, Apr 08, 2004 at 10:31:44AM -0400, Tom Lane wrote: It seems that MySQL *will* read /etc/my.cnf if it exists, whether it's appropriate or not, and so it's impossible to have a truly independent test installation, even though you can configure it to build/install into nonstandard directories. Let's not emulate that bit of brain damage. A counterexample of Apache shows that you can easily use -f or another command line option to point the server to alternate master config file (which I believe is the same with MySQL). According to http://www.mysql.com/documentation/mysql/bychapter/manual_Using_MySQL_Programs.html#Option_files /etc/my.cnf will be read if it exists, no matter what you say on the command line. So AFAICS the only way to make a private installation is to make sure that you have overridden each and every setting in /etc/my.cnf in a private config file that you do control. This is tedious and breakage-prone, of course. Yes. But we don't have to do that. If we're truly concerned about the possibility of multiple installations attempting to use the same config, then the answer is simple: require that the location of the config file be specified on the command line and don't compile a default location into the binary. Similarly, don't take the value from an environment variable. Packaged installations won't have trouble with this: they supply a startup script which would pass the appropriate argument to the postmaster. If we want to be a bit paranoid (justifiable if you've got really important data on the line), we could also require that a version string exist in the config file. If the version string doesn't match the version of the postmaster being started, the postmaster exits with an error (and a hint of what to set the version string to and what the name of the version string parameter is). That way, even if you screw up on the command line, you won't hose a database by starting the wrong version of the postmaster against it. Not sure if this would break anything, though. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Function to kill backend
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Not having a way to kill backends is like having no way to kill a process except rebooting the server. Some people think that making a database hard to kill is a good thing. Sure. But we're not talking about taking down the whole database, we're talking about taking down a connection. Killing the database is the equivalent of killing the OS. It should be hard to do either. But it should be easy to kill a process on an OS if you have the right permissions, and similiarly it should be easy to kill a connection to the database if you have the right permissions. With respect to nested transactions and other things that might make properly shutting down difficult, it seems to me that the SIGINT case is actually a harder case to deal with. Why? Because for the SIGTERM case, you basically have to do whatever is done whenever the connection itself drops. If we can't handle the connection itself dropping out arbitrarily then we have more serious problems than just how to handle SIGTERM. :-) But for SIGINT you have to decide whether to just abort the innermost transaction or the outermost one, and if it's the outermost one you have to abort then you have to provide the mechanism for it -- something that you might not have to deal with otherwise. So it seems that handling SIGTERM might actually be easy: you have the signal handler close the backend's side of the connection and let the connection-dropping logic kick in automatically, no? Thoughts? Am I completely off my rocker here? :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] [HACKERS] fsync method checking
I wrote: Note, too, that the preferred method isn't likely to depend just on the operating system, it's likely to depend also on the filesystem type being used. Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs, and that's just off the top of my head. I imagine the performance of the various syncing methods will vary significantly between them. For what it's worth, my database throughput for transactions involving a lot of inserts, updates, and deletes is about 12% faster using fdatasync() than O_SYNC under Linux using JFS. I'll run the test program and report my results with it as well, so we'll be able to see if there's any consistency between it and the live database. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] [HACKERS] fsync method checking
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Well, I wrote the program to allow testing. I don't see a complex test as being that much better than simple one. We don't need accurate numbers. We just need to know if fsync or O_SYNC is faster. Faster than what? The thing everyone is trying to point out here is that it depends on context, and we have little faith that this test program creates a context similar to a live Postgres database. Note, too, that the preferred method isn't likely to depend just on the operating system, it's likely to depend also on the filesystem type being used. Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs, and that's just off the top of my head. I imagine the performance of the various syncing methods will vary significantly between them. It seems reasonable to me that decisions such as which sync method to use should initially be made at installation time: have the test program run on the target filesystem as part of the installation process, and build the initial postgresql.conf based on the results. You might even be able to do some additional testing such as measuring the difference between random block access and sequential access, and again feed the results into the postgresql.conf file. This is no substitute for experience with the platform, but I expect it's likely to get you closer to something optimal than doing nothing. The only question, of course, is whether or not it's worth going to the effort when it may or may not gain you a whole lot. Answering that is going to require some experimentation with such an automatic configuration system. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Too-many-files errors on OS X
I wrote: Larry Rosenman wrote: I had to hack on the code some more for FreeBSD: (the realloc call needed the multiplication). I ran this same code on UnixWare. I feel like a moron, having missed that. Probably explains the bad file number error I was getting on AIX, too... And sure enough, that was it. Got the same results on AIX 5 as on other systems: [EMAIL PROTECTED]:~$ ./eatfds /usr/lib/librpm.so.0 /usr/lib/librpmbuild.so.0 dup() failed: Too many open files Was able to use 1997 file descriptors dup() failed: Too many open files Was able to use 1997 file descriptors after opening 2 shared libs [EMAIL PROTECTED]:~$ uname -a AIX m048 1 5 0001063A4C00 -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Too-many-files errors on OS X
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: I originally thought that each shared library that was loaded would eat a file descriptor (since I thought it would be implemented via mmap()) but that doesn't seem to be the case, at least under Linux Hmm. This may be OS-specific. The shlibs certainly show up in the output of lsof in every variant I've checked, but do they count against your open-file limit? It seems not, for both shared libraries that are linked in at startup time by the dynamic linker and shared libraries that are explicitly opened via dlopen(). This seems to be true for Linux and Solaris (I wasn't able to test on HP-UX, and AIX yields a strange bad file number error that I've yet to track down). Attached is the test program I used. It takes as its arguments a list of files to hand to dlopen(), and will show how many files it was able to open before and after running a batch of dlopen() commands. -- Kevin Brown [EMAIL PROTECTED] #include stdio.h #include errno.h #include stdlib.h #include dlfcn.h int *fd; int size = 1024; int eatallfds(void) { int i = 0; int j, myfd; while (1) { myfd = dup(0); if (myfd 0) { fprintf (stderr, dup() failed: %s\n, strerror(errno)); break; } fd[i++] = myfd; if (i = size) { size *= 2; fd = realloc(fd, size); if (fd == NULL) { fprintf (stderr, Can't allocate: %s\n, strerror(errno)); fprintf (stderr, Had used %d descriptors\n, i); exit(1); } } } for (j = 0 ; j i ; ++j) { close(fd[j]); } return i; } int main (int argc, char *argv[]) { int n, na; int i; void *addr; size = 1024; fd = malloc(size * sizeof(*fd)); if (fd == NULL) { fprintf (stderr, Can't allocate: %s\n, strerror(errno)); return 1; } n = eatallfds(); printf (Was able to use %d file descriptors\n, n); na = 0; for (i = 1 ; i argc ; ++i) { addr = dlopen(argv[i], RTLD_LAZY); if (addr != NULL) na++; } n = eatallfds(); printf (Was able to use %d file descriptors after opening %d shared libs\n, n, na); return 0; } ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint
I wrote: But that someplace else could easily be a process forked by the backend in question whose sole purpose is to go through the list of files generated by its parent backend and fsync() them. The backend can then go about its business and upon receipt of the SIGCHLD notify anyone that needs to be notified that the fsync()s have completed. Duh, what am I thinking? Of course, the right answer is to have the child notify anyone that needs notification that fsync()s are done. No need for involvement of the parent (i.e., the backend in question) unless the architecture of PG requires it somehow. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: Well, running out of space in the list isn't that much of a problem. If the backends run out of list space (and the max size of the list could be a configurable thing, either as a percentage of shared memory or as an absolute size), then all that happens is that the background writer might end up fsync()ing some files that have already been fsync()ed. But that's not that big of a deal -- the fact they've already been fsync()ed means that there shouldn't be any data in the kernel buffers left to write to disk, so subsequent fsync()s should return quickly. Yes, it's a big deal. You're arguing as though the bgwriter is the thing that needs to be fast, when actually what we care about is the backends being fast. If the bgwriter isn't doing the vast bulk of the writing (and especially the fsync waits) then we are wasting our time having one at all. So we need a scheme that makes it as unlikely as possible that backends will have to do their own fsyncs. Small per-backend fsync lists aren't the way to do that. Ah, okay. Pardon me, I was writing on low sleep at the time. If we want to make the backends as fast as possible then they should defer synchronous writes to someplace else. But that someplace else could easily be a process forked by the backend in question whose sole purpose is to go through the list of files generated by its parent backend and fsync() them. The backend can then go about its business and upon receipt of the SIGCHLD notify anyone that needs to be notified that the fsync()s have completed. This approach on any reasonable OS will have minimal overhead because of copy-on-write page handling in the kernel and the fact that the child process isn't going to exec() or write to a bunch of memory. The advantage is that each backend can maintain its own list in per-process memory instead of using shared memory. The disadvantage is that a given file could have multiple simultaneous (or close to simultaneous) fsync()s issued against it. As noted previously, that might not be such a big deal. You could still build a list in shared memory of the files that backends are accessing but it would then be a cache of sorts because it would be fixed in size. As soon as you run out of space in the shared list, you'll have to expire some entries. An expired entry simply means that multiple fsync()s might be issued for the file being referred to. But I suspect that such a list would have far too much contention, and that it would be more efficient to simply risk issuing multiple fsync()s against the same file by multiple backend children. Another advantage to the child-of-backend-fsync()s approach is that it would cause simultaneous fsync()s to happen, and on more advanced OSes the OS itself should be able to coalesce the work to be done into a more efficient pattern of writes to the disk. That won't be possible if fsync()s are serialized by PG. It's not as good as a syscall that would allow you to fsync() a bunch of file descriptors simultaneously, but it might be close. I have no idea whether or not this approach would work in Windows. Perhaps a better way to do it would be to store the list of all the relfilenodes of everything in pg_class, with a flag for each indicating whether or not an fsync() of the file needs to take place. You're forgetting that we have a fixed-size workspace to do this in ... and no way to know at postmaster start how many relations there are in any of our databases, let alone predict how many there might be later on. Unfortunately, this is going to apply to most any approach. The number of blocks being dealt with is not fixed, because even though the cache itself is fixed in size, the number of block writes it represents (and thus the number of files involved) is not. The list of files itself is not fixed in size, either. However, this *does* suggest another possible approach: you set up a fixed size list and fsync() the batch when it fills up. It sounds like we need to define the particular behavior we want first. We're optimizing for some combination of throughput and responsiveness, and those aren't necessarily the same thing. I suppose this means that the solution chosen has to have enough knobs to allow the DBA to pick where on the throughput/responsiveness curve he wants to be. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: Instead, have each backend maintain its own separate list in shared memory. The only readers of a given list would be the backend it belongs to and the bgwriter, and the only time bgwriter attempts to read the list is at checkpoint time. The sum total size of all the lists shouldn't be that much larger than it would be if you maintained it as a global list. I fear that is just wishful thinking. Consider the system catalogs as a counterexample of files that are likely to be touched/modified by many different backends. Oh, I'm not arguing that there won't be a set of files touched by a lot of backends, just that the number of such files is likely to be relatively small -- a few tens of files, perhaps. But that admittedly can add up fast. But see below. The bigger problem though with this is that it makes the problem of list overflow much worse. The hard part about shared memory management is not so much that the available space is small, as that the available space is fixed --- we can't easily change it after postmaster start. The more finely you slice your workspace, the more likely it becomes that one particular part will run out of space. So the inefficient case where a backend isn't able to insert something into the appropriate list will become considerably more of a factor. Well, running out of space in the list isn't that much of a problem. If the backends run out of list space (and the max size of the list could be a configurable thing, either as a percentage of shared memory or as an absolute size), then all that happens is that the background writer might end up fsync()ing some files that have already been fsync()ed. But that's not that big of a deal -- the fact they've already been fsync()ed means that there shouldn't be any data in the kernel buffers left to write to disk, so subsequent fsync()s should return quickly. How quickly depends on the individual kernel's implementation of the dirty buffer list as it relates to file descriptors. Perhaps a better way to do it would be to store the list of all the relfilenodes of everything in pg_class, with a flag for each indicating whether or not an fsync() of the file needs to take place. When anything writes to a file without O_SYNC or a trailing fsync(), it sets the flag for the relfilenode of what it's writing. Then at checkpoint time, the bgwriter can scan the list and fsync() everything that has been flagged. The relfilenode list should be relatively small in size: at most 16 bytes per item (and that on a 64-bit machine). A database that has 4096 file objects would have a 64K list at most. Not bad. Because each database backend can only see the class objects associated with the database it's connected to or the global objects (if there's a way to see all objects I'd like to know about it, but pg_class only shows objects in the current database or objects which are visible to all databases), the relfilenode list might have to be broken up into one list per database, with perhaps a separate list for global objects. The interesting question in that situation is how to handle object creation and removal, which should be a relatively rare occurrance (fortunately), so it supposedly doesn't have to be all that efficient. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Bruce Momjian wrote: Here is my new idea. (I will keep throwing out ideas until I hit on a good one.) The bgwriter it going to have to check before every write to determine if the file is already recorded as needing fsync during checkpoint. My idea is to have that checking happen during the bgwriter buffer scan, rather than at write time. if we add a shared memory boolean for each buffer, backends needing to write buffers can writer buffers already recorded as safe to write by the bgwriter scanner. I don't think the bgwriter is going to be able to keep up with I/O bound backends, but I do think it can scan and set those booleans fast enough for the backends to then perform the writes. (We might need a separate bgwriter thread to do this or a separate process.) That seems a bit excessive. It seems to me that contention is only a problem if you keep a centralized list of files that have been written by all the backends. So don't do that. Instead, have each backend maintain its own separate list in shared memory. The only readers of a given list would be the backend it belongs to and the bgwriter, and the only time bgwriter attempts to read the list is at checkpoint time. At checkpoint time, for each backend list, the bgwriter grabs a write lock on the list, copies it into its own memory space, truncates the list, and then releases the read lock. It then deletes the entries out of its own list that have entries in the backend list it just read. It then fsync()s the files that are left, under the assumption that the backends will fsync() any file they write to directly. The sum total size of all the lists shouldn't be that much larger than it would be if you maintained it as a global list. I'd conjecture that backends that touch many of the same files are not likely to be touching a large number of files per checkpoint, and those systems that touch a large number of files probably do so through a lot of independent backends. One other thing: I don't know exactly how checkpoints are orchestrated between individual backends, but it seems clear to me that you want to do a sync() *first*, then the fsync()s. The reason is that sync() allows the OS to order the writes across all the files in the most efficient manner possible, whereas fsync() only takes care of the blocks belonging to the file in question. This won't be an option under Windows, but on Unix systems it should make a difference. On Linux it should make quite a difference, since its sync() won't return until the buffers have been flushed -- and then the following fsync()s will return almost instantaneously since their data has already been written (so there won't be any dirty blocks in those files). I suppose it's possible that on some OSes fsync()s could interfere with a running sync(), but for those OSes we can just drop back do doing only fsync()s. As usual, I could be completely full of it. Take this for what it's worth. :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] index scan with functional indexes -- solved
Dave Cramer wrote: Interesting it works now, and the good news is it is *WAY* faster, this might be able to speed up marc's doc search by orders of magnitude this is searching 100536 rows select * from url where fn_strrev(url) like fn_strrev('%beta12.html'); 1.57ms explain select * from url where url like '%beta12.html'; 3310.38 ms The nice thing about this is that you can create your query thusly: SELECT * from table WHERE column like 'string' AND fn_strrev(column) LIKE fn_strrev('string') and, if you have both a standard index on column and a functional index on fn_strrev(column), the query will be fast (well, as fast as the pattern in question allows) as long as 'string' is anchored on either end. I've implemented the 'locate' utility in Perl using a PG backend instead of the standard locate database. I internally convert globs given as arguments into LIKE strings, and with a functional index like that the searches are now blazingly fast -- faster than the original 'locate' utility. It has the added advantage that you can specify a file type to further narrow the search (thus 'locate --type file core' will find all regular files named 'core' in the database). I'll be happy to share my code with anyone who's interested. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] psql \d option list overloaded
in the know about how to show the available databases, or how to describe a table. sold me on MySQL when I first learned it. For me, it's like: 'dir' in DOS, 'ls' in Unix 'SHOW' in MySQL ??? in PostgreSQL ? We've been over this. It's \d*. For listing databases it's \l. Not exactly consistent with the rest of the related psql commands. Sure, with time as my database needs grew and I matured as a developer, I eventually gained more respect for PostgreSQL and have made the switch even without this feature, but to this day, I really think MySQL *did it right* with those extensions. You can't become a PostgreSQL guru without being a newbie first. I vote we make it easier for newbies. What really frightens me here is that I know of several applications (shudder, LAMP applications) which use the output of show tables or other of your extensions. The problem with this is precisely that it /isn't/ sql, and it can't be supported as a static command. Of course not. But applications which rely on information such as that provided by show tables will typically not be possible to write while adhering to the feature intersection of all major databases anyway. It is intended to be there for people to use interactively. Nonsense. It's there to be used. Whether it's used interactively or not is irrelevant. The command provides useful information. But see below. Making pseudo sql will encourage more developers to (and I'd apologize for this if it weren't true) code in Postgres the same lazy way they code in MySQL. This is a strawman argument, although I understand your concern here. To be honest, for application development I'd much rather see people use information_schema, but that's only because information_schema is in the SQL standard and as such should be the preferred way to retrieve the information that the SHOW commands in MySQL return. That said, the inclusion of information_schema is a very recent development on the PostgreSQL side of things, and doesn't even exist on some other major databases such as MSSQL. Of course, a PG equivalent to MySQL's show would be an even more recent development... :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pg_restore and create FK without verification check
Tom Lane wrote: Andreas Pflug [EMAIL PROTECTED] writes: In principle you could do this today, but we don't have enough support code in place to make it work smoothly, eg WAL segment files aren't labeled with enough identifying information to let you manage an archive full of 'em. Still it doesn't seem that far away. So I issue CHECKPOINT, and tar the cluster or database. Still, I got two questions: - how to restore a single database You don't. As I said, any physical backup is going to be all-or-nothing. These techniques are not a replacement for pg_dump. But this is just an artifact of the fact that the WAL is a single instance-wide entity, rather than a per-database entity. But since databases are completely separate entities that cannot be simultaneously accessed by any query (corrections welcome), there isn't any reason in principle that the WAL files cannot also be created on a per-database basis. I'm sure, of course, that doing so would bring with it a new set of problems and tradeoffs, so it might not be worth it... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] pg_restore and create FK without verification check
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: Tom Lane wrote: You don't. As I said, any physical backup is going to be all-or-nothing. These techniques are not a replacement for pg_dump. But this is just an artifact of the fact that the WAL is a single instance-wide entity, rather than a per-database entity. But since databases are completely separate entities that cannot be simultaneously accessed by any query (corrections welcome), there isn't any reason in principle that the WAL files cannot also be created on a per-database basis. WAL is not the bottleneck ... as I already mentioned today, pg_clog (and more specifically the meaning of transaction IDs) is what really makes a cluster an indivisible whole at the physical level. If you want to do separate physical dumps/restores, the answer is to set up separate clusters (separate postmasters). Not so hard, is it? Well, aside from the fact that separate clusters have completely separate user databases, listen on different ports, will compete with other clusters on the same system for resources that would be better managed by a single cluster, and generally have to be maintained as completely separate entities from start to finish, no it's not that hard. ;-) The ability to restore a single large database quickly is, I think, a reasonable request, it's just that right now it's difficult (perhaps impossible) to satisfy that request. It's probably something that we'll have to deal with if we want PG to be useful to people managing really large databases on really, really big iron, though. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Release cycle length
Neil Conway wrote: Peter Eisentraut [EMAIL PROTECTED] writes: First, if you develop something today, the first time users would realistically get a hand at it would be January 2005. Do you want that? Don't you want people to use your code? Sure :-) But I don't mind a long release cycle if it is better for users. Given that users can run whatever they like, it's not clear that a long release cycle is better for users. (c) As PostgreSQL gets more mature, putting out stable, production-worthy releases becomes even more important. In theory, longer release cycles contribute to higher quality releases: we have more time to implement new features properly, polish rough edges and document things, test and find bugs, and ensure that features we've implemented earlier in the release cycle are properly thought out, and so forth. On the other hand, the longer you wait to release a new feature, the longer it will be before you get your REAL testing done. You don't want to release something that hasn't at least been looked over and checked out by the development community first, of course, but waiting beyond that point to release a new version of PG doesn't help you that much, because most people aren't going to run the latest CVS version -- they'll run the latest released version, whatever that may be. So the time between the testing phase for the feature you implement and the version release is essentially dead time for testing of that feature, because most developers have moved on to working on and/or testing something else. That's why the release methodology used by the Linux kernel development team is a reasonable one. Because the development releases are still releases, people who wish to be more on the bleeding edge can do so without having to grab the source from CVS and compile it themselves. And package maintainers are more likely to package up the development version if it's given to them in a nice, prepackaged format, even if it's just a source tarball. Note that whether or not we are using those 355 days effectively is another story -- it may well be true that there are we could make parts of the development process much more efficient. Furthermore, longer release cycles reduce, to some degree, the pain of upgrades. Unless we make substantial improvements to the upgrade story any time soon, I wouldn't be surprised if many DBAs are relieved at only needing to upgrade once a year. But DBAs only need to upgrade as often as they feel like. Any reasonable distribution will give them an option of using either the stable version or the development version anyway, if we're talking about prepackaged versions. The longer you develop, the more parallel efforts are underway, and it becomes impossible to synchronize them to a release date. I think this is inherent to the way PostgreSQL is developed: Tom has previously compared PostgreSQL release scheduling to herding cats :-) As long as much of the work on the project is done by volunteers in their spare time, ISTM that coordinating everyone toward a single release date is going to be difficult, if not impossible. The length of the release cycle doesn't really effect this, IMHO. Linux, too, is done largely by volunteers in their spare time. Yet Linux kernel releases are much more frequent than PostgreSQL releases. One difference is that the Linux community makes a distinction between development releases and stable releases. The amount of time between stable releases is probably about the same as it is for PostgreSQL. The difference is that the *only* releases PostgreSQL makes are stable releases (or release candidates, when a stable release is close). That's something we might want to re-think. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Release cycle length
Matthew T. O'Connor wrote: I agree with Peter's other comment, that the longer the development cycle, the longer the beta / bug shakeout period, perhaps a shorter dev cycle would yield a shorter beta period, but perhaps it would also result in a less solid release. Perhaps. Perhaps not. The fewer the changes, the less complexity you have to manage. But it would certainly result in a smaller set of feature changes per release. Some people might regard that as a good thing. The advantage to doing more frequent releases is that new features end up with more real-world testing within a given block of time, on average, because a lot more people pick up the releases than the CVS snapshots or even release candidates.. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Still a few flaws in configure's default CFLAGS selection
Bruce Momjian wrote: Well, we don't want to use debug for non-gcc (no optimization) so do we do -g for gcc, and then --enable-debug does nothing. Seems people can decide for themselves. But doesn't --enable-debug turn off optimization? It's really a question of what the default behavior should be for each option. Clearly for non-gcc compilers, the default should be -O only since they probably can't simultaneously handle -g. But gcc builds are an exception, one which I think is worth considering. Hence my opinion that for gcc builds, the default should be -g and -O. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Still a few flaws in configure's default CFLAGS selection
Bruce Momjian wrote: Peter Eisentraut wrote: Tom Lane writes: What Peter was advocating in that thread was that we enable -g by default *when building with gcc*. I have no problem with that, since there is (allegedly) no performance penalty for -g with gcc. However, the actual present behavior of our configure script is to default to -g for every compiler, and I think that that is a big mistake. On most non-gcc compilers, -g disables optimizations, which is way too high a price to pay for production use. You do realize that as of now, -g is the default for gcc? It is? [EMAIL PROTECTED]:~/tmp$ gcc -c foo.c [EMAIL PROTECTED]:~/tmp$ ls -l foo.o -rw-r--r--1 kevinkevin 876 Oct 26 18:52 foo.o [EMAIL PROTECTED]:~/tmp$ gcc -g -c foo.c [EMAIL PROTECTED]:~/tmp$ ls -l foo.o -rw-r--r--1 kevinkevin 12984 Oct 26 18:52 foo.o Reading specs from /usr/lib/gcc-lib/i386-linux/3.3/specs Doesn't look like it to me... If -g is the default, it must be very recent, in which case it's obviously not something for our configuration scripts to rely on. Was that the intent? I was going to ask that myself. It seems strange to include -g by default --- we have --enable-debug, and that should control -g on all platforms. I thought --enable-debug had other implications, e.g. enabling assert()s and other such things you might want enabled for debugging but not for production. It certainly makes sense for it to have such semantics even if it doesn't right now. When combined with gcc, -g is, IMO, too useful to eliminate: it makes it possible to get good stacktraces in the face of crashes, and makes it possible to examine variables and such when looking at core files. Also, -g bloats the executable, encouraging people/installers to run strip, which removes all symbols. Without -g and without strip, at least we get function names in the backtrace. This should be up to the individual. I'd argue that disk space is so plentiful and so cheap these days that executable bloat is hardly worth considering. But even if it were, a database tends to be so critical to so many things that you probably want to know why and how it crashes more than you would most other things. So even if you might be inclined to strip most of your binaries, you might think twice about doing the same for the PG binaries. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Call for port reports
Bruce Momjian wrote: How does everyone like this patch? It removes -g from non-debug compiles, and changes -O2 to -O for FreeBSD/Alpha. I'd be hesitant to remove -g from non-debug compiles. If something crashes, it's useful to be able to get a good stacktrace from the resulting core file. The -g option makes that possible for optimized code when compiling with gcc. Is there any way we can have configure put -g in when it detects gcc? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Dreaming About Redesigning SQL
Dawn M. Wolthuis wrote: So, nope, I'm not trolling. I've been doing some research the past couple of years and I'm convinced that it is time to do something new (and yet old) with data persistence. Perhaps. But before you go down that road, you have to answer the following simple, yet possibly difficult-to-answer, question: What problem are you trying to solve? Data persistence is far too vague a term to be meaningful in its own right -- the term needs some context to have any real meaning here. We store data for a reason. Similarly, we retrieve it for a reason. The data we're interested in looking for and the way we are interested in looking for it will have a huge impact on any data retrieval solution one could craft. The relational model of data storage and retrieval is just a model, highly suitable to some things and not suitable at all to others. The amount of development work that has gone into it and the amount of use it has gotten shows that the relational model is actually reasonably good at meeting many of the data storage and retrieval needs that people have. As with any method, it has tradeoffs and drawbacks, There is no magic bullet and there never will be (or so experience says). I have no reason to believe that the problem of data persistence and retrieval is any exception to that. If you have a particular set of data retrieval problems in mind that you wish to solve, by all means feel free to develop the mathematical foundation to solve them. Feel free to tell us that the relational model is not suitable for that set of problems -- we might even agree with you on that. But don't make the claim that the relational model is lacking as a result of not being a storage and retrieval method that is suitable to all problems, and that there is a Better Way that will Solve Everything. Many have made such claims about many different technologies. They were wrong, too. I may be misreading you and responding to arguments you aren't making or implying, but if so I doubt I'm alone, based on some of the other responses I've seen here. By the way, language is only a means of expression, and the only sort of question (relevant to this discussion, anyway) that a language is the answer to is what's the best way for me to express X?. It is neither an answer to the question of how to retrieve data nor is it a solution to the problem of storing data in a persistent manner. The answer to the question of how best to query data is certainly going to be a language, but the specific language one comes up with in answer to the question will depend on what the person asking wants. English is likely to be the best answer only under certain circumstances. SQL is likely to be the best answer (or, at least, a very good answer) only under other circumstances. It just depends. But as with any solution to any problem, there is no one-size-fits-all solution. As a mathematician, you should know this: the English language is horribly inadequate for expressing mathematical concepts. That's why mathematicians don't use it as their primary language for doing math. Why, then, should we expect English, or Java, or any other language to be any better for performing certain kinds of queries against data than some other, more directed language? Say what you want about SQL, but at least it was designed with querying table-based data in mind and is at least somewhat good at its job. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] 2-phase commit
Bruce Momjian wrote: Kevin Brown wrote: Actually, all that's really necessary is the ability to call a stored procedure when some event occurs. The stored procedure can take it from there, and since it can be written in C it can do anything the postgres user can do (for good or for ill, of course). But the postmaster doesn't connect to any database, and in a serious failure, might not be able to start one. Ah, true. But I figured that in the context of 2PC and replication that most of the associated failures were likely to occur in an active backend or something equivalent, where a stored procedure was likely to be accessible. But yes, you certainly want to account for failures where the database itself is unavailable. So I guess my original comment isn't strictly true. :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] invalid tid errors in latest 7.3.4 stable.
Stephan Szabo wrote: The case at hand (with *'s on the ri queries) assuming pk already has an id=1 row would be. T1: begin; T1: set transaction isolation level serializable; T1 ... (something that does a select, not necessarily on either pk or fk) T2: begin; T2: insert into fk values (1); T2*:select * from pk where id=1 for update; T2: commit; T1: delete from pk where id=1; T1*:select * from fk where id=1 for update; T1: commit; If you want to treat the serial execution as T1 followed by T2. Then T2* would have to show no rows for pk and T2 rolls back. If you want to treat the order as T2,T1, then T1* would have to see the row that T2 inserted and T1 rolls back. Right now, you won't get that, you'll get T2* showing 1 row and T1* showing 0 rows. Does it also behave this way *without* any actual foreign key constraints in place? In other words, if you perform the RI queries explicitly? If so, then the problem is with the serialization code. Sounds like that's pretty much what you're saying. The problem in the scenario you described should be solved if we mark any rows that are selected with the for update option (either implicitly, as with RI triggers, or explicitly) as having been modified by the selecting transaction, the equivalent of (in the case of T2*) update pk set id=id where id=1 but without firing any of the ON MODIFY triggers. A rollback would, of course, not have any effect on the data in those rows since there weren't any real changes. This fix won't work, of course, if the serialization code is so broken that it doesn't work properly even in the face of updates (something I'd find hard to believe). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] invalid tid errors in latest 7.3.4 stable.
Stephan Szabo wrote: The problem in the scenario you described should be solved if we mark any rows that are selected with the for update option (either implicitly, as with RI triggers, or explicitly) as having been modified by the selecting transaction, the equivalent of (in the case of T2*) update pk set id=id where id=1 but without firing any of the ON MODIFY triggers. A rollback would, of course, not have any effect on the data in those rows since there weren't any real changes. This fix won't work, of course, if the serialization code is so broken that it doesn't work properly even in the face of updates (something I'd find hard to believe). That fixes the case above which will fix the ri constraints for right now (although they really have to stop using for update eventually), but doesn't really solve the serialization problem since it still exists AFAICS without for update. Without the for update, you still have T2* getting 1 row and T1* getting 0 which can't happen for either ordering of the transactions. It gets worse if that select as a holder at the beginning of T1 was say select * from fk where id=1 because SQL tells us that the later select can't see a different set of rows from the earlier one, so T2 shouldn't be allowed to commit before T1. That's what I was afraid of, and what I figured serialization really meant: what you see is a snapshot of the database as it was at transaction start time. I can't think of any good way to implement proper serialization without destroying a serialized transaction's read performance, because it seems to me that the only way to properly implement serialization is to somehow record on-disk all the rows a serializable transaction visits, which means that a serializable transaction is going to be *much* slower than a read-committed transaction. You have to mark such rows because other transactions (even read-committed transactions) have to abort if they attempt to modify such a row, and the list of such rows can grow far too large to record it in shared memory. Worse, you have to maintain a dynamic list of serializable transactions that have seen the row and remove a transaction from the list once it commits or rolls back, because the only time a transaction needs to care about this when changing a row is when there's a currently-running transaction that's seen it. We could use the MVCC mechanism to implement it: duplicate the row being examined and assign the reader's transaction ID to the duplicate just as if it had modified the row. But you also have to somehow flag the duplicate as being there as a result of a serializable read, so that other serializable transactions that try to modify the row after the one in question has committed won't themselves throw a serialization error (because without the flag they'd think they were attempting to read a row that had been modified by someone else during their lifetime). The other situation you have to deal with is when you have two transactions, 1 and 2, that start and commit in that order but which have overlapping times of execution. If transaction 1 modifies a row after transaction 2 starts, then commits before transaction 2 reads it, transaction 2 has to be able to detect that and throw a serialization error. The way around that problem is to assign a commit ID to each transaction at commit time. The commit ID is just the transaction ID that will be assigned to the next transaction that runs. It might make sense for assignment of commit IDs to increment the transaction ID counter the way assignment of a transaction ID does. Anyway, if a serializable transaction reads a row that has a commit ID greater than the reader's transaction ID, it throws a serialization error. It's probably sufficient to store the commit ID along with the transaction ID of the committer in the transaction log as well as in shared memory, so that the commit ID can be quickly looked up from the transaction ID. Maybe there's a better way around all this, but I certainly can't think of one at the moment. :-( -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] 2-phase commit
Bruce Momjian wrote: Marc G. Fournier wrote: On Sat, 27 Sep 2003, Bruce Momjian wrote: I have been thinking it might be time to start allowing external programs to be called when certain events occur that require administrative attention --- this would be a good case for that. Administrators could configure shell scripts to be run when the network connection fails or servers drop off the network, alerting them to the problem. Throwing things into the server logs isn't _active_ enough. Actually, apparently you can do this now ... there is apparently a mail module for PostgreSQL that you can use to have the database send email's out ... The only part that needs to be added is the ability to call an external program when some even occurs, like a database write failure. Actually, all that's really necessary is the ability to call a stored procedure when some event occurs. The stored procedure can take it from there, and since it can be written in C it can do anything the postgres user can do (for good or for ill, of course). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] invalid tid errors in latest 7.3.4 stable.
Tom Lane wrote: Stephan Szabo [EMAIL PROTECTED] writes: I think theoretically in serializable the cases where the difference between the snapshot from this statement and the standard snapshot for the transaction are noticable we probably have a serialization failure Hmm, that is a good point. It would be cleaner to throw a can't serialize failure than have the RI triggers run under a different snapshot. I am not sure if we can implement that behavior easily, though. Can you think of a way to detect whether there's an RI conflict against a later-started transaction? Just some thoughts on this that, of course, could be wrong. So please don't be too hard on me if I'm full of it. :-) By a later-started transaction I assume you mean a later-started transaction that commits before yours does? I don't see how RI is any different than dealing with straight SQL in this regard. The effect of RI is to read/write/delete rows from a related table that you otherwise wouldn't read or modify, and that means that the RI mechanism needs to be treated in exactly the same way that the equivalent SELECT/UPDATE/DELETE would be. So the question I have is: what would PG do in the case that you SELECT the same row(s) that the RI triggers are reading implicitly? For instance, suppose we have two tables: CREATE TABLE corps (id integer PRIMARY KEY, name varchar(32)); CREATE TABLE widgets (id integer PRIMARY KEY, name varchar(32), corpid integer REFERENCES corps(id) ON DELETE CASCADE); When, within a transaction, I do: INSERT INTO widgets (id, name, corpid) VALUES (1, 'cpu', 3); the RI mechanism will automatically check to make sure that the value 3 is in the id column of the corps table. Put another way, it will do an implicit SELECT id FROM corps WHERE id = 3, right? So suppose that for the purposes of testing the serialization code I remove the RI triggers and then actually do the following: SELECT id FROM corps WHERE id = 3; INSERT INTO widgets (id, name, corpid) VALUES (1, 'cpu', 3); If my transaction is serializable then clearly, when another transaction does UPDATE corps SET id = 4 WHERE id = 3; and commits before my transaction commits, either the updating transaction is in violation of serializability rules or the inserting transaction is. Serialization is maintained if either of those transactions aborts with a serialization error. But note that whether or not RI is involved should be entirely irrelevant. What matters is what rows each transacion sees and modifies. How the row gets looked at doesn't matter; the only thing that matters is that the row *does* get looked at. The important thing here is that the effect of the RI mechanism MUST be the same as if the equivalent manual SQL statements were exected within the same transaction. If it's not, then the RI mechanism is broken and needs to be fixed at that level. But if PG exhibits exactly the same bug this thread refers to regardless of whether a row is examined/modified via directly issued SQL or via the RI mechanism then the problem lies not within the RI code at all, but within the serialization code. I just hope I'm not merely stating the obvious here... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Killing the backend to cancel a long waiting query
Tom Lane wrote: Paulo Scardine [EMAIL PROTECTED] writes: I trust when you say kill, you really mean send SIGINT ... I'm sending a SIGTERM. Would SIGINT be more appropriate? Yes --- that would actually cancel the query, not cause the backend to shut down. Ahh...this is very useful information. Wouldn't it be useful, though, to implement a KILL or CANCEL SQL command that takes a backend ID as its argument (and, of course, does the appropriate checks of whether you're a superuser or the owner of the backend) and sends the appropriate signal to the target backend? That would make it possible for users to kill their own runaway queries without having to run as whatever user PG is running as. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] New thoughts about indexing cross-type comparisons
Tom Lane wrote: Dave Smith [EMAIL PROTECTED] writes: My point was that it was inconstant behavour. What exactly are you comparing with int2? To me the case without the cast should should throw the same error as the statement with the cast. select * from test where f=1981928928921; I contend not. The above is perfectly well defined. It will always return false if f is int2, but that does not mean it should throw an error instead. In any standard programming language, you'd resolve the operator by up-converting f to the type of the constant, not by trying to down-convert the wider value. PG happens to have implementation reasons to wish to use the variable's datatype instead of the constant's, but that doesn't excuse us from obeying the ordinary laws of arithmetic. Hmm...but what if the cast were to return NULL in the event that the cast fails or cannot be done? Would that even be reasonable? I don't know what the standard says about this so my suggestion may be unreasonable (and it may break a lot of things as well). In a way, this would be consistent with the meaning of NULL: no value, and would also yield the desired effect in the example select (no matches). Of course, I could always be off my rocker here. :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Stats Collector Error 7.4beta1 and 7.4beta2
Tom Lane wrote: [EMAIL PROTECTED] writes: This analysis makes sense - I think using memcmp is clearly wrong here. Yeah, now that I think about it, we're betting on the kernel to faithfully zero all unused bits in addrinfo structures. In an ideal world, all kernels would do that, but in the real world it seems like a losing bet. Yeah, I've always been under the impression that it's a bad idea in general to memcmp() structs, if only because in doing so you make a lot of implicit assumptions about the structs in question that aren't necessarily true, especially when dealing with multiple architectures. Makes me wonder if there are other parts of the code where we're vulnerable to the same sort of issue... I could go for Jan's idea of putting a random key into the messages, if anyone feels that we should not trust to the kernel to enforce the packet source address restriction. But the memcmp() test seems a clear loser given today's discussions. The test in the 7.3.x code looked reasonable to me, especially if it's possible to make it work with IPV6 (if it doesn't already). It's doing basically the right thing, at any rate: directly comparing the actual fields that are relevant. Does this test represent a significant performance hit? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] set constraints docs page
Bruce Momjian wrote: Added to TODO: * Print table names with constraint names in error messages, or make constraint names unique within a schema Should the TODO also include adding ALTER TABLE x ALTER CONSTRAINT y RENAME TO z functionality if we don't make constraint names unique within a schema? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Date input changed in 7.4 ?
Tom Lane wrote: Mendola Gaetano [EMAIL PROTECTED] writes: I noticed that some date are not anymore accepted: test=# select '18/03/71'::date; ERROR: invalid input syntax for date: 18/03/71 is this the indendeed behaviour ? If it does not match your DateStyle setting, then yes. Umm...I hope this is controllable with a GUC variable then. There are some cases where it's extremely useful for PostgreSQL to accept dates of any format it knows about (ambiguities should be resolved either by looking at the current DateStyle or, failing that, by applying the recognition in a well-defined order). In my case I can probably code around it but it does require some extra effort. But I can easily imagine situations in which that wouldn't be an option. Whatever happened to be liberal in what you accept and conservative in what you send? :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] set constraints docs page
Bruce Momjian wrote: Kevin Brown wrote: The two approaches aren't necessarily mutually exclusive (though SQL99 compliance on constraint names would obviously make it unnecessary to specify a tablename along with a constraint name), so I see little problem here. But the current arrangement is obviously untenable, because it allows you to create a situation (multiple constraints by the same name) that you can't reasonably extricate yourself from. Well, it seems if we want to continue to allow the same constraint name to be used by different tables in the same schema, we have to print the tablename in the error message. Would someone actually be looking for a standards-compliant error string? We have already extended the standard --- either we revert that, or we have to go the entire way and print the table name. If PG were configurable in terms of how it manages constraint names, then it would depend on how the DBA had the database configured. With it configured to disallow name collisions, it would obviously be unnecessary to report the table name, though I still think it would be useful (if only because it gives a little extra context to work with). But if it's configured to allow name collisions, then it doesn't make sense not to print the table name in an error message, because that's the only way to guarantee that the DBA can identify which constraint is being referred to. The problem as things stand now is that even if we printed the table name involved, the DBA is placed in a difficult position if the constraint in question isn't uniquely named -- which is the only case where printing the table name would really matter. That's because he can't actually refer to the constraint in any unique way short of playing with the system tables; he'd have to rename the constraint first before being able to really do something with it (is this even possible for him to do without manipulating system tables? Is there an ALTER CONSTRAINT?). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] set constraints docs page
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: I object to creating gratuitous incompatibilities with the SQL standard, which will obstruct legitimate features down the road. The SQL standard says it is schema.constraint. Is there a case for enforcing uniqueness on constraint names, then? Other than SQL92 says so? Very little. This seems to me to be a design error in the spec. Per-table constraint names are easier to work with --- if they're global across a schema then you have a serious problem avoiding collisions. I assume that SQL99 and later don't specify anything different than what SQL92 calls for in this regard? Without any meaningful guidance from the spec, the best we can do is support per-table constraint names and provide optional (via a GUC variable) support for SQL92-compliant constraint names. Let the DBA decide which (if not both) is best for his situation. Inasmuch as one of our selling points is our compliance with the SQL spec, I see little reason to entirely avoid compliance with the spec on this issue -- just make it possible to do something else when/if necessary. The two approaches aren't necessarily mutually exclusive (though SQL99 compliance on constraint names would obviously make it unnecessary to specify a tablename along with a constraint name), so I see little problem here. But the current arrangement is obviously untenable, because it allows you to create a situation (multiple constraints by the same name) that you can't reasonably extricate yourself from. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] contrib compilation probs
Peter Eisentraut wrote: Tom Lane writes: I was afraid it was something like that. Can we leave the directory structure as-is and just make the .o (and .d) files get built in the upper directory, that is gcc ... -o english_stem.o snowball/english_stem.c That will fail for a more basic reason: not all compilers support the -o option. ...jaw drops to ground... I'm astounded. There are such compilers still in use?? Which ones? -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] reprise on Linux overcommit handling
Bruce Momjian wrote: Thanks. Interesting. Hard to imagine what they were thinking when they put this code in. Way back in the day, when dinosaurs ruled the earth, or at least the server room, many applications were written with rather bad memory allocation semantics: they'd grab a bunch of memory and not necessarily use it for anything. Typically you could specify a maximum memory allocation amount for the program but the problem was that it would grab exactly that amount, and it's obviously better for it to be a bit more dynamic. That in itself isn't a terribly bad thing ... if you have enough actual memory to deal with it. Problem is, back then most systems didn't have enough memory to deal with multiple programs behaving that way. Overcommit was designed to account for that behavior. It's not ideal at all but it's better to have that option than not. Overcommit isn't really necessary today because of the huge amount of memory that you can put into a system for cheap (HP servers excluded, they want some serious cash for memory). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Two weeks to feature freeze
The Hermit Hacker wrote: On Wed, 25 Jun 2003, Kevin Brown wrote: So...would it make sense to create a gborg project to which people who have written their own test suites can contribute whatever code and data they feel comfortable releasing? As a gborg project, it would be separate from the main PG distribution and would thus have no impact on the build process or anything like that. But at the same time, if there are any ideas on testing that people have had, they could be shared with others through that mechanism. And any tests which prove to be particularly useful could make their way into the PG distribution if people here wish. Of course, like anything else this could be a bad (or perhaps redundant) idea. :-) It doesn't sound like a bad idea ... but, it pretty much comes down to the original thread: are you willing to step up and maintain such a project? Yes, I am (how hard can it be?, he asks himself, knowing all the while that it's a really bad idea to be asking that question. :-). But I haven't the faintest idea of how or where to even start, so pointers would be appreciated. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Two weeks to feature freeze
Tom Lane wrote: Josh Berkus [EMAIL PROTECTED] writes: 3) Dann is proposing not just a feature but sweeping changes to the way our commmunity works, despite having been a member of this community for about 3 weeks total. In Dann's defense, I didn't think I heard him proposing that we get rid of our existing testing methods, but rather that we see if we can't supplement them with something more formal. This strikes me as a perfectly reasonable proposal. However, he hasn't succeeded in convincing anyone else to put their time into it (for reasons that are also perfectly reasonable, namely that we find that our existing methods do pretty well, and we don't have the manpower to create a large formal testing structure ... even if we thought it would repay the effort, which many of us doubt). So it's his itch to scratch, or not. Unless there's something more profitable to be said on the subject, could we drop the thread? One thing that came out of the thread is the fact that many people who use PostgreSQL do testing in many different ways, and that much of the stability of PostgreSQL can be attributed to that. It occurs to me that there may be (perhaps) a lot of duplication of effort there that could be reduced a bit. So...would it make sense to create a gborg project to which people who have written their own test suites can contribute whatever code and data they feel comfortable releasing? As a gborg project, it would be separate from the main PG distribution and would thus have no impact on the build process or anything like that. But at the same time, if there are any ideas on testing that people have had, they could be shared with others through that mechanism. And any tests which prove to be particularly useful could make their way into the PG distribution if people here wish. Of course, like anything else this could be a bad (or perhaps redundant) idea. :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updating psql for features of new FE/BE protocol
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: I think 'is_superuser' is more appropriate. Okay, fine. I forgot one other thing that is available from the recent libpq additions and needs to be exposed by psql: error message verbosity setting. What's there now is described in http://candle.pha.pa.us/main/writings/pgsql/sgml/libpq-control.html to wit, terse, default, and verbose options. We have the choice of exposing this as a backslash command or as a special variable in psql --- any preferences? My preference for such things is to use variables. It seems to me that backslash commands should be reserved for actual actions, e.g. show me the list of tables or import data from stdin, etc. It seems to me that variables are a natural way of representing the state of psql, and that changing that state should be accomplished through the standard mechanisms, i.e. \set. Also, I would like to provide the same set of options w.r.t. messages logged in the server log. Here there is an additional frammish that could be imagined, ie, more detail for more-serious errors. Any opinions about what it should look like? Not sure exactly what you're asking for here. If you're asking what additional detail should be included for more serious errors, I'd say it should be things like the actual text of the query being executed and perhaps the file and line number of the code that threw the error. A stack trace could be useful in the most extreme cases (and, obviously, only when verbosity is maximized), too, but that may be too much to ask for. :-) -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Updating psql for features of new FE/BE protocol
[EMAIL PROTECTED] wrote: [EMAIL PROTECTED] writes: Is it too late to suggest that there be a way to have output displayed on screen AND output to a file? tee perhaps? Tee ALMOST does it. Try doing a \d while tee'ing the output, for example. Try using script (start it from the shell before invoking psql). It sounds like it'll do much of what you're after. Screen also has a logging option which may work just as well, if not better, than script, and has the additional advantage that the session will continue (and can be reattached to) even if your terminal window dies for whatever reason. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Two weeks to feature freeze
Tom Lane wrote: I have been through crash-me in some detail, and it left a very bad taste in my mouth. Don't bother holding it up as an example of good practice. You seem to miss Dan's point. The specific implementation of crashme is undoubtedly flawed in a number of ways, but the idea is very useful as part of an acceptance testing suite. In short, it would probably be beneficial to us to fix crashme so that it tests the proper, standards-compliant things and reports the actual results, and then include it in the test suite. Indeed, we could even go so far as to use it for our own marketing purposes! Have it cite, for each test, which part of the SQL spec it's testing and what the result should be. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Two weeks to feature freeze
I wrote: Tom Lane wrote: I have been through crash-me in some detail, and it left a very bad taste in my mouth. Don't bother holding it up as an example of good practice. You seem to miss Dan's point. The specific implementation of crashme is undoubtedly flawed in a number of ways, but the idea is very useful as part of an acceptance testing suite. In short, it would probably be beneficial to us to fix crashme so that it tests the proper, standards-compliant things and reports the actual results, and then include it in the test suite. Actually, now that I think about it, it would probably be more beneficial to merge any correct tests that we aren't already performing into our existing regression test framework, provided that the end result doesn't take too long to run (as you pointed out elsewhere, regression tests that take a really long time to run simply won't be run by most people, except perhaps in a tinderbox type of environment). Overall, it might be of some benefit to mark individual regression tests with a priority, and then make it possible to run only those tests of a specified priority or higher. That way, the indvidual developer may decide for himself which group of regression tests to run based on the amount of time he's willing to let it take and how much hardware he has to throw at it. And at the same time, it would make it easier for new tests to be included in the suite without worrying about the impact it would have on people running the tests. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Two weeks to feature freeze
Dann Corbit wrote: -Original Message- From: Jason Earl [mailto:[EMAIL PROTECTED] Sent: Friday, June 20, 2003 10:45 AM To: The Hermit Hacker Cc: Robert Treat; Tom Lane; Christopher Kings-Lynne; Bruce Momjian; PostgreSQL-development Subject: Re: [HACKERS] Two weeks to feature freeze [...] Why couldn't you just release the win32 version of 7.4 when it was finished. If it takes an extra month then that just gives you guys the chance to circulate *two* press releases. The Native Win32 port is likely to make a big enough splash all by itself. A formal release needs a big testing effort. Two separate releases will double the work of validation. That's true in the general case. But in this case we're talking about releasing for a completely new platform. That's much different than doing another release for the same platform set. The testing that needs to be done for the Win32 release has to be done separately *anyway*, so there's nothing lost by releasing the Win32 port separately. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] contrib and licensing
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: But if both of these paragraphs are simultaneously true, then why put *anything* in contrib? Don't say that too loudly, or Marc may take it upon himself to make it happen ;-). Well, I hope he's not so eager to do so that he does it before this licensing thing is hammered out. :-) I think having contrib is a very good idea, but that's only *because* it currently allows non-BSD licenses. Take that away, and I think you've taken away the entire purpose for contrib (versus simply including something in the main source tree with a configure switch to enable/disable it). There are a number of reasons to keep things in contrib. One is that the code may be too tightly tied to backend innards to be appropriate to maintain separately (the GIST extension modules are a good example, and most of the modules that include server-side code are easier to maintain with the server than not). Shouldn't this stuff (if we decide to make contrib BSD-only) become part of the main source tree, then, with a configure option to enable/disable it at compile time? Another is that small modules may not have enough critical mass to get maintained at all, if they're kicked out to live or die on their own. That's certainly a problem, but only if the modules are also tightly tied to the backend. But then, does that mean that anything which is strongly tied to the backend must be included in contrib, no matter how unpopular? Aside from that and licensing, what other criteria would you use to decide on such inclusion? Otherwise, perhaps you're more concerned about the licensing issues in contrib than you need to be? The way I see it, the only BSD stuff in contrib rule is designed precisely to save us from having to think too hard about licensing issues. I'm not interested in getting into lawyeristic arguments about how it's okay to distribute something with a different license if only we don't do XYZ with it. Yeah, but what I'm saying is that *we* don't have to think about this lawyeristic stuff regardless. All we have to care about is whether or not the contrib item in question has a license that allows source distribution. The rest of it is a problem for whoever wants to build binary distributions, and such people *already* know what the issues are and know that they have to think about them. The only case where that might not be true is the case of the individual who is building a binary distribution for use within his own group. But such a person is easily covered by simply disabling compilation of contrib by default. If he has to go to the trouble of enabling that explicitly, then he can be warned that the stuff in contrib might be covered by a different license. But such a person is likely to be in a role where he's had to deal with licensing issues elsewhere, so it's more likely than not that he'll be aware of such things. The only thing he needs to be made aware of is that contrib contains items that fall under different licenses. That alone isn't, IMO, justification for removing non-BSD items from contrib. So in the end, keeping contrib BSD-only doesn't help *us*, it only helps the people who build binary distributions. But because they're already used to dealing with the problem, they don't need our help on this. And that means that kicking non-BSD stuff out of contrib doesn't really help anyone very much, if any...but it does hurt us in that some potentially very valuable things will no longer be considered for inclusion in the distribution. So from here it looks like there's more (perhaps much more) to be lost by making contrib BSD-only than there is to be gained. It would be one thing if we had a lot of people clamoring for removal of non-BSD stuff from contrib because they'd actually been burned by licensing issues. But I haven't seen anything to that effect on this list, at least, and we've had at least one GPL item in there (pgcrypto) since late 2000. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Deadlock while doing VACUUM??
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: When a heavy INSERT or UPDATE load on a table is occurring (lots of quick INSERTs or UPDATEs within a single transaction), a VACUUM ANALYZE (or just straight VACUUM) has a really good chance (10% or so) of causing either the INSERT/UPDATE or the VACUUM to fail with a deadlock detected error. I was unable to replicate this in CVS tip, using pgbench -c 10 -t 1000 as the load generator. I finally got 7.3.2 installed, and confirmed that the problem does not exist on that version. So this is something that's limited to the 7.2.x tree. Which, I guess, means that it's not going to get fixed for that tree (I assume that 7.2.x is effectively end-of-lifed)... On 7.3.2, a concurrent VACUUM appears to slow inserts down A LOT, but it won't deadlock them. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Detecting corrupted pages earlier
Andrew Sullivan wrote: On Thu, Apr 03, 2003 at 02:39:17PM -0500, Tom Lane wrote: just not listing zero_damaged_pages in postgresql.conf.sample? We already have several variables deliberately not listed there ... Hey, that might be a good solution. Of course, it doesn't solve the doomsday device problem, but nobody who uses it can complain that they didn't know what the thing would do. Shouldn't each variable listed in postgresql.conf.sample have comments right above it explaining what it does anyway? A self-documenting configuration file is a really handy thing to have. If it's documented that way in postgresql.conf.sample (and adjacent to the variable itself, even) then nobody who changed it would have grounds to complain about not knowing what the variable did. I'm much more in favor of being lucid and upfront about everything than hiding things just because they might be dangerous. That said, all sorts of warnings and such should be in that bit of documentation in postgresql.conf.sample, so that it's made abundantly clear that this particular option is not one to be messing with except when you know exactly what you're doing... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])