Re: [HACKERS] WAL SHM principles
On Thu, 8 Mar 2001, Martin Devera wrote: Bruce Momjian [EMAIL PROTECTED] writes: Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. Bummer. BTW, what means "bummer" ? It's a Postgres-specific extension to the SQL standard. It means "I am disappointed". As far as I can tell, you _may_ use it as a column or table name. :-) Tim -- --- Tim Allen [EMAIL PROTECTED] Proximity Pty Ltd http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Performance monitor
On Wed, Mar 07, 2001 at 10:06:38PM -0500, Bruce Momjian wrote: I think Bruce wants per-backend data, and this approach would seem to only get the data for the current backend. Also, I really don't like the proposal to write files to /tmp. If we want a perf tool, then we need to have something like 'top', which will continuously update. With 40 backends, the idea of writing 40 file to /tmp every second seems a little excessive to me. My idea was to use 'ps' to gather most of the information, and just use the internal stats when someone clicked on a backend and wanted more information. Are you sure about 'ps' stuff portability? I don't known how data you want read from 'ps', but /proc utils are very OS specific and for example on Linux within a few years was libproc several time overhauled. I spent several years with /proc stuff (processes manager: http://home.zf.jcu.cz/~zakkr/kim). Karel -- Karel Zak [EMAIL PROTECTED] http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Memory management, palloc
Hi guys, I've been looking through the memory management system today. When a request is made for a memory memory chunk larger than ALLOC_CHUNK_LIMIT, AllocSetAlloc() uses malloc() to give the request its own block. The result is tested by AllocSetAlloc() to see if the memory was allocated. Irrespective of this, a chunk can be returned which has not had memory allocated to it. There is no testing of the return status of palloc() through out the code. Was/has this been addressed? Thanks Gavin ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
[HACKERS] Deactivate Rules
Hi! I am student of system engineering and I carried out a work with PostgreSQL in which I should implement inclusion dependencies using rules. During the development of this work, I met with a problem that I solved adding the Postgres the possibility to disable (deactivate) a rule temporarily. In what follows, I detail the problem and define the deactivation. If you are interested in this concept to incorporate it in the official version, I can send you a diff. Moreover, I suppose this concept may be useful in other contexts since STARBUST, for instance, has a similar sentence to activate/deactivate the rules. With the following tables: test=# create table table_a(a_cod int primary key,a_desc char); NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index 'table_a_pkey' for table 'table_a' CREATE test=# create table table_b(b_cod int primary key,b_desc char,a_cod int); NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index 'table_b_pkey' for table 'table_b' CREATE test=# insert into table_a values (1,'O'); INSERT 138026 1 test=# insert into table_a values (2,'T'); INSERT 138027 1 test=# insert into table_b values (100,'O',1); INSERT 138028 1 And the inclusion dependency defined as follows: table_b(a_cod) = table_a(a_cod) The UPDATE with "restricted" option can be implemented with the rule: CREATE RULE u_table_b_res AS ON UPDATE TO table_b WHERE (OLD.a_cod NEW.a_cod OR OLD.a_cod is NULL ) AND NEW.a_cod is not NULL AND NOT EXISTS (SELECT table_a.a_cod FROM table_a WHERE table_a.a_cod = new.a_cod ) DO INSTEAD select pg_abort_with_msg(new.a_cod||' NOT EXIST IN TABLE table_a'); -- pg_abort_with_msg(msg) is a function, that call elog(ERROR,msg) This rule works as expected but if I define a "cascade" action for table_b when table_a is updated: CREATE RULE u_table_a_cas AS ON UPDATE TO table_a DO update table_b set a_cod=New.a_cod where table_b.a_cod=OLD.a_cod; And I execute: test=# update table_a set a_cod=100 where a_cod=1; ERROR: 100 NOT EXIST IN TABLE table_a This result is no the expected one. This happens because a rewriting system characteristic: the queryTree of the rule u_table_b_res is executed in the first place and therefore the execution miscarries. To solve this problem I added to the grammar the sentences DEACTIVATE RULE rulename and REACTIVATE RULE rulename. The sentence DEACTIVATE RULE allows me to disable the rule u_table_b_res and then to avoid the interference. The sentence REACTIVATE RULE turns the rule in active state again. These new sentences don't reach the executor. DEACTIVATE only avoids the triggering of this rule during the rewriting process (i.e. the action is not included in the queryTree to be executed) and it only affects the current session. The rule remains disabled only during the rewriting phase of the original sentence (the UPDATE to table_a in this example) since the DEACTIVATE is detected (in fireRules, of rewriteHandler.c), until finding a REACTIVATE or until the end of the rule. With the new sentence the rule would be: CREATE RULE u_table_a_cas AS ON UPDATE TO table_a DO ( deactivate rule u_table_b_res; update table_b set a_cod=New.a_cod where table_b.a_cod=OLD.a_cod; ) It is necessary to keep in mind that the rule should only be disabled when its action is not longer necessary, like it is this case. A rule cannot be disabled indiscriminately, thus it is only possible to disable a rule if the user that creates the rule (in whose action the DEACTIVATE is executed) has "permission RULE" on the table owning the rule to be disabled. For the previous case, if 'userA' is the user that creates the rule 'u_table_a_cas', ' userA' should have "permission RULE" on ' table_a' and also on ' table_b' (that is the owner of the rule ' u_table_b_res') That is all. Thanks Sergio. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Proposed WAL changes
What do we debate? I never told that we shouldn't worry about current WAL disability to restart. And WAL already correctly works in situation of "failing to write a couple of disk blocks when the system crashes". My statement at first place that "WAL can't help in the event of disk errors" was to remind that we should think over how much are we going to guarantee and by what means in the event our *base* requirements were not answered (guaranteed -:)). My POV is that two checkpoints increases disk space requirements for *everyday usage* while buying near nothing because of data consistency cannot be guaranteed anyway. On the other hand this is the fastest way to implement WAL restart-ability - *which is the real problem we have to fix*. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] WAL does not recover gracefully from out-of-disk-space
Was the following bug already fixed ? I was going to ask same Q. I see that seek+write was changed to write-s in XLogFileInit (that was induced by subj, right?), but what about problem itself? DEBUG: redo starts at (0, 21075520) The Data Base System is starting up DEBUG: open(logfile 0 seg 0) failed: No such file or directory ^^^ redo started in seg 1 and shouldn't try to read seg 0... BTW, were performance tests run after seek+write -- write-s change? Write-s were not obviously faster to me, that's why I've used seek+write, but never tested that area -:( Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Memory management, palloc
On Thu, Mar 08, 2001 at 10:28:50PM +1100, Gavin Sherry wrote: Hi guys, I've been looking through the memory management system today. When a request is made for a memory memory chunk larger than ALLOC_CHUNK_LIMIT, AllocSetAlloc() uses malloc() to give the request its own block. The result is tested by AllocSetAlloc() to see if the memory was allocated. Irrespective of this, a chunk can be returned which has not had memory allocated to it. There is no testing of the return status of palloc() through out the code. I don't understand. If some memory is not obtain in AllocSetAlloc() all finish with elog(ERROR). Not exists way how return insufficient space. Or not? Karel -- Karel Zak [EMAIL PROTECTED] http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
[HACKERS] Depending on system install scripts (was Re: [BUGS] COBOL)
Jarom Hagen [EMAIL PROTECTED] writes: Yes, we have an evil COBOL compiler from MicroFocus that put that install script there. I was really confused why postgres wanted a COBOL system. :-) I've suggested a couple of times that since we include install-sh in our distro anyway, it's pointless and unnecessarily risky to go looking for a platform-supplied install program. However, I could never quite get anyone else to see the reasoning. Now that I have this sterling example to point to, I'm going to start rattling the cage again. Why don't we get rid of the configure-time search for 'install', and just always use our own script? regards, tom lane On Wed, Mar 07, 2001 at 07:38:30PM -0500, Tom Lane wrote: Jarom Hagen [EMAIL PROTECTED] writes: /usr/local/bin/install -c -m 555 postgres /usr/local/pgsql/bin/postgres You must have a COBOL system present to install this product Weird. It looks like you have some exceedingly nonstandard program in /usr/local/bin/install --- certainly not what configure thought that that program would do, anyway. Do you know where that program came from (perhaps a Sun COBOL package)? A nondestructive workaround would be to hand-edit src/Makefile.global's INSTALL variable to refer to our install-sh script (also in src/) rather than /usr/local/bin/install. However, that install is going to bite a lot of other open-source packages that expect to find a standard-ish install script available, so I'd suggest deleting or at least renaming it... ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Memory management, palloc
Karel, On Thu, 8 Mar 2001, Karel Zak wrote: On Thu, Mar 08, 2001 at 10:28:50PM +1100, Gavin Sherry wrote: Hi guys, I've been looking through the memory management system today. When a request is made for a memory memory chunk larger than ALLOC_CHUNK_LIMIT, AllocSetAlloc() uses malloc() to give the request its own block. The result is tested by AllocSetAlloc() to see if the memory was allocated. Irrespective of this, a chunk can be returned which has not had memory allocated to it. There is no testing of the return status of palloc() through out the code. I don't understand. If some memory is not obtain in AllocSetAlloc() all finish with elog(ERROR). Not exists way how return insufficient space. Or not? Ahh. Of course. My mistake =) Gavin ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Memory management, palloc
Gavin Sherry [EMAIL PROTECTED] writes: I've been looking through the memory management system today. When a request is made for a memory memory chunk larger than ALLOC_CHUNK_LIMIT, AllocSetAlloc() uses malloc() to give the request its own block. The result is tested by AllocSetAlloc() to see if the memory was allocated. Irrespective of this, a chunk can be returned which has not had memory allocated to it. There is no testing of the return status of palloc() through out the code. What's your point? palloc() does not have the same specification as malloc. It guarantees to return allocated memory, or elog trying. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] WAL does not recover gracefully from out-of-disk-space
Hiroshi Inoue [EMAIL PROTECTED] writes: Was the following bug already fixed ? Dunno. I've changed the WAL ReadRecord code so that it fails soft (no Asserts or elog(STOP)s) for all failure cases, so the particular crash mode exhibited here should be gone. But I'm not sure why the code appears to be trying to open the wrong log segment, as Vadim comments. That bug might still be there. Need to try to reproduce the problem with new code. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL does not recover gracefully from out-of-disk-space
"Vadim Mikheev" [EMAIL PROTECTED] writes: I see that seek+write was changed to write-s in XLogFileInit (that was induced by subj, right?), but what about problem itself? BTW, were performance tests run after seek+write -- write-s change? That change was for safety, not for performance. It might be a performance win on systems that support fdatasync properly (because it lets us use fdatasync), otherwise it's probably not a performance win. But we need it regardless --- if you didn't want a fully-allocated WAL file, why'd you bother with the original seek-and-write-1-byte code? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL SHM principles
Bruce Momjian [EMAIL PROTECTED] writes: The only problem is that we would no longer have control over which pages made it to disk. The OS would perhaps write pages as we modified them. Not sure how important that is. Unfortunately, this alone is a *fatal* objection. See nearby discussions about WAL behavior: we must be able to control the relative timing of WAL write/flush and data page writes. Bummer. BTW, what means "bummer" ? Sorry, it means, "Oh, I am disappointed." But for many OSes you CAN control when to write data - you can mlock individual pages. mlock() controls locking in physical memory. I don't see it controling write(). -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed WAL changes
"Mikheev, Vadim" [EMAIL PROTECTED] writes: And how well will that approach work if the last checkpoint record got written near the start of a log segment file, and then the checkpointer discarded all your prior log segments because "you don't need those anymore"? If the checkpoint record gets corrupted, you have no readable log at all. The question - why should we have it? It is assumed that data files are flushed before checkpoint appears in log. If this assumtion is wrong due to *bogus* fsync/disk/whatever why should we increase disk space requirements which will affect *good* systems too? What will we buy with extra logs? Just some data we can't guarantee consistency anyway? It seems that you want guarantee more than me, Tom -:) No, but I want a system that's not brittle. You seem to be content to design a system that is reliable as long as the WAL log is OK but loses the entire database unrecoverably as soon as one bit goes bad in the log. I'd like a slightly softer failure mode. WAL logs *will* go bad (even without system crashes; what of unrecoverable disk read errors?) and we ought to be able to deal with that with some degree of grace. Yes, we lost our guarantee of consistency. That doesn't mean we should not do the best we can with what we've got left. BTW, in some my tests size of on-line logs was ~ 200Mb with default checkpoint interval. So, it's worth to care about on-line logs size. Okay, but to me that suggests we need a smarter log management strategy, not a management strategy that throws away data we might wish we still had (for manual analysis if nothing else). Perhaps the checkpoint creation rule should be "every M seconds *or* every N megabytes of log, whichever comes first". It'd be fairly easy to signal the postmaster to start up a new checkpoint process when XLogWrite rolls over to a new log segment, if the last checkpoint was further back than some number of segments. Comments? Please convince me that NEXTXID is necessary. Why add anything that is not useful? I'm not convinced that it's not necessary. In particular, consider the case where we are trying to recover from a crash using an on-line checkpoint as our last readable WAL entry. In the pre-NEXTXID code, this checkpoint would contain the current XID counter and an advanced-beyond-current OID counter. I think both of those numbers should be advanced beyond current, so that there's some safety margin against reusing XIDs/OIDs that were allocated by now-lost XLOG entries. The OID code is doing this right, but the XID code wasn't. Again, it's a question of brittleness. Yes, as long as everything operates as designed and the WAL log never drops a bit, we don't need it. But I want a safety margin for when things aren't perfect. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] WAL does not recover gracefully from out-of-disk-space
Was the following bug already fixed ? Dunno. I've changed the WAL ReadRecord code so that it fails soft (no Asserts or elog(STOP)s) for all failure cases, so the particular crash mode exhibited here should be gone. But I'm not sure why the code appears to be trying to open the wrong log segment, as Vadim comments. That bug might still be there. Need to try to reproduce the problem with new code. Did you try to start up with wal-debug? Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] WAL does not recover gracefully from out-of-disk-space
I see that seek+write was changed to write-s in XLogFileInit (that was induced by subj, right?), but what about problem itself? BTW, were performance tests run after seek+write -- write-s change? That change was for safety, not for performance. It might be a performance win on systems that support fdatasync properly (because it lets us use fdatasync), otherwise it's probably not a performance win. Even with true fdatasync it's not obviously good for performance - it takes too long time to write 16Mb files and fills OS buffer cache with trash-:( Probably, we need in separate process like LGWR (log writer) in Oracle. I also like the Andreas idea about re-using log files. But we need it regardless --- if you didn't want a fully-allocated WAL file, why'd you bother with the original seek-and-write-1-byte code? I considered this mostly as hint for OS about how log file should be allocated (to decrease fragmentation). Not sure how OSes use such hints but seek+write costs nothing. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] Checkpoint process signal handling seems wrong
I am currently looking at a frozen system: a backend crashed during XLOG write (which I was deliberately provoking, via running it out of disk space), and now the postmaster is unable to recover because it's waiting around for a checkpoint process that it had launched milliseconds before the crash. The checkpoint process, unfortunately, is not going to quit anytime soon because it's hung up trying to get a spinlock that the crashing backend left locked. Eventually the checkpoint process will time out the spinlock and abort (but please note that this is true only because I insisted --- Vadim wanted to have infinite timeouts on the WAL spinlocks. I think this is good evidence that that's a bad idea). However, while sitting here looking at it I can't help wondering whether the checkpoint process shouldn't have responded to the SIGTERM that the postmaster sent it when the other backend crashed. Is it really such a good idea for the checkpoint process to ignore SIGTERM? While we're at it: is it really such a good idea to use elog(STOP) all over the place in the WAL stuff? If XLogFileInit had chosen to exit with elog(FATAL), then we would have released the spinlock on the way out of the failing backend, and the checkpointer wouldn't be stuck. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] WAL does not recover gracefully from out-of-disk-sp ace
"Mikheev, Vadim" [EMAIL PROTECTED] writes: But we need it regardless --- if you didn't want a fully-allocated WAL file, why'd you bother with the original seek-and-write-1-byte code? I considered this mostly as hint for OS about how log file should be allocated (to decrease fragmentation). Not sure how OSes use such hints but seek+write costs nothing. Doing a seek to a large value and doing a write is not a hint to a Unix system that you are going to write a large sequential file. If anything, it's a hint that you are going to write a sparse file. A Unix kernel will optimize by not allocating blocks you aren't going to write to. Ian ---(end of broadcast)--- TIP 97: Oh this age! How tasteless and ill-bred it is. -- Gaius Valerius Catullus ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Query not using index, please explain.
Greetings, I have a real simple table with a timestamp field. The timestamp field has an index on it. But, the index does not seem to be taken into account for selects that return rows: pglog=# explain select time_stamp from history_entries where time_stamp '03-01-2000'; NOTICE: QUERY PLAN: Index Scan using hist_entries_timestamp on history_entries (cost=0.00..12810.36 rows=3246 width=8) EXPLAIN pglog=# explain select time_stamp from history_entries where time_stamp '04-01-2000'; NOTICE: QUERY PLAN: Seq Scan on history_entries (cost=0.00..160289.71 rows=138215 width=8) EXPLAIN pglog=# set enable_seqscan to off; SET VARIABLE pglog=# explain select time_stamp from history_entries where time_stamp '04-01-2000'; NOTICE: QUERY PLAN: Index Scan using hist_entries_timestamp on history_entries (cost=0.00..368241.51 rows=138215 width=8) EXPLAIN pglog=# set enable_seqscan to on; SET VARIABLE pglog=# The query where the time_stamp '03-01-2000' does not return any rows, the 04-01-2000 date does return rows. When I disable seqscan the query is almost instant, but with it on, it takes about 3 or 4 minutes. Why can't the query planner use the index in the later case? Thanks, Matthew ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Proposed WAL changes
It seems that you want guarantee more than me, Tom -:) No, but I want a system that's not brittle. You seem to be content to design a system that is reliable as long as the WAL log is OK but loses the entire database unrecoverably as soon as one bit goes bad in the log. I don't see how absence of old checkpoint forces losing entire database. You probably will get better consistency by re-applying modifications which supposed to be in data files already but it seems questionable to me. BTW, in some my tests size of on-line logs was ~ 200Mb with default checkpoint interval. So, it's worth to care about on-line logs size. Okay, but to me that suggests we need a smarter log management strategy, not a management strategy that throws away data we might wish we still had (for manual analysis if nothing else). This is what should be covered by archiving log files. Unimplemented -:( Perhaps the checkpoint creation rule should be "every M seconds *or* every N megabytes of log, whichever comes first". I like this! Regardless usability of keeping older checkpoint (especially in future, with log archiving) your rule is worth in any case. (Nevertheless, keeping two checkpoints still increases disk requirements -:) But seems I have to waive my objection if I didn't convince you - it's really simplest way to get WAL restart-ability and I personally have no ability to implement log scanning now). Please convince me that NEXTXID is necessary. Why add anything that is not useful? I'm not convinced that it's not necessary. In particular, consider the case where we are trying to recover from a crash using an on-line checkpoint as our last readable WAL entry. In the pre-NEXTXID code, this checkpoint would contain the current XID counter and an advanced-beyond-current OID counter. I think both of those numbers should be advanced beyond current, so that there's some safety margin against reusing XIDs/OIDs that were allocated by now-lost XLOG entries. The OID code is doing this right, but the XID code wasn't. Again, it's a question of brittleness. Yes, as long as everything operates as designed and the WAL log never drops a bit, we don't need it. But I want a safety margin for when things aren't perfect. Once again - my point is that in the event of lost log record one shouldn't try to use existent database but just dump it, etc - ie no reason to keep info about allocated XIDs. But keeping NEXTXID costs nothing, at least -:) Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] Checkpoint process signal handling seems wrong
Eventually the checkpoint process will time out the spinlock and abort (but please note that this is true only because I insisted --- Vadim wanted to have infinite timeouts on the WAL spinlocks. I think this is good evidence that that's a bad idea). I disagree - this is evidence of bug in implementation -:) Timeout will take too long time - so it's not solution. Years ago we used timeouts for deadlock detection. Spin timeouts are yesterday too -:) However, while sitting here looking at it I can't help wondering whether the checkpoint process shouldn't have responded to the SIGTERM that the postmaster sent it when the other backend crashed. Is it really such a good idea for the checkpoint process to ignore SIGTERM? Seems not, SIGTERM -- elog(STOP) should be Ok here. While we're at it: is it really such a good idea to use elog(STOP) all over the place in the WAL stuff? If XLogFileInit had chosen I just hadn't time to consider each particular case. It's better to restart than to break WAL rules (only bogus disks are allowed to do this -:)). to exit with elog(FATAL), then we would have released the spinlock on the way out of the failing backend, and the checkpointer wouldn't be stuck. I didn't use elog(FATAL) exactly because of it releases spins! Who knows what and how much other backend will have time to do if we'll release spins when things are bad. Each particular case must be carefully considered. Example: one backend failed to insert log record for modified data page, releases that page write spin lock, concurrent checkpoint maker takes read spin lock on that buffer and write it to disk before postmaster kill it... Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Proposed WAL changes
"Mikheev, Vadim" [EMAIL PROTECTED] writes: No, but I want a system that's not brittle. You seem to be content to design a system that is reliable as long as the WAL log is OK but loses the entire database unrecoverably as soon as one bit goes bad in the log. I don't see how absence of old checkpoint forces losing entire database. As the code stood last week that's what would happen, because the system would not restart unless pg_control pointed to a valid checkpoint record. I addressed that in a way that seemed good to me. Now, from what you've said in this conversation you would rather have the system scan XLOG to decide where to replay from if it cannot read the last checkpoint record. That would be okay with me, but even with that approach I do not think it's safe to truncate the log to nothing as soon as we've written a checkpoint record. I want to see a reasonable amount of log data there at all times. I don't insist that "reasonable amount" necessarily means "back to the prior checkpoint" --- but that's a simple and easy-to-implement interpretation. You probably will get better consistency by re-applying modifications which supposed to be in data files already but it seems questionable to me. It's not a guarantee, no, but it gives you a better probability of recovering recent changes when things are hosed. BTW, can we really trust checkpoint to mean that all data file changes are down on disk? I see that the actual implementation of checkpoint is write out all dirty shmem buffers; sync(); if (IsUnderPostmaster) sleep(2); sync(); write checkpoint record to XLOG; fsync XLOG; Now HP's man page for sync() says The writing, although scheduled, is not necessarily complete upon return from sync. I can assure you that 2 seconds is nowhere near enough to ensure that a sync is complete on my workstation... and I doubt that "scheduled" means "guaranteed to complete before any subsequently-requested I/O is done". I think it's entirely possible that the checkpoint record will hit the disk before the last heap buffer does. Therefore, even without considering disk drive write reordering, I do not believe that a checkpoint guarantees very much, and so I think it's pretty foolish to delete the preceding XLOG data immediately afterwards. Perhaps the checkpoint creation rule should be "every M seconds *or* every N megabytes of log, whichever comes first". I like this! Regardless usability of keeping older checkpoint (especially in future, with log archiving) your rule is worth in any case. Okay, I'll see if I can do something with this idea. Other than what we've discussed, do you have any comments/objections to my proposed patch? I've been holding off committing it so that you have time to review it... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Query not using index, please explain.
Richard, Thanks for the response, I guess I should have included a little more information. The table contains 3.5 million rows. The indexes were created after the data was imported into the table and I had just run vacuum and vacuum analyze on the database before trying the queries and sending this question to hackers. When I turned the seqscan variable off and ran the query with the '04-01-2000' date the results were literally instantaneous. Turn the seqscan back on and it takes right around 3 minutes. Also, the query for any date older than the '04-01-2000' returns zero rows. The actual number of rows for the '04-01-2000' select is right around 8300. Here is the table for more information: pglog=# \d history_entries Table "history_entries" Attribute |Type | Modifier +-+-- domain | varchar(80) | time_stamp | timestamp | response | integer | transfered | integer | reqtime| integer | entry | text| Indices: hist_entries_domain, hist_entries_timestamp I'm also having problems with this query: select domain from history_entries group by domain; To me, since there is an index on domain, it seems like this should be a rather fast thing to do? It takes a *very* long time, no matter if I turn seqscan on or off. pglog=# select version(); version - PostgreSQL 7.0.3 on i386-unknown-freebsdelf3.4, compiled by gcc 2.7.2.3 (1 row) Thanks, Matthew At 07:18 PM 3/8/2001 +, you wrote: On Thu, Mar 08, 2001 at 01:49:42PM -0500, Matthew Hagerty wrote: Greetings, I have a real simple table with a timestamp field. The timestamp field has an index on it. But, the index does not seem to be taken into account for selects that return rows: pglog=# explain select time_stamp from history_entries where time_stamp '03-01-2000'; NOTICE: QUERY PLAN: Index Scan using hist_entries_timestamp on history_entries (cost=0.00..12810.36 rows=3246 width=8) EXPLAIN pglog=# explain select time_stamp from history_entries where time_stamp '04-01-2000'; NOTICE: QUERY PLAN: Seq Scan on history_entries (cost=0.00..160289.71 rows=138215 width=8) EXPLAIN pglog=# set enable_seqscan to off; SET VARIABLE pglog=# explain select time_stamp from history_entries where time_stamp '04-01-2000'; NOTICE: QUERY PLAN: Index Scan using hist_entries_timestamp on history_entries (cost=0.00..368241.51 rows=138215 width=8) EXPLAIN pglog=# set enable_seqscan to on; SET VARIABLE pglog=# The query where the time_stamp '03-01-2000' does not return any rows, the 04-01-2000 date does return rows. When I disable seqscan the query is almost instant, but with it on, it takes about 3 or 4 minutes. Why can't the query planner use the index in the later case? Well, it can, it just chooses not to. Your second EXPLAIN shows that it thinks it's going to get 138215 rows from that select; it then calculates that it would be more expensive to use the index than simply to scan the table. Presumably it actually returns many fewer rows than that. Have you done a VACUUM ANALYZE recently? If you get plans this badly wrong immediately after a VACUUM ANALYZE, *then*'s the time to ask -hackers about it (FAQ item 4.9). Richard ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
RE: [HACKERS] Proposed WAL changes
Other than what we've discussed, do you have any comments/objections to my proposed patch? I've been holding off committing it so that you have time to review it... Sorry - I'm heading to meet Marc, Thomas Geoff right now, will try to comment asap. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] WAL does not recover gracefully from out-of-disk-sp ace
"Mikheev, Vadim" [EMAIL PROTECTED] writes: Even with true fdatasync it's not obviously good for performance - it takes too long time to write 16Mb files and fills OS buffer cache with trash-:( True. But at least the write is (hopefully) being done at a non-performance-critical time. Probably, we need in separate process like LGWR (log writer) in Oracle. I think the create-ahead feature in the checkpoint maker should be on by default. But we need it regardless --- if you didn't want a fully-allocated WAL file, why'd you bother with the original seek-and-write-1-byte code? I considered this mostly as hint for OS about how log file should be allocated (to decrease fragmentation). Not sure how OSes use such hints but seek+write costs nothing. AFAIK, extant Unixes will not regard this as a hint at all; they'll think it is a great opportunity to not store zeroes :-(. One reason that I like logfile fill to be done separately is that it's easier to convince ourselves that failure (due to out of disk space) need not require elog(STOP) than if we have the same failure during XLogWrite. You are right that we don't have time to consider each STOP in the WAL code, but I think we should at least look at that case... regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Checkpoint process signal handling seems wrong
"Mikheev, Vadim" [EMAIL PROTECTED] writes: However, while sitting here looking at it I can't help wondering whether the checkpoint process shouldn't have responded to the SIGTERM that the postmaster sent it when the other backend crashed. Is it really such a good idea for the checkpoint process to ignore SIGTERM? Seems not, SIGTERM -- elog(STOP) should be Ok here. Yes, after further thought this seems not only desirable but *necessary*. Else the checkpoint maker might be writing bad data from corrupted shmem structures, which is exactly what the system-wide restart mechanism is supposed to prevent. I'll fix the checkpoint process to accept SIGTERM and SIGUSR1 (but not SIGINT) from the postmaster. While we're at it: is it really such a good idea to use elog(STOP) all over the place in the WAL stuff? If XLogFileInit had chosen I just hadn't time to consider each particular case. Okay. You're right, that probably needs case-by-case thought that we haven't time for right now. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Query not using index, please explain.
Matthew Hagerty [EMAIL PROTECTED] writes: The query where the time_stamp '03-01-2000' does not return any rows, the 04-01-2000 date does return rows. When I disable seqscan the query is almost instant, but with it on, it takes about 3 or 4 minutes. Why can't the query planner use the index in the later case? It *can* (and did, in two of the three examples you gave). It just doesn't think the indexscan is faster --- note the cost estimates. Evidently the cost estimates are way off, probably because the estimated number of selected rows is way off. Have you done a VACUUM ANALYZE lately? Not that that will help if the distribution of timestamps is highly irregular :-(. See the many past discussions of these issues. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Query not using index, please explain.
On Thu, Mar 08, 2001 at 02:43:54PM -0500, Matthew Hagerty wrote: Richard, Thanks for the response, I guess I should have included a little more information. The table contains 3.5 million rows. The indexes were created after the data was imported into the table and I had just run vacuum and vacuum analyze on the database before trying the queries and sending this question to hackers. When I turned the seqscan variable off and ran the query with the '04-01-2000' date the results were literally instantaneous. Turn the seqscan back on and it takes right around 3 minutes. Also, the query for any date older than the '04-01-2000' returns zero rows. The actual number of rows for the '04-01-2000' select is right around 8300. This is where you need an expert. :) But I'll have a go and someone will correct me if I'm wrong... The statistics which are kept aren't fine-grained enough to be right here. All the optimiser knows are the highest and lowest values of the attribute, the most common value (not really useful here), the number of nulls in the column, and the "dispersion" (a sort of handwavy measure of how bunched-together the values are). So in a case like this, where effectively the values are all different over a certain range, all it can do is (more or less) linearly interpolate in the range to guess how many tuples are going to be returned. Which means it's liable to be completely wrong if your values aren't evenly distributed over their whole range, which it seems they aren't. It thinks you're going to hit around 1/28 of the tuples in this table, presumably because '04/01/2000' is about 1/28 of the way from your minimum value to your maximum. This sort of thing will all become much better one fine day when we have much better statistics available, and so many of us want such things that that fine day will surely come. Until then, I think you're best off turning off seqscans from your client code when you know they'll be wrong. (That's what we do here in several similar cases). Can someone who really knows this stuff (Tom?) step in if what I've just said is completely wrong? select domain from history_entries group by domain; To me, since there is an index on domain, it seems like this should be a rather fast thing to do? It takes a *very* long time, no matter if I turn seqscan on or off. The reason this is slow is that Postgres always has to look at heap tuples, even when it's been sent there by indexes. This in turn is because of the way the storage manager works (only by looking in the heap can you tell for sure whether a tuple is valid for the current transaction). So a "group by" always has to look at every heap tuple (that hasn't been eliminated by a where clause). "select distinct" has the same problem. I don't think there's a way to do what you want here with your existing schema without a sequential scan over the table. Richard ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Use SIGQUIT instead of SIGUSR1?
To implement the idea of performing a checkpoint after every so many XLOG megabytes (as well as after every so many seconds), I need to pick an additional signal number for the postmaster to accept. Seems like the most appropriate choice for this is SIGUSR1, which isn't currently being used at the postmaster level. However, if I just do that, then SIGUSR1 and SIGQUIT will have completely different meanings for the postmaster and for the backends, in fact SIGQUIT to the postmaster means send SIGUSR1 to the backends. This seems hopelessly confusing. I think it'd be a good idea to change the code so that SIGQUIT is the per-backend quickdie() signal, not SIGUSR1, to bring the postmaster and backend signals back into some semblance of agreement. For the moment we could leave the backends also accepting SIGUSR1 as quickdie, just in case someone out there is in the habit of sending that signal manually to individual backends. Eventually backend SIGUSR1 might be reassigned to mean something else. (I suspect Bruce is coveting it already ;-).) Any objections? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Use SIGQUIT instead of SIGUSR1?
On Thu, Mar 08, 2001 at 04:06:16PM -0500, Tom Lane wrote: To implement the idea of performing a checkpoint after every so many XLOG megabytes (as well as after every so many seconds), I need to pick an additional signal number for the postmaster to accept. Seems like the most appropriate choice for this is SIGUSR1, which isn't currently being used at the postmaster level. However, if I just do that, then SIGUSR1 and SIGQUIT will have completely different meanings for the postmaster and for the backends, in fact SIGQUIT to the postmaster means send SIGUSR1 to the backends. This seems hopelessly confusing. I think it'd be a good idea to change the code so that SIGQUIT is the per-backend quickdie() signal, not SIGUSR1, to bring the postmaster and backend signals back into some semblance of agreement. For the moment we could leave the backends also accepting SIGUSR1 as quickdie, just in case someone out there is in the habit of sending that signal manually to individual backends. Eventually backend SIGUSR1 might be reassigned to mean something else. (I suspect Bruce is coveting it already ;-).) The number and variety of signals used in PG is already terrifying. Attaching a specific meaning to SIGQUIT may be dangerous if the OS and its daemons also send SIGQUIT to mean something subtly different. I'd rather see a reduction in the use of signals, and a movement toward more modern, better behaved interprocess communication mechanisms. Still, "if it were done when 'tis done, then 'twere well It were done" cleanly. -- Nathan Myers [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
[HACKERS] Is INSERT FROM considered a transaction?
Greetings, Sorry about all the posts lately, but things seems to be running *really* slow on my database. I have two tables, both are identical and one is used to hold entries older than a certain date, i.e. the history table. I use this query to move the old records from one to the other. In this case, is each insert part of a big transaction that commits when it is done, or is each insert its own transaction? Is there anything I can do to make this faster? On average the entries table has about 50,000 records and the history_entries table has about 3.5 million. insert into history_entries select * from entries where domain='somevalue' and time_stamp between 'date1' and 'date2' Thanks, Matthew ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Performance monitor
Tom Lane writes: How many of our supported platforms actually have working ps-status code? (This is an honest question: I don't know.) BeOS, DG/UX, and Cygwin don't have support code, the rest *should* work. Seems we will find out when people complain my performance monitor doesn't show the proper columns. :-) -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Performance monitor
I don't believe that UnixWare will take the PS change without having ROOT. LER Original Message On 3/8/01, 3:54:31 PM, Peter Eisentraut [EMAIL PROTECTED] wrote regarding Re: [HACKERS] Performance monitor : Tom Lane writes: How many of our supported platforms actually have working ps-status code? (This is an honest question: I don't know.) BeOS, DG/UX, and Cygwin don't have support code, the rest *should* work. -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Use SIGQUIT instead of SIGUSR1?
Tom Lane writes: I think it'd be a good idea to change the code so that SIGQUIT is the per-backend quickdie() signal, not SIGUSR1, to bring the postmaster and backend signals back into some semblance of agreement. I think we agreed on this already when someone wanted to use a signal for synchronizing "near-committers". Still seems like a good idea. -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Internationalized error messages
I really feel that translated error messages need to happen soon. Managing translated message catalogs can be done easily with available APIs. However, translatable messages really require an error code mechanism (otherwise it's completely impossible for programs to interpret error messages reliably). I've been thinking about this for much too long now and today I finally settled to the simplest possible solution. Let the actual method of allocating error codes be irrelevant for now, although the ones in the SQL standard are certainly to be considered for a start. Essentially, instead of writing elog(ERROR, "disaster struck"); you'd write elog(ERROR, "XYZ01", "disaster struck"); Now you'll notice that this approach doesn't make the error message text functionally dependend on the error code. The alternative would have been to write elog(ERROR, "XYZ01"); which makes the code much less clear. Additonally, most of the elog() calls use printf style variable argument lists. So maybe elog(ERROR, "XYZ01", (arg + 1), foo); This is not only totally obscure, but also incredibly cumbersome to maintain and very error prone. One earlier idea was to make the "XYZ01" thing a macro instead that expands to a string with % arguments, that GCC can check as it does now. But I don't consider this a lot better, because the initial coding is still obscured, and additonally the list of those macros needs to be maintained. (The actual error codes might still be provided as readable macro names similar to the errno codes, but I'm not sure if we should share these between server and client.) Finally, there might also be legitimate reasons to have different error message texts for the same error code. For example, "type errors" (don't know if this is an official code) can occur in a number of places that might warrant different explanations. Indeed, this approach would preserve "artistic freedom" to some extent while still maintaining some structure alongside. And it would be rather straightforward to implement, too. Those who are too bored to assign error codes to new code can simply pick some "zero" code as default. On the protocol front, this could be pretty easy to do. Instead of "message text" we'd send a string "XYZ01: message text". Worst case, we pass this unfiltered to the client and provide an extra function that returns only the first five characters. Alternatively we could strip off the prefix when returning the message text only. At the end, the i18n part would actually be pretty easy, e.g., elog(ERROR, "XYZ01", gettext("stuff happened")); Comments? Better ideas? -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: Depending on system install scripts (was Re: [BUGS] COBOL)
Tom Lane writes: I've suggested a couple of times that since we include install-sh in our distro anyway, it's pointless and unnecessarily risky to go looking for a platform-supplied install program. However, I could never quite get anyone else to see the reasoning. Now that I have this sterling example to point to, I'm going to start rattling the cage again. Why don't we get rid of the configure-time search for 'install', and just always use our own script? I've sent this to the Autoconf list for some comment, but in general I agree with you. -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Internationalized error messages
Peter Eisentraut [EMAIL PROTECTED] writes: Let the actual method of allocating error codes be irrelevant for now, although the ones in the SQL standard are certainly to be considered for a start. Essentially, instead of writing elog(ERROR, "disaster struck"); you'd write elog(ERROR, "XYZ01", "disaster struck"); I like this approach. One of the nice things about Oracle is that they have an error manual. All Oracle errors have an associated number. You can look up that number in the error manual to find a paragraph giving details and workarounds. Admittedly, sometimes the further details are not helpful, but sometimes they are. The basic idea of being able to look up an error lets programmers balance the need for a terse error message with the need for a fuller explanation. Ian ---(end of broadcast)--- TIP 32: I just know I'm a better manager when I have Joe DiMaggio in center field. -- Casey Stengel ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Internationalized error messages
I like this approach. One of the nice things about Oracle is that they have an error manual. All Oracle errors have an associated number. You can look up that number in the error manual to find a paragraph giving details and workarounds. Admittedly, sometimes the further details are not helpful, but sometimes they are. The basic idea of being able to look up an error lets programmers balance the need for a terse error message with the need for a fuller explanation. One of the examples when you need exact error message code is when you want to separate unique index violations from other errors. This often needed when you want just do insert, and leave all constraint checking to database... -- Sincerely Yours, Denis Perchine -- E-Mail: [EMAIL PROTECTED] HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 -- ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] How to handle waitingForLock in LockWaitCancel()
Tom Lane wrote: Hiroshi Inoue [EMAIL PROTECTED] writes: Is it OK to commit the change before 7.1 release ? I want to do it before forgetting this issue. If that fixes the problem for you, then commit it. I was waiting to hear back whether you still saw a crash or not... I see no crash in my test case. I would commit the change. regards, Hiroshi Inoue ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster