[HACKERS] Farewell
Title: Farewell It's timefor formalacknowledgement that I'm not in The Project any more. I'm not interested in small features/fixes and have no time for big ones. It was this way for very long time and I don't see how/when that could change. My participation in The Project was one of the greatest adventures in my life. Thanks to everyone! Good luck on your ways. And - long live to Postgres!!! Vadim
Re: [HACKERS] WAL replay logic (was Re: [PERFORM] Mount options f
So if you do this, do you still need to store that information in pg_control at all? Yes: to speeds up the recovery process. If it's going to slow down the performance of my database when not doing recovery (because I have to write two files for every transaction, rather than one) Control file is not updated for every transaction, only on a few special events like checkpoint. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Vacation
I'll be on vacation from 12/27/02 till 01/20/03. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PostgreSQL-R
It seems that locking tuples via LockTable at Phase 1 is not required anymore, right? We haven't put those hooks in yet, so the current version is master/slave. So, you are not going to use any LockTable in Phase 1 on master right now but you still need some LockTable in Phase 3 on slaves. Are you going to use pg lock manager table in Phase 3? Shouldn't ordering in Phase 3 be implemented using special LockTable, totally separated from pg lock manager? (if it's right that Phase 1 doesn't require Phase 3 LockTable at all.) Also it seems that we could perform Phases 2 3 periodically during transaction execution. This would make WS smaller and conflicts between long running transactions from different sites would be resoved faster. And it would increase commit chances for long running transactions: due to async notification to other nodes about changes made by transaction, short transactions may have noticeably higher chances to commit ... and abort conflicting long transactions. Seems like a good idea to me, but we won't know for sure until we implement the multi-master hooks. Is it about periodic Phases 2 3 or about using Phase 3' LockTable in Phase 1? The first one definitely can wait but the second one should be resolved before merging pg-r code with main CVS, imo. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [BUGS] Bug #613: Sequence values fall back to previously chec
| Placing a restriction on an application that says it must treat the values | returned from a sequence as if they might not be committed is absurd. | | Why? The fact that you are not able to rollback sequences does not | necessary mean that you are not required to perform commit to ensure | permanent storage of changes made to database. I use sequences to generate message identifiers for a simple external-to-database message passing system. I also use them for file upload identifiers. In both cases, if the external action (message or file upload) succeeds, I commit; otherwise I roll-back. I assume that the datbase won't give me a duplicate sequence... otherwise I'd have to find some So can you do select nextval() in *separate* (committed) transaction *before* external action and real transaction where you store information (with sequence number) about external action in database? BEGIN; SELECT NEXTVAL(); COMMIT; BEGIN; -- Do external actions and store info in DB -- COMMIT/ROLLBACK; Is this totally unacceptable? Is it really *required* to call nextval() in *the same* transaction where you store info in DB? Why? other way go get sequences or I'd have duplicate messages or non-unique file identifiers. With these changes is this assumption no longer valid? If 1. It's not valid to assume that sequences will not return duplicate numbers if there was no commit after nextval. 2. It doesn't matter when sequence numbers are stored in database objects only. 3. But if you're going to use sequence numbers in external objects you must (pre)fetch those numbers in separate committed transaction. (Can we have this in FAQ?) so, this change will break alot of user programs. | And why? Just for convenience of 1% applications which need | to use sequences in their own, non-database, external objects? I think you may be underestimating the amount of external resources which may be associated with a datbase object. Regardless, may of the database features in PostgreSQL are there for 1% or less of the user base... Please note that I was talking about some *inconvenience*, not about *inability* of using sequence numbers externally (seems my words were too short). Above is how to do this. And though I agreed that it's not very convenient/handy/cosy to *take care* and fetch numbers in separate committed transaction, but it's required only in those special cases and I think it's better than do fsync() per each nextval() call what would affect other users/applications where storing sequence numbers outside of database is not required. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Spinlock performance improvement proposal
I have committed changes to implement this proposal. I'm not seeing any significant performance difference on pgbench on my single-CPU system ... but pgbench is I/O bound anyway on this hardware, so that's not very surprising. I'll be interested to see what other people observe. (Tatsuo, care to rerun that 1000-client test?) What is your system? CPU, memory, IDE/SCSI, OS? Scaling factor and # of clients? BTW1 - shouldn't we rewrite pgbench to use threads instead of libpq async queries? At least as option. I'd say that with 1000 clients current pgbench implementation is very poor. BTW2 - shouldn't we learn if there are really portability/performance issues in using POSIX mutex-es (and cond. variables) in place of TAS (and SysV semaphores)? Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Status of index location patch
Just wondering what is the status of this patch. Is seems from comments that people like the idea. I have also looked in the archives for other people looking for this kind of feature and have found alot of interest. If you think it is a good idea for 7.2, let me know what needs to be changed and I will work on it this weekend. Just change index' dir naming as was already discussed. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Status of index location patch
Can you explain how I would get the tblNode for an existing database index files if it doesn't have the same OID as the database entry in pg_databases. Well, keeping in mind future tablespace implementation I would add tblNode to pg_class and in pg_databases I'd have defaultTblNode and indexTblNode. If it's too late to do for 7.2 then let's wait till 7.3. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] User locks code
Well, ability to lock only unlocked rows in select for update is useful, of course. But uniq features of user'locks are: 1. They don't interfere with normal locks hold by session/transaction. 2. Share lock is available. 3. User can lock *and unlock objects* inside transaction, which is not (and will not be) available with locks held by transactions. They are interesting too and proposed implementation will not impact lock manager (just additional 4 bytes in LOCKTAG = same size of LOCKTAG on machines with 8 bytes alignment). An interesting method would be to allow users to simply avoid locked rows: SELECT * FROM queue FOR UPDATE LIMIT 1 UNLOCKED; Unlocked, return immediately, whatever could be used as a keyword to avoid rows that are locked (skipping over them). For update locks the row of course. Currently for the above type of thing I issue an ORDER BY random() which avoids common rows enough, the queue agent dies if queries start taking too long (showing it's waiting for other things) and tosses up new copies if it goes a while without waiting at all (showing increased load). -- Rod Taylor This message represents the official view of the voices in my head - Original Message - From: Mikheev, Vadim [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, August 17, 2001 2:48 PM Subject: [HACKERS] User locks code 1. Just noted this in contrib/userlock/README.user_locks: User locks, by Massimo Dal Zotto [EMAIL PROTECTED] Copyright (C) 1999, Massimo Dal Zotto [EMAIL PROTECTED] This software is distributed under the GNU General Public License either version 2, or (at your option) any later version. Well, anyone can put code into contrib with whatever license he/she want but user locks package includes interface functions in contrib *and* changes in our lock manager, ie changes in backend code. I wonder if backend' part of package is covered by the same license above? And is it good if yes? 2. Not good implementation, imho. It's too complex (separate lock method table, etc). Much cleaner would be implement this feature the same way as transactions wait other transaction commit/abort: by locking objects in pseudo table. We could get rid of offnum and lockmethod from LOCKTAG and add struct Oid RelId; Oid ObjId; } userObjId; to objId union of LOCKTAG. This way user could lock whatever object he/she want in specified table and note that we would be able to use table access rights to control if user allowed to lock objects in table - missed in 1. One could object that 1. is good because user locks never wait. I argue that never waiting for lock is same bad as always waiting. Someday we'll have time-wait etc features for general lock method and everybody will be happy -:) Comments? Vadim P.S. I could add 2. very fast, no matter if we'll keep 1. or not. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Re: Strangeness in xid allocation / snapshot setup
1. Tx Old is running. 2. Tx S reads new transaction ID in GetSnapshotData() and swapped away before SInval acquired. 3. Tx New gets new transaction ID, makes changes and commits. 4. Tx Old changes some row R changed by Tx New and commits. 5. Tx S gets snapshot data and now sees R changed by *both* Tx Old and Tx New *but* does not see *other* changes made by Tx New = Tx S reads unconsistent data. Hmm, but that doesn't seem to have anything to do with the way that GetSnapshotData operates. If Tx New has an XID = xmax read by Tx S' GetSnapshotData, then Tx New will be considered uncommitted by S no matter which order we get the locks in; it hardly matters whether Tx New You forget about Tx Old! The point is that changes made by Tx Old *over* Tx New' changes effectively make those Tx New' changes *visible* to Tx S! And this is not good: Tx New inserts PK and corresponding FK and commits; Tx Old changes some field in row with that FK and commits - now Tx S will see FK row *but not PK one* (and what if Tx S was serializable Tx run by pd_dump...) SInval lock prevents Tx Old from commit (xact.c:CommitTransaction()) in points 2. - 4. above and so Tx Old' changes will not be visible to Tx S. manages to physically commit before we finish building the snapshot for S. On the other side of the coin, if Tx New's XID xmax for S, then *with the GetNewTransactionId change that I want* we can be sure that Tx New will be seen running by S when it does get the SInval lock (unless New has managed to finish before S gets the lock, in which case it's perfectly reasonable for S to treat it as committed or aborted). And this is how it worked (MyProc-xid was updated while holding XXXGenLockId) in varsup.c from version 1.21 (Jun 1999) till version 1.36 (Mar 2001) when you occasionally moved it outside of locked code part: http://www.ca.postgresql.org/cgi/cvsweb.cgi/pgsql/src/backend/access/transam /varsup.c.diff?r1=1.35r2=1.36 Anyway, it seems to me that the possibility of inconsistent data is inherent in the way we handle updated rows in Read Committed mode --- you can always get to see a row that was emitted by a transaction you don't see the other effects of. If I correctly understand meaning of emitted then sentence above is not correct: set of rows to be updated can only be shortened by concurrent transactions. Yes, changes can be made over changes from concurrent transactions but only for rows from original set defined by query snapshot and only if concurrently updated rows (from that set) satisfy query qual = a row must satisfy snapshot *and* query qual = double satisfaction guaranteed -:)) And let's remember that this behaviour is required for current RI constraints implementation. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Now it's my turn...
Baby girl on Jun 27. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Plans for solving the VACUUM problem
Yes, that is a good description. And old version is only required in the following two cases: 1. the txn that modified this tuple is still open (reader in default committed read) 2. reader is in serializable transaction isolation and has earlier xtid Seems overwrite smgr has mainly advantages in terms of speed for operations other than rollback. ... And rollback is required for 5% transactions ... Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: AW: [HACKERS] Plans for solving the VACUUM problem
You mean it is restored in session that is running the transaction ? Depends on what you mean with restored. It first reads the heap page, sees that it needs an older version and thus reads it from the rollback segment. So are whole pages stored in rollback segments or just the modified data? This is implementation dependent. Storing whole pages is much easy to do, but obviously it's better to store just modified data. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Plans for solving the VACUUM problem
Hm. On the other hand, relying on WAL for undo means you cannot drop old WAL segments that contain records for any open transaction. We've already seen several complaints that the WAL logs grow unmanageably huge when there is a long-running transaction, and I think we'll see a lot more. It would be nicer if we could drop WAL records after a checkpoint or two, even in the presence of long-running transactions. We could do that if we were only relying on them for crash recovery and not for UNDO. As you understand this is old, well-known problem in database practice, described in books. Two ways - either abort too long running transactions or (/and) compact old log segments: fetch and save (to use for undo) records of long-running transactions and remove other records. Neither way is perfect but nothing is perfect at all -:) 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can make lightweight VACUUM work well. (I definitely don't like the idea Sorry, but I'm going to consider background vacuum as temporary solution only. As I've already pointed, original PG authors finally became disillusioned with the same approach. What is good in using UNDO for 1. is the fact that WAL records give you *direct* physical access to changes which should be rolled back. that after a very long transaction fails and aborts, I'd have to wait another very long time for UNDO to do its thing before I could get on with my work. Would much rather have the space reclamation happen in background.) Understandable, but why other transactions should read dirty data again and again waiting for background vacuum? I think aborted transaction should take some responsibility for mess made by them -:) And keeping in mind 2. very long transactions could be continued -:) 2. SAVEPOINTs would be awfully nice to have, I agree. 3. Reusing xact IDs would be nice, but there's an answer with a lot less impact on the system: go to 8-byte xact IDs. Having to shut down the postmaster when you approach the 4Gb transaction mark isn't going to impress people who want a 24x7 commitment, anyway. +8 bytes in tuple header is not so tiny thing. 4. Recycling pg_log would be nice too, but we've already discussed other hacks that might allow pg_log to be kept finite without depending on UNDO (or requiring postmaster restarts, IIRC). We did... and didn't get agreement. I'm sort of thinking that undoing back to a savepoint is the only real usefulness of WAL-based UNDO. Is it practical to preserve the WAL log just back to the last savepoint in each xact, not the whole xact? No, it's not. It's not possible in overwriting systems at all - all transaction records are required. Another thought: do we need WAL UNDO at all to implement savepoints? Is there some way we could do them like nested transactions, wherein each savepoint-to-savepoint segment is given its own transaction number? Committing multiple xact IDs at once might be a little tricky, but it seems like a narrow, soluble problem. Implicit savepoints wouldn't be possible - this is very convenient feature I've found in Oracle. And additional code in tqual.c wouldn't be good addition. Implementing UNDO without creating lots of performance issues looks a lot harder. What *performance* issues?! The only issue is additional disk requirements. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Plans for solving the VACUUM problem
1. Space reclamation via UNDO doesn't excite me a whole lot, if we can make lightweight VACUUM work well. Sorry, but I'm going to consider background vacuum as temporary solution only. As I've already pointed, original PG authors finally became disillusioned with the same approach. How could they become disillusioned with it, when they never tried it? I know of no evidence that any version of PG has had backgroundable (non-blocking-to-other-transactions) VACUUM, still less within-relation space recycling. They may have become disillusioned with the form of VACUUM that they actually had (ie, the same one we've inherited) --- but please don't call that the same approach I'm proposing. Pre-Postgres'95 (original) versions had vacuum daemon running in background. I don't know if that vacuum shrinked relations or not (there was no shrinking in '95 version), I know that daemon had to do some extra work in moving old tuples to archival storage, but anyway as you can read in old papers in the case of consistent heavy load daemon was not able to cleanup storage fast enough. And the reason is obvious - no matter how optimized your daemon will be (in regard to blocking other transactions etc), it will have to perform huge amount of IO just to find space available for reclaiming. Certainly, doing VACUUM this way is an experiment that may fail, or may require further work before it really works well. But I'd appreciate it if you wouldn't prejudge the results of the experiment. Why not, Tom? Why shouldn't I say my opinion? Last summer your comment about WAL, may experiment that time, was that it will save just a few fsyncs. It was your right to make prejudment, what's wrong with my rights? And you appealed to old papers as well, BTW. Understandable, but why other transactions should read dirty data again and again waiting for background vacuum? I think aborted transaction should take some responsibility for mess made by them -:) They might read it again and again before the failed xact gets around to removing the data, too. You cannot rely on UNDO for correctness; at most it can be a speed/space optimization. I see no reason to assume that it's a more effective optimization than a background vacuum process. Really?! Once again: WAL records give you *physical* address of tuples (both heap and index ones!) to be removed and size of log to read records from is not comparable with size of data files. Another thought: do we need WAL UNDO at all to implement savepoints? Is there some way we could do them like nested transactions, wherein each savepoint-to-savepoint segment is given its own transaction number? Implicit savepoints wouldn't be possible - this is very convenient feature I've found in Oracle. Why not? Seems to me that establishing implicit savepoints is just a user-interface issue; you can do it, or not do it, regardless of the underlying mechanism. Implicit savepoints are setted by server automatically before each query execution - you wouldn't use transaction IDs for this. Implementing UNDO without creating lots of performance issues looks a lot harder. What *performance* issues?! The only issue is additional disk requirements. Not so. UNDO does failed-transaction cleanup work in the interactive backends, where it necessarily delays clients who might otherwise be issuing their next command. A VACUUM-based approach does the cleanup work in the background. Same work, more or less, but it's not in the clients' critical path. Not same work but much more and in the critical pathes of all clients. And - is overall performance of Oracle or Informix worse then in PG? Seems delays in clients for rollback doesn't affect performance so much. But dirty storage does it. BTW, UNDO for failed transactions alone will not eliminate the need for VACUUM. Will you also make successful transactions go back and physically remove the tuples they deleted? They can't do this, as you know pretty well. But using WAL to get TIDs to be deleted is considerable, no? Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Plans for solving the VACUUM problem
Were you going to use WAL to get free space from old copies too? Considerable approach. Vadim, I think I am missing something. You mentioned UNDO would be used for these cases and I don't understand the purpose of adding what would seem to be a pretty complex capability: Yeh, we already won title of most advanced among simple databases, -:) Yes, looking in list of IDs assigned to single transaction in tqual.c is much easy to do than UNDO. As well as couple of fsyncs is easy than WAL. 1. Reclaim space allocated by aborted transactions. Is there really a lot to be saved here vs. old tuples of committed transactions? Are you able to protect COPY FROM from abort/crash? Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Plans for solving the VACUUM problem
Really?! Once again: WAL records give you *physical* address of tuples (both heap and index ones!) to be removed and size of log to read records from is not comparable with size of data files. You sure? With our current approach of dumping data pages into the WAL on first change since checkpoint (and doing so again after each checkpoint) it's not too difficult to devise scenarios where the WAL log is *larger* than the affected datafiles ... and can't be truncated until someone commits. Yes, but note mine size of log to read records from - each log record has pointer to previous record made by same transaction: rollback must not read entire log file to get all records of specific transaction. Why not? Seems to me that establishing implicit savepoints is just a user-interface issue; you can do it, or not do it, regardless of the underlying mechanism. Implicit savepoints are setted by server automatically before each query execution - you wouldn't use transaction IDs for this. If the user asked you to, I don't see why not. Example of one of implicit savepoint usage: skipping duplicate key insertion. Using transaction IDs when someone want to insert a few thousand records? Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Plans for solving the VACUUM problem
If its an experiment, shouldn't it be done outside of the main source tree, with adequate testing in a high load situation, with a patch released to the community for further testing/comments, before it is added to the source tree? From reading Vadim's comment above (re: pre-Postgres95), this daemonized approach would cause a high I/O load on the server in a situation where there are *alot* of UPDATE/DELETEs happening to the database, which should be easily recreatable, no? Or, Vadim, am I misundertanding? It probably will not cause more IO than vacuum does right now. But unfortunately it will not reduce that IO. Cleanup work will be spreaded in time and users will not experience long lockouts but average impact on overall system throughput will be same (or maybe higher). My point is that we'll need in dynamic cleanup anyway and UNDO is what should be implemented for dynamic cleanup of aborted changes. Plus UNDO gives us natural implementation of savepoints and some abilities in transaction IDs management, which we may use or not (though, 4. - pg_log size management - is really good thing). Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Plans for solving the VACUUM problem
Seriously, I don't think that my proposed changes need be treated with quite that much suspicion. The only part that is really intrusive is Agreed. I fight for UNDO, not against background vacuum -:) the shared-memory free-heap-space-management change. But AFAICT that will be a necessary component of *any* approach to getting rid of VACUUM. We've been arguing here, in essence, about whether a background or on-line approach to finding free space will be more useful; but that still leaves you with the question of what you do with the free space after you've found it. Without some kind of shared free space map, there's not anything you can do except have the process that found the space do tuple moving and file truncation --- ie, VACUUM. So even if I'm quite wrong about the effectiveness of a background VACUUM, the FSM code will still be needed: an UNDO-style approach is also going to need an FSM to do anything with the free space it finds. It's equally clear Unfortunately, I think that we'll need in on-disk FSM and that FSM is actually the most complex thing to do in space reclamation project. Besides which, Vadim has already said that he won't have time to do anything about space reclamation before 7.2. So even if background vacuum does end up getting superseded by something better, we're going to need it for a release or two ... Yes. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] 7.1 startup recovery failure
There's a report of startup recovery failure in Japan. Redo done but ... Unfortunately I have no time today. Please ask to start up with wal_debug = 1... Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Stuck spins in current
BTW, I've got ~320tps with 50 clients inserting (int4, text[1-256]) records into 50 tables (-B 16384, wal_buffers = 256) on Ultra10 with 512Mb RAM, IDE (clients run on the same host as server). Not bad. What were you getting before these recent changes? As I already reported - with O_DSYNC this test shows 30% better performance than with fsync. (BTW, seems in all my tests I was using -O0 flag...) Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] BufferSync() FlushRelationBuffers() conflict
Just committed changes in bufmgr.c Regress tests passed but need more specific tests, as usually. Descr as in CVS: Check bufHdr-cntxDirty and call StartBufferIO in BufferSync() *before* acquiring shlock on buffer context. This way we should be protected against conflicts with FlushRelationBuffers. (Seems we never do excl lock and then StartBufferIO for the same buffer, so there should be no deadlock here, - but we'd better check this very soon). Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] BufferSync() FlushRelationBuffers() conflict
Tom, since you appear to be able to recreate the bug, can you comment on this, as to whether we are okay now? Sorry for the delay --- I was down in Norfolk all day, and am just now catching up on email. I will pull Vadim's update and run the test some more. However, last night I only saw the failure once in about an hour's worth of testing, so it's not that easy to reproduce anyway... I saw ~ 10 failures with -B 32 in ~ 3 minutes of testing. With old code, of course -:) Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Stuck spins in current
At this point I must humbly say "yes, you told me so", because if I No, I didn't - I must humbly say that I didn't foresee this deadlock, so "I didn't tell you so" -:) Anyway, deadlock in my tests are very correlated with new log file creation - something probably is still wrong... Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] RE: xlog loose ends, continued
Anyway I like idea of StartUpID in page headers - this will help Can you please describe StartUpID for me ? Ideal would be a stamp that has the last (smallest open) XID, or something else that has more or less timestamp characteristics (without the actual need of wallclock) in regard to the WAL. StartUpID counts database startups and so has timestamp characteristics. Actually, idea is to use SUI in future to allow reusing XIDs after startup: seeing old SUI in data pages we'll know that all transaction on this page was committed "long ago" (ie visible from MVCC POV). This requires UNDO, of course. This could then also be used to scan all pages for modification since last backup, to make incremental backups possible. (note, that incremental backup is not WAL backup) We can scan log itself to get all pages modified since last backup or whatever point we want - thanks to your idea about data pages backup. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] RE: xlog loose ends, continued
StartUpID counts database startups and so has timestamp characteristics. Actually, idea is to use SUI in future to allow reusing XIDs after startup: seeing old SUI in data pages we'll know that all transaction on this page was committed "long ago" (ie visible from MVCC POV). This requires UNDO, of course. First thanx for the description, but db startup would only count to 5-7 per year :-), is that sufficient ? It hardly sounds like anything useful to include in page header. It will be sufficient if DB will not use all 2^32 XIDs without shutdown. Removing pg_log *segments* for old XIDs is another story. What about the xlog id, that is also used for xlog file name, but I still think a xid would be the best candidate. logid would be ok too, xid is not - we have to shorten xids lifetime in near future. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: AW: AW: AW: [HACKERS] WAL does not recover gracefully from ou t-of -dis k-sp ace
Ok, I've made changes in xlog.c and run tests: Could you send me your diffs? Sorry, Monday only. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed WAL changes
What do we debate? I never told that we shouldn't worry about current WAL disability to restart. And WAL already correctly works in situation of "failing to write a couple of disk blocks when the system crashes". My statement at first place that "WAL can't help in the event of disk errors" was to remind that we should think over how much are we going to guarantee and by what means in the event our *base* requirements were not answered (guaranteed -:)). My POV is that two checkpoints increases disk space requirements for *everyday usage* while buying near nothing because of data consistency cannot be guaranteed anyway. On the other hand this is the fastest way to implement WAL restart-ability - *which is the real problem we have to fix*. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] WAL does not recover gracefully from out-of-disk-space
Was the following bug already fixed ? I was going to ask same Q. I see that seek+write was changed to write-s in XLogFileInit (that was induced by subj, right?), but what about problem itself? DEBUG: redo starts at (0, 21075520) The Data Base System is starting up DEBUG: open(logfile 0 seg 0) failed: No such file or directory ^^^ redo started in seg 1 and shouldn't try to read seg 0... BTW, were performance tests run after seek+write -- write-s change? Write-s were not obviously faster to me, that's why I've used seek+write, but never tested that area -:( Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: AW: [HACKERS] Proposed WAL changes
I thought the intended way to change a GUC parameter permanently was to edit data/postgresql.conf . No ? What I've thought is to implement a new command to change archdir under WAL's control. If it's different from Vadim's plan I don't object. Actually, I have no concrete plans for archdir yet - this one is for WAL based BAR we should discuss in future. So, I don't see why to remove archdir from pg_control now. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: AW: AW: AW: AW: [HACKERS] WAL-based allocation of XIDs is insecur e
Before commit or rollback the xlog is not flushed to disk, thus you can loose those xlog entries, but the index page might already be on disk because of LRU buffer reuse, no ? No. Buffer page is written to disk *only after corresponding records are flushed to log* (WAL means Write-Ahead-Log - write log before modifying data pages). Another example would be a btree reorg, like adding a level, that is partway through before a crash. And this is what I hopefully fixed recently with btree runtime recovery. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposed WAL changes
I have just sent to the pgsql-patches list a rather large set of Please send it to me directly - pgsql-patches' archieve is dated by Feb -:( proposed diffs for the WAL code. These changes: * Store two past checkpoint locations, not just one, in pg_control. On startup, we fall back to the older checkpoint if the newer one is unreadable. Also, a physical copy of the newest checkpoint record And what to do if older one is unreadable too? (Isn't it like using 2 x CRC32 instead of CRC64 ? -:)) And what to do if pg_control was lost? (We already discussed that we should read all logs from newest to oldest ones to find checkpoint). And why to keep old log files with older checkpoint? is kept in pg_control for possible use in disaster recovery (ie, complete loss of pg_xlog). Also add a version number for pg_control Mmmm, how recovery is possible if log was lost? All what could be done with DB in the event of corrupted/lost log is dumping data from tables *asis*, without any guarantee about consistency. How checkpoint' content could be useful? I feel that the fact that WAL can't help in the event of disk errors is often overlooked. itself. Remove archdir from pg_control; it ought to be a GUC parameter, not a special case (not that it's implemented yet anyway). I would discuss WAL based BAR management before deciding how to store/assign archdir. On the other hand it's easy to add archdir to pg_control later -:) * Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs on alternate bytes. Great if you've found reliable CRC64 impl! * Fix XLOG record length handling so that it will work at BLCKSZ = 32k. Case I've overlooked -:( (Though, I always considered BLCKSZ 8K as temp hack -:)) * Change XID allocation to work more like OID allocation, so that we can flush XID alloc info to the log before there is any chance an XID will appear in heap files. I didn't read you postings about this yet. * Add documentation and clean up some coding infelicities; move file format declarations out to include files where planned contrib utilities can get at them. Thanks for that! Before committing this stuff, I intend to prepare a contrib utility that can be used to reset pg_control and pg_xlog. This is mainly for disaster recovery purposes, but as a side benefit it will allow people Once again, I would call this "disaster *dump* purposes" -:) After such operation DB shouldn't be used for anything but dump! to update 7.1beta installations to this new code without doing initdb. I need to update contrib/pg_controldata, too. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: AW: AW: AW: AW: [HACKERS] WAL-based allocation of XIDs is insecur e
Before commit or rollback the xlog is not flushed to disk, thus you can loose those xlog entries, but the index page might already be on disk because of LRU buffer reuse, no ? No. Buffer page is written to disk *only after corresponding records are flushed to log* (WAL means Write-Ahead-Log - write log before modifying data pages). You mean, that for each dirty buffer that is reused, the reusing backend fsyncs the xlog before writing the buffer to disk ? In short - yes. To be accurate - XLogFlush is called to ensure that records reflecting buffer' modifications are on disk. That's how it works everywhere. And that's why LRU is not good policy for bufmgr anymore (we discussed this already). Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WAL-based allocation of XIDs is insecure
Consider the following scenario: 1. A new transaction inserts a tuple. The tuple is entered into its heap file with the new transaction's XID, and an associated WAL log entry is made. Neither one of these are on disk yet --- the heap tuple is in a shmem disk buffer, and the WAL entry is in the shmem WAL buffer. 2. Now do a lot of read-only operations, in the same or another backend. The WAL log stays where it is, but eventually the shmem disk buffer will get flushed to disk so that the buffer can be re-used for some other disk page. 3. Assume we now crash. Now, we have a heap tuple on disk with an XID that does not correspond to any XID visible in the on-disk WAL log. Impossible (with fsync ON -:)). Seems my description of core WAL rule was bad, I'm sorry -:( WAL = Write-*Ahead*-Log = Write data pages *only after* log records reflecting data pages modifications are *flushed* on disk = If a modification was not logged then it's neither in data pages. No matter when bufmgr writes data buffer (at commit time or to re-use it) bufmgr first ensures that buffer' modifications are logged. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL-based allocation of XIDs is insecure
The point is to make the allocation of XIDs and OIDs work the same way. In particular, if we are forced to reset the XLOG using what's stored in pg_control, it would be good if what's stored in pg_control is a value beyond the last-used XID/OID, not a value less than the last-used ones. If we're forced to reset log (ie it's corrupted/lost) then we're forced to dump, and only dump, data *because of they are not consistent*. So, I wouldn't worry about XID/OID/anything - we can only provide user with way to restore data ... *manually*. If user really cares about his data he must U1. Buy good disks for WAL (data may be on not so good disks). U2. Set up distributed DB if U1. is not enough. To help user with above we must D1. Avoid bugs in WAL D2. Implement WAL based BAR (so U1 will have sence). D3. Implement distributed DB. There will be no D2 D3 in 7.1, and who knows about D1. So, manual restoring data is the best we can do for 7.1. And actually, "manual restoring" is what we had before, anyway. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] WAL-based allocation of XIDs is insecure
On third thought --- we could still log the original page contents and the modification log record atomically, if what were logged in the xlog record were (essentially) the parameters to the operation being logged, not its results. That is, make the log entry before you start doing the mod work, not after. This might also simplify redo, since redo would be no different from the normal case. I'm not sure why Vadim didn't choose to do it that way; maybe there's some other fine point I'm missing. There is one - indices over user defined data types: catalog is not available at the time of recovery, so, eg, we can't know how to order keys of "non-standard" types. (This is also why we have to recover aborted index split ops at runtime, when catalog is already available.) Also, there is no point why should we log original page content and the next modification record separately. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL-based allocation of XIDs is insecure
Hm, wasn't it handling non-atomic disk writes, Andreas? Yes, but for me, that was only one (for me rather minor) issue. I still think that the layout of PostgreSQL pages was designed to reduce the risc of a (heap) page beeing inconsistent because it is only partly written to an acceptable minimum. If your hw and os can I believe that I explained why it's not minor issue (and never was). Eg - PageRepaireFragmentation "compacts" page exactly like other, overwriting, DBMSes do and partial write of modified page means lost page content. And for what else "physical log" could be used? 1. create a consistent state if rollforward bails out for some reason but log is still readable What difference between consistent state as it was before checkpoint and after that? Why should we log old page images? New (after modification) page images are also consistent and can be used to create consistent state. 2. have an easy way to handle index rollforward/abort (might need to block some index modifications during checkpoint though) There is no problems now. Page is either splitted (new page created/ properly initialized, right sibling updated) or not. 3. ease the conversion to overwrite smgr ? 4. ease the creation of BAR to create consistent snapshot without need for log rollforward Isn't it the same as 1. with "snapshot" == "state"? Now, why should we log page as it was *before* modification? We would log modification anyway (yet another log record!) and Oh, so currently you only do eighter ? I would at least add the info which slot was inserted/modified (maybe that is already there (XID)). Relfilenode + TID are saved, as well as anything else that would required to UNDO operation, in future. would apply it to page, so result would be the same as now when we log page after modification - consistent *modifyed* page. Maybe I am too focused on the implementation of one particular db, that I am not able to see this without prejudice, and all is well as is :-) ^ I hope so -:) Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] I'm back
Hi! Snow in New York - I'm arrived only today. Reading mail... Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Re: Uh, this is *not* a 64-bit CRC ...
This isn't a 64-bit CRC. It's two independent 32-bit CRCs, one done on just the odd-numbered bytes and one on just the even-numbered bytes of the datastream. That's hardly any stronger than a single 32-bit CRC; I believe that the longer data the more chance to get same CRC/hash for different data sets (if data length CRC/hash length). Or am I wrong? Having no crc64 implementation (see below) I decided to use 2 CRC32 instead of one - it looked better, without any additional cost (record header is 8 byte aligned anyway, on, mmm, most platform). it's certainly not what I thought we had agreed to implement. I've asked if anyone can send crc64 impl to me and got only one from Nathan Myers. Unfortunately, SWISS-PROT impl assumes that long long is 8 bytes - is it portable? Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] RE: [CORE] WAL RC1 status
I've reported the major problems to the mailing lists but gotten almost no feedback about what to do. I can't comment without access to code -:( commit: 2001-02-26 17:19:57 0/0059996C: prv 0/00599948; xprv 0/; xid 0; RM 0 info 00 len 32 checkpoint: redo 0/0059996C; undo 0/; sui 29; nextxid 18903; nextoid 35195; online -- this is the last normal-looking checkpoint record. -- Judging from the commit timestamps surrounding prior -- checkpoints, checkpoints were happening every five -- minutes approximately on the 5-minute mark, so You can't count on this: postmaster runs checkpoint "maker" in 5 minutes *after* prev checkpoint was created, not from the moment "maker" started. And checkpoint can take *minutes*. -- this one happened about 17:20. -- (There really should be a timestamp -- in the checkpoint records...) Agreed. commit: 2001-02-26 17:26:02 ReadRecord: record with zero len at 0/005A4B4C -- My dump program is unhappy here because the rest -- of the page is zero. Given that there is a -- continuation record at the start of the next -- page, there certainly should have been record(s) -- here. But it's worse than that: check the commit -- timestamps and the xid numbers before and after the -- discontinuity. Did time go backwards here? Commit timestamps are created *before* XLogInsert call, which can suspend backend for some time (in multi-user env). Random xid-s are also ok, generally. -- Also notice the back-pointers in the first valid -- record on the next page; they point not into the -- zeroed space, which would suggest a mere failure -- to write a buffer after filling it, but into the -- middle of one of the valid records on the prior -- page. It almost looks like page 5A6000 came from -- a completely different run than page 5A4000. Unexpected page info flags 0001 at offset 5A6000 Skipping unexpected continuation record at offset 5A6000 0/005A6904: prv 0/005A48B4(?); xprv 0/005A48B4; xid 19047; ^^ ^^ Same. So, TX 19047 really inserted record at 0/005A48B4 position. -- What's even nastier (and the immediate cause of -- Scott's inability to restart) is that the pg_control -- file's checkPoint pointer points to 0/005AF9F0, which -- is *not* the location of this checkpoint, but of -- the record after it. Well, well. Checkpoint position is taken from MyLastRecord - I wonder how could this internal var take "invalid" data from concurrent backend. Ok, we're leaving Krasnoyarsk in 8 hrs and should arrive SF Feb 5 ~ 10pm. Vadim --- FREE! The World's Best Email Address @email.com Reserve your name now at http://www.email.com ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] RE: Re: [ADMIN] v7.1b4 bad performance
So 7.0.3 is twice as fast only with fsync off. Are there FK updates/deletes in pgbench' tests? Remember how SELECT FOR UPDATE in FK triggers affects performance... Also, 5 clients is small number. Vadim P.S. Sorry for delays with my replies - internet connection is pain here: it takes 5-10 minutes to read each message -:((( --- FREE! The World's Best Email Address @email.com Reserve your name now at http://www.email.com
Re: [HACKERS] WAL does not recover gracefully from out-of-disk-space
Regardless of whether this particular behavior is fixable, this brings up something that I think we *must* do before 7.1 release: create a utility that blows away a corrupted logfile to allow the system to restart with whatever is in the datafiles. Otherwise, there is no recovery technique for WAL restart failures, short of initdb and restore from last backup. I'd rather be able to get at data of questionable up-to-dateness than not have any chance of recovery at all. I've asked 2 or 3 times how to recover from recovery failure but got no answer. We should some recipi for the failure before 7.1 release. And I answered 2 or 3 times with fixes for each reported recovery failure -:) (And asked to help with testing...) Seems to me that "fixing" is the only "answer" you would get asking the same question to Oracle, Informix or any other system with transaction log. Does anybody know how "big boys" deal with this issue? Vadim --- FREE! The World's Best Email Address @email.com Reserve your name now at http://www.email.com
Re: [HACKERS] RE: Re: [ADMIN] v7.1b4 bad performance
It may be that WAL has changed the rollback time-characteristics to worse than pre-wal ? Nothing changed ... yet. And in future rollbacks of read-only transactions will be as fast as now, anyway. What about rollbacks of a bunch uf inserts/updates/deletes? I remember a scenario where an empty table was used by several backends for gathering report data, and when the report is done they will rollback to keep the table empty. Should this kind of usage be replaced in the future by having backend id as a key and then doing delete by that key in the end ? Isn't it what we have right now? But I believe that in future we must remove modifications made by aborted transactions immediately, without keeping them till vacuum. So - yes: rollback of read-write transactions will take longer time. Vadim --- FREE! The World's Best Email Address @email.com Reserve your name now at http://www.email.com
Re: [HACKERS] RE: Re: [ADMIN] v7.1b4 bad performance
It may be that WAL has changed the rollback time-characteristics to worse than pre-wal ? Nothing changed ... yet. And in future rollbacks of read-only transactions will be as fast as now, anyway. So my guess is that the 7.1 updates (with default fsync) are significantly slower than 7.0.3 fsync=off now. Do you update tables with foreign keys? Did you run tests in multi-user or single-user environment? Vadim --- FREE! The World's Best Email Address @email.com Reserve your name now at http://www.email.com
[HACKERS] Re: Recovery of PGSQL after system crash failing!!!
It removes the need to disable fsync to get best performance! -F performance is still better, only the difference is not so big as before. Well, when "checkpoint seek in logs" will be implemented difference will be the same - lost consistency. Since there is a fundamental recovery problem if the WAL file disappears, then perhaps we should have a workaround which can ignore the requirement for that file on startup? Or maybe we do already? Vadim?? This was discussed, but iirc not yet implemented. Yes yes. Also, could the "-F" option be disabled now that WAL is enabled? Or is there still some reason to encourage/allow folks to use it? I've used it when testing btree runtime recovery to increase concurrence. I use it, since I restore after a system crash (which never happens). I think all that is probably missing in -F mode is probably 2-3 fsyncs during checkpoint. One for the xlog, and one for pg_control (maybe also pg_log). All other fsyncs are only to not buffer transactions. Probably we could just force fsync during checkpoint, for the moment. Thanks to all for help! Vadim
Re: [HACKERS] Open 7.1 items
As you can see from the current open items list, there isn't much left to do for the 7.1 release. I am going to suggest we remove the LAZY VACUUM option at this point. I know Tom Lane posted an item about the Well, leaving for vacation tomorrow I have to agree -:( LAZY patch will be available in a few days after 7.1 release. Vadim
Re: [HACKERS] RE: [ADMIN] SOS !!: Porstgress forgot all ! Help !
O, your system reached max transaction ID -:( That's two reports now of people who have managed to wrap around the XID counter. It doesn't seem that hard to do in a heavily used database. Does anyone want to take more seriously the stopgap solution I proposed for this problem (pghackers archives around 3-Nov-00)? I believe you shot it down that time, but I don't think that ignoring the problem for another release cycle is a better answer. Actually, I believed that you've done this temp solution till I've found that it's not true couple weeks ago. If you'll do this please don't forget about reusing ID of *committed* transactions and crashes - this should be handled somehow on recovery. Vadim
[HACKERS] Leaving for vacation
from Feb 15 till Mar 6... I'll not be able to read mail lists, so in the event of needs please use [EMAIL PROTECTED] address. Regards! Vadim
Re: [HACKERS] Recovery of PGSQL after system crash failing!!!
#2 0x20dc71 in abort () from /lib/libc.so.6 #3 0x8080495 in XLogFileOpen () Hm. Evidently it's failing to open the xlog file, but the code is set up in such a way that it dies before telling you why :-( Take a look at XLogFileOpen in src/backend/access/transam/xlog.c and tweak the code to tell you the path and errno it's failing on before it abort()s. I don't remember why there is abort() in XLogFileOpen just before appropriate elog(STOP) there - I'll remove it in few moments, - but it's already obvious why open failed: there is no file with checkpoint record pointed by pg_control - data/pg_xlog/002F. So, the question is who removed this file - PG or Linux? Ryan, do you have postmaster' log before crash (where MoveOfflineLogs reports WAL files to be deleted) and/or some logs from Linux' startup? And meanwhile I'll take a look arround MoveOfflineLogs... Vadim
Re: [HACKERS] Recovery of PGSQL after system crash failing!!!
DEBUG: starting up DEBUG: database system was interrupted at 2001-02-11 04:08:12 DEBUG: Checkpoint record at (0, 805076492) postmaster: reaping dead processes... Startup failed - abort And that is it, from running 'postmaster -D /usr/local/pgsql/data/'. I get the same thing each time I run it. I assume that WAL is for some reason failing to restore/recover the database. The system is a stock Debian 2.2 system, Dual PPro200, w/pgsql 7.1beta4. The system crash occured during the nightly update of the databases (from another, internal, non-pgsql, database system). Is there anyway to recover the database, or do I need to do a 'rm -rf Please try to restart with option wal_debug = 1 so postmaster log will be more informative and send this log me. data; initdb'? A quick response would be greatly appreciated. Thanks. Please archieve PG' data dir - it probably will be useful to find bug. Vadim
Re: [HACKERS] Btree runtime recovery. Stuck spins.
Hm. It was OK to use spinlocks to control buffer access when the max delay was just the time to read or write one disk page. But it sounds Actually, btree split requires 3 simult. buffers locks and after that _bt_getstackbuf may read *many* parent buffers while holding locks on 2 buffers. AFAIR, the things are even worse in hash. And anyway, there is always probability that someone else will get just freed lock while you're waiting next 0.01 sec. The problem is that there is no priority/ordering while waiting for spin lock. like we've pushed the code way past what it was designed to do. I think this needs some careful thought, not just a quick hack like increasing the timeout interval. I fear there is not enough time -:( After thinking more about this, simply increasing S_MAX_BUSY is clearly NOT a good answer. If you are under heavy load then processes that are spinning are making things worse, not better, because they are sucking CPU cycles that would be better spent on the processes that are holding the locks. It would not be very difficult to replace the per-disk-buffer spinlocks with regular lockmanager locks. Advantages: * Processes waiting for a buffer lock aren't sucking CPU cycles. * Deadlocks will be detected and handled reasonably. (The more stuff that WAL does while holding a buffer lock, the bigger the chances of deadlock. I think this is a significant concern now.) I disagree. Lmgr needs in deadlock detection code because of deadlock may be caused by *user application* design and we must not count on *user application* correctness. But we must not use deadlock detection code when we protected from deadlock by *our* design. Well, anyone can make mistake and break order of lock acquiring - we should just fix those bugs -:) So, it doesn't matter *how much stuff that WAL does while holding buffer locks* as long as WAL itself doesn't acquire buffer locks. Of course the major disadvantage is: * the lock setup/teardown overhead is much greater than for a spinlock, and the overhead is just wasted when there's no contention. Exactly. A reasonable alternative would be to stick with the spinlock mechanism, but use a different locking routine (maybe call it S_SLOW_LOCK) that is designed to deal with locks that may be held for a long time. It would use much longer delay intervals than the regular S_LOCK code, and would have either a longer time till ultimate timeout, or no timeout at all. The main problem with this idea is choosing an appropriate timeout behavior. As I said, I am concerned about the possibility of deadlocks in WAL-recovery scenarios, so I am not very happy with the thought of no timeout at all. But it's hard to see what a reasonable timeout would And I'm unhappy with timeouts -:) It's not solution at all. We should do right design instead. be if a minute or more isn't enough in your test cases; seems to me that that suggests that for very large indexes, you might need a *long* time. Comments, preferences, better ideas? For any spins which held while doing IO ops we should have queue of waiting backend' PROCs. As I said - some kind of lightweight lock manager. Just two kind of locks - shared exclusive. No structures to find locked objects. No deadlock detection code. Backends should wait on their semaphores, without timeouts. For "true" spins (held for really short time when accessing control structures in shmem) we should not sleep 0.01 sec! tv_usec == 1 would be reasonable - just to yield CPU. Actually, mutexes would be much better... Vadim
Re: [HACKERS] Btree runtime recovery. Stuck spins.
Shouldn't we increase S_MAX_BUSY and use ERROR instead of FATAL? No. If you have delays exceeding a minute, or that are even a visible fraction of a minute, then a spinlock is NOT the correct mechanism to be using to wait ... because guess what, it's spinning, and consuming processor time to no purpose. You should be using a lock instead for anything that involves more than a trivial amount of delay. "Amount of delay" depends on system load - something we can't control... Btree uses spins to lock buffers (as all other access methods) and so I could use only spins in new code. And though tree recovery locks buffers for longer time than normal insert operations it's possible to get "stuck" spins when using concurrent buffers locks *everywhere* under heavy load (especially with WAL which requires holding buffer locks for duration of logging). So, probably we have to use some kind of light locks (without deadlock detection) for buffers, in future. Vadim
Re: [HACKERS] WAL Crash during index vacuuming 7.1beta4
during the nightly vacuum pgsql closed and do not start any more. Attached the log. Seems the problem was rebuilding an Index, There is a way to force wal to ignore indexes ? The problem was in redoing tuple movement in *table*. Can I delete it ? ... DEBUG: redo starts at (6, 4144414796) NOTICE: PageAddItem: tried overwrite of used ItemId ^^^ FATAL 2: heap_update_redo: failed to add tuple I think that I've just fixed this problem (must not check itemid' flag in PageAddItem in overwrite mode when offset number == maxoff + 1). I hope that Giuseppe will check new code soon. Thanks to Giuseppe for help! Vadim
Re: [HACKERS] Open 7.1 items
Here are the open items for 7.1. Much shorter: + Runtime btree recovery Vadim
[HACKERS] Out of town these long weekends...
Re: [HACKERS] Pg7.1beta3: connect failed: The DB System is starting up.
With Apache Mod Perl, Apache::DBI, stress test with apache bench (ab -n 10 -c 4) in apache error_log i've got: [Pg7.1beta3 with standard conf files.] And how many simult connections you did? .. [Fri Jan 12 07:48:58 2001] [error] DBI-connect(dbname=mydb) failed: The Data Base System is starting up Also messages: "DB in recovery ...". Looks like server was crashed and now is in recovery. Vadim
Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)
No, I thought we agreed disk block CRC was way overkill. If the CRC on the WAL log checks for errors that are not checked anywhere else, then fine, but I thought disk CRC would just duplicate the I/O subsystem/disk checks. A disk-block CRC would detect partially written blocks (ie, power drops after disk has written M of the N sectors in a block). The disk's own checks will NOT consider this condition a failure. I'm not convinced that WAL will reliably detect it either (Vadim?). Certainly WAL will Idea proposed by Andreas about "physical log" is implemented! Now WAL saves whole data blocks on first after checkpoint modification. This way on recovery modified data blocks will be first restored *as a whole*. Isn't it much better than just detection of partially writes? Only one type of modification isn't covered at the moment - updated t_infomask of heap tuples. not help for corruption caused by external agents, away from any updates that are actually being performed/logged. What do you mean by "external agents"? Vadim
[HACKERS] Re: CRCs (was Re: [GENERAL] Re: Loading optimization)
But CRC is used in WAL records only. Oh. I thought we'd agreed that a CRC on each stored disk block would be a good idea as well. I take it you didn't do that. Do we want to consider doing this (and forcing another initdb)? Or shall we say "too late for 7.1"? I personally was never agreed to this. Reasons? Vadim
Re: [HACKERS] Quite strange crash
Well, it's not good idea because of SIGTERM is used for ABORT + EXIT (pg_ctl -m fast stop), but shouldn't ABORT clean up everything? Er, shouldn't ABORT leave the system in the exact state that it's in so that one can get a crashdump/traceback on a wedged process without it trying to clean up after itself? Sorry, I've meant "transaction abort"... Vadim
[HACKERS] New WAL version (CRC data pages backup)
just committed. initdb is required. *I didn't do any serious tests yet*, simple regression only. I'll run more tests in the next couple of days. Plase, help with this. Vadim
Re: [HACKERS] switching txlog file in 7.1beta
when doing txlog switches there seems to be a problem with remembering the correct = active logfile, when the postmaster crashes. This is one of the problems I tried to show up previously: You cannot rely on writes to other files except the txlog itself !!! Why? If you handle those files specifically, as txlog itself. Thus the current way of recording the active txlog seg and position in pg_control is busted, and must be avoided. I would try to not use pg_control for this at all, but scan the pg_xlog directory for this purpose. cusejoua=# update journaleintrag set txt_funktion=trim(txt_funktion); FATAL 2: write(logfile 0 seg 2 off 4612096) failed: No such file or directory pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Can you start up db with --wal_debug=1 and send me output? Vadim
Re: [HACKERS] beta testing version
I totaly missed your point here. How closing source of ERserver is related to closing code of PostgreSQL DB server? Let me clear things: (not based on WAL) That's wasn't clear from the blurb. Still, this notion that PG, Inc will start producing closed-source products poisons the well. It strengthens FUD arguments of the "open source can't provide enterprise solutions" variety. "Look, even PostgreSQL, Inc realizes that you must follow a close sourced model in order to provide tools for the corporate world." Did you miss Thomas' answer? Wasn't it clear that the order is to provide income? Vadim
Re: [HACKERS] beta testing version
There is risk here. It isn't so much in the fact that PostgreSQL, Inc is doing a couple of modest closed-source things with the code. After all, the PG community has long acknowleged that the BSD license would allow others to co-op the code and commercialize it with no obligations. It is rather sad to see PG, Inc. take the first step in this direction. How long until the entire code base gets co-opted? I totaly missed your point here. How closing source of ERserver is related to closing code of PostgreSQL DB server? Let me clear things: 1. ERserver isn't based on WAL. It will work with any version = 6.5 2. WAL was partially sponsored by my employer, Sectorbase.com, not by PG, Inc. Vadim
Re: [HACKERS] beta testing version
As for replaying logs against a restored snapshot dump... AIUI, a dump records tuples by OID, but the WAL refers to TIDs. Therefore, the WAL won't work as a re-do log to recover your transactions because the TIDs of the restored tables are all different. True for current way of backing up - ie saving data in "external" (sql) format. But there is another way - saving data files in their natural (binary) format. WAL records may be applyed to such dump, right? But (AIUI) you can only safely/usefully copy those files when the database is shut down. No. You can read/save datafiles at any time. But block reads must be "atomic" - no one should be able to change any part of a block while we read it. Cp tar are probably not suitable for this, but internal BACKUP command could do this. Restoring from such backup will like recovering after pg_ctl -m i stop: all data blocks are consistent and WAL records may be applyed to them. Many people hope to run PostgreSQL 24x7x365. With vacuuming, you might just as well shut down afterward; but when that goes away (in 7.2?), when will you get the chance to take your backups? Ability to shutdown 7.2 will be preserved -:)) But it's not required for backup. To get replaying we need an "update log", something that might be in 7.2 if somebody does a lot of work. What did you mean by "update log"? Are you sure that WAL is not "update log" ? -:) No, I'm not sure. I think it's possible that a new backup utility could be written to make a hot backup which could be restored and then replayed using the current WAL format. It might be easier to add another log which could be replayed against the existing form of backups. That last is what I called the "update log". Consistent read of data blocks is easier to implement, sure. The point is, WAL now does one job superbly: maintain a consistent on-disk database image. Asking it to do something else, such as supporting hot BAR, could interfere with it doing its main job. Of course, only the person who implements hot BAR can say. There will be no interference because of BAR will not ask WAL to do anything else it does right now - redo-ing changes. Vadim
Re: [HACKERS] Are pg_control contents really variable-length?
In xlog.c, the declaration of struct ControlFileData says: /* * MORE DATA FOLLOWS AT THE END OF THIS STRUCTURE - locations of data * dirs */ Is this comment accurate? I don't see any sign in the code of placing extra data after the declared structure. If you're planning to do it in future, I think it would be a bad idea. I'd prefer to see all the That was my thought but as you see nothing was done for it, so feel free to change anything you want there. Vadim
Re: [HACKERS] WAL
Now WAL is ON by default. make distclean + initdb are required. Vadim
Re: [HACKERS] Re: [COMMITTERS] pgsql/src/backend/access/transam (xlog.c)
Nope. Still fails... I know, but looks better, eh? -:) *** ./expected/opr_sanity.out Tue Nov 14 13:32:58 2000 --- ./results/opr_sanity.out Mon Nov 20 20:27:46 2000 *** *** 482,489 (p2.pronargs = 1 AND p1.aggbasetype = 0))); oid | aggname | oid | proname ---+-+-+- ! 16998 | max | 768 | int4larger ! 17012 | min | 769 | int4smaller (2 rows) -- Cross-check finalfn (if present) against its entry in pg_proc. --- 482,489 (p2.pronargs = 1 AND p1.aggbasetype = 0))); oid | aggname | oid | proname ---+-+-+- ! 16997 | max | 768 | int4larger ! 17011 | min | 769 | int4smaller (2 rows) -- Cross-check finalfn (if present) against its entry in pg_proc. == * [EMAIL PROTECTED] [EMAIL PROTECTED] [001120 20:11]: Date: Monday, November 20, 2000 @ 21:11:06 Author: vadim Update of /home/projects/pgsql/cvsroot/pgsql/src/backend/access/transam from hub.org:/home/projects/pgsql/tmp/cvs-serv62721/src/backend/access/transam Modified Files: xlog.c - Log Message - Init ShmemVariableCache in BootStrapXLOG() (should fix OID bootstraping). -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Re: [HACKERS] Re: [COMMITTERS] pgsql/src/backend/access/transam (xlog.c)
Larry Rosenman [EMAIL PROTECTED] writes: Nope. Still fails... You should've said that the OIDs are now just off-by-one from where they were before, instead of off by several thousand. That I'm willing to accept as an implementation change ;-) I've updated the expected file. Actually, pg_shadow' oid for DBA inserted by initdb is 2 now - I'm fixing this now... Vadim
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
Ok, so with CHECKPOINTS, we could move the offline log files to somewhere else so that we could archive them, in my undertstanding. Now question is, how we could recover from disaster like losing every table files except log files. Can we do this with WAL? If so, how can we do it? Not currently. WAL based BAR is required. I think there will be no BAR in 7.1, but it may be added in 7.1.X (no initdb will be required). Anyway BAR implementation is not in my plans. All in your hands, guys -:) Vadim Cam I ask what BAR is ? Backup And Restore. Vadim
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
Earlier, Vadim was talking about arranging to share fsyncs of the WAL log file across transactions (after writing your commit record to the log, sleep a few milliseconds to see if anyone else fsyncs before you do; if not, issue the fsync yourself). That would offer less-than- one-fsync-per-transaction performance without giving up any guarantees. If people feel a compulsion to have a tunable parameter, let 'em tune the length of the pre-fsync sleep ... Already implemented (without ability to tune this parameter - xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so backend sleeps 1/200 sec before checking/forcing log fsync. But it returns _completed_ to the client before sleeping, right? No. Vadim
[HACKERS] WAL
Hi! I'll be in Las Vegas (Comdex) till next week Wednesday. I wasn't able to implement redo for sequences but was going to turn WAL on by default anyway. Unfortunately, I've got core in regress:opr_sanity test (in FileSeek code), so WAL still is not default. See you! Vadim
Re: [HACKERS] Transaction ID wraparound: problem and proposed solution
One idea I had from this is actually truncating pg_log at some point if we know all the tuples have the special committed xid. It would prevent the file from growing without bounds. Not truncating, but implementing pg_log as set of files - we could remove files for old xids. Vadim, can you explain how WAL will make pg_log unnecessary someday? First, I mentioned only that having undo we could remove old pg_log after postmaster startup because of only committed changes would be in data files and they would be visible to new transactions (small changes in tqual will be required to take page' startup id into account) which would reuse xids. While changing a page first time in current startup, server would do exactly what Tom is going to do at vacuuming - just update xmin/xmax to "1" in all items (or setting some flag in t_infomask), - and change page' startup id to current. I understand that this is not complete solution for xids problem, I just wasn't going to solve it that time. Now after Tom' proposal I see how to reuse xids without vacuuming (but having undo): we will add XidWrapId (XWI) - xid wrap counter - to pages and set it when we change page. First time we do this for page with old XWI we'll mark old items (to know later that they were changed by xids with old XWI). Each time we change page we can mark old xmin/xmax with xid = current xid as committed long ago (basing on xact TTL restrinctions). All above assumes that there will be no xids from aborted transactions in pages, so we need not lookup in pg_log to know is a xid committed/aborted, - there will be only xids from running or committed xactions there. And we need in undo for this. Vadim
Re: [HACKERS] Transaction ID wraparound: problem and proposed solution
So, we'll have to abort some long running transaction. Well, yes, some transaction that continues running while ~ 500 million other transactions come and go might give us trouble. I wasn't really planning to worry about that case ;-) Agreed, I just don't like to rely on assumptions -:) Required frequency of *successful* vacuum over *all* tables. We would have to remember something in pg_class/pg_database and somehow force vacuum over "too-long-unvacuumed-tables" *automatically*. I don't think this is a problem now; in practice you couldn't possibly go for half a billion transactions without vacuuming, I'd think. Why not? And once again - assumptions are not good for transaction area. If your plans to eliminate regular vacuuming become reality, then this scheme might become less reliable, but at present I think there's plenty of safety margin. If undo would be implemented then we could delete pg_log between postmaster startups - startup counter is remembered in pages, so seeing old startup id in a page we would know that there are only long ago committed xactions (ie only visible changes) there and avoid xid comparison. But ... there will be no undo in 7.1. And I foresee problems with WAL based BAR implementation if we'll follow proposed solution: redo restores original xmin/xmax - how to "freeze" xids while restoring DB? So, we might eventually have a better answer from WAL, but not for 7.1. I think my idea is reasonably non-invasive and could be removed without much trouble once WAL offers a better way. I'd really like to have some answer for 7.1, though. The sort of numbers John Scott was quoting to me for Verizon's paging network throughput make it clear that we aren't going to survive at that level with a limit of 4G transactions per database reload. Having to vacuum everything on at least a 1G-transaction cycle is salable, dump/initdb/reload is not ... Understandable. And probably we can get BAR too but require full backup every WRAPLIMIT/2 (or better /4) transactions. Vadim
Re: [HACKERS] Alternative database locations are broken
I think that to handle locations we could symlink catalogs - ln -s path_to_database_in_some_location .../base/DatabaseOid But that's a kludge. We ought to discourage people from messing with the storage internals. It's not a kluge, it's a perfectly fine implementation. The only kluge here is if people have to reach in and establish such symlinks by hand. We want to set up a user interface that hides the implementation. Agreed. And I don't see problems with handling this at CREATE DATABASE time. Create database dir in specified location, create symlink from base dir and remember location name in pg_database.datpath. Vadim
Re: [HACKERS] WAL status update
"Mikheev, Vadim" [EMAIL PROTECTED] writes: I think that at least 1 2 from WAL todo (checkpoints and port to machines without TAS) is required before beta. I'm not sure that you do need to add support for machines without TAS. I pointed out a couple months ago that the non-TAS support code hasn't even compiled for the past release or three, and proposed ripping it all out rather than fixing code that clearly isn't being used. No one objected. I haven't got round to doing the ripping yet, but as far as I know there is no reason to expend work on adding more code for the non-TAS case. Oh, it's great! Thanks! one todo item is gone -:) I have a few other things I still need to do before 7.1 beta. Do you think you'll be ready in, say, a week? yes. Vadim
Re: [HACKERS] WAL status update
First, as I've already mentioned in answer to Tom about DROP TABLE, undo logic will not be implemented in 7.1 -:( Doable for tables but for indices we would need either in compensation records or in xmin/cmin in index tuples. So, we'll still live with dust from aborted xactions in our tables/indices. Does it mean that there would still be inconsistency between tables and their indexes ? Not related. I just meant to say that tuples inserted into tables/indices by aborted transactions will stay there till vacuum. Redo should guarantee that index tuples will not be lost in split operation (what's possible now), but not that an index will have correct structure after crash - parent page may be unupdated, what could be handled at run time. Vadim
Re: [HACKERS] Proposal for DROP TABLE rollback mechanism
Mmmm, why not call FlushRelationBuffers? Calling bufmgr from smgr doesn't look like right thing. ? Yes, it's a little bit ugly, but if we call FlushRelationBuffers then we will likely be doing some useless writes (to flush out pages that we are only going to throw away anyway). If we leave the buffers alone till commit, then we'll only write out pages if we need to recycle a buffer for another use during that transaction. BTW, why do we force buffers to disk in FlushRelationBuffers at all? Seems all what is required is to flush them *from* pool, not *to* disk immediately. In WAL bufmgr version (temporary in xlog_bufmgr.c) I've changed this. Actually I wouldn't worry about these writes at all - drop relation is rare case, - but ok. Also, I don't feel comfortable with the idea of doing FlushRelationBuffers mid-transaction and then relying on the buffer cache to still be empty of pages for that relation much later on when we finally commit. Sure, it *should* be empty, but I'll be happier if we flush the buffer cache immediately before deleting the file. (It *must* be empty -:) Sorry, I don't understand "*immediately* before" in multi-user environment. Relation must be excl locked and this must be only guarantee for us, not time of doing anything with this relation.) What might make sense is to make a pass over the buffer cache at the time of DROP (inside the transaction) to make sure there are no pinned buffers for the rel --- if so, we want to elog() during the transaction not after commit. We could also release any non-dirty buffers at that point. Then after commit we know we don't care about the dirty buffers anymore, so we come back and discard them. Please note that there is xlog_bufmgr.c If you'll add/change something in bufmgr please let me know later. Vadim
Re: [HACKERS] Proposal for DROP TABLE rollback mechanism
BTW, why do we force buffers to disk in FlushRelationBuffers at all? Seems all what is required is to flush them *from* pool, not *to* disk immediately. Good point. Seems like it'd be sufficient to do a standard async write rather than write + fsync. We'd still need some additional logic at commit time, however, because we want to make sure there are no BufferDirtiedByMe bits set for the file we're about to delete... Note that there is no BufferDirtiedByMe in WAL bufmgr version. How about this: change FlushRelationBuffers so that it does standard async writes for dirty buffers and then removes all the rel's buffers ^ When it's used from vacuum no reason to do this. from the pool. This is invoked inside the transaction, same as now. Make DROP TABLE call this routine, *not* RemoveRelationBuffers. Then call RemoveRelationBuffers from smgr during transaction commit or abort. In the commit case, there really shouldn't be any buffers for the rel, so we can emit an elog NOTICE (it's too late for ERROR, no?) ^ Sure, but good time for Assert -:) if we find any. But in the abort case, we'd not be surprised to find buffers, even dirty buffers, and we just want to throw 'em away. This would also be the place to clean out the BufferDirtiedByMe state. BTW, currently smgrcommit is called twice on commit *before* xact status in pg_log updated and so you can't use it to remove files. In WAL version smgrcommit isn't called at all now but we easy can add smgrcommit call after commit record is logged. Vadim
Re: [HACKERS] Numeric file names
Now that we have numeric file names, I would like to have a command I can run from psql that will dump a mapping of numeric file name to table name, i.e., 121233 pg_proc 143423 pg_index select oid, relname from pg_class; No. select relfilenode, relname from pg_class - in theory relfilenode may differ from relation oid. Vadim
Re: [HACKERS] relation ### modified while in use
In my understanding,locking levels you provided contains an implicit share/exclusive lock on the corrsponding pg_class tuple i.e. AccessExclusive Lock acquires an exclusive lock on the corresping pg_class tuple and other locks acquire a share lock, Is it right ? No. Access...Locks are acquired over target table (table' oid is used as key for lmgr hash table), not over corresponding pg_class tuple, in what case we would use pg_clas' oid + table' oid as key (possibility I've described below). Yes,I know that "lock table" doesn't touch the correpon ding pg_class tuple at all. However isn't it equivalent ? From what POV? Lock manager will allow two simultaneous exclusive locks using these different methods (keys) and so we can interpret (use) them differently. Vadim
Re: [HACKERS] relation ### modified while in use
in general. What I'm proposing is that once an xact has touched a table, other xacts should not be able to apply schema updates to that table until the first xact commits. I agree with you. I don't know. We discussed this issue just after 6.5 and decided to allow concurrent schema modifications. Oracle has disctionary locks but run each DDL statement in separate xaction, so - no deadlock condition here. OTOH, I wouldn't worry about deadlock - one just had to follow common anti-deadlock rules. I've wondered why AccessShareLock is a short term lock. MUST BE. AccessShare-/Exclusive-Locks are *data* locks. If one want to protect schema then new schema share/excl locks must be inroduced. There is no conflict between data and schema locks - they are orthogonal. We use AccessShare-/Exclusive-Locks for schema because of... we allow concurrent schema modifications and no true schema locks were required. If we have a mechanism to acquire a share lock on a tuple,we could use it for managing system info generally. However the only allowed lock on a tuple is exclusive. Access(Share/Exclusive) Actually, just look at lock.h:LTAG structure - lock manager supports locking of "some objects" inside tables: typedef struct LTAG { OidrelId; OiddbId; union { BlockNumberblkno; Transactionxid; } objId; ... - we could add oid to union above and lock tables by acquiring lock on pg_class with objId.oid = table' oid. Same way we could lock indices and whatever we want... if we want -:) Lock on tables would give us a restricted solution about pg_class tuples. Thers'a possibility of deadlock in any case but there are few cases when AccessExclusiveLock is really needed and we could acquire an AccessExclusiveLock manually from the first if necessary. I'm not sure about the use of AccessShareLock in parse-analyze- optimize phase however. There is notion about breakable (parser) locks in Oracle documentation -:) Vadim
Re: AW: [HACKERS] relation ### modified while in use
As for locks,weak locks doesn't pass intensive locks. Dba seems to be able to alter a table at any time. Sorry, I don't understand this sentence. Tom suggested placing a shared lock on any table that is accessed until end of tx. Noone can alter table until all users have closed their txns and not accessed tables again. More of that - while one xaction will wait to alter a table no new xaction will be allowed to access this table too. Remember that this would include creating an index ... I don't think so. Index creation requires 1. share lock on schema 2. share lock on data Vadim
Re: [HACKERS] Re: [COMMITTERS] pgsql/src/test/regress/expected (plpgsql.out inet.out foreign_key.out errors.out)
Bruce Momjian [EMAIL PROTECTED] writes: Speaking of error messages, one idea for 7.2 might be to prepended numbers to the error messages. Isn't that long since on the TODO list? I know we've had long discussions about a thoroughgoing revision of error reporting. Yes, yes, yes! We need in numbers especially because of we hopefully will have savepoints in 7.2 and so we would get powerful error handling by *applications* not by *human* only. Vadim
Re: [HACKERS] SetQuerySnapshot() for utility statements
I notice that ProcessUtility() calls SetQuerySnapshot() for FETCH and COPY TO statements, and nothing else. Seems to me this is very broken. Isn't a query snapshot needed for any utility command that might do database accesses? Not needed. We don't support multi-versioning for schema operations. More of that, sometimes it would be better to read *dirty* data from system tables - so, no snapshot required. What is really, hm, not good is that first SetQuerySnapshot defines serializable snapshot for *all* transactions, even for ones with read committed isolevel: in the times of 6.5 I thought about ability to switch between isolevels inside single xaction - this is not required by standard and *bad* for system: just remember that vacuum doesn't clean up deleted tuples if there is some transaction *potentially* interested in them. For read committed xactions must be no serializable snapshot defined and MyProc-xmin must be updated when *each* top-level query begins. Vadim
Re: [HACKERS] time stops within transaction
Snapshot is made per top-level statement and functions/subqueries use the same snapshot as that of top-level statement. Not so. SetQuerySnapshot is executed per querytree, not per top-level statement --- for example, if a rule generates multiple queries from a user statement, SetQuerySnapshot is called again for each query. This is true. I just make it to work as it was in pre-6.5 times - each query of *top level* query list uses own snapshot (in read committed mode only) as if they were submitted by user one by one. But functions/subqueries called while executing query uses same snapshot as query itself. With the current structure of pg_exec_query_string(), an operation executed in the outer loop, rather than the inner, would more or less correspond to one "top level" query --- if you want to assume that pg_exec_query_string() is only called from PostgresMain. That's true today but hasn't always been true --- I believe it used to be used to parse SPI commands, and someday it may be again. It was never used in SPI. Just look at _SPI_execute. Same parent query snapshot is used in SPI functions. *But* SPI' queries *see* changes made by parent query - I never was sure about this and think I've asked other opinions. No opinions - no changes -:) If there's no concept of top-level statement,there's no concept of read consistency and MVCC isn't needed. Except of the fact that SPI' queries see changes made by parent same snapshot is used all time while executing top-level query (single query, not query list). Vadim
Re: [HACKERS] SetQuerySnapshot() for utility statements
Seems to me this is very broken. Isn't a query snapshot needed for any utility command that might do database accesses? Not needed. We don't support multi-versioning for schema operations. No? Seems to me we're almost there. Look for instance at that DROP USER bug I just fixed: it was failing because it wasn't careful to make sure that during "DROP USER foo,bar", the loop iteration to delete user bar would see the changes the first loop iteration had made. So even ^^^ Snapshot defines visibility of changes made by other transactions. Seems that you talk here about self-visibility, defined by CommandId. though we use a lot of table-level locking rather than true MVCC behavior for schema changes, ISTM that we still have to play by all the rules when it comes to tuple visibility. In particular I suspect we ought to be using standard query snapshot behavior... What would it buy for us? MVCC lies to user - it returns view of data as they were some time ago. What would we get by seeing old view of catalog? More of that, sometimes it would be better to read *dirty* data from system tables - so, no snapshot required. There may be a small number of places like that, but for generic utility operations like CREATE/DROP USER, I don't see that this is a good idea. But your fix for DROP USER didn't change anything about snapshot used by scans so it's not good example. Vadim
Re: [HACKERS] time stops within transaction
I am inclined to think that we should do SetQuerySnapshot in the outer loop of pg_exec_query_string, just before calling pg_analyze_and_rewrite. This would ensure that parse/plan accesses to ^^ Actually not - snapshot is passed as parameter to heap_beginscan... And currently SnapshotNow is used everywhere. If we did that, then SetQuerySnapshot would be called once per user- written command (defining a command as whatever the grammar produces a single parsetree for, which is probably OK) so long as SPI functions don't try to use pg_exec_query_string... SPI doesn't try this from its birthday in ~6.2 Vadim
[HACKERS] WAL status todo
Well, hopefully WAL will be ready for alpha testing in a few days. Unfortunately at the moment I have to step side from main stream to implement new file naming, the biggest todo for integration WAL into system. I would really appreciate any help in the following issues (testing can start regardless of their statuses but they must be resolved anyway): 1. BTREE: sometimes WAL can't guarantee right order of items on leaf pages after recovery - new flag BTP_REORDER introduced to mark such pages. Btree should be changed to handle this case in normal processing mode. 2. HEAP: like 1., this issue is result of attempt to go without compensation records (ie without logging undo operations): it's possible that sometimes in redo there will be no space for new records because of in recovery we don't undo changes for aborted xactions immediately - function like BTREE' _bt_cleanup_page_ required for HEAP as well as general inspection of all places where HEAP' redo ops try to insert records (initially I thought that in recovery we'll undo changes immediately after reading abort record from log - this wouldn't work for BTREE: splits must be redo-ne before undo). 3. There are no redo/undo for HASH, RTREE GIST yet. This would be *really really great* if someone could implement it using BTREE' redo/undo code as prototype. These are the most complex parts of this todo. Probably, something else will follow later. Regards, Vadim