Re: [HACKERS] Nested transactions: low level stuff
I see no concurrency problems. If two or more backends visit the same tuple, they either write the same value to the same position which doesn't hurt, or one sees the other's changes which is a good thing. AFAIR, on multi-CPU platforms it's possible that second transaction could see COMMITTED state but still old (subtrans id) in xmin: it's not guaranteed that changes made on CPU1 (V1 was changed first, then V2 was changed) will appear at the same order on CPU2 (V2 may come first, then V1). Vadim _ Revere Data, LLC, formerly known as Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Nested transactions: low level stuff
I see no concurrency problems. If two or more backends visit the same tuple, they either write the same value to the same position which doesn't hurt, or one sees the other's changes which is a good thing. AFAIR, on multi-CPU platforms it's possible that second transaction could see COMMITTED state but still old (subtrans id) in xmin: it's not guaranteed that changes made on CPU1 (V1 was changed first, then V2 was changed) will appear at the same order on CPU2 (V2 may come first, then V1). Vadim _ Revere Data, LLC, formerly known as Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL replay logic (was Re: [PERFORM] Mount options f
Added to TODO: * Allow WAL information to recover corrupted pg_controldata ... Using pg_control to get the checkpoint position speeds up the recovery process, but to handle possible corruption of pg_control, we should actually implement the reading of existing log segments in reverse order -- newest to oldest -- in order to find the last checkpoint. This has not been implemented, yet. So if you do this, do you still need to store that information in pg_control at all? Yes: to speeds up the recovery process. Vadim _ Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] PostgreSQL-R
http://www.cs.mcgill.ca/~kemme/papers/vldb00.html Thanks for the link, Darren, I think everyone interested in discussion should read it. First, I like approach. Second, I don't understand why ppl oppose pg-r 2pc. 2pc is just simple protocol to perform distributed commits *after* distributed conflicts were resolved. It says nothing about *how* to resolve conflicts. Commonly, distributed locks are used, pg-r uses GCS kind of batch locking to order distributed transactions and serialize execution of conflicting ones. Actually, this serialization is the only drawback I see at the moment: due to batching of writes/locks pg-r will not allow execution of transactions from different sites in read committed mode - one of conflicting transactions will be aborted instead of waiting for abort/commit of another one, continuing execution after that. Because of resolving conflicts *before* commit pg-r is not async solution. But it's not true sync replication neither due to async commit (read Jan vs Darren discussion in http://archives.postgresql.org/pgsql-hackers/2002-12/msg00754.php). What's problem with using 2pc for commit in pg-r? We could make it optional (and can discuss it later). Next, pg-r was originally based on 6.4, so what was changed for current pg versions when MV is used for CC? It seems that locking tuples via LockTable at Phase 1 is not required anymore, right? Upon receiving local WS in Phase 3 local transaction should just check that there are no conflicting locks from remote transactions in LockTable and can commit after that. Remote transactions will not see conflicts from local ones in LockTable but will notice them during execution and will be able to abort local transactions. (I hope I didn't miss something here.) Also it seems that we could perform Phases 2 3 periodically during transaction execution. This would make WS smaller and conflicts between long running transactions from different sites would be resoved faster. Comments? Vadim _ Sector Data, LLC, is not affiliated with Sector, Inc., or SIAC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] RI_FKey_check: foreign key constraint blocks parall
void heap_mark4fk_lock_acquire(Relation relation, HeapTuple tuple) { Just wonder how are you going to implement it - is it by using some kind of read-locks, ie FK transaction locks PK to prevent delete (this is known as pessimistic approach)? About two years ago we discussed with Jan optimistic approach with using dirty reads, when PK/FK transactions do not check existence of FK/PK untill constraint should be checked (after statement processed for immediate mode, at the commit time/ set constraint immediate for deferred constraints). So, at the check time, FK transaction uses dirty reads to know about existence/status of PK: 1. No PK - abort. 2. PK (inserted?/)deleted/updated/selected for update by concurrent transaction P - wait for P commit/abort (just like transactions do for concurrent same-row-update); go to 1. 3. Else (PK exists and no one changing it right now) - proceed. PK transaction does the same: 1. No FK - proceed. 2. FK inserted/updated/selected for update by concurrent transaction F - wait for F commit/abort; go to 1. This would be more in MVCC style -:) Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Idea for better handling of cntxDirty
Wouldn't it work for cntxDirty to be set not by LockBuffer, but by XLogInsert for each buffer that is included in its argument list? I thought to add separate call to mark context dirty but above should work if all callers to XLogInsert always pass all modified buffers - please check. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [GENERAL] Database replication... - Mission Critica
My presumption would be that if you initialize 2 databases to a known identical start, have all the same triggers and rules on both, then send all queries to both databases, you will have 2 identical databases at the end. This is wrong assumption. If 1st client executes UPDATE t SET a = 1 WHERE b = 2; 2nd client executes UPDATE t SET a = 2 WHERE b = 2; at the same time you don't know in what order these queries will be executed on two different servers (because you can't control what transaction will lock record(s) for update first). Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PITR, checkpoint, and local relations
The predicate for files we MUST (fuzzy) copy is: File exists at start of backup File exists at end of backup Right, which seems to me to negate all these claims about needing a (horribly messy) way to read uncommitted system catalog entries, do blind reads, etc. What's wrong with just exec'ing tar after having done a checkpoint? Right. It looks like insert/update/etc ops over local relations are WAL-logged, and it's Ok (we have to do this). So, we only have to use shared buffer pool for local (but probably not for temporary) relations to close this issue, yes? I personally don't see any performance issues if we do this. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PITR, checkpoint, and local relations
(In particular, I *strongly* object to using the buffer manager at all for reading files for backup. That's pretty much guaranteed to blow out buffer cache. Use plain OS-level file reads. An OS directory search will do fine for finding what you need to read, too.) How do you get atomic block copies otherwise? You don't need it. As long as whole block is saved in log on first after checkpoint (you made before backup) change to block. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PITR, checkpoint, and local relations
So, we only have to use shared buffer pool for local (but probably not for temporary) relations to close this issue, yes? I personally don't see any performance issues if we do this. Hmm. Temporary relations are a whole different story. It would be nice if updates on temp relations never got WAL-logged at all, but I'm not sure how feasible that is. Right now we don't really There is no any point to log them. distinguish temp relations from ordinary ones --- in particular, they have pg_class entries, which surely will get WAL-logged even if we persuade the buffer manager not to do it for the data pages. Is that a problem? Not sure. It was not about any problem. I just mean that local buffer pool still could be used for temporary relations if someone thinks that it has any sence, anyone? Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] PITR, checkpoint, and local relations
You don't need it. As long as whole block is saved in log on first after checkpoint (you made before backup) change to block. I thought half the point of PITR was to be able to turn off pre-image logging so you can trade potential recovery Correction - *after*-image. time for speed without fear of data-loss. Didn't we have this discussion before? Sorry, I missed this. So, it's already discussed what to do about partial block updates? When system crashed just after LSN, but not actual tuple etc, was stored in on-disk block and on restart you compare log record' LSN with data block' LSN, they are equal and so you *assume* that actual data are in place too, what is not the case? I always thought that the whole point of PITR is to be able to restore DB fast (faster than pg_restore) *AND* up to the last committed transaction (assuming that log is Ok). Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] PITR, checkpoint, and local relations
How do you get atomic block copies otherwise? Eh? The kernel does that for you, as long as you're reading the same-size blocks that the backends are writing, no? Good point. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] PITR, checkpoint, and local relations
As long as whole block is saved in log on first after checkpoint (you made before backup) change to block. I thought half the point of PITR was to be able to turn off pre-image logging so you can trade potential recovery time for speed without fear of data-loss. Didn't we have this discussion before? Suppose you can turn off/on PostgreSQL's atomic write on the fly. Which means turning on or off whether XLoginsert writes a copy of the block into the log file upon first modification after a checkpoint. So ALTER SYSTEM BEGIN BACKUP would turn on atomic write and then checkpoint the database. So while the OS copy of the data files is going on the atomic write would be enabled. So any read of a partial write would be fixed up by the usual crash recovery mechanism. Yes, simple way to satisfy everyone. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PITR, checkpoint, and local relations
Are you sure this is true for all ports? Well, maybe you're right and it's not. But with after-image blocks in log after checkpoint you really shouldn't worry about block atomicity, right? And ability to turn blocks logging on/off, as suggested by Richard, looks as appropriate for everyone, ? And if so, why would it be cheaper for the kernel to do it in its buffer manager, compared to us doing it in ours? This just seems bogus to rely on. Does anyone know what POSIX has to say about this? Does doing it in ours mean reading all data files through our shared buffer pool? Sorry, I just don't see point in this when tar ect will work just fine. At least for the first release tar is SuperOK, because of there must be and will be other problems/bugs, unrelated to how to read data files, and so the sooner we start testing the better. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] PITR, checkpoint, and local relations
So I think what will work then is pg_copy (hot backup) would: 1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on atomic write, checkpoints the database and disables further checkpoints (so wal files won't be reused) until the backup is complete. 2) Change ALTER SYSTEM BACKUP DATABASE TO directory read the database directory to find which files it should backup rather than pg_class and for each file just use system(cp...) to copy it to the backup directory. Did you consider saving backup on the client host (ie from where pg_copy started)? 3) ALTER SYSTEM FINISH BACKUP does at it does now and backs up the pg_xlog directory and renables database checkpointing. Well, wouldn't be single command ALTER SYSTEM BACKUP enough? What's the point to have 3 commands? (If all of this is already discussed then sorry - I'm not going to start new discussion). Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] PITR, checkpoint, and local relations
I really dislike the notion of turning off checkpointing. What if the backup process dies or gets stuck (eg, it's waiting for some operator to change a tape, but the operator has gone to lunch)? IMHO, backup systems that depend on breaking the system's normal operational behavior are broken. It should be sufficient to force a checkpoint when you start and when you're done --- altering normal operation in between is a bad design. But you have to prevent log files reusing while you copy data files. That's why I asked are 3 commands from pg_copy required and couldn't be backup accomplished by issuing single command ALTER SYSTEM BACKUP dir | stdout (to copy data to client side) (even from pgsql) so backup process would die with entire system -:) As for tape changing, maybe we could use some timeout and then just stop backup process. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] PITR, checkpoint, and local relations
It should be sufficient to force a checkpoint when you start and when you're done --- altering normal operation in between is a bad design. But you have to prevent log files reusing while you copy data files. No, I don't think so. If you are using PITR then you presumably have some process responsible for archiving off log files on a continuous basis. The backup process should leave that normal operational behavior in place, not muck with it. Well, PITR without log archiving could be alternative to pg_dump/pg_restore, but I agreed that it's not the big feature to worry about. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] PITR, checkpoint, and local relations
Well, PITR without log archiving could be alternative to pg_dump/pg_restore, but I agreed that it's not the big feature to worry about. Seems like a pointless feature to me. A pg_dump dump serves just as well to capture a snapshot --- in fact better, since it's likely smaller, definitely more portable, amenable to selective restore, etc. But pg_restore probably will take longer time than copy data files back and re-apply log. I think we should design the PITR dump to do a good job for PITR, not a poor job of both PITR and pg_dump. As I already said - agreed -:) Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [BUGS] Bug #613: Sequence values fall back to previously chec
Attached is a patch against current CVS that fixes both of the known problems with sequences: failure to flush XLOG after a transaction Great! Thanks... and sorry for missing these cases year ago -:) Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [BUGS] Bug #613: Sequence values fall back to previously chec
This isn't an issue for a SELECT nextval() standing on its own AFAIK the result will not be transmitted to the client until after the commit happens. But it would be an issue for a select executed inside a transaction block (begin/commit). The behavior of SELECT nextval() should not be conditional on being in or out of a transaction block. And it's not. But behaviour of application *must* be conditional on was transaction committed or not. What's the problem for application that need nextval() for external (out-of-database) purposes to use sequence values only after transaction commit? What's *wrong* for such application to behave the same way as when dealing with other database objects which are under transaction control (eg only after commit you can report to user that $100 was successfully added to his/her account)? --- I agree that if nextval-s were only write actions in transaction and they made some XLogInsert-s then WAL must be flushed at commit time. But that's it. Was this fixed? Very easy. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [GENERAL] Database corruption?
Um, Vadim? Still of the opinion that elog(STOP) is a good idea here? That's two people now for whom that decision has turned localized corruption into complete database failure. I don't think it's a good tradeoff. One is able to use pg_resetxlog so I don't see point in removing elog(STOP) there. What do you think? Well, pg_resetxlog would get around the symptom, but at the cost of possibly losing updates that are further along in the xlog than the update for the corrupted page. (I'm assuming that the problem here is a page with a corrupt LSN.) I think it's better to treat flush On restart, entire content of all modified after last checkpoint pages should be restored from WAL. In Denis case it looks like newly allocated for update page was somehow corrupted before heapam.c:2235 (7.1.2 src) and so there was no XLOG_HEAP_INIT_PAGE flag in WAL record = page content was not initialized on restart. Denis reported system crash - very likely due to memory problem. request past end of log as a DEBUG or NOTICE condition and keep going. Sure, it indicates badness somewhere, but we should try to have some robustness in the face of that badness. I do not see any reason why XLOG has to declare defeat and go home because of this condition. Ok - what about setting some flag there on restart and abort restart after all records from WAL applied? So DBA will have choice either to run pg_resetxlog after that and try to dump data or restore from old backup. I still object just NOTICE there - easy to miss it. And in normal processing mode I'd leave elog(STOP) there. Vadim P.S. Further discussions will be in hackers-list, sorry. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Bulkloading using COPY - ignore duplicates?
The effects don't stop propagating there, either. The decision not to insert the tuple must be reported up still further, so that the executor knows not to run any AFTER INSERT/UPDATE triggers and knows not to count the tuple as inserted/updated for the command completion report. But what about BEFORE insert/update triggers which could insert records too? Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Bulkloading using COPY - ignore duplicates?
But what about BEFORE insert/update triggers which could insert records too? Well, what about them? It's already possible for a later BEFORE trigger to cause the actual insertion to be suppressed, so I don't see any difference from what we have now. If a BEFORE trigger takes actions on the assumption that the insert will happen, it's busted already. This problem could be solved now by implementing *single* trigger. In future, we could give users ability to specify trigger execution order. But with proposed feature ... Mind you, I'm not actually advocating that we do any of this ;-). I understand -:) I was just sketching a possible implementation approach in case someone wants to try it. And I'm just sketching possible problems -:) Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Index location patch for review (more pgbench resul
Moving the test to a system with SCSI disks gave different results. There is NO difference between having the indexes on the same disk or different disk with the data while running pgbench. So I leave it up to you guys as to include the patch or not. I do believe that even if performance doesn't increase, this patch as alot of other benefits for admins. Agreed. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index location patch for review
The more general and standard way to go are TABLESPACEs. But probably proposed feature will be compatible with tablespaces, when we'll got them: Will it be? I'm afraid of creating a backwards-compatibility problem for ourselves when it comes time to implement tablespaces. As I said, INDEX_LOCATION in CREATE DATABASE could mean location of default tablespace for indices in future and one will be able to override tablespace for particular index with TABLESPACE clause in CREATE INDEX command. At the very least I'd like to see some information demonstrating how much benefit there is to this proposed patch, before we consider whether to adopt it. If there's a significant performance benefit to splitting a PG database along the table-vs-index divide, then it's interesting as a short-term improvement ... but Jim didn't even make that assertion, let alone provide evidence to back it up. Agreed. He mentioned significant performance difference but it would be great to see results of pgbench tests with scaling factor of = 10. Jim? Also, after reviewing patch I have to say that it will NOT work with WAL. Jim, please do not name index' dir as TBL_NODE_index. Instead, just use different TBL_NODE for indices (different number). It's not good to put if(reln-rd_rel-relkind == RELKIND_INDEX) stuff into storage manager - only two numbers (tblnode relnode) must be used to identify file, no any other logical information totally unrelated to storage issues. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index location patch for review
I don't understand the WAL issue below, can you explain. The dir name is the same name as the database with _index added to it. This is how the current datpath stuff works. I really just copied the datpath code to get this patch to work... At the time of after crash recovery WAL is not able to read relation description from catalog and so only relfilenode is provided for storage manager in relation structure (look backend/access/transam/ xlogutils.c:XLogOpenRelation). Well, we could add Index/Table file type identifier to RmgrData (rmgr.c in the same dir) to set relkind in relation structure, but I don't see any reason to do so when we can just use different tblnode number for indices and name index dirs just like other dirs under 'base' named - ie only tblnode number is used for dir names, without any additions unrelated to storage issues. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Index location patch for review
Also I have been running this patch (both 7.1.3 and 7.2devel) against some of my companies applications. I have loaded a small database 10G We are not familiar with your applications. It would be better to see results of test suit available to the community. pgbench is first to come in mind. Such tests would be more valuable. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index location patch for review
Attached is a patch that adds support for specifying a location for indexes via the create database command. I believe this patch is complete, but it is my first . This patch allows index locations to be specified as different from data locations. Is this a feature direction we want to go in? Comments? The more general and standard way to go are TABLESPACEs. But probably proposed feature will be compatible with tablespaces, when we'll got them: we could use new create database syntax to specify default tablespace for indices. Unfortunately I removed message with patch, can you send it to me, Bruce? Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Problems starting up postgres
Sep 6 02:09:30 mx postgres[13468]: [9] FATAL 2: XLogFlush: request(1494286336, 786458) is not satisfied -- flushed to (23, 2432317444) First note that Denis could just restart with wal_debug = 1 to see bad request, without code change. (We should ask ppl to set wal_debug ON in the case of any WAL problem...) Denis, could you provide us with debug output? Yeek. Looks like you have a page somewhere in the database with a bogus LSN value (xlog pointer) ... and, most likely, other corruption as well. We got error during checkpoint, when backend flushes pages changed by REDO (and *only those pages*). So, that page X (with bad LSN) was recovered from log. We didn't see CRC errors, so log is Ok, physically. We should know what is the X page (by setting breakpoint as suggested by Tom) and than look into debug output to see where we got bad LSN. Maybe it comes from restored pages or from checkpoint LSN, due to errors in XLogCtl initialization, but for sure it looks like bug in WAL code. Vadim, what do you think of reducing this elog from STOP to a notice on a permanent basis? ISTM we saw cases during 7.1 beta where this And increase probability that ppl will just miss/ignore NOTICE and bug in WAL will continue to harm others? STOP prevented people from recovering, so I'm thinking it does more And we fixed bug in WAL that time... harm than good to overall system reliability. No reliability having bugs in WAL code, so I object. But I'd move check into XLogWrite code to STOP if flush request is beyond write point. Denis, please help us to fix this bug. Some GDB-ing probably will be required. If you have not enough time/disk resources but able to give us copy of data-dir, it would be great (I have RedHat 7.? and Solaris 2.6 hosts, Tom ?). In any case debug output is the first thing I'd like to see. If it's big please send it to Tom and me only. And of course you can contact with me in Russian -:) Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [GENERAL] getting the oid for a new tuple in a BEFORE trigger
we need to control database changes within BEFORE triggers. There is no problem with triggers called by update, but there is a problem with triggers called by insert. We strongly need to know the oid of a newly inserted tuple. In this case, we use tg_newtuple of the TriggerData structure passed to thetrigger function, and its t_data - t_oid will have the value '0'. Using BEFORE and AFTER triggers would make our lives much harder. Is there any way (even hack) to get the oid the newly inserted tuple will receive? Just set t_data-t_oid = newoid() - this is what backend does in heapam.c:heap_insert(). Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] CURRENT OF cursor without OIDs
Oops I'm referring to client side cursors in our ODBC driver. We have no cross-transaction cursors yet though I'd like to see a backend cross-transaction cursor also. Ops, sorry. BTW, what are "visibility" rules for ODBC cross-tx cursor? No Repeatable reads, no Serializability? Do you hold some locks over table while cursor opened (I noticed session locking in lmgr recently)? Could ODBC cross-tx cursors be implemented using server cross-tx cursors? I think we'll be able to restore old tid along with other tuple data from rollback segments, so I don't see any problem from osmgr... How do we detect the change of tuples from clients ? What version of tuple client must see? New one? TIDs are invariant under osmgr. xmin is about to be unreliable for the purpose. Seems I have to learn more about ODBC cross-tx cursors -:( Anyway, *MSQL*, Oracle, Informix - all have osmgr. Do they have cross-tx cursors in their ODBC drivers? Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] User locks code
Besides, anyone who actually wanted to use the userlock code would need only to write their own wrapper functions to get around the GPL license. This is a part of copyright law that eludes me - can i write a replacement function for something so simple that it can essentially be done in one way only (like incrementing a value by one) ? Yes, this is what bothers me in user-lock case. On the other hand contrib/user-lock' licence cannot cover usage of LOCKTAG and LockAcquire (because of this code is from backend) and this is all what used in user_lock funcs. So, that licence is unenforceable to everything... except of func names -:) Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [OT] Re: [HACKERS] User locks code
Because the code we got from Berkeley was BSD licensed, we can't change it, and because many of us like the BSD license better because we don't want to require them to release the source code, we just want them to use PostgreSQL. And we think they will release the source code eventually anyway. And we think that no one will try to fork and commercialize server code - todays, when SAP InterBase open their DB code, it seems as no-brain. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] User locks code
So, rather than going over everone's IANAL opinons about mixing licenses, let's just let Massimo know that it'd just be a lot easier to PostgreSQL/BSD license the whole thing, if he doesn't mind too much. Yes, it would be better. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] User locks code
If the licence becomes a problem I can easily change it, but I prefer the GPL if possible. We just wanted to make sure the backend changes were not under the GPL. No, Bruce - backend part of code is useless without interface functions and I wonder doesn't GPL-ed interface implementation prevent using of user-locks in *commercial* applications. For example, one could use user-locks for processing incoming orders by multiple operators: select * from orders where user_lock(orders.oid) = 1 LIMIT 1 - so each operator would lock one order for processing and operators wouldn't block each other. So, could such application be commercial with current licence of user_lock()? (Sorry, I'm not licence guru.) DISCLAIMER (to avoid ungrounded rumors -:)) I have no plans to use user-locks in applications of *any* kind (free/commercial). It's just matter of principle - anything in/from backend code maybe used for any purposes, - that's nature of our licence. DISCLAIMER II (to avoid ungrounded rumors in future -:)) I would be interested in using proposed key-locking in some particular commercial application but this feature is not must have for that application - for my purposes it's enough: -- LOCKTAG tag; tag.relId = XactLockTableId; tag.dbId = _tableId_; // tag.dbId = InvalidOid is used in XactLockTableInsert // and no way to use valid OID for XactLock purposes, // so no problem tag.objId.xid = _user_key_; -- - but I like standard solutions more -:) (BTW, key-locking was requested by others a long ago.) Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] User locks code
For example, one could use user-locks for processing incoming orders by multiple operators: select * from orders where user_lock(orders.oid) = 1 LIMIT 1 - so each operator would lock one order for processing and operators wouldn't block each other. So, could such application be commercial with current licence of user_lock()? (Sorry, I'm not licence guru.) I assume any code that uses contrib/userlock has to be GPL'ed, meaning it can be used for commercial purposes but can't be sold as binary-only, and actually can't be sold for much because you have to make the code available for near-zero cost. I'm talking not about solding contrib/userlock separately, but about ability to sold applications which use contrib/userlock. Sorry, if it was not clear. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] User locks code
I assume any code that uses contrib/userlock has to be GPL'ed, meaning it can be used for commercial purposes but can't be sold as binary-only, and actually can't be sold for much because you have to make the code available for near-zero cost. I'm talking not about solding contrib/userlock separately, but about ability to sold applications which use contrib/userlock. Sorry, if it was not clear. No, you were clear. So I missed your near-zero cost sentence. My assumption is that once you link that code into the backend, the entire backend is GPL'ed and any other application code you link into it is also (stored procedures, triggers, etc.) I don't think your client application will be GPL'ed, assuming you didn't link in libreadline. Application would explicitly call user_lock() functions in queries, so issue is still not clear for me. And once again - compare complexities of contrib/userlock and backend' userlock code: what's reason to cover contrib/userlock by GPL? Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] User locks code
Application would explicitly call user_lock() functions in queries, so issue is still not clear for me. And once again - Well, yes, it calls user_lock(), but the communication is not OS-linked, it is linked over a network socket, so I don't think the GPL spreads over a socket. Just as telnet'ing somewhere an typing 'bash' doesn't make your telnet GPL'ed, so I think the client code is safe. To the client, the backend is just returning information. You don't really link to the query results. Ah, ok. compare complexities of contrib/userlock and backend' userlock code: what's reason to cover contrib/userlock by GPL? Only because Massimo prefers it. I can think of no other reason. It clearly GPL-stamps any backend that links it in. Ok, let's do one step back - you wrote: My assumption is that once you link that code into the backend, the entire backend is GPL'ed and any other application code you link into it is also (stored procedures, triggers, etc.) So, one would have to open-source and GPL all procedures/triggers used by application just because of application uses user_lock() in queries?! Is it good? Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] CURRENT OF cursor without OIDs
AFAICS, if you are holding an open SQL cursor, it is sufficient to check that ctid hasn't changed to know that you have the same, un-updated tuple. Under MVCC rules, VACUUM will be unable to delete any tuple that is visible to your open transaction, and so new-style VACUUM cannot recycle the ctid. ... As Tom mentiond once in this thread, I've referred to non-SQL cursors which could go across transaction boundaries. TIDs aren't that reliable across transactions. We could avoid reassignment of MyProc-xmin having cursors opened across tx boundaries and so new-style vacuum wouldn't remove old tuple versions... OIDs and xmin have already lost a part of its nature. Probably I have to guard myself beforehand and so would have to mention repeatedly from now on that if we switch to an overwriting smgr, there's no system item to detect the change of tuples. So, is tid ok to use for your purposes? I think we'll be able to restore old tid along with other tuple data from rollback segments, so I don't see any problem from osmgr... Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] RE: User locks code
yep: lock tablename.colname.val=1 select count(*) from tablename where colname=1 If no rows, insert, else update. (dunno if the locks would scale to a scenario with hundreds of concurrent inserts - how many user locks max?). I don't see problem here - just a few bytes in shmem for key. Auxiliary table would keep refcounters for keys. Why wouldn't it work with serializable isolevel? Because of selects see old database snapshot and so you wouldn't see key inserted+committed by concurrent tx. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] User locks code
Regarding the licencing of the code, I always release my code under GPL, which is the licence I prefer, but my code in the backend is obviously released under the original postgres licence. Since the module is loaded dynamically and not linked into the backend I don't see a problem here. The problem is how to use user-locks in commercial projects. Some loadable interface functions are required to use in-backend user lock code, but interface is so simple - if one would write new functions they would look the same as yours covered by GPL. If the licence becomes a problem I can easily change it, but I prefer the GPL if possible. Actually I don't see why to cover your contrib module by GPL. Not so much IP (intellectual property) there. Real new things which make new feature possible are in lock manager. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] RE: User locks code
I don't see problem here - just a few bytes in shmem for key. Auxiliary table would keep refcounters for keys. I think that running out of shmem *would* be a problem for such a facility. We have a hard enough time now sizing the lock table for Auxiliary table would have fixed size and so no new keys would be added if no space. I don't see problem with default 8Kb aux table, do you? system locks, even though they use fixed-size keys and the system as a whole is designed to ensure that not too many locks will be held simultaneously. (For example, SELECT FOR UPDATE doesn't try to use per-tuple locks.) Earlier in this thread, someone proposed using user locks as a substitute for SELECT FOR UPDATE. I can guarantee you that that someone will run out of shared memory before long, if the userlock table resides in shared memory. How is proposed key locking is different from user locks we have right now? Anyone can try to acquire many-many user locks. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] RE: User locks code
Would your suggested implementation allow locking on an arbitrary string? Well, placing string in LOCKTAG is not good so we could create auxilary hash table in shmem to keep such strings and use string' address as part of LOCKTAG. New function (LockRelationKey?) in lmgr.c would first find/place key in that table, than construct LOCKTAG and call LockAcquire. Possible syntax: LOCK TABLE relation IN {SHARE | EXCLUSIVE} MODE ON KEY user-string [FOR TRANSACTION | FOR SESSION]; UNLOCK (RELEASE?) TABLE relation {SHARE | EXCLUSIVE} LOCK ON KEY user-string; (or just some built-in functions). If it does then one of the things I'd use it for is to insert unique data without having to lock the table or rollback on failed insert (unique index still kept as a guarantee). (Classic example how could be used SAVEPOINTs -:)) So, in your application you would first lock a key in excl mode (for duration of transaction), than try to select and insert unless found? (Note that this will not work with serializable isolevel.) Comments? Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] User locks code
1. Just noted this in contrib/userlock/README.user_locks: User locks, by Massimo Dal Zotto [EMAIL PROTECTED] Copyright (C) 1999, Massimo Dal Zotto [EMAIL PROTECTED] This software is distributed under the GNU General Public License either version 2, or (at your option) any later version. Well, anyone can put code into contrib with whatever license he/she want but user locks package includes interface functions in contrib *and* changes in our lock manager, ie changes in backend code. I wonder if backend' part of package is covered by the same license above? And is it good if yes? 2. Not good implementation, imho. It's too complex (separate lock method table, etc). Much cleaner would be implement this feature the same way as transactions wait other transaction commit/abort: by locking objects in pseudo table. We could get rid of offnum and lockmethod from LOCKTAG and add struct { Oid RelId; Oid ObjId; } userObjId; to objId union of LOCKTAG. This way user could lock whatever object he/she want in specified table and note that we would be able to use table access rights to control if user allowed to lock objects in table - missed in 1. One could object that 1. is good because user locks never wait. I argue that never waiting for lock is same bad as always waiting. Someday we'll have time-wait etc features for general lock method and everybody will be happy -:) Comments? Vadim P.S. I could add 2. very fast, no matter if we'll keep 1. or not. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Using POSIX mutex-es
1. Just changed TAS(lock) to pthread_mutex_trylock(lock) S_LOCK(lock) to pthread_mutex_lock(lock) S_UNLOCK(lock) to pthread_mutex_unlock(lock) (and S_INIT_LOCK to share mutex-es between processes). 2. pgbench was initialized with scale 10. SUN WS 10 (512Mb), Solaris 2.6 (I'm unable to test on E4500 -:() -B 16384, wal_files 8, wal_buffers 256, checkpoint_segments 64, checkpoint_timeout 3600 50 clients x 100 transactions (after initialization DB dir was saved and before each test copyed back and vacuum-ed). 3. No difference. Mutex version maybe 0.5-1 % faster (eg: 37.264238 tps vs 37.083339 tps). So - no gain, but no performance loss from using pthread library (I've also run tests with 1 client), at least on Solaris. And so - looks like we can use POSIX mutex-es and conditional variables (not semaphores; man pthread_cond_wait) and should implement light lmgr, probably with priority locking. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] Performance TODO items
* Order duplicate index entries by tid In other words - add tid to index key: very old idea. I was thinking during index creation, it would be nice to order them by tid, but not do lots of work to keep it that way. I hear this not do lots of work so often from you -:) Days of simplicity are gone, Bruce. To continue, this project requires more and more complex solutions. * Add queue of backends waiting for spinlock We shouldn't mix two different approaches for different kinds of short-time internal locks - in one cases we need in light lmgr (when we're going to keep lock long enough, eg for IO) and in another cases we'd better to proceed with POSIX' mutex-es or semaphores instead of spinlocks. Queueing backends waiting for spinlock sounds like nonsense - how are you going to protect such queue? With spinlocks? -:) Yes, I guess so but hopefully we can spin waiting for the queue lock rather than sleep. We could use POSIX spinlocks/semaphores now but we don't because of performance, right? No. As long as no one proved with test that mutexes are bad for performance... Funny, such test would require ~ 1 day of work. Should we be spinning waiting for spinlock on multi-cpu machines? Is that the answer? What do you mean? Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
RE: [HACKERS] Performance TODO items
We could use POSIX spinlocks/semaphores now but we don't because of performance, right? No. As long as no one proved with test that mutexes are bad for performance... Funny, such test would require ~ 1 day of work. Good question. I know the number of function calls to spinlock stuff is huge. Seems real semaphores may be a big win on multi-cpu boxes. Ok, being tired of endless discussions I'll try to use mutexes instead of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Performance TODO items
New TODO entries are: * Order duplicate index entries by tid In other words - add tid to index key: very old idea. * Add queue of backends waiting for spinlock We shouldn't mix two different approaches for different kinds of short-time internal locks - in one cases we need in light lmgr (when we're going to keep lock long enough, eg for IO) and in another cases we'd better to proceed with POSIX' mutex-es or semaphores instead of spinlocks. Queueing backends waiting for spinlock sounds like nonsense - how are you going to protect such queue? With spinlocks? -:) Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: OID wraparound (was Re: [HACKERS] pg_depend)
Yes, nowhere near, and yes. Sequence objects require disk I/O to update; the OID counter essentially lives in shared memory, and can be bumped for the price of a spinlock access. Sequences also cache values (32 afair) - ie one log record is required for 32 nextval-s. Sequence' data file is updated at checkpoint time, so - not so much IO. I really think that using sequences for system tables IDs would be good. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: OID wraparound (was Re: [HACKERS] pg_depend)
If you want to make oids optional on user tables, we can vote on that. Let's vote. I'm proposing optional oids for 2-3 years, so you know how I'll vote -:) However, OID's keep our system tables together. How?! If we want to find function with oid X we query pg_proc, if we want to find table with oid Y we query pg_class - we always use oids in context of class to what an object belongs. This means that two tuples from different system tables could have same oid values and everything would work perfectly. There is no magic around OIDs. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: OID wraparound (was Re: [HACKERS] pg_depend)
OK, we need to vote on whether Oid's are optional, and whether we can have them not created by default. Optional OIDs: YES No OIDs by default: YES However, OID's keep our system tables together. How?! If we want to find function with oid X we query pg_proc, if we want to find table with oid Y we query pg_class - we always use oids in context of class to what an object belongs. This means that two tuples from different system tables could have same oid values and everything would work perfectly. I meant we use them in many cases to link entries, and in pg_description for descriptions and lots of other things that may use them in the future for system table use. So, add class' ID (uniq id from pg_class) when linking. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup
You forget about Tx Old! The point is that changes made by Tx Old *over* Tx New' changes effectively make those Tx New' changes *visible* to Tx S! Yes, but what's that got to do with the order of operations in GetSnapshotData? The scenario you describe can occur anyway. Try to describe it step by step. Only if Tx Old is running in Read Committed mode, of course. But if it is, then it's capable of deciding to update a row updated by Tx New. Whether Tx S's xmax value is before or after Tx New's ID is not going to change the behavior of Tx Old. 1. I consider particular case when Tx S' xmax is before Tx New' ID. 1.1 For this case acquiring SInval lock before ReadNewTransactionId() changes behavior of Tx Old: it postpones change of Tx Old' (and Tx New') MyProc-xid in xact.c:CommitTransaction(), so Tx S will see Tx Old as running, ie Tx Old' changes will be invisible to Tx S on the base of analyzing MyProc.xid-s, just like Tx New' changes will be invisible on the base of analyzing next Tx ID. 2. If you can find examples when current code is not able to provide consistent snapshot of running (out of interest) transactions let's think how to fix code. Untill then my example shows why we cannot move SInval lock request after ReadNewTransactionId(). I'd still like to change GetSnapshotData to read the nextXid before it acquires SInvalLock, though. If we did that, it'd be safe to make GetNewTransactionId be SpinAcquire(XidGenLockId); xid = nextXid++; SpinAcquire(SInvalLockId); MyProc-xid = xid; SpinRelease(SInvalLockId); SpinRelease(XidGenLockId); which is really necessary if you want to avoid assuming that TransactionIds can be fetched and stored atomically. To avoid that assumption one should add per MyProc spinlock. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup
Oh, now I get it: the point is to prevent Tx Old from exiting the set of still running xacts as seen by Tx S. Okay, it makes sense. I'll try to add some documentation to explain it. TIA! I had no time from '99 -:) Given this, I'm wondering why we bother with having a separate XidGenLock spinlock at all. Why not eliminate it and use SInval spinlock to lock GetNewTransactionId and ReadNewTransactionId? Reading all MyProc in GetSnashot may take long time - why disallow new Tx to begin. What did you think about reordering the vacuum qual tests and AbortTransaction sequence? Sorry, no time at the moment. BTW, I'm starting to think that it would be really nice if we could replace our spinlocks with not just a semaphore, but something that has a notion of shared and exclusive lock requests. For example, if GetSnapshotData could use a shared lock on SInvalLock, it'd improve concurrency. Yes, we already told about light lock manager (no deadlock detection etc). Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup
Isn't spinlock just a few ASM instructions?... on most platforms... If we change over to something that supports read vs write locking, it's probably going to be rather more than that ... right now, I'm pretty dissatisfied with the performance of our spinlocks under load. We shouldn't use light locks everywhere. Updating/reading MyProc.xid is very good place to use simple spinlocks... or even better mutexes. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] Re: Strangeness in xid allocation / snapshot setup
Given this, I'm wondering why we bother with having a separate XidGenLock spinlock at all. Why not eliminate it and use SInval spinlock to lock GetNewTransactionId and ReadNewTransactionId? Reading all MyProc in GetSnashot may take long time - why disallow new Tx to begin. Because we need to synchronize? It bothers me that we're assuming that fetching/storing XIDs is atomic. There's no possibility at all of going to 8-byte XIDs as long as the code is like this. I doubt that a spinlock per PROC structure would be a better answer, either; the overhead of getting and releasing each lock would be nontrivial, considering the small number of instructions spent at each PROC in these routines. Isn't spinlock just a few ASM instructions?... on most platforms... Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] Rule recompilation
Anyway, what's the preferred syntax for triggering the rule recompilation? I thought about ALTER RULE {rulename|ALL} RECOMPILE; Where ALL triggers only those rules where the user actually has RULE access right on a relation. In good world rules (PL functions etc) should be automatically marked as dirty (ie recompilation required) whenever referenced objects are changed. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
[HACKERS] RE: Strangeness in xid allocation / snapshot setup
I am trying to understand why GetSnapshotData() needs to acquire the SInval spinlock before it calls ReadNewTransactionId, rather than after. I see that you made it do so --- in the commit at http://www.ca.postgresql.org/cgi/cvsweb.cgi/pgsql/src/backend/storage/ipc/sh mem.c.diff?r1=1.41r2=1.42 but I don't understand why the loss of concurrency is necessary. Since we are going to treat all xids = xmax as in-progress anyway, what's wrong with reading xmax before we acquire the SInval lock? AFAIR, I made so to prevent following: 1. Tx Old is running. 2. Tx S reads new transaction ID in GetSnapshotData() and swapped away before SInval acquired. 3. Tx New gets new transaction ID, makes changes and commits. 4. Tx Old changes some row R changed by Tx New and commits. 5. Tx S gets snapshot data and now sees R changed by *both* Tx Old and Tx New *but* does not see *other* changes made by Tx New = Tx S reads unconsistent data. - As for issue below - I don't remember why I decided that it's not important and will need in some time to remember. Also, it seems to me that in GetNewTransactionId(), it's important for MyProc-xid to be set before releasing XidGenLockId, not after. Otherwise there is a narrow window for failure: 1. Process A calls GetNewTransactionId. It allocates an xid of, say, 1234, and increments nextXid to 1235. Just after releasing the XidGenLock spinlock, but before it can set its MyProc-xid, control swaps away from it. 2. Process B gets to run. It runs GetSnapshotData. It sees nextXid = 1235, and it does not see xid = 1234 in any backend's proc-xid. Therefore, B will assume xid 1234 has already terminated, when it hasn't. Isn't this broken? The problem would be avoided if GetNewTransactionId sets MyProc-xid before releasing the spinlock, since then after GetSnapshotData has called ReadNewTransactionId, we know that all older XIDs that are still active are recorded in proc structures. Comments? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] Re: Buffer access rules, and a probable bug
On further thought, btbuild is not that badly broken at the moment, because CREATE INDEX acquires ShareLock on the relation, so there can be no concurrent writers at the page level. Still, it seems like it'd be a good idea to do LockBuffer(buffer, BUFFER_LOCK_SHARE) here, and probably also to invoke HeapTupleSatisfiesNow() via the HeapTupleSatisfies() macro so that infomask update is checked for. Vadim, what do you think? Looks like there is no drawback in locking buffer so let's lock it. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] RE: [BUGS] Update is not atomic
Incrementing comand counter is not enough - dirty reads are required to handle concurrent PK updates. What's that with you and dirty reads? Every so often you tell me that something would require them - you really like to read dirty things - no? :-) Dirty things occure - I like to handle them -:) All MVCC stuff is just ability to handle dirties, unlike old, locking, behaviour when transaction closed doors to table while doing its dirty things. Welcome to open world but be ready to handle dirty things -:) So let me get it straight: I execute the entire UPDATE SET A=A+1, then increment the command counter and don't see my own results? So an index scan with heap tuple check will return OLD (+NEW?) rows? Last time I fiddled around with Postgres it didn't, but I could be wrong. How are you going to see concurrent PK updates without dirty reads? If two transactions inserted same PK and perform duplicate check at the same time - how will they see duplicates if no one committed yet? Look - there is very good example of using dirty reads in current system: uniq indices, from where we started this thread. So, how uniq btree handles concurrent (and own!) duplicates? Btree calls heap_fetch with SnapshotDirty to see valid and *going to be valid* tuples with duplicate key. If VALID -- ABORT, if UNCOMMITTED (going to be valid) -- wait for concurrent transaction commit/abort (note that for obvious reasons heap_fetch(SnapshotDirty) doesn't return OLD rows modified by current transaction). I had to add all this SnapshotDirty stuff right to get uniq btree working with MVCC. All what I propose now is to add ability to perform dirty scans to SPI (and so to PL/*), to be able make right decisions in SPI functions and triggers, and make those decisions *at right time*, unlike uniq btree which makes decision too soon. Is it clear now how to use dirty reads for PK *and* FK? You proposed using share *row* locks for FK before. I objected then and object now. It will not work for PK because of PK rows do not exist for concurrent transactions. What would work here is *key* locks (locks placed for some key in a table, no matter does row with this key exist or not). This is what good locking systems, like Informix, use. But PG is not locking system, no reasons to add key lock overhead, because of PG internals are able to handle dirties and we need just add same abilities to externals. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Good name for new lock type for VACUUM?
Any better ideas out there? Names were always hard for me -:) Where did the existing lock type names come from, anyway? (Not SQL92 or SQL99, for sure.) Oracle. Except for Access Exclusive/Share Locks. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
[HACKERS] RE: [BUGS] Update is not atomic
Problem can be demonstrated by following example create table a (a numeric primary key); insert into a values (1); insert into a values (2); insert into a values (3); insert into a values (4); update a set a=a+1 where a2; ERROR: Cannot insert a duplicate key into unique index a_pkey We use uniq index for UK/PK but shouldn't. Jan? Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] RE: [BUGS] Update is not atomic
update a set a=a+1 where a2; ERROR: Cannot insert a duplicate key into unique index a_pkey This is a known problem with unique contraints, but it's not easy to fix it. Yes, it requires dirty reads. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] RE: [BUGS] Update is not atomic
update a set a=a+1 where a2; ERROR: Cannot insert a duplicate key into unique index a_pkey We use uniq index for UK/PK but shouldn't. Jan? What else can you use than an index? A deferred until statement end trigger checking for duplicates? Think it'd have a real bad performance impact. AFAIR, standard requires deffered (until statement/transaction(?) end) as default behaviour for RI (all?) constraints. But no matter what is default, deffered *must* be available = uniq indices must not be used. Whatever the execution order might be, the update of '3' to '4' will see the other '4' as existent WRT the scan commandId and given snapshot - right? If we at the time we now fire up the ERROR add the key, the index and heap to a list of possible dupkeys, that we'll check at the end of the actual command, the above would work. The check at statement end would have to increment the commandcounter and for each entry do an index scan with the key, counting the number of found, valid heap tuples. Incrementing comand counter is not enough - dirty reads are required to handle concurrent PK updates. Well, with some million rows doing a set a = a + 1 could run out of memory. So this would be something that'd work in the sandbox and for non-broken applications (tm). Maybe at How is this different from (deffered) updates of million FK we allow right now? Let's user decide what behaviour (deffered/immediate) he need. The point is that now user has no ability to choose what's right for him. some level (when we escalate the lock to a full table lock?) we simply forget about single keys, but have a new index access function that checks the entire index for uniqueness. I wouldn't bother to implement this. User always has ability to excl. lock table, drop constraints, update whatever he want and recreate constraints again. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: AW: [HACKERS] Postgres Replication
Here are some disadvantages to using a trigger based approach: 1) Triggers simply transfer individual data items when they are modified, they do not keep track of transactions. I don't know about other *async* replication engines but Rserv keeps track of transactions (if I understood you corectly). Rserv transfers not individual modified data items but *consistent* snapshot of changes to move slave database from one *consistent* state (when all RI constraints satisfied) to another *consistent* state. 4) The activation of triggers in a database cannot be easily rolled back or undone. What do you mean? Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] Baby girl
I had a baby girl on Tuesday. I am working through my backlogged emails today. Congratulations -:) Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: AW: [HACKERS] Plans for solving the VACUUM problem
So are whole pages stored in rollback segments or just the modified data? This is implementation dependent. Storing whole pages is much easy to do, but obviously it's better to store just modified data. I am not sure it is necessarily better. Seems to be a tradeoff here. pros of whole pages: a possible merge with physical log (for first modification of a page after checkpoint there would be no overhead compared to current since it is already written now) Using WAL as RS data storage is questionable. in a clever implementation a page already in the rollback segment might satisfy the modification of another row on that page, and thus would not need any additional io. This would be possible only if there was no commit (same SCN) between two modifications. But, aren't we too deep on overwriting smgr (O-smgr) implementation? It's doable. It has advantages in terms of IO active transactions must do to follow MVCC. It has drawback in terms of required disk space (and, oh yeh, it's not easy to implement -:)). So, any other opinions about value of O-smgr? Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] Plans for solving the VACUUM problem
Seems overwrite smgr has mainly advantages in terms of speed for operations other than rollback. ... And rollback is required for 5% transactions ... This obviously depends on application. Small number of aborted transactions was used to show useless of UNDO in terms of space cleanup - that's why I use same argument to show usefulness of O-smgr -:) I know people who rollback most of their transactions (actually they use it to emulate temp tables when reporting). Shouldn't they use TEMP tables? -:) OTOH it is possible to do without rolling back at all as MySQL folks have shown us ;) Not with SDB tables which support transactions. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] Plans for solving the VACUUM problem
OTOH it is possible to do without rolling back at all as MySQL folks have shown us ;) Not with SDB tables which support transactions. My point was that MySQL was used quite a long time without it and still quite many useful applications were produced. And my point was that needless to talk about rollbacks in non-transaction system and in transaction system one has to implement rollback somehow. BTW, do you know what strategy is used by BSDDB/SDB for rollback/undo ? AFAIR, they use O-smgr = UNDO is required. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] Plans for solving the VACUUM problem
Do we want to head for an overwriting storage manager? Not sure. Advantages: UPDATE has easy space reuse because usually done in-place, no index change on UPDATE unless key is changed. Disadvantages: Old records have to be stored somewhere for MVCC use. Could limit transaction size. Really? Why is it assumed that we *must* limit size of rollback segments? We can let them grow without bounds, as we do now keeping old records in datafiles and letting them eat all of disk space. UNDO disadvantages are: Limit size of transactions to log storage size. Don't be kidding - in any system transactions size is limitted by available storage. So we should tell that more disk space is required for UNDO. From my POV, putting $100 to buy 30Gb disk is not big deal, keeping in mind that PGSQL requires $ZERO to be used. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Plans for solving the VACUUM problem
Oracle has MVCC? With restrictions, yes. What restrictions? Rollback segments size? No, that is not the whole story. The problem with their rollback segment approach is, that they do not guard against overwriting a tuple version in the rollback segment. They simply recycle each segment in a wrap around manner. Thus there could be an open transaction that still wanted to see a tuple version that was already overwritten, leading to the feared snapshot too old error. Copying their rollback segment approach is imho the last thing we want to do. So, they limit size of rollback segments and we don't limit how big our datafiles may grow if there is some long running transaction in serializable mode. We could allow our rollback segments to grow without limits as well. Non-overwriting smgr can eat all disk space... You didn't know that? Vadim did ... Didn't I mention a few times that I was inspired by Oracle? -:) Looking at what they supply in the feature area is imho good. Copying their technical architecture is not so good in general. Copying is not inspiration -:) Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] Plans for solving the VACUUM problem
Removing dead records from rollback segments should be faster than from datafiles. Is it for better locality or are they stored in a different way ? Locality - all dead data would be localized in one place. Do you think that there is some fundamental performance advantage in making a copy to rollback segment and then deleting it from there vs. reusing space in datafiles ? As it showed by WAL additional writes don't mean worse performance. As for deleting from RS (rollback segment) - we could remove or reuse RS files as whole. How does it do MVCC with an overwriting storage manager ? 1. System Change Number (SCN) is used: system increments it on each transaction commit. 2. When scan meets data block with SCN SCN as it was when query/transaction started, old block image is restored using rollback segments. You mean it is restored in session that is running the transaction ? I guess thet it could be slower than our current way of doing it. Yes, for older transactions which *really* need in *particular* old data, but not for newer ones. Look - now transactions have to read dead data again and again, even if some of them (newer) need not to see those data at all, and we keep dead data as long as required for other old transactions *just for the case* they will look there. But who knows?! Maybe those old transactions will not read from table with big amount of dead data at all! So - why keep dead data in datafiles for long time? This obviously affects overall system performance. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] Plans for solving the VACUUM problem
Impractical ? Oracle does it. Oracle has MVCC? With restrictions, yes. What restrictions? Rollback segments size? Non-overwriting smgr can eat all disk space... You didn't know that? Vadim did ... Didn't I mention a few times that I was inspired by Oracle? -:) Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] Plans for solving the VACUUM problem
If PostgreSQL wants to stay MVCC, then we should imho forget overwriting smgr very fast. Let me try to list the pros and cons that I can think of: Pro: no index modification if key stays same no search for free space for update (if tuple still fits into page) no pg_log Con: additional IO to write before image to rollback segment (every before image, not only first after checkpoint) (also before image of every index page that is updated !) I don't think that Oracle writes entire page as before image - just tuple data and some control info. As for additional IO - we'll do it anyway to remove before image (deleted tuple data) from data files. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Plans for solving the VACUUM problem
I think so too. I've never said that an overwriting smgr is easy and I don't love it particularily. What I'm objecting is to avoid UNDO without giving up an overwriting smgr. We shouldn't be noncommittal now. Why not? We could decide to do overwriting smgr later and implement UNDO then. For the moment we could just change checkpointer to use checkpoint.redo instead of checkpoint.undo when defining what log files should be deleted - it's a few minutes deal, and so is changing it back. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
RE: AW: [HACKERS] Plans for solving the VACUUM problem
- A simple typo in psql can currently cause a forced rollback of the entire TX. UNDO should avoid this. Yes, I forgot to mention this very big advantage, but undo is not the only possible way to implement savepoints. Solutions using CommandCounter have been discussed. This would be hell. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Plans for solving the VACUUM problem
Oracle has MVCC? With restrictions, yes. What restrictions? Rollback segments size? Non-overwriting smgr can eat all disk space... Is'nt the same true for an overwriting smgr ? ;) Removing dead records from rollback segments should be faster than from datafiles. You didn't know that? Vadim did ... Didn't I mention a few times that I was inspired by Oracle? -:) How does it do MVCC with an overwriting storage manager ? 1. System Change Number (SCN) is used: system increments it on each transaction commit. 2. When scan meets data block with SCN SCN as it was when query/transaction started, old block image is restored using rollback segments. Could it possibly be a Postgres-inspired bolted-on hack needed for better concurrency ? -:)) Oracle has MVCC for years, probably from the beginning and for sure before Postgres. BTW, are you aware how Interbase does its MVCC - is it more like Oracle's way or like PostgreSQL's ? Like ours. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] ? potential bug in LockBuffer ?
(buf-r_locks)--; if (!buf-r_locks) *buflock = ~BL_R_LOCK; Or I am missing something... buflock is per-backend flag, it's not in shmem. Backend is allowed only single lock per buffer. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] Plans for solving the VACUUM problem
And, I cannot say that I would implement UNDO because of 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management) but because of ALL of 1., 2., 4. OK, I understand your reasoning here, but I want to make a comment. Looking at the previous features you added, like subqueries, MVCC, or WAL, these were major features that greatly enhanced the system's capabilities. Now, looking at UNDO, I just don't see it in the same league as those other additions. Of course, you can work on whatever you want, but I was hoping to see another major feature addition for 7.2. We know we badly need auto-vacuum, improved replication, and point-in-time recover. I don't like auto-vacuum approach in long term, WAL-based BAR is too easy to do -:) (and you know that there is man who will do it, probably), bidirectional sync replication is good to work on, but I'm more interested in storage/transaction management now. And I'm not sure if I'll have enough time for another major feature in 7.2 anyway. It would be better to put work into one mechanism that would reuse all tuples. This is what we're discussing now -:) If community will not like UNDO then I'll probably try to implement dead space collector which will read log files and so on. Easy to #ifdef it in 7.2 to use in 7.3 (or so) with on-disk FSM. Also, I have to implement logging for non-btree indices (anyway required for UNDO, WAL-based BAR, WAL-based space reusing). Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Plans for solving the VACUUM problem
We could keep share buffer lock (or add some other kind of lock) untill tuple projected - after projection we need not to read data for fetched tuple from shared buffer and time between fetching tuple and projection is very short, so keeping lock on buffer will not impact concurrency significantly. Or drop the pin on the buffer to show we no longer have a pointer to it. This is not good for seqscans which will return to that buffer anyway. Or we could register callback cleanup function with buffer so bufmgr would call it when refcnt drops to 0. Hmm ... might work. There's no guarantee that the refcnt would drop to zero before the current backend exits, however. Perhaps set a flag in the shared buffer header, and the last guy to drop his pin is supposed to do the cleanup? This is what I've meant - set (register) some pointer in buffer header to cleanup function. But then you'd be pushing VACUUM's work into productive transactions, which is probably not the way to go. Not big work - I wouldn't worry about it. Two ways: hold index page lock untill heap tuple is checked or (rough schema) store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple is used by some scan so cleaner could change stored TID (get one from prev index tuple) and set flag to help scan restore its current position on return. Another way is to mark the index tuple gone but not forgotten, so to speak --- mark it dead without removing it. (We could know that we need to do that if we see someone else has a buffer pin on the index page.) Register cleanup function just like with heap above. None of these seem real clean though. Needs more thought. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Plans for solving the VACUUM problem
We could keep share buffer lock (or add some other kind of lock) untill tuple projected - after projection we need not to read data for fetched tuple from shared buffer and time between fetching tuple and projection is very short, so keeping lock on buffer will not impact concurrency significantly. Or drop the pin on the buffer to show we no longer have a pointer to it. I'm not sure that the time to do projection is short though --- what if there are arbitrary user-defined functions in the quals or the projection targetlist? Well, while we are on this subject I finally should say about issue bothered me for long time: only simple functions should be allowed to deal with data in shared buffers directly. Simple means: no SQL queries there. Why? One reason: we hold shlock on buffer while doing seqscan qual - what if qual' SQL queries will try to acquire exclock on the same buffer? Another reason - concurrency. I think that such heavy functions should be provided with copy of data. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Plans for solving the VACUUM problem
It probably will not cause more IO than vacuum does right now. But unfortunately it will not reduce that IO. Uh ... what? Certainly it will reduce the total cost of vacuum, because it won't bother to try to move tuples to fill holes. Oh, you're right here, but daemon will most likely read data files again and again with in-memory FSM. Also, if we'll do partial table scans then we'll probably re-read indices 1 time. The index cleanup method I've proposed should be substantially more efficient than the existing code, as well. Not in IO area. My point is that we'll need in dynamic cleanup anyway and UNDO is what should be implemented for dynamic cleanup of aborted changes. UNDO might offer some other benefits, but I doubt that it will allow us to eliminate VACUUM completely. To do that, you would need to I never told this -:) keep track of free space using exact, persistent (on-disk) bookkeeping data structures. The overhead of that will be very substantial: more, I predict, than the approximate approach I proposed. I doubt that big guys use in-memory FSM. If they were able to do this... Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Plans for solving the VACUUM problem
I hope we can avoid on-disk FSM. Seems to me that that would create problems both for performance (lots of extra disk I/O) and reliability (what happens if FSM is corrupted? A restart won't fix it). We can use WAL for FSM. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
RE: [HACKERS] Plans for solving the VACUUM problem
Really?! Once again: WAL records give you *physical* address of tuples (both heap and index ones!) to be removed and size of log to read records from is not comparable with size of data files. So how about a background vacuum like process, that reads the WAL and does the cleanup ? Seems that would be great, since it then does not need to scan, and does not make forground cleanup necessary. Problem is when cleanup can not keep up with cleaning WAL files, that already want to be removed. I would envision a config, that sais how many Mb of WAL are allowed to queue up before clients are blocked. Yes, some daemon could read logs and gather cleanup info. We could activate it when switching to new log file segment and synchronization with checkpointer is not big deal. That daemon would also archive log files for WAL-based BAR, if archiving is ON. But this will be useful only with on-disk FSM. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Plans for solving the VACUUM problem
My point is that we'll need in dynamic cleanup anyway and UNDO is what should be implemented for dynamic cleanup of aborted changes. I do not yet understand why you want to handle aborts different than outdated tuples. Maybe because of aborted tuples have shorter Time-To-Live. And probability to find pages for them in buffer pool is higher. The ratio in a well tuned system should well favor outdated tuples. If someone ever adds dirty read it is also not the case that it is guaranteed, that nobody accesses the tuple you currently want to undo. So I really miss to see the big difference. It will not be guaranteed anyway as soon as we start removing tuples without exclusive access to relation. And, I cannot say that I would implement UNDO because of 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management) but because of ALL of 1., 2., 4. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
RE: [HACKERS] Plans for solving the VACUUM problem
From: Mikheev, Vadim Sent: Monday, May 21, 2001 10:23 AM To: 'Jan Wieck'; Tom Lane Cc: The Hermit Hacker; 'Bruce Momjian'; [EMAIL PROTECTED] Strange address, Jan? Subject: RE: [HACKERS] Plans for solving the VACUUM problem I think the in-shared-mem FSM could have some max-per-table limit and the background VACUUM just skips the entire table as long as nobody reused any space. Also it might only compact pages that lead to 25 or more percent of freespace in the first place. That makes it more likely that if someone looks for a place to store a tuple that it'll fit into that block (remember that the toaster tries to keep main tuples below BLKSZ/4). This should be configurable parameter like PCFREE (or something like that) in Oracle: consider page for insertion only if it's PCFREE % empty. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: AW: [HACKERS] Plans for solving the VACUUM problem
Correct me if I am wrong, but both cases do present a problem currently in 7.1. The WAL log will not remove any WAL files for transactions that are still open (even after a checkpoint occurs). Thus if you do a bulk insert of gigabyte size you will require a gigabyte sized WAL directory. Also if you have a simple OLTP transaction that the user started and walked away from for his one week vacation, then no WAL log files can be deleted until that user returns from his vacation and ends his transaction. Todo: 1. Compact log files after checkpoint (save records of uncommitted transactions and remove/archive others). 2. Abort long running transactions. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: [HACKERS] Plans for solving the VACUUM problem
I have been thinking about the problem of VACUUM and how we might fix it for 7.2. Vadim has suggested that we should attack this by implementing an overwriting storage manager and transaction UNDO, but I'm not totally comfortable with that approach: it seems to me that it's an awfully large change in the way Postgres works. I'm not sure if we should implement overwriting smgr at all. I was/is going to solve space reusing problem with non-overwriting one, though I'm sure that we have to reimplement it ( 1 table per data file, stored on disk FSM etc). Second: if VACUUM can run in the background, then there's no reason not to run it fairly frequently. In fact, it could become an automatically scheduled activity like CHECKPOINT is now, or perhaps even a continuously running daemon (which was the original conception of it at Berkeley, BTW). And original authors concluded that daemon was very slow in reclaiming dead space, BTW. 3. Lazy VACUUM processes a table in five stages: A. Scan relation looking for dead tuples;... B. Remove index entries for the dead tuples... C. Physically delete dead tuples and compact free space... D. Truncate any completely-empty pages at relation's end. E. Create/update FSM entry for the table. ... If a tuple is dead, we care not whether its index entries are still around or not; so there's no risk to logical consistency. What does this sentence mean? We canNOT remove dead heap tuple untill we know that there are no index tuples referencing it and your A,B,C reflect this, so ..? Another place where lazy VACUUM may be unable to do its job completely is in compaction of space on individual disk pages. It can physically move tuples to perform compaction only if there are not currently any other backends with pointers into that page (which can be tested by looking to see if the buffer reference count is one). Again, we punt and leave the space to be compacted next time if we can't do it right away. We could keep share buffer lock (or add some other kind of lock) untill tuple projected - after projection we need not to read data for fetched tuple from shared buffer and time between fetching tuple and projection is very short, so keeping lock on buffer will not impact concurrency significantly. Or we could register callback cleanup function with buffer so bufmgr would call it when refcnt drops to 0. Presently, VACUUM deletes index tuples by doing a standard index scan and checking each returned index tuple to see if it points at any of the tuples to be deleted. If so, the index AM is called back to delete the tested index tuple. This is horribly inefficient: ... This is mainly a problem of a poorly chosen API. The index AMs should offer a bulk delete call, which is passed a sorted array of main-table TIDs. The loop over the index tuples should happen internally to the index AM. I agreed with others who think that the main problem of index cleanup is reading all index data pages to remove some index tuples. You told youself about partial heap scanning - so for each scanned part of table you'll have to read all index pages again and again - very good way to trash buffer pool with big indices. Well, probably it's ok for first implementation and you'll win some CPU with bulk delete - I'm not sure how much, though, and there is more significant issue with index cleanup if table is not locked exclusively: concurrent index scan returns tuple (and unlock index page), heap_fetch reads table row and find that it's dead, now index scan *must* find current index tuple to continue, but background vacuum could already remove that index tuple = elog(FATAL, _bt_restscan: my bits moved...); Two ways: hold index page lock untill heap tuple is checked or (rough schema) store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple is used by some scan so cleaner could change stored TID (get one from prev index tuple) and set flag to help scan restore its current position on return. I'm particularly interested in discussing this issue because of it must be resolved for UNDO and chosen way will affect in what volume we'll be able to implement dirty reads (first way doesn't allow to implement them in full - ie selects with joins, - but good enough to resolve RI constraints concurrency issue). There you have it. If people like this, I'm prepared to commit to making it happen for 7.2. Comments, objections, better ideas? Well, my current TODO looks as (ORDER BY PRIORITY DESC): 1. UNDO; 2. New SMGR; 3. Space reusing. and I cannot commit at this point anything about 3. So, why not to refine vacuum if you want it. I, personally, was never be able to convince myself to spend time for this. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] Re: [SQL] possible row locking bug in 7.0.3 7.1
I see postgres 7.1.1 is out now. Was the fix for this problem included in the new release? I fear it will be in 7.2 only. On Thursday 29 March 2001 20:02, Philip Warner wrote: At 19:14 29/03/01 -0800, Mikheev, Vadim wrote: Reported problem is caused by bug (only one tuple version must be returned by SELECT) and this is way to fix it. I assume this is not possible in 7.1? Just looked in heapam.c - I can fix it in two hours. The question is - should we do this now? Comments? It's a bug; how confident are you of the fix? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
RE: [HACKERS] Plans for solving the VACUUM problem
Vadim, can you remind me what UNDO is used for? Ok, last reminder -:)) On transaction abort, read WAL records and undo (rollback) changes made in storage. Would allow: 1. Reclaim space allocated by aborted transactions. 2. Implement SAVEPOINTs. Just to remind -:) - in the event of error discovered by server - duplicate key, deadlock, command mistyping, etc, - transaction will be rolled back to the nearest implicit savepoint setted just before query execution; - or transaction can be aborted by ROLLBACK TO savepoint_name command to some explicit savepoint setted by user. Transaction rolled back to savepoint may be continued. 3. Reuse transaction IDs on postmaster restart. 4. Split pg_log into small files with ability to remove old ones (which do not hold statuses for any running transactions). Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] 7.1 startup recovery failure
There's a report of startup recovery failure in Japan. DEBUG: redo done at (1, 3923880100) FATAL 2: XLogFlush: request is not satisfied postmaster: Startup proc 4228 exited with status 512 - abort Is this person using 7.1 release, or a beta/RC version? That looks just like the last WAL bug Vadim fixed before final ... No, it doesn't. That bug was related to cases when there is no room on last log page for startup checkpoint. ~5k is free in this case. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
RE: [HACKERS] Re: 7.1 vacuum
What's the deal with vacuum lazy in 7.1? I was looking forward to it. It was never clear whether or not you guys decided to put it in. If it is in as a feature, how does one use it? If it is a patch, how does one get it? If it is neither a patch nor an existing feature, has development stopped? I still had no time to port it to 7.1 -:( I'll post message to -hackers when it will be ready. Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] WAL feature
WAL was a difficult feature to add to 7.1. Currently, it is only used as a performance benefit, but I expect it will be used in the future to Not only. Did you forget about btree stability? Partial disk writes? add new features like: Advanced Replication I'm for sure not fan of this. Point-in-time recovery Row reuse without vacuum Yes, it will help to remove uncommitted rows. And don't forget about SAVEPOINTs. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] WAL feature
Yep, WAL collects all database changes into one file. Easy to see how some other host trying to replication a different host would find the WAL contents valuable. Unfortunately, slave database(s) should be on the same platform (hardware+OS) to be able to use information about *physical* changes in data files. Also, this would be still *async* replication. Maybe faster than rserv, maybe with less space requirements (no rserv' log table), but maybe not. I believe that making efforts to implement (bi-directional) *sync* replication would be more valuable. Vadim ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] v7.1.1 branched and released on Tuesday ...
As Tom's mentioned the other day, we're looking at doing up v7.1.1 on Tuesday, and starting in on v7.2 ... Does anyone have any outstanding fixes for v7.1.x that they want to see in *before* we do this release? Any points unresolved that anyone knows about that we need to look at? Hiroshi reported about startup problem yesterday - we should fix this for 7.1.1... Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
RE: AW: [HACKERS] timeout on lock feature
This is the real reason why I've been holding out for restricting the feature to a specific LOCK TABLE statement: if it's designed that way, at least you know which lock you are applying the timeout to, and have some chance of being able to estimate an appropriate timeout. As I pointed before - it's half useless. And I totally do not understand why to object feature 1. that affects users *only when explicitly requested*; 2. whose implementation costs nothing - ie has no drawbacks for overall system. It was general practice in project so far: if user want some feature and it doesn't affect others - let's do it. What's changed? Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: AW: [HACKERS] timeout on lock feature
One idea Tom had was to make it only active in a transaction, so you do: BEGIN WORK; SET TIMEOUT TO 10; UPDATE tab SET col = 3; COMMIT Tom is concerned people will do the SET and forget to RESET it, causing all queries to be affected by the timeout. And what? Queries would be just aborted. It's not critical event to take responsibility from user. Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: AW: [HACKERS] timeout on lock feature
Added to TODO: * Add SET parameter to timeout if waiting for lock too long I repeat my strong objection to any global (ie, affecting all locks) timeout. Such a "feature" will have unpleasant consequences. But LOCK TABLE T IN ROW EXCLUSIVE MODE WITH TIMEOUT X will not give required results not only due to parser/planner locks - what if UPDATE T will have to wait for other transactions commit/abort (same row update)? Lock on pseudo-table is acquired in this case... Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: AW: AW: [HACKERS] timeout on lock feature
The timeout will be useful to let the client or user decide on an alternate course of action other that killing his application (without the need for timers or threads in the client program). This assumes (without evidence) that the client has a good idea of what the timeout limit ought to be. I think this "feature" has no real use other than encouraging application programmers to shoot themselves in the foot. I see no reason that we should make it easy to misdesign applications. AFAIR, Big Boys have this feature. If its implementation is safe, ie will not affect applications not using it, why do not implement it? Vadim ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Re: TODO list
So, for what CRC could be used? To catch disk damages? Disk has its own CRC for this. OK, this was already discussed, maybe while Vadim was absent. Should I re-post the previous text? Let's return to this discussion *after* 7.1 release. My main objection was (and is) - no time to deal with this issue for 7.1 Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
RE: [HACKERS] Re: TODO list
Blocks that have recently been written, but failed to make it down to the disk platter intact, should be restorable from the WAL log. So we do not need a block-level CRC to guard against partial writes. If a block is missing some sectors in the middle, how would you know to reconstruct it from the WAL, without a block CRC telling you that the block is corrupt? On recovery we unconditionally copy *entire* block content from the log for each block modified since last checkpoint. And we do not write new checkpoint record (ie do not advance recovery start point) untill we know that all data blocks are flushed on disk (including blocks modified before checkpointer started). Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html