Re: [HACKERS] win32 tablespace handing
hardlinks and junctions don't work across physical disks, only symlinks. Where did you read this? I just looked and can see no such restriction. There is no such restriction for junctions, I just tried it to be safe. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] AIX and V8 beta 3
Have you tried using cc_r for that compile line? Does that help? Alas, that is not an option available. cc_r is specific to the AIX xlc compiler; we're using GCC, and xlc is not available to us. What is missing is a -lpthread . Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] psql questions: SQL, progname, copyright dates
I think adding 'S' to \df confuses more than it helps. Why that? Imho it would be consistent. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] APR 1.0 released
Well, it's easily changed, if all that's needed is a search-and-replace. Suggestions for a better name? MINGW32 I think that is a bad idea. That symbol sure suggests, that you are using mingw. Are you expecting someone who creates a VisualStudio project to define MINGW32 ? Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] APR 1.0 released
Personally I don't think that any rename()-usleep loop is necessary. I'll check the archives. I agree the rename loop seems unnecessary. I kept it in case we hadn't dealt with all the failure places. Should we remove them now or wait for 8.1? Seems we should keep them in and see if we get reports from users of looping forever, and if not we can remove them in 8.1. What I do not understand is, that Windows has rename and _unlink. Are we using those or not? Looping forever is certainly not good, but I thought the current code had a limited loop. I think a limited loop is required, since both rename and _unlink can not cope with a locked file. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] PITR: XLog File compression on Archive
PS: but something you *could* do in 8.0 is replace cp by gzip to archive compressed files that way. How about replacing the page image records with a same size dummy record that only contains a dummy header and all 0's. If the archiver cares about space he will use some sort of compression anyway. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] 8.0 beta 1 on linux-mipsel R5900
Think harder... one processor != one process... Well sure, but you don't want a spinlock in that case. Actually you do, when the normal case is that you don't have to block. You want it to fall through as quickly as possible in the success case (the blocking case is going to suck no matter what). Given the present set of available/portable technologies, spinlocks win. I guess it could still save some CPU cycles in the single CPU case, if you yield() instead of tight loop around TAS in the failure case. Problem is yield and detecting single CPU is not portable. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] PITR: XLog File compression on Archive
Re-thinking the whole purpose of the additional full page images appended to the xlog records, I now understand and agree with Tom's comment in the docs that we don't need to include those additional full page images for PITR - they only exist to correctly recover the database in the event of a crash. The reason for this is that the base backup provides a full set of blocks on which to recover - there is no danger that the backup contains bad blocks, as would be the case for crash recovery. It is correct, that the base backup cannot contain bad blocks (on OS's we know). But it can contain newer blocks than WAL records would expect. Will it not matter if e.g. a page split for a btree index is already in the index file, but a WAL record exists, that references the not yet split page? I think not, but I am not sure, since normal crash recovery won't see this situation because of the page images (that can be older than what is on disk). Also is there not now code, that logs index recreation by only logging page images ? We would still need those, no ? Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
I was wondering about this point - might it not be just as reasonable for the copied file to *be* an exact image of pg_control? Then a very simple variant of pg_controldata (or maybe even just adding switches to pg_controldata itself) would enable the relevant info to be extracted We didn't do that so admins could easily read the file contents. If you use a readable file you will also need a feature for restore (or a tool) to create an appropriate pg_control file, or are you intending to still require that pg_control be the first file backed up. Another possibility would be that the start function writes the readable file and also copies pg_control. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Sketch of extending error handling for subtransactions
I was just looking around the net to see exactly what Oracle's PL/SQL syntax is. It doesn't seem too unreasonable syntax-wise: BEGIN ... controlled statements ... EXCEPTION WHEN exception_name THEN ... error handling statements ... WHEN exception_name THEN ... error handling statements ... ... WHEN OTHERS THEN ... error handling statements ... END; There's nothing here we couldn't do. However, it seems that Oracle thinks you should throw in explicit SAVEPOINT and ROLLBACK statements on top of this! That's just weird. It might be that we should deliberately *not* adopt the exact syntax they are using, just so we don't create compatibility gotchas. That is because they usually use this to handle the exception of only one potentially failing statement, which does not rollback any prev statements (except in pg). Thus in Oracle you need savepoints in a lot fewer cases. It is only in those seldom cases, that you add savepoints on top of blocks with exceptions in Oracle. But yes, I think doing implicit savepoints for plpgsql blocks that contain an exception ... handler is absolutely the way to proceed. For maximum protability that savepoint should probably travel along with every successful statement after the BEGIN (that of course is debateable). BEGIN --implicit savepoint x update 1-- succeeds --implicit release old x, new savepoint x update 2-- fails --position y update 3-- succeeds EXCEPTION -- rollback to savepoint x WHEN -- transfer control to position y END;-- implicit RELEASE savepoint x Doing this only for blocks with an exception handler would not impose any overhead for conventional plpgsql funcs. Andreas PS: can someone please help me get through the lists spam blocker, get Marc to contact me, or I don't know what else I can do ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Fixing PKs and Uniques in tablespaces
Also, since I checked and it seems that our syntax for putting tables an d indexes in tablespaces at creation time is identical to oracle's, perhaps we should copy them on constraints as well. Since we're getting close to beta, can we have consensus on what I'm to do about this? The Oracle 10g documentation has: USING INDEX TABLESPACE blabla none of the words are optional. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] localhost redux
I don't know if the problem is isolated to just me, but I wanted to suggest that we use a parameter for that, which can be configured in the postgresql.conf, with a default value if it's not set, set to localhost. I think you should first trace down what the problem really is --- is your system just misconfigured or is there some fundamental issue that we really ought to answer to? The trouble on AIX is, that getaddrinfo only does a nameservice lookup, no /etc/hosts. So if your nameservice does not have localhost ... Same for `hostname`, if nameservice does not resolve `hostname` ... Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
Hang on, are you supposed to MOVE or COPY away WAL segments? Copy. pg will delete them once they are archived. Copy. pg will recycle them once they are archived. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] localhost redux
If localhost isn't being resolved correctly are you seeing error messages like this on the server log (from pqcomm.c)? could not translate host name \%s\, service \%s\ to address: %s After poking around, I found the following: The Windows pdc (==name server :-( ) does really not resolve localhost (this might be common for Windows nameservers). I do not have a services entry for 5432, and don't have a dns record for this RS6000. LOG: could not translate service 5432 to address: Host not found WARNING: could not create listen socket for * LOG: could not bind socket for statistics collector: Can't assign requested address Setting to a port, that is listed in /etc/services does not change anything (except the port of course). I do not have those problems on a machine where dns lists the machine and localhost. This machine has a long obsolete oslevel 4.3.2, so am not sure it is worth pursuing the issue. I only wanted to state, that it does not work here under certain cirumstances eighter. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
I'm aiming for the minimum feature set - which means we do need to take care over whether that set is insufficient and also to pull any part that doesn't stand up to close scrutiny over the next few days. As you can see, we are still chewing on NT. What PITR features are missing? I assume because we can't stop file system writes during backup that we will need a backup parameter file like I described. Is there anything else that PITR needs? No, we don't need to stop writes ! Not even to split a mirror, other db's need that to be able to restore, but we dont. We only need to tell people to backup pg_control first. The rest was only intended to enforce 1. that pg_control is the first file backed up 2. the dba uses a large enough PIT (or xid) for restore I think the idea with an extra file with WAL start position was overly complicated, since all you need is pg_control (+ WAL end position to enforce 2.). If we don't want to tell people to backup pg_control first, imho the next best plan would be to add a WAL start input (e.g. xlog name) parameter to recovery.conf, that fixes pg_control. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
then on restore once all the files are restored move the pg_control.backup to its original name. That gives us the checkpoint wal/offset but how do we get the start/stop information. Is that not required? The checkpoint wal/offset is in pg_control, that is sufficient start information. The stop info is only necessary as a safeguard. Do we need a checkpoint after the archiving starts but before the backup begins? No. Also, when you are in recovery mode, how do you get out of recovery mode, meaning if you have a power failure, how do you prevent the system from doing another recovery? Do you remove the recovery.conf file? pg_control could be updated during rollforward (only if that actually does a checkpoint). So if pg_control is also the recovery start info, then we can continue from there if we have a power failure. For the first release it would imho also be ok to simply start over if you loose power. I think the filename 'recovery.conf' is misleading, since it is not a static configuration file, but a command file for one recovery. How about 'recovery.command' then 'recovery.inprogress', and on recovery completion it should be renamed to 'recovery.done' Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Point in Time Recovery
Do we need a checkpoint after the archiving starts but before the backup begins? No. Actually yes. Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-( So yes, you need one checkpoint after archiving starts. Imho turning on xlog archiving should issue such a checkpoint just to be sure. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Point in Time Recovery
Other db's have commands for: start/end external backup I see that the analogy to external backup was not good, since you are correct that dba's would expect that to stop all writes, so they can safely split their mirror or some such. Usually the expected time from start until end external backup is expected to be only seconds. I actually think we do not need this functionality, since in pg you can safely split the mirror any time you like. My comment was meant to give dba's a familiar tool. The effect of it would only have been to create a separate backup of pg_control. Might as well tell people to always backup pg_control first. I originally thought you would require restore to specify an xlog id from which recovery will start. You would search this log for the first checkpoint record, create an appropriate pg_control, and start rollforward. I still think this would be a nice feature, since then all that would be required for restore is a system backup (that includes pg data files) and the xlogs. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
Sorry for the stupid question, but how do I get this patch if I do not receive the patches mails ? The web interface html'ifies it, thus making it unusable. Thanks Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
The recovery mechanism doesn't rely upon you knowing 1 or 3. The recovery reads pg_control (from the backup) and then attempts to de-archive the appropriate xlog segment file and then starts rollforward Unfortunately this only works if pg_control was the first file to be backed up (or by chance no checkpoint happened after backup start and pg_control backup) Other db's have commands for: start/end external backup Maybe we should add those two commands that would initially only do the following: start external backup: - (checkpoint as an option) - make a copy of pg_control end external backup: - record WAL position (helps choose an allowed minimum PIT) Those commands would actually not be obligatory but recommended, and would only help with the restore process. Restore would then eighter take the existing pg_control backup, or ask the dba where rollforward should start and create a corresponding pg_control. A helper utility could list possible checkpoints in a given xlog. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Portals and nested transactions
My answers: Q1: Should Portals successfully created within the failed subxact be closed? Or should they remain open? no for protocol level I can understand a yes to this one for sql level, because it will be hard to clean up by hand :-( But I like the analogy to hold cursors, so I would also say no to sql level. Is the pro yes argument ACID allowed here ? I thought ACID is about data integrity and not flow control, and also deals with main transactions and not subtransactions. Q2: If the subxact changed the state of a pre-existing Portal, should that state change roll back? In particular, can a Close Portal operation roll back? NO for both SQL and protocol level. The analogy is imho that closing a 'hold cursor' is also never rolled back How to do it non-transactionally Sounds like a good plan, but also sounds like a lot of work. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Point in Time Recovery
The starting a new timeline thought works for xlogs, but not for clogs. No matter how far you go into the future, there is a small (yet vanishing) possibility that there is a yet undiscovered committed transaction in the future. (Because transactions are ordered in the clog because xids are assigned sequentially at txn start, but not ordered in the xlog where they are recorded in the order the txns complete). Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem would arise at xid wraparound, no ? Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Nested Transactions, Abort All
My proposal would be: 1. Begin main transaction: BEGIN { TRANSACTION | WORK } 2. Commit main (all) transaction: COMMIT { TRANSACTION | WORK } 3. Rollback main (all) transaction: ROLLBACK { TRANSACTION } 4. Begin inner transaction: BEGIN NESTED { TRANSACTION | WORK } 5. Commit inner transaction: COMMIT NESTED { TRANSACTION | WORK } 6. Rollback inner transaction: ROLLBACK NESTED { TRANSACTION } 1, 2 and 3 are not negotiable. 4, 5 and 6 are. Hmm, 1-3 are at least negotiable for the abbreviated form 'BEGIN' and 'END'. I think we could differentiate those. The standard only has 'BEGIN TRANSACTION' and 'COMMIT [WORK]' and 'ROLLBACK [WORK]'. I agree that we are not allowed to change the semantics of those non abbreviated forms. How about: 1. Begin main tx: BEGIN WORK | BEGIN TRANSACTION 2. Commit main (all) transaction: COMMIT [ TRANSACTION | WORK ] 3. Rollback main (all) transaction: ROLLBACK [ TRANSACTION | WORK ] 4. BEGIN: starts eighter a main or a subtransaction (for plsql BEGIN SUB) 5. END: commits nested, maybe also abort a nested tx that is already in abort state (for plsql END SUB) 6. ROLLBACK SUB[TRANSACTION]: rolls subtx back Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Nested Transactions, Abort All
As far as implementing only savepoints, look at this: BEGIN; BEGIN; INSERT INTO ...; COMMIT; BEGIN; INSERT INTO ...; COMMIT; BEGIN; INSERT INTO ...; COMMIT; With savepoints, it looks pretty strange: BEGIN; SAVEPOINT x1; INSERT INTO ...; SAVEPOINT x2; If you meant same as your nested example, it would be: BEGIN TRANSACTION; SAVEPOINT x; INSERT INTO ...; SAVEPOINT x;-- this implicitly commits previous subtxn x INSERT INTO ...; SAVEPOINT x; INSERT INTO ...; COMMIT; Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Nested Transactions, Abort All
I'd opt for BEGIN as a start of a subtransaction (no need for special semantics in plpgsql), the corresponding END simply changes the transaction context to the parent level. But 'BEGIN' in plpgsql does not start a [sub]transaction, it starts a statement block. Are we intending to change that ? I think not. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Nested Transactions, Abort All
But 'BEGIN' in plpgsql does not start a [sub]transaction, it starts a statement block. Are we intending to change that ? I think not. There are two possibilities: Either BEGIN *does* start a subtransaction, or BEGIN does not. I don't see how two nesting level hierarchies in a function should be handleable, i.e. having independent levels of statements blocks and subtransactions. BEGIN [whatever] suggests that there's also a statement closing that block of [whatever], but it's very legal for subtransactions to have no explicit end; the top level COMMIT does it all. An 'END SUB' after a 'BEGIN SUB' in plpgsql could be required, and could mean start/end block and subtx. I do not really see a downside. But, it would imho only make sense if the 'END SUB' would commit sub or abort sub iff subtx is in aborted state (see my prev posting) Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Point in Time Recovery
Well, Tom does seem to have something with regard to StartUpIds. I feel it is easier to force a new timeline by adding a very large number to the LogId IF, and only if, we have performed an archive recovery. That way, we do not change at all the behaviour of the system for people that choose not to implement archive_mode. Imho you should take a close look at StartUpId, I think it is exactly this large number. Maybe you can add +2 to intentionally leave a hole. Once you increment, I think it is very essential to checkpoint and double check pg_control, cause otherwise a crashrecovery would read the wrong xlogs. Should we implement timelines? Yes :-) Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Point in Time Recovery
- by time - but the time stamp on each xlog record only specifies to the second, which could easily be 10 or more commits (we hope) Should we use a different datatype than time_t for the commit timestamp, one that offers more fine grained differentiation between checkpoints? Imho seconds is really sufficient. If you know a more precise position you will probably know it from backend log or an xlog sniffer. With those you can easily use the TransactionId way. - when we stop, keep reading records until EOF, just don't apply them. When we write a checkpoint at end of recovery, the unapplied transactions are buried alive, never to return. - stop where we stop, then force zeros to EOF, so that no possible record remains of previous transactions. I'm tempted by the first plan, because it is more straightforward and stands much less chance of me introducing 50 wierd bugs just before close. But what if you restore because after that PIT everything went haywire including the log ? Did you mean to apply the remaining changes but not commit those xids ? I think it is essential to only leave xlogs around that allow a subsequent rollforward from the same old full backup. Or is an instant new full backup required after restore ? Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: subtransactions and FETCH behaviour (was Re: [HACKERS] PREPARE and transactions)
Well, the proposal of implementing it like holdable cursors means using a Materialize node which, if I understand correctly, means taking the whole result set and storing it on memory (or disk). Would it help to hold the lock for a record that is the current cursor position, iff this record was updated (and subsequently rolled back) by this subtxn, and release that lock as soon as you fetch next ? Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] nested-xacts cursors (was Re: Performance with new nested-xacts code)
begin; declare cursor c ...; fetch 1 from c; -- returns tuple 1 begin; fetch 1 from c; -- returns tuple 2 rollback; fetch 1 from c; -- returns tuple 1 again This is mightly ugly but I think it's the most usable of the options seen so far. Imho most usabel would be to handle the cursor like a hold corsor. begin; declare cursor c ...; fetch 1 from c; -- returns tuple 1 begin; fetch 1 from c;-- returns tuple 2 rollback; fetch 1 from c; -- returns tuple 3 For me the reason is, that most likely you are not going to rollback because the fetch did not work or returned something you don't like. Most likely some consequent action did not work out, and the next step will be to correct (or ignore) the problem. You can do that without an extra fetch, because you still have the values in host variables. resetting to tuple 1 imho opens the door for endless loops. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] signal 11 on AIX: 7.4.2
My only guess is that getaddrinfo in your libc has a bug somehow that is corrupting the stack (hance the improper backtrace), then crashing. It could be libc on AIX, I suppose, but it strikes me as sort of odd that nobody else ever seens this. Unless nobody else is using AIX 5.1, which is of course possible. I can confirm, that AIX 4.3.2 getaddrinfo is at least a bit *funny*. getaddrinfo seems to not honour nsorder and only does dns, even though the manual sais: Should there be any discrepancies between this description and the POSIX description, the POSIX description takes precedence. The function does return multiple entries, often the first is not the best. Log is: LOG: could not translate service 5432 to address: Host not found WARNING: could not create listen socket for * LOG: could not bind socket for statistics collector: Can't assign requested address LOG: disabling statistics collector for lack of working socket This area probably needs a fix/workaround on AIX :-( Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Tablespaces
With the rule system and two underlying tables one could make it work by hand I think. The rule system could be used to do this, but there was some discussion of using inherited tables to handle it. However neither handles the really hard part of detecting queries that use only a part of the table and taking that into account in generating the plan. I think the consensus should be to add smarts to the planner to include static constraint information to reduce table access. e.g if you have a constraint acol integer, check acol 5 and you have a query with a where acol = 10 you could reduce that to where false. This would help in all sorts of situations not only partitioned/inherited tables. I am not sure what the runtime cost of such an inclusion would be, so maybe it needs smarts to only try in certain cases ? Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Why hash indexes suck
We could safely sort on the hash value, but I'm not sure how effective that would be, considering that we're talking about values that already hashed into the same bucket --- there's likely not to be very many distinct hash values there. I think we can safely put that on the todo list. The existing hash algorithm is very good. So I would on the contrary beleive that only a few keys share a hash value per pagesized bucket. For the equal keys case it does not matter since we want all of the rows anyways. For the equal hash value case it would probably be best to sort by ctid. TODO ?: order heap pointers inside hash index pages by hash value and ctid Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Fast index build vs. PITR
I think an actually implementable version of this would be: 1. Don't log any index operations at all in WAL. 2. When recovering from WAL, restore all the table contents by WAL replay. (This would of course include the system catalog contents that describe the indexes.) Then sit there and do a global REINDEX to rebuild all the indexes. This would gain a reduction of some percentage in WAL traffic, at the cost of a hugely expensive recovery cycle any time you actually needed to use the WAL. I guess this could be attractive to some installations, but I'm not sure very many people would want it ... I think only the global part of it is not really acceptable. If we had a flag for each index that marks it inconsistent reindexing only those that are marked would be great. Could we log a WAL record that basically only marks an index for deferred reindex after WAL recovery ? During WAL replay all records for this index could be ignored (this is not a must because of the post update page images in WAL, the index would still stay inconsistent until reindex of course). I think such a reindex step could also be responsible for those non-btree indexes that don't fully support WAL (gist?). Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] SELECT * FROM table LIMIT 1; is really slow
What would be useful for that (as of 7.5) is a dummy ALTER TABLE TYPE, viz ALTER TABLE tab ALTER COLUMN col TYPE same-type-it-already-has I think we should add special syntax for this purpose, since I would like to (or someone else later on) see all possible cases of alter column short circuited. The syntax I would see fit is something like: ALTER TABLE tab [MOVE] TABLESPACE xy; For the above special case the tablespace would be the same as before. Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Timezone fun (bugs and a request)
LOG: database system was shut down at 2004-05-25 15:15:44 GMT-12 For comparison, 7.4.1 on the same system says: LOG: database system was shut down at 2004-05-25 16:03:43 NZST Can we keep the zic database convention unchanged but change the display format in the logs to be consistent with the SQL conventions? I think what was meant is, that if it displays GMT+-something it should convert the sign. Is that zic filename exposed anywhere else ? It is common practice that + is East and - is West, no ? Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Table Spaces
If you run NTFS, it's still possible to use arbitrary links. In the Windows world, they are called junctions. Microsoft does not provide a junction tool for some reason (perhaps because it's limited to NTFS). A good tool, free and with source, can be found here http://www.sysinternals.com/ntw2k/source/misc.shtml#junction I use this tool myself. Works like a charm. We've looked at it before. Apart from anything else I don't think its license is compatible with PostgreSQL's. The tool was suggested for use as is, not for inclusion in pg source. It can be used to e.g. symlink xlog manually. If you want something in pg source, the relevant lines (about 20) can be copied from MSDN. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Call for 7.5 feature completion
It is too late to think about pushing back another month. We had this discussion already. June 1 is it. I thought the outcome of that discussion was June 15 ? Can we try to do the 2PC patch now instead of waiting for subtransactions ? Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] relcache refcount
BTW, what are your plans for state saving/reversion for the lock manager and buffer manager? The lock state, in particular, makes these other problems look trivial by comparison. Why can't we keep all locks until main tx end ? Locks are not self conflicting are they ? So the only reason to free them would be to improve concurrency, and imho we don't need that. I guess I am just not seeing this correctly. (I am assuming that a deadlock will still break the whole tx) Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] threads stuff/UnixWare
I know, this sucks, but, I don't see any other way, other than linking *ALL* libpq-using programs (including initdb and friends) with -K pthread. How about making a libpq.so (without pthread) and a thread safe libpq_r.so ? Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] XLog: how to log?
I think this argument is largely a red herring ... but if it makes you feel better, we could change the contents of the commit timestamp to be gettimeofday() output (seconds+microseconds) instead of just time() output. That should be precise enough for practical purposes. I am saying timestamp as used for specifying a recovery location might not be unique enough, no? Maybe the api should allow a xid only in addition to a timestamp, for use when timestamp alone is not precise enough. That would solve the problem of not finding the xid. I would stay away from microseconds. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] ALTER TABLE TODO items
FireBird: ALTER COLUMN column TYPE type DB2: ALTER COLUMN column SET DATA TYPE type. Oracle: MODIFY column type MSSQL:ALTER COLUMN column type constraints MySQL:Both Oracle and MSSQL Sap: MODIFY column type Spec: Nothing (obvious) on changing column types MODIFY is horrible. It seems to drop all constraints, defaults, etc that are not specified in the second definition. It is essentially a replacement of the column. In Oracle MODIFY leaves omitted parts unchanged, syntax is actually ALTER TABLE table MODIFY (column type default constraint) I think the parentheses are optional if only one column is modified. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] ANALYZE locks pg_listener in EXCLUSIVE for long time?
* Is it really a good idea for database-wide ANALYZE to run as a single transaction? Holding all those locks is a recipe for deadlocks, even if they're as inoffensive as AccessShareLocks normally are. Wasn't one idea behind that change also to not make the planner create a plan from mixed old and new statistics ? I guess that could later be accomplished with begin work; analyze; commit work; (with subtransactions) though. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] PITR logging control program
Basically it is updating the logs as soon as it receives the notifications. Writing 16 MB of xlogs could take some time. In my experience with archiving logs, 16 Mb is on the contrary way too small for a single log. The overhead of starting e.g. a tape session is so high that you cannot keep up (a few seconds). Once the tape is streaming it is usually quite fast. So imho it is not really practical to have logs so small that they can fill in less that 20 seconds. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] pre-loading a user table.
Why do you think it useful to preload something during InitPostgres, anyway? Any heavily used table will certainly be present in shared buffers already, and even more surely present in kernel buffers. And if you really want it preloaded you can issue dummy selects with a client right after startup. I really think locking a table into memory is a worthless feature if a good buffer manager is at work. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [pgsql-hackers-win32] fsync with sync, and Win32 unlink
Consider either a box with many different postgresql instances, or one that run both postgresql and other software. Issuing sync() in that sitaution will cause sync of a lot of data that probably doesn't need syncing. But it'd probably be a very good thing on a dedicated server, giving the kernel the chance to optimise. It is not like the sync is done every few seconds ! It is currently done every 5 minutes (I actually think this is too frequent now that we have bgwriter, 10 - 20 min would be sufficient). So imho even on a heavily otherwise used system the sync will be better. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] PITR Functional Design v2 for 7.5
The only way we can support file-level hot backup is in conjunction with PITR-style WAL log archiving. It is okay for the data area dump to be inconsistent, so long as your recovery process includes replay of WAL starting at some checkpoint before the filesystem dump started, and extending to some point after the filesystem dump finished. Replaying WAL will correct the inconsistencies. And the last checkpoint info resides in pg_control, and not in pg_clog, no ? So basically a PITR restore would need to adjust the pg_control file after filesystem restore and before starting recovery. Maybe it can take that info from the oldest available WAL ? The OP would only need to ensure, that only such logs that need to be rolled forward are visible (in the correct directory) to the recovery. Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PITR Functional Design v2 for 7.5
To clarify: I'd expect a cluster to be workable, if I - disable VACUUM until backup completed - issue CHECKPOINT - backup clog (CHECKPOINT and backup clog are the backup checkpoint) - backup all datafiles (which include at least all completed transaction data at checkpoint time) and then - restore datafiles and clog - bring up pgsql. Why is that a useful approach? You might as well shut down the postmaster and do a cold filesystem backup, because you are depending on the data files (including clog) not to change after the checkpoint. You cannot make such an assumption in a running database. I think there is a misunderstanding here. What I think is possible is the following (continuous backup of WAL assumed): - disable VACUUM - issue CHECKPOINT C1 - backup all files - reenable VACUUM - restore files - adapt pg_control (checkpoint C1) - recover WAL until at least end of backup The db is inconsistent until you recovered all WAL (PITR) that accumulated during file backup. I am not sure about clog, isn't clog logged in xlog ? Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
First of all, symlinks are a pretty popular feature. Even Windows supports what would be needed. Second of all, PostgreSQL will still run on OSes without symlinks, tablespaces won't be available, but PostgreSQL will still run. Since we are all using PostgreSQL without My idea for platforms that don't support symlinks would be to simply create a tblspaceoid directory inplace instead of the symlink (maybe throw a warning). My feeling is, that using the same syntax on such platforms is important, but actual distribution is not (since they will most likely be small systems). Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Thread safe connection-name mapping in ECPG. Is it
- Dispose names of connectiong and replace them with a pointer. You cannot dispose the names, you can only add something to also allow pointers. The names are in the ESQL/C standard. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
For tablespaces on OS's that don't support it, I think we will have to store the path name in the file and read it via the backend. Somehow we should cache those lookups. My feeling is that we need not support tablespaces on OS's without symlinks. To create symlinked directories on Win2k NTFS see: http://www.sysinternals.com/ntw2k/source/misc.shtml#junction I think Win2000 or XP would be a reasonable restriction for Win32 PG installations that want tablespaces. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Tablespaces
Ahh. I forgot to detail my ideas on this. It seems to me that we cannot drop a table space until the directory is empty. Agreed. How would it get to be empty? Are you thinking of some sort of connect database to tablespace and disconnect database from tablespace commands that would respectively create and delete the per-database subdirectory? That seems moderately reasonable to me. We could then I would only allow the drop if the directory only contains empty db oid directories. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Thread safe connection-name mapping in ECPG. Is it required?
I'm not sure I understand you correctly. The SQL standard says you can call your statement as this: exec sql at CONNECTION select 1; Here CONNECTION of course is a string, the name of the connection. So, yes, we have to maintain that list to make sure we get the right connection. I thought the main problem was an exec sql set connection :hostvar, where hostvar is a string determined at runtime. Else above could be translated at precompile time to anything more efficient than a string search. Oh, just noticed above is not standard, but sure is very useful. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Thread safe connection-name mapping in ECPG. Is it
I am asking for CONNECTION being a variable of data type 'connection *' rather than 'const char *'. That would avoid name lookups. Is that out of spec? Yes, but the preprocessor could still add an optimization ala 'connection *' for the hardcoded cases (exec sql set connection 'myconn1'; exec sql at 'myconn1' ...). It needs to maintain the string list for the non hardcoded cases though. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Tablespaces
I do not intend to undertake raw disk tablespaces for 7.5. I'd be interested if anyone could provide some real world benchmarking of file system vs. raw disk. Postgres benefits a lot from kernel file system cache at the moment. Yes, and don't forget that pg also relys on the OS for grouping and sorting the physical writes and doing readahead where appropriate. The use of raw disks is usually paired with the use of kernel aio. The difference is said to be up to 30% on Solaris. I can assert, that it made the difference between a bogged down system and a much better behaved DB on Sun here. My experience with kaio on AIX Informix is, that kaio is faster as long as IO is not the bottleneck (disk 100% busy is the metric to watch, not Mb/s), while for an IO bound system the Informix builtin IO threads that can be used instead win. (Since they obviously do better at grouping, sorting and readahead than the AIX kernel does for kaio) Overall I think the price and komplexity is too high, especially since there are enough platforms where the kernel does a pretty good job at grouping, sorting and readahead. Additionally the kernel takes non PostgreSQL IO into account. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] OIDs, CTIDs, updateable cursors and friends
I believe the ODBC driver uses CTID for this sort of problem. CTID is guaranteed to exist and to be fast to access (since it's a physical locator). Against this you have the problem that concurrent updates of the record will move it, leaving your CTID invalid. However, that IIRC the ctid access follows the chain up to the currently valid tuple ? I thought the only enemy of ctid access was vacuum ? Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] NO WAIT ...
The question is whether we should have a GUC variable to control no waiting on locks or add NO WAIT to specific SQL commands. Does anyone want to vote _against_ the GUC idea for nowait locking. (We already have two voting for such a variable.) I vote against. We got bit by both the regex and the autocommit GUC vars and this is setting up to cause a similar headache with old code on new platforms. I vote for the GUC. Imho it is not comparable to the autocommit case, since it does not change the way your appl needs to react (appl needs to react to deadlock already). I personally think a wait period in seconds would be more useful. Milli second timeouts tend to be misused with way too low values in this case, imho. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [PATCHES] NO WAIT ...
I personally think a wait period in seconds would be more useful. Milli second timeouts tend to be misused with way too low values in this case, imho. I understand, but GUC lost the vote. I have updated the TODO list to indicate this. Tatsuo posted a patch to add NO WAIT to the LOCK command, so we will see if we can get that into CVS. Ok, I can see the advantages of that approach too. Too bad there is no standard for this. And it is probably really true that statement_timeout solves the problem of very long (indefinite :-) waits for locks. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Transaction aborts on syntax error.
In both cases, the transaction either commits or rollback occurs. No other option is possible at the end of the transaction, but in the first style of transaction semantics you get a mid-way decision point. This only refers to retryable errors, since errors like access rights violations and many other system errors aren't retryable. In the example You seem to ignore the fact, that a lot of errors (and I assume most of the real world cases, where the appl actually reacts inside a transaction) are repared by these applications by doing something else instead. Like the application does an insert gets a duplicate key error and does an update instead. Or it does an insert gets a foreign key constraint violation, populates the foreign table and does the insert again. Note that this programming practice is more efficient, than letting the appl check beforehand if the error cases are seldom. It seems to me, that leaving all this to the client (which implicitly inserts savepoints) can never be as efficient as a serverside feature. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Transaction aborts on syntax error.
It seems to me, that leaving all this to the client (which implicitly inserts savepoints) can never be as efficient as a serverside feature. I think this is an overly narrow view of efficiency. With client control, the client can insert savepoints whereever it needs them, Yes, but not if the client API does implicit savepoints. So imho if it is not cheap we should not start to supply API's that do them implicitly. which might not be for every statement. Savepoints that you don't actually need are going to be a fairly expensive overhead, AFAICS. Well with other db's per statement rollback is a no overhead feature, so this is pg specific. (In the sense of: nothing is done that would not need to be done anyway, since they all undo the changes) Imho the 80% main use case for applications would be for duplicate key to not abort. For interactive psql it would probably be the syntax error. Maybe something can be done about those special cases to make partial rollback cheaper for those. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Transaction aborts on syntax error.
Improving on not ideal would be good, and would get even closer to full Oracle/SQLServer migration/compatibility. However, since I've never looked at that section of code, I couldn't comment on any particular approach nor implement such a change, so I'll shut up and be patient. Imagine this: BEGIN WORK; LOCK oldtab; CREATE_X TABLE newtab AS SELECT * FROM oldtab; DELETE oldtab; COMMIT In this case, you would want the database to abort on a syntax error, right? Yeah, but in other db's this is solved by the frontend. e.g. in Informix dbaccess has a mode that simply stops execution upon first error. So I don't think this is a nogo argument, if we added such a feature to psql. Imagine your script continuing with insert into newtab ... after the commit, wouldn't you actually want that to not run eighter ? Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint
I don't think the bgwriter is going to be able to keep up with I/O bound backends, but I do think it can scan and set those booleans fast enough for the backends to then perform the writes. As long as the bgwriter does not do sync writes (which it does not, since that would need a whole lot of work to be performant) it calls write which returns more or less at once. So the bottleneck can only be the fsync. From those you would want at least one per pg disk open in parallel. But I think it should really be left to the OS when it actually does the IO for the writes from the bgwriter inbetween checkpoints. So Imho the target should be to have not much IO open for the checkpoint, so the fsync is fast enough, even if serial. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [pgsql-hackers-win32] Sync vs. fsync during checkpoint
People keep saying that the bgwriter mustn't write pages synchronously because it'd be bad for performance, but I think that analysis is faulty. Performance of what --- the bgwriter? Nonsense, the *point* Imho that depends on the workload. For a normal OLTP workload this is certainly correct. I do not think it is correct for mass loading, or an otherwise IO bound db. of the bgwriter is to do the slow tasks. The only argument that has any merit is that O_SYNC or immediate fsync will prevent us from having multiple writes outstanding and thus reduce the efficiency of disk write scheduling. This is a valid point but there is a limit to how many writes we need to have in flight to keep things flowing smoothly. But that is imho the main point. The difference for modern disks is 1Mb/s for random 8k vs. 20 Mb/s for random 256k. Don't understand me wrong I think sync writing would achieve maximum performance, but you have to try to write physically adjacent 256k, and you need a vague idea which blocks to write in parallel. And since that is not so easy I think we could leave it to the OS. And as an aside I think 20-30 minute checkpoint intervals would be sufficient with a bgwriter. Andreas Ps: don't most syncers have 60s intervals, not 30 ? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [pgsql-hackers-win32] [HACKERS] [PATCHES] fork/exec patch
How about the typical answer on Windows ? Create an invisible Window with an Event Handler and pass it a windows message ? Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] 7.4 include file conflict
Does anyone know how Informix, where this file comes from, handles this? Informix puts those files in $INFORMIXDIR/incl/esql (e.g. /usr/informix/incl/esql), so imho a /usr/postgres installation could have them somewhere under /usr/postgres Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] fsync method checking
Ideally that path isn't taken very often. But I'm currently having a discussion off-list with a CMU student who seems to be seeing a case where it happens a lot. (She reports that both WALWriteLock and WALInsertLock are causes of a lot of process blockages, which seems to mean that a lot of the WAL I/O is being done with both held, which would have to mean that AdvanceXLInsertBuffer is doing the I/O. More when we figure out what's going on exactly...) I would figure, that this is in a situation where a large transaction fills one XLInsertBuffer, and a lot of WAL buffers are not yet written. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] fsync method checking
Running the attached test program shows on BSD/OS 4.3: write 0.000360 write fsync 0.001391 I think the write fsync pays for the previous write test (same filename). write, close fsync 0.001308 open o_fsync, write0.000924 I have tried to modify the program to more closely resemble WAL writes (all writes to WAL are 8k), the file is usually already open, and test larger (16k) transactions. [EMAIL PROTECTED]:~ test_sync1 write 0.000625 write fsync 0.016748 write fdatasync 0.006650 write, close fsync 0.017084 write, close fdatasync 0.006890 open o_dsync, write0.015997 open o_dsync, one write0.007128 For the last line xlog.c would need to be modified, but the measurements seem to imply that it is only worth it on platforms that have O_DSYNC but not fdatasync. Andreas test_sync1.c Description: test_sync1.c ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] pg_restore and create FK without verification check
Q2: New situation: Why is it not a good idea to backup the database files of a cluster incl. all c_log and x_log (log files last) to get a physicaly hot backup. In principle it is the same situation like a server which is crashing (not a once but during some time). After restoring, it should do a redo and rollback automatically like after a crash. This methode (physical hot backup) would increas backup and restore times dramatically. The answer from Robert Treat was: Essentially I think you're right, it should behave much like a crashing server. The main reason why people don't recommend it is that (depending on your os setup) there is the potential to lose data that has been commited but not actually written to disk. Note that you shouldn't get corrupted data from this, but in many cases losing data is just as bad so we don't recomend it. If you really want to do this, you should really either shut down the database or get LVM going. The key issue here is to have a pg_control file to start from with a finished checkpoint from before you start to backup. Then you need to ensure that you have all logfiles from checkpoint until backup finishes. The last thing to backup must be the last active x_log. It would prbbly be a good idea to not have a vacuum running concurrently :-) And then you need to do a lot of tests, since nobody else does it yet. I think this is an issue, since it is such high risk, nobody will step up easily and say that it is safe. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] A rough roadmap for internationalization fixes
There are no such libraries. I keep hearing ICU, but that is much too bloated. At least it is kind of standard and also something what will be maintained for foreseeable future, it also has a compatible license and is available on all platforms of interest to postgresql. And it is used for DB/2 and Informix, so it must be quite feature complete for DB relevant stuff. Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] A rough roadmap for internationalization fixes
Have you looked at what is available from http://oss.software.ibm.com/icu/ ? Seems they have a compatible license, but use some C++. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
splitting WAL (was RE: [HACKERS] Providing anonymous mmap as an option of sharing memory)
In case of WAL per database, the operations done on a shared catalog from a backend would need flushing system WAL and database WAL to ensure such transaction commit. Otherwise only flushing database WAL would do. I don't think that is a good idea. If you want databases separated you should install more than one instance. That gives you way more flexibility. Imho per database WAL is a deficiency, not a feature. Andreas PS: problem with mmap was, that it has no attached process count ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Sponsoring enterprise features
Main needs partitioning is useful for: - partition elimination for queries (e.g. seq scans only scan relevant partitions) - deleting/detaching huge parts of a table in seconds - attaching huge parts to a table in seconds (that may have been loaded with a fast loading utility (e.g. loading without index, prebuilding indexes, attaching table + prebuilt partitioned indexes)) - achieving [heap and index] per page data locality (for better cache rates) - allowing partial restores (for defect disks) while the rest of the db is still online - in pg, allowing partial vacuums (only partitions that see changes) People needing those features usually have data with more than 10-50 Gb per partition. I'm a little unclear, personally, about what can be accomplished through table partitioning that we can't currently do through partial indexes and inherited tables, especially after Gavin finishes his tablespaces patch Well, sure the goal needs to be to make use of what already exists, but a few things are still missing, e.g.: - unique indexes, that span the hierarchy (and do not contain the partitioning column[s]) - partition elimination (imho we should use check constraints for that) - physical backups :-) - tablespaces ? Note, that these would all be useful for non partitioning use cases also. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Background writer process
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if Without grouping WAL writes that does not fly. Iff however such grouping is implemented that should deliver optimal performance. I don't think flushing WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k to disk with one call takes nearly the same time. The WAL write would need to be done as soon as eighter 256k fill or a txn commits. That means no special treatment to WAL files? If it works, great. There would be single class of files to take care w.r.t sync. issue. Even more simpler. No, WAL needs special handling. Eighter leave it as is with write + f[data]sync, or implement O_SYNC|O_DIRECT with grouping of writes (the current O_SYNC implementation is only good for small (8kb) transactions). Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Background writer process
If the background writer uses fsync, it can write and allow the buffer to be reused and fsync later, while if we use O_SYNC, we have to wait for the O_SYNC write to happen before reusing the buffer; that will be slower. You can forget O_SYNC for datafiles for now. There would simply be too much to do currently to allow decent performance, like scatter/gather IO, ... Imho the reasonable target should be to write from all backends but sync (fsync) from the background writer only. (Tune the OS if it actually waits until the pg invoked sync (== 5 minutes per default)) Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Background writer process
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if Without grouping WAL writes that does not fly. Iff however such grouping is implemented that should deliver optimal performance. I don't think flushing WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k to disk with one call takes nearly the same time. The WAL write would need to be done as soon as eighter 256k fill or a txn commits. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] 7.4RC2 regression failur and not running stats collector process on Solaris
LOG: could not bind socket for statistics collector: Cannot assign requested address Hmm ... that's sure the problem, but what can we do about it? ISTM that any non-broken system ought to be able to resolve localhost. Actually it's worse than that: your system resolved localhost and then refused Are we using an api that only returns nslookup responses and not /etc/hosts entries ? At least on AIX it looks like it. Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Experimental patch for inter-page delay in VACUUM
that works well enough to make it uncommon for backends to have to write dirty buffers for themselves. If we can, then doing all the writes O_SYNC would not be a problem. One problem with O_SYNC would be, that the OS does not group writes any more. So the code would need to eighter do it's own sorting and grouping (256k) or use aio, or you won't be able to get the maximum out of the disks. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Experimental patch for inter-page delay in VACUUM
One problem with O_SYNC would be, that the OS does not group writes any more. So the code would need to eighter do it's own sorting and grouping (256k) or use aio, or you won't be able to get the maximum out of the disks. Or just run multiple writer processes, which I believe is Oracle's solution. That does not help, since for O_SYNC the OS'es (those I know) do not group those writes together. Oracle allows more than one writer to busy more than one disk(subsystem) and circumvent other per process limitations (mainly on platforms without AIO). Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Experimental patch for inter-page delay in VACUUM
The only idea I have come up with is to move all buffer write operations into a background writer process, which could easily keep track of every file it's written into since the last checkpoint. I fear this approach. It seems to limit a lot of design flexibility later. But I can't come up with any concrete way it limits things so perhaps that instinct is just fud. A lot of modern disk subsystems can only be saturated with more then one parallel IO request. So it would at least need a tuneable number of parallel writer processes, or one writer that uses AIO to dump all outstanding IO requests out at once. (Optimal would be all, in reality it would need to be batched into groups of n pages, since most systems have a max aio request queue size, e.g. 8192). Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Experimental ARC implementation
My plan is to create another background process very similar to the checkpointer and to let that run forever basically looping over that BufferSync() with a bool telling that it's the bg_writer. Why not use the checkpointer itself inbetween checkpoints ? use a min and a max dirty setting like Informix. Start writing when more than max are dirty stop when at min. This avoids writing single pages (which is slow, since it cannot be grouped together by the OS). Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Experimental ARC implementation
Why not use the checkpointer itself inbetween checkpoints ? use a min and a max dirty setting like Informix. Start writing when more than max are dirty stop when at min. This avoids writing single pages (which is slow, since it cannot be grouped together by the OS). Current approach is similar ... if I strech the IO and syncing over the entire 150-300 second checkpoint interval, grouping in 50 pages then sync()+nap, the system purr's pretty nice and without any peaks. But how do you handle a write IO bound system then ? My thought was to let the checkpointer write dirty pages inbetween checkpoints with a min max, but still try to do the checkpoint as fast as possible. I don't think streching the checkpoint is good, since it needs to write hot pages, which the inbetween IO should avoid doing. The checkpointer would have two tasks, that it handles alternately, checkpoint or flush LRU from max to min. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Experimental patch for inter-page delay in VACUUM
Or... It seems to me that we have been observing something on the order of 10x-20x slowdown for vacuuming a table. I think this is WAY overcompensating for the original problems, and would cause it's own problem as mentioned above. Since the granularity of delay seems to be the problem can we do more work between delays? Instead of sleeping after every page (I assume this is what it's doing) perhaps we should sleep every 10 pages, I also think doing more than one page per sleep is advantageous since it would still allow the OS to do it's readahead optimizations. I suspect those would fall flat if only one page is fetched per sleep. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] 2-phase commit
I was wondering whether we need to keep WAL online for 2PC, or whether only something like clog is sufficient. What if: 1. phase 1 commit must pass the slave xid that will be used for 2nd phase (it needs to return some sort of identification anyway) 2. the coordinator must keep a list of slave xid's along with corresponding (commit/rollback) info Is that not sufficient ? Why would WAL be needed in the first place ? This is not replication, the slave has it's own WAL anyway. I also don't buy the argument with the lockup. Iff today somebody connects with psql starts a transaction modifies something and then never commits or aborts there is also no automatism builtin that will eventually kill it automatically. 2PC will simply need to have means for the administrator to rollback/commit an in doubt transaction manually. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] 2-phase commit
Why would you spent time on implementing a mechanism whose ultimate benefit is supposed to be increasing reliability and performance, when you already realize that it will have to lock up at the slightest sight of trouble? There are better mechanisms out there that you can use instead. If you want cross-server transactions, what other methods are there that are more reliable? It seems network unreliability is going to be a problem no matter what method you use. And unless you have 2-phase (or 3-phase) commit, all other methods are going to be worse, since their time window for possible critical failure is going to be substantially larger. (extending 2-phase to 3-phase should not be too difficult) A lot of use cases for 2PC are not for manipulating the same data on more than one server (replication), but different data that needs to be manipulated in an all or nothing transaction. In this scenario it is not about reliability but about physically locating data (e.g. in LA vs New York) where it is needed most often. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] new initdb.c available
. using wb for writing out on Windows is so that we don't get Windows' gratuitous addition of carriage returns. I will document that. Please use the #define PG_BINARY_W from c.h which is defined with the correct letters for all platforms (wb on Windows). That is how Peter's comment was meant. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] 2-phase commit
The simplest senario(though there could be varations) is [At participant(master)'s side] Because the commit operations is done, does nothing. [At coordinator(slave)' side] 1) After a while 2) re-establish the communication path between the partcipant(master)'s TM. 3) resend the commit requeset to the participant's TM. 1)2)3) would be repeated until the coordinator receives the commit ok message from the partcipant. [ scratches head ] I think you are using the terms master and slave oppositely than I would. Oops my mistake, sorry. But is it 2-phase commit protocol in the first place ? That is, in your exmaple below Example: Master Slave -- - commit ready-- This is the commit for phase 1. This commit is allowed to return all sorts of errors, like violated deferred checks, out of diskspace, ... --OK commit done-XX This is commit for phase 2, the slave *must* answer with success in all but hardware failure cases. (Note that instead the master could instead send rollback, e.g. because some other slave aborted) is the commit done message needed ? So, yes this is needed. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] 2-phase commit
I don't think there is any way to handle cases where the master or slave just disappears. The other machine isn't under the server's control, so it has no way of it knowing. I think we have to allow the administrator to set a timeout, or ask to wait indefinately, and allow them to call an external program to record the event or notify administrators. Multi-master replication has the same issues. Needs to wait indefinitely, a timeout is not acceptable since it leads to inconsistent data. Human (or monitoring software) intervention is needed if they can't reach each other in a reasonable time. I think this needs to be kept dumb. Different sorts of use cases will simply need different answers to resolve in-doubt transactions. What is needed is an interface that allows listing and commit/rollback of in-doubt transactions (preferably from a newly started client, or a direct command for the postmaster). Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] 2-phase commit
Master Slave -- - commit ready-- --OK commit done-XX is the commit done message needed ? Of course ... how else will the Slave commit? From my understanding, the concept is that the master sends a commit ready to the slave, but the OK back is that OK, I'm ready to commit whenever you are, at which point the master does its commit and tells the slave to do its ... Or the slave could reject the request. At this point only because of a hardware error. In case of network problems the commit done eighter did not reach the slave or the success answer did not reach the master. That is what it's all about. Phase 2 is supposed to be low overhead and very fast to allow keeping the time window for failure (that produces in-doubt transactions) as short as possible. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] 2-phase commit
Or the slave could reject the request. Huh? The slave has that option?? In what circumstance? I thought the slave could reject if someone local already had the row locked. No, not at all. The slave would need to reject phase 1 commit ready for this. Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Threads vs Processes
When the address-of operator is applied to a thread-local variable, it is evaluated at run-time and returns the address of the current thread's instance of that variable. An address so obtained may be used by any thread. When a thread terminates, any pointers to thread-local variables in that thread become invalid. Bummer, I would have thought one advantage of using TLS must surely be memory protection ? So the only for pg useful usage for TLS seems to be __declspec(thread) and __declspec(thread) static (both for stuff that do not need runtime preinitialization). Maybe the techniques of electric fence could be used for protecting the shmem at least a little bit. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] More Prelimiary DBT-2 Test Results with PostgreSQL 7.3.4
Are those response times in the right unit? 7-10s? No problem: http://developer.osdl.org/markw/misc/plana.out Ok, I guess I misunderstood you. These queries are taking 0.5ms - 300ms except for the last aggregate query which takes just over 1s. Yes, but because this is a benchmark he probably has many clients running in parallel. It would be interesting to see the 'explain analyze' from a session that runs concurrently with the benchmark. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] AIX 4.2.1 CVS head and SSL
I get the following errors gmake -C ecpglib all gmake[4]: Entering directory `/usr/local/postgres/pgsql/src/interfaces/ecpg/ecpglib' ../../../../src/backend/port/aix/mkldexport.sh libecpg.a libecpg.exp gcc -O2 -pipe -Wall -Wmissing-prototypes -Wmissing-declarations -Wl,-bnoentry - Wl,-H512 -Wl,-bM:SRE -o libecpg.so libecpg.a -L../../../../src/port -lm -L/usr/local/ssl/lib -L../../../../src/port -L/usr/local/ssl/lib -L../pgtypeslib -lpgtypes -L../../../../src/interfaces/libpq -lpq -Wl,-bI:../../../../src/backend/postgres.imp -Wl,-bE:libecpg.exp ld: 0711-317 ERROR: Undefined symbol: .SSL_new ld: 0711-317 ERROR: Undefined symbol: .SSL_set_ex_data ld: 0711-317 ERROR: Undefined symbol: .SSL_set_fd ld: 0711-317 ERROR: Undefined symbol: .SSL_read ld: 0711-317 ERROR: Undefined symbol: .SSL_get_error ld: 0711-317 ERROR: Undefined symbol: .SSL_write You can get around that problem by adding the SSL library to the link line manually (-lssl). AIX wants all symbols resolved at link time (or at least be told where to load them from at runtime with an *.imp file). All Platforms with that requirement will have that problem when linking libecpg.so with ssl. Sorry I don't have time to fix it right now. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Win32 native port
Below is the email that prompted me to add the derived files to WIN32_DEV CVS. However, most people don't want them in there, so I have removed them, and updated the web page to recommend the nightly snapshots (which have the derived files), and mentioned the tools that will be needed for a CVS build: Here: http://unxutils.sourceforge.net/ are ports of several unix utility programs (including bison and flex) for win32. From my experiences compiling the Peer Direct port, this is the easiest way to get started. While I agree, that those utilities are really fine, the trouble with this bison is that it is version 1.28. pg needs a newer bison version. The bison from MinGW does really only seem to work under obscure preconditions :-( Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Vote: Adding flex/bison derived files in WIN32_DEV
I'm confused. Right on the MinGW download page is a link for bison-1.875. Yep, but I had problems with it. Author confirmed that there could be some problems creating processes (for example M4). However if You make it work, I'll be interested to know how. Check the MinGW mailing list for last few weeks, and see what I'm talking about. By the way the last tip of the author : reconfigure make, was not working for me :-( Download the bin package at http://prdownloads.sf.net/mingw/bison-1.875.0-2003.02.10-1.exe?download and use it. You do not need to compile bison yourself. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] 2-phase commit
From our previous discussion of 2-phase commit, there was concern that the failure modes of 2-phase commit were not solvable. However, I think multi-master replication is going to have similar non-solvable failure modes, yet people still want multi-master replication. No. The real problem with 2PC in my mind is that its failure modes occur *after* you have promised commit to one or more parties. In multi-master, if you fail you know it before you have told the client his data is committed. Hmm ? The appl cannot take the first phase commit as its commit info. It needs to wait for the second phase commit. The second phase is only finished when all coservers have reported back. 2PC is synchronous. The problems with 2PC are when after second phase commit was sent to all servers and before all report back one of them becomes unreachable/down ... (did it receive and do the 2nd commit or not) Such a transaction must stay open until the coserver is reachable again or an administrator committed/aborted it. It is multi master replication that usually has an asynchronous mode for performance, and there the trouble starts. Andreas ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Index creation takes for ever
I don't think so, because the patch does nothing to keep the sort order once the index is initially created. As Tom mentioned, we might not want to keep the tid's in order after the index is created because he wants the most recent tid's first, so the expired ones migrate to the end. But on average this argument only holds true for unique indexes, no ? Is there any code that stops the heap lookup after the visible tuple is found ? At least in an index with more rows per key you will fetch all heaps after the first one anyway to get at the next row. This is better done in heap order, no ? And the bitmap approach will not work for large result sets. Summa summarum I would leave the TODO item (maybe add a comment (only for non-unique, evaluate performance)) Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Correlation in cost_index()
In both cases ANALYZE will calculate correlation 1.0 for column X, and something near zero for column Y. We would like to come out with index correlation 1.0 for the left-hand case and something much less (but, perhaps, not zero) for the right-hand case. I don't really see a way to do this without actually examining the multi-column ordering relationship during ANALYZE. The only way the second column correlation will be irrelevant is if the first column is already (nearly) unique (enough so, that the second column wont scatter fetches enough to fill the buffer before seeing cache hits). Thus I think when merging correlations you could take nunique into account. corr = corr_1 * (corr_2 * ( 1 - nunique_1 / nrows)) But, I think one (new) correlation metric for the whole index (whole key) and the data pages would actually be sufficient. This metric could imho always be used instead of the per column correlations to calculate index cost. This holds true as long as you walk an index range, and that is what it is all about, no ? ??? Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Changing behavior of BEGIN...sleep...do something...COMMIT
Also, per other discussions, we are removing backend autocommit support in 7.4. It was the wrong way to do it. Somehow I did not see that conclusion made. I thought, at least for JDBC, it is already successfully used ? I think the backend autocommit is useful. Maybe only the installation/database/user wide GUC setting should be depricated/ disabled, so it is only used by a session SET ? Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Nested transactions: low level stuff
In fact, I had proposed a simpler UNDO capability that revisited tuples and set their XID to a fixed aborted XID to clean up aborted subtransactions, but most now like the multiple XID solution. I think for the implicit subtransactions that we will want (with error codes comming) using a different xid for every command inside a transaction is not so sexy, no ? Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Roadmap for FE/BE protocol redesign
for that, we get what exactly? Fetching one row at a time is *guaranteed* to be inefficient. The correct response if that bothers you is to fetch multiple rows at a time, not to make a less robust protocol. I don't feel strongly either way on this one, but IIRC the SQL standard for cursors only specifies fetching one record at a time (at least that is how MSSQL and DB2 implement it). Thus portable code is likely to only fetch one record at a time. The current row description isn't too big, but with the changes being suggested it might become so. Yes, I think it should be the responsibility of the higer level interfaces (ODBC, esql/c, ...) to fetch rows in optimal packets (e.g. 4-16k at a time). This is what other db's interfaces do for you. The first fetch gets a packet from the db, subsequent fetches are answered from the client side cache until it is empty, then the next fetch gets the next packet from the db ... The trick would be to ask for n bytes of row data (maybe a range), which might be 1 - n rows (or maybe even only part of one row if it is very wide ?). I think this is one of the things the new protocol should be able to cope with. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]