Re: [HACKERS] floor function in 7.3b2
Tom Lane [EMAIL PROTECTED] writes: Mario Weilguni [EMAIL PROTECTED] writes: Why is floor not working anymore? Mph. Seems we have floor(numeric) but not floor(float8), and the latter is what you need here. Sorry, I missed much of the casting discussion -- but is there a reason why we can't cast from float8 - numeric implicitely? IIRC the idea was to allow implicit casts from lower precision types to higher precision ones. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] Threaded Sorting
Did anybody think about threaded sorting so far? Assume an SMP machine. In the case of building an index or in the case of sorting a lot of data there is just one backend working. Therefore just one CPU is used. What about starting a thread for every temporary file being created? This way CREATE INDEX could use many CPUs. Maybe this is worth thinking about because it will speed up huge databases and enterprise level computing. Best regards, Hans-Jürgen Schönig -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Threaded Sorting
On 4 Oct 2002 at 9:46, Hans-Jürgen Schönig wrote: Did anybody think about threaded sorting so far? Assume an SMP machine. In the case of building an index or in the case of sorting a lot of data there is just one backend working. Therefore just one CPU is used. What about starting a thread for every temporary file being created? This way CREATE INDEX could use many CPUs. Maybe this is worth thinking about because it will speed up huge databases and enterprise level computing. I have a better plan. I have a thread architecture ready which acts as generic thread templates. Even the function pointers in the thread can be altered on the fly. I suggest we use some such architecture for threading. It can be used in any module without hardcoding things. Like say in sorting we assign exclusive jobs/data ranges to threads then there would be minimum locking and one thread could merge the results.. Something like that. All it takes to change entry functions to accept one more parameter that indicates range of values to act upon. In non-threaded version, it's not there because the function acts on entire data set. Further more, with this model threading support can be turned off easily. In non-threaded model, a wrapper function can call the entry point in series with necessary arguments. So postgresql does not have to deal with not-so-good- enough thread implementations. Keeping tradition to conservative defaults we can set default threads to off.. The code is in C++ but it's hardly couple of pages. I can convert it to C and post it if required.. Let me know.. Bye Shridhar -- Parkinson's Fourth Law: The number of people in any working group tends to increaseregardless of the amount of work to be done. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Bad rules
I am using postgres 7.2, and have rule on a table which causes a notify if an insert/update/delete is performed on the table. The table is very very small. When performing a simple (very simple) update on the table this takes about 3 secs, when I remove the rule it is virtually instantaneous. The rest of the database seems to perform fine, have you any ideas or come across this before?? Regards Steve ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] (Fwd) Re: Any Oracle 9 users? A test please...
Mike == Mike Mascari [EMAIL PROTECTED] writes: Mike Tom Lane wrote: Yury Bokhoncovich [EMAIL PROTECTED] writes: As reported by my friend: Oracle 8.1.7 (ver.9 behaves the same way): [ to_char(sysdate) advances in a transaction ] Now I'm really confused; this directly contradicts the report of Oracle 8's behavior that we had earlier from Roland Roberts. Can someone explain why the different results? Mike Roland used an anonymous PL/SQL procedure: You're right and I didn't think enough about what was happening. This also explains why I so often see the same timestamp throughout a transaction---the transaction is all taking place inside a PL/SQL procedure. roland -- PGP Key ID: 66 BC 3B CD Roland B. Roberts, PhD RL Enterprises [EMAIL PROTECTED] 76-15 113th Street, Apt 3B [EMAIL PROTECTED] Forest Hills, NY 11375 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)
Neil Conway [EMAIL PROTECTED] writes: Sorry, I missed much of the casting discussion -- but is there a reason why we can't cast from float8 - numeric implicitely? IIRC the idea was to allow implicit casts from lower precision types to higher precision ones. The implicit casting hierarchy is now int2 - int4 - int8 - numeric - float4 - float8 Moving to the left requires an explicit cast (or at least an assignment to a column). I know this looks strange to someone who knows that our numeric type beats float4/float8 on both range and precision, but it's effectively mandated by the SQL spec. Any combination of exact and inexact numeric types is supposed to yield an inexact result per spec, thus numeric + float8 yields float8 not numeric. Another reason for doing it this way is that a numeric literal like 123.456 can be initially typed as numeric, and later implicitly promoted to float4 or float8 if context demands it. Doing that the other way 'round would introduce problems with precision loss. We had speculated about introducing an unknown_numeric pseudo-type to avoid that problem, but the above hierarchy eliminates the need for unknown_numeric. We can initially type a literal as the smallest thing it will fit in, and then do implicit promotion as needed. (7.3 is not all the way there on that plan, but 7.4 will be.) regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Threaded Sorting
=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes: Did anybody think about threaded sorting so far? Assume an SMP machine. In the case of building an index or in the case of sorting a lot of data there is just one backend working. Therefore just one CPU is used. What about starting a thread for every temporary file being created? This way CREATE INDEX could use many CPUs. In my experience, once you have enough data to force a temp file to be used, the sort algorithm is I/O bound anyway. Throwing more CPUs at it won't help much. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Threaded Sorting
I wouldn't hold your breath for any form of threading. Since PostgreSQL is process based, you might consider having a pool of sort processes which address this but I doubt you'll get anywhere talking about threads here. Greg On Fri, 2002-10-04 at 02:46, Hans-Jürgen Schönig wrote: Did anybody think about threaded sorting so far? Assume an SMP machine. In the case of building an index or in the case of sorting a lot of data there is just one backend working. Therefore just one CPU is used. What about starting a thread for every temporary file being created? This way CREATE INDEX could use many CPUs. Maybe this is worth thinking about because it will speed up huge databases and enterprise level computing. Best regards, Hans-Jürgen Schönig -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Bad rules
Steve King [EMAIL PROTECTED] writes: I am using postgres 7.2, and have rule on a table which causes a notify if an insert/update/delete is performed on the table. The table is very very small. When performing a simple (very simple) update on the table this takes about 3 secs, when I remove the rule it is virtually instantaneous. The rest of the database seems to perform fine, have you any ideas or come across this before?? Let's see the rule exactly? NOTIFY per se is not slow in my experience. (One thing to ask: have you done a VACUUM FULL on pg_listener in recent memory? Heavy use of LISTEN/NOTIFY does tend to bloat that table if you don't keep after it with VACUUM.) regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Correlation in cost_index()
On Thu, 03 Oct 2002 14:50:00 -0400, Tom Lane [EMAIL PROTECTED] wrote: indexCorrelation is calculated by dividing the correlation of the first index column by the number of index columns. Yeah, I concluded later that that was bogus. I've been thinking of just using the correlation of the first index column and ignoring the rest; that would not be great, but it's probably better than what's there. Have you got a better formula? Unfortunately not. I think such a formula does not exist for the information we have. What we'd need is a notion of correlation of the nth (n 1) index column for constant values of the first n-1 index columns; or a combined correlation for the first n index columns (1 n = number of index columns). I try to understand the problem with the help of use cases. [ Jump to the end, if this looks to long-winded. ] 1) Have a look at invoice items with an index on (fyear, invno, itemno). Invoice numbers start at 1 for each financial year, item numbers start at 1 for each invoice. In a typical scenario correlations for fyear, (fyear, invno), and (fyear, invno, itemno) are close to 1; invno correlation is expected to be low; and itemno looks totally chaotic to the analyzer. When we SELECT * FROM item WHERE fyear = 2001 AND invno 1000 dividing the correlation of the first column by the number of columns gives 1/3 which has nothing to do with what we want. (And then the current implementation of cost_index() squares this and gets 1/9 which is even farther away from the truth.) Just using the correlation of the first index column seems right here. 2) OTOH consider bookkeeping with enties identified by (fyear, account, entryno). Again fyear has a correlation near 1. For account we can expect something near 0, and entryno has a distribution comparable to invno in the first use case, i.e. starting at 1 for each year. SELECT * from entry WHERE fyear = 2001 AND account = whatever Taking the correlation of fyear would imply that the tuples we are looking for are close to each other, which again can turn out to be wrong. So what do we know now? Even less than before :-( I have just one idea that might help a bit: If correlation of the first index column is near +/1, cost_index() should not use baserel-pages, but baserel-pages * selectivity of the first column. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Threaded Sorting
On Fri, 2002-10-04 at 09:40, Hans-Jürgen Schönig wrote: I had a brief look at the code used for sorting. It is very well documented so maybe it is worth thinking about a parallel algorithm. When talking about threads: A pool of processes for sorting? Maybe this could be useful but I doubt if it the best solution to avoid overhead. Somewhere in the TODO it says that there will be experiments with a threaded backend. This make me think that threads are not a big no no. Hans That was a fork IIRC. Threading is not used in baseline PostgreSQL nor is there any such plans that I'm aware of. People from time to time ask about threads for this or that and are always told what I'm telling you. The use of threads leads to portability issues not to mention PostgreSQL is entirely built around the process model. Tom is right to dismiss the notion of adding additional CPUs to something that is already I/O bound, however, the concept it self should not be dismissed. Applying multiple CPUs to a sort operation is well accepted and understood technology. At this point, perhaps Tom or one of the other core developers having insight in this area would be willing to address how readily such a mechanism could could be put in place. Also, don't be so fast to dismiss what the process model can do. There is not reason to believe that having a process pool would not be able to perform wonderful things if implemented properly. Basically, the notion would be that the backend processing the query would solicit assistance from the sort pool if one or more processes were available. At that point, several methods could be employed to divide the work. Some form of threshold would also have to be created to prevent the pool from being used when a single backend is capable of addressing the need. Basically the idea is, you only have the pool assist with large tuple counts and then, only when resources are available and resource are available from within the pool. By doing this, you avoid additional overhead for small sort efforts and gain when it matters the most. Regards, Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Threaded Sorting
On Fri, 2002-10-04 at 10:37, Hans-Jürgen Schönig wrote: My concern was that a process model might be a bit too slow for that but if we had processes in memory this would be wonderful thing. Yes, that's the point of having a pool. The idea is not only do you avoid process creation and destruction which is notoriously expensive on many platforms, they would sit idle until signaled to begin working on it's assigned sort operation. Ideally, these would be configurable options which would include items such as, pool size (maximum number of processes in the pool), max concurrency level (maximum number of process from the pool which can contribute to a single backend) and tuple count threshold (threshold which triggers solicition for assistance from the sort pool). Using it for small amounts of data is pretty useless - I totally agree but when it comes to huge amounts of data it can be useful. It is a mechanism for huge installations with a lot of data. Hans Agreed. Thus the importance of being able to specify some type of meaningful threshold. Any of the core developers wanna chime in here on this concept? Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Correlation in cost_index()
On Thu, 3 Oct 2002 10:45:08 -0600 (MDT), scott.marlowe [EMAIL PROTECTED] wrote: effective cache size is the default (i.e. commented out) The default is 1000, meaning ca. 8 MB, which seems to be way too low. If your server is (almost) exclusively used by Postgres, try setting it to represent most of your OS cache (as reported by free on Linux). Otherwise you have to estimate the fraction of the OS cache that gets used by Postgres. I'm still trying to get a feeling for how these settings play together, so I'd be grateful if you report back the effects this has on your cost estimates. Caveat: effective_cache_size is supposed to be the number of cache pages available to one query (or is it one scan?). So if you have several concurrent queries (or complex queries with several scans), you should choose a lower value. OTOH if most of your queries operate on the same data, one query could benefit from pages cached by other queries ... You have to experiment a little. Servus Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Threaded Sorting
On Fri, 2002-10-04 at 12:26, Bruce Momjian wrote: Added to TODO: * Allow sorting to use multiple work directories Why wouldn't that fall under the table space effort??? Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Threaded Sorting
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Bingo! Want to increase sorting performance, give it more I/O bandwidth, and it will take 1/100th of the time to do threading. Added to TODO: * Allow sorting to use multiple work directories Yeah, I like that. Actually it should apply to all temp files not only sorting. A crude hack would be to allow there to be multiple pg_temp_NNN/ subdirectories (read symlinks) in a database, and then the code would automatically switch among these. TODO updated: * Allow sorting/temp files to use multiple work directories Tom, what temp files do we use that aren't for sorting; I forgot. Probably a cleaner idea would be to somehow integrate this with tablespace management --- if you could mark some tablespaces as intended for temp stuff, the system could round-robin among those as it creates temp files and/or temp tables. Yes, tablespaces would be the place for this. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] ALTER TABLE ... ADD COLUMN
Hello hackers, I'm thinking about ALTER TABLE ... ADD COLUMN working properly when child tables already contain the column. I have two proposals. First one: There are two cases: one when specifying ALTER TABLE ONLY, and other when specifying recursive (not ONLY). In the first (ONLY) case, child tables are checked for existance of the column. If any child does not contain the column or it has a different atttypid or attypmod (any other field that should be checked?), the request is aborted. Else the column is added in the parent only, and the child attributes have their attinhcount incremented. In the second case, child tables are checked for existance of the column. If the column doesn't exist, the procedure is called recursively. If the column exists but has a different atttypid o atttypmod, the request is aborted. Else, the attribute has its attinhcount incremented and attislocal reset. There are two differences between these two cases: - ONLY does not add a column in childs. This seems natural. Obviously the scenario where this can lead to trouble produces an error (inexistent attr in child). - ONLY does not touch attislocal in childs. The second may seems odd, but consider the following scenario: CREATE TABLE p1 (f1 int); CREATE TABLE p2 (f2 int); CREATE TABLE c (f1 int) INHERITS (p1, p2); In this case, c.f1.attislocal is true. Now suppose the user wants to create p2.f1. If the recursive version is used, attislocal will be reset and the scenario will be equivalent to CREATE TABLE p1 (f1 int); CREATE TABLE p2 (f1 int, f2 int); CREATE TABLE c () INHERITS (p1, p2); but the natural way would be CREATE TABLE p1 (f1 int); CREATE TABLE p2 (f1 int, f2 int); CREATE TABLE c (f1 int) INHERITS (p1, p2); which is what the ONLY case does. Second proposal: Another way of doing this would be resetting attislocal iff attinhcount is exactly zero. This assumes that if attislocal is true and attinhcount is not zero, the user wants the column as locally defined and we should keep it that way. On the other hand, if a child table already has the column there's no way to push the definition up the inheritance tree. What do you think? I have actually a patch ready that implements the first proposal; the second one, while simpler conceptually and in terms of code, occurred to me just as I was writing this, and has an issue of completeness. -- Alvaro Herrera (alvherre[a]atentus.com) Las cosas son buenas o malas segun las hace nuestra opinion (Lisias) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] New lock types
Hello hackers, I'm looking at implementing the btree reorganizer described in On-line reorganization of sparsely-..., ACM SIGMOD proceedings 1996, by Zou and Salzberg. It seems to me I'll have to add some amount of lock types in the lock manager. Does that bother you? -- Alvaro Herrera (alvherre[a]atentus.com) On the other flipper, one wrong move and we're Fatal Exceptions (T.U.X.: Term Unit X - http://www.thelinuxreview.com/TUX/) ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] ALTER TABLE ... ADD COLUMN
Alvaro Herrera [EMAIL PROTECTED] writes: I'm thinking about ALTER TABLE ... ADD COLUMN working properly when child tables already contain the column. There are two cases: one when specifying ALTER TABLE ONLY, and other when specifying recursive (not ONLY). I think ALTER TABLE ONLY ... ADD COLUMN is nonsensical and should be rejected out of hand. That solves that part of the problem easily ;-) The comparison case in my mind is ALTER TABLE ONLY ... RENAME COLUMN, which has to be rejected. (Surely you're not going to say we should support this by allowing the parent column to become associated with some other child columns than before...) ALTER ONLY ADD COLUMN cannot add any functionality that's not perfectly well available with ALTER ADD COLUMN. In the second case, child tables are checked for existance of the column. If the column doesn't exist, the procedure is called recursively. If the column exists but has a different atttypid o atttypmod, the request is aborted. Else, the attribute has its attinhcount incremented and attislocal reset. I don't like resetting attislocal here. If you do that, then DROPping the parent column doesn't return you to the prior state. I think I gave this example before, but consider CREATE TABLE p (f1 int); CREATE TABLE c (f2 int) INHERITS (p); ALTER TABLE p ADD COLUMN f2 int; -- sometime later, realize that the ADD was a mistake, so: ALTER TABLE p DROP COLUMN f2; If you reset attislocal then the ending state will be that c.f2 is gone. That seems clearly wrong to me. The second may seems odd, but consider the following scenario: CREATE TABLE p1 (f1 int); CREATE TABLE p2 (f2 int); CREATE TABLE c (f1 int) INHERITS (p1, p2); In this case, c.f1.attislocal is true. Now suppose the user wants to create p2.f1. If the recursive version is used, attislocal will be reset and the scenario will be equivalent to CREATE TABLE p1 (f1 int); CREATE TABLE p2 (f1 int, f2 int); CREATE TABLE c () INHERITS (p1, p2); ... which is wrong also. c had a local definition before and should still, IMHO. What's the argument for taking away its local definition? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] New lock types
Alvaro Herrera [EMAIL PROTECTED] writes: I'm looking at implementing the btree reorganizer described in On-line reorganization of sparsely-..., ACM SIGMOD proceedings 1996, by Zou and Salzberg. It seems to me I'll have to add some amount of lock types in the lock manager. Does that bother you? Such as? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: ... most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. tom lane replies: Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. Well I'm basing this on past empirical testing and having read some man pages that describe fsync under this exact scenario. I'll have to write a test to prove this one way or another. I'll also try and look into the linux/BSD source for the common file systems used for PostgreSQL. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? Then there's no need to call fsync() at all, except during checkpoints (which actually issue sync() not fsync(), anyway). No, they're not exactly the same thing. Consider: Process A File System - --- Writes index buffer .idling... Writes entry to log cache . Writes another index buffer . Writes another log entry. Writes tuple buffer . Writes another log entry. Index scan . Large table sort. Writes tuple buffer . Writes another log entry. Writes . Writes another index buffer . Writes another log entry. Writes another index buffer . Writes another log entry. Index scan . Large table sort. Commit . File Write Log Entry. .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache Write Commit Log Entry .idling... .idling... Write to cache Call fsync .idling... .idling... Write all buffers to device. .DONE. In this case, Process A is waiting for all the buffers to write at the end of the transaction. With asynchronous I/O this becomes: Process A File System - --- Writes index buffer .idling... Writes entry to log cache Queue up write - move head to cylinder Writes another index buffer Write log entry to media Writes another log entryImmediate write to cylinder since head is still there. Writes tuple buffer . Writes another log entryQueue up write - move head to cylinder Index scan .busy with scan... Large table sortWrite log entry to media Writes tuple buffer . Writes another log entryQueue up write - move head to cylinder Writes . Writes another index buffer Write log entry to media Writes another log entryQueue up write - move head to cylinder Writes another index buffer . Writes another log entryWrite log entry to media Index scan . Large table sortWrite log entry to media Commit . Write Commit Log Entry Immediate write to cylinder since head is still there. .DONE. Effectively the real work of writing the cache is done while the CPU for the process is busy doing index scans, sorts, etc. With the WAL log on another device and SCSI I/O the log writing should almost always be done except for the final commit write. Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, on safety grounds (we want to be sure we have a consistent WAL up to the end of what we've written). Even if we can allow some reordering when a single transaction puts out a large volume of WAL data, I fail to see where any large gain is going to come from. We're going to be issuing those writes sequentially and that ought to match the disk layout about as well as can be hoped anyway. My comment was applying to reads and writes of other processes not the WAL log. In my original email, recall I mentioned using the O_APPEND open flag which will ensure that all log entries are done sequentially. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. We do not fsync buffer pages; in fact a transaction commit doesn't write buffer pages at all. I think the above is just a misunderstanding of what's really
Re: [HACKERS] ALTER TABLE ... ADD COLUMN
On Fri, Oct 04, 2002 at 05:57:02PM -0400, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: I'm thinking about ALTER TABLE ... ADD COLUMN working properly when child tables already contain the column. There are two cases: one when specifying ALTER TABLE ONLY, and other when specifying recursive (not ONLY). I don't like resetting attislocal here. If you do that, then DROPping the parent column doesn't return you to the prior state. I think I gave this example before, but consider Huh, I don't know where I got the idea you were (or someone else was?) in the position that attislocal should be reset. I'll clean everything up and submit the patch I had originally made. -- Alvaro Herrera (alvherre[a]atentus.com) Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end. (2nd Commandment for C programmers) ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: It looks to me like BufferAlloc will simply result in a call to BufferReplace smgrblindwrt write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I think Tom was suggesting that when a buffer is written out, the write() call only pushes the data down into the filesystem's buffer -- which is free to then write the actual blocks to disk whenever it chooses to. In other words, the write() returns, the backend process can continue with what it was doing, and at some later time the blocks that we flushed from the Postgres buffer will actually be written to disk. So in some sense of the word, that I/O is asynchronous. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Threaded Sorting
Greg Copeland wrote: -- Start of PGP signed section. On Fri, 2002-10-04 at 12:26, Bruce Momjian wrote: Added to TODO: * Allow sorting to use multiple work directories Why wouldn't that fall under the table space effort??? Yes, but we make it a separate item so we are sure that is implemented as part of tablespaces. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] fsync exlusive lock evidence WAS: Potential Large Performance Gain in WAL synching
After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: Code from: /usr/src/sys/syscalls/vfs_syscalls int fsync(p, uap) struct proc *p; struct fsync_args /* { syscallarg(int) fd; } */ *uap; { register struct vnode *vp; struct file *fp; vm_object_t obj; int error; if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0) return (error); vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); if (VOP_GETVOBJECT(vp, obj) == 0) vm_object_page_clean(obj, 0, 0, 0); if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 vp-v_mount (vp-v_mount-mnt_flag MNT_SOFTDEP) bioops.io_fsync) error = (*bioops.io_fsync)(vp); VOP_UNLOCK(vp, 0, p); return (error); } Notice the calls to: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); .. VOP_UNLOCK(vp, 0, p); surrounding the call to VOP_FSYNC. From the man pages for VOP_UNLOCK: HEADER STUFF . VOP_LOCK(struct vnode *vp, int flags, struct proc *p); int VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p); int VOP_ISLOCKED(struct vnode *vp, struct proc *p); int vn_lock(struct vnode *vp, int flags, struct proc *p); DESCRIPTION These calls are used to serialize access to the filesystem, such as to prevent two writes to the same file from happening at the same time. The arguments are: vp the vnode being locked or unlocked flags One of the lock request types: LK_SHARED Shared lock LK_EXCLUSIVE Exclusive lock LK_UPGRADEShared-to-exclusive upgrade LK_EXCLUPGRADEFirst shared-to-exclusive upgrade LK_DOWNGRADE Exclusive-to-shared downgrade LK_RELEASERelease any type of lock LK_DRAIN Wait for all lock activity to end The lock type may be or'ed with these lock flags: LK_NOWAITDo not sleep to wait for lock LK_SLEEPFAIL Sleep, then return failure LK_CANRECURSEAllow recursive exclusive lock LK_REENABLE Lock is to be reenabled after drain LK_NOPAUSE No spinloop The lock type may be or'ed with these control flags: LK_INTERLOCKSpecify when the caller already has a simple lock (VOP_LOCK will unlock the simple lock after getting the lock) LK_RETRYRetry until locked LK_NOOBJDon't create object p process context to use for the locks Kernel code should use vn_lock() to lock a vnode rather than calling VOP_LOCK() directly. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Potential Large Performance Gain in WAL synching
On Fri, 2002-10-04 at 18:03, Neil Conway wrote: Curtis Faith [EMAIL PROTECTED] writes: It looks to me like BufferAlloc will simply result in a call to BufferReplace smgrblindwrt write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I think Tom was suggesting that when a buffer is written out, the write() call only pushes the data down into the filesystem's buffer -- which is free to then write the actual blocks to disk whenever it chooses to. In other words, the write() returns, the backend process can continue with what it was doing, and at some later time the blocks that we flushed from the Postgres buffer will actually be written to disk. So in some sense of the word, that I/O is asynchronous. Isn't that true only as long as there is buffer space available? When there isn't buffer space available, seems the window for blocking comes into play?? So I guess you could say it is optimally asynchronous and worse case synchronous. I think the worse case situation is one which he's trying to address. At least that's how I interpret it. Greg signature.asc Description: This is a digitally signed message part
Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)
Bruce Momjian [EMAIL PROTECTED] writes: Yes, I realize it is during parsing. I was just wondering if making constants coming in from the parser NUMERIC is a performance hit? Offhand I don't see a reason to think that coercing to NUMERIC (and then something else) is slower than coercing to FLOAT (and then something else). Yeah, you would come out a little behind when the final destination type is FLOAT, but on the other hand you win a little when it's NUMERIC. I see no reason to think this isn't a wash overall. I see in gram.y that FCONST comes in as a Float so I don't even see were we make it NUMERIC. It's make_const in parse_node.c that has the first contact with the grammar's output. Up to that point the value's just a string, really. The grammar does *not* coerce it to float8. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Threaded Sorting
I see. I just always assumed that it would be done as part of table space effort as it's such a defacto feature. I am curious as to why no one has commented on the other rather obvious performance enhancement which was brought up in this thread. Allowing for parallel sorting seems rather obvious and is a common enhancement yet seems to of been completely dismissed as people seem to be fixated on I/O. Go figure. Greg On Fri, 2002-10-04 at 14:02, Bruce Momjian wrote: Greg Copeland wrote: -- Start of PGP signed section. On Fri, 2002-10-04 at 12:26, Bruce Momjian wrote: Added to TODO: * Allow sorting to use multiple work directories Why wouldn't that fall under the table space effort??? Yes, but we make it a separate item so we are sure that is implemented as part of tablespaces. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Threaded Sorting
Greg Copeland wrote: -- Start of PGP signed section. I see. I just always assumed that it would be done as part of table space effort as it's such a defacto feature. I am curious as to why no one has commented on the other rather obvious performance enhancement which was brought up in this thread. Allowing for parallel sorting seems rather obvious and is a common enhancement yet seems to of been completely dismissed as people seem to be fixated on I/O. Go figure. We think we are fixated on I/O because we think that is where the delay is. Is there a reason we shouldn't think that? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Neil Conway [EMAIL PROTECTED] writes: Curtis Faith [EMAIL PROTECTED] writes: It looks to me like BufferAlloc will simply result in a call to BufferReplace smgrblindwrt write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I think Tom was suggesting that when a buffer is written out, the write() call only pushes the data down into the filesystem's buffer -- which is free to then write the actual blocks to disk whenever it chooses to. Exactly --- in all Unix systems that I know of, a write() is asynchronous unless one takes special pains (like opening the file with O_SYNC). Pushing the data from userspace to the kernel disk buffers does not count as I/O in my mind. I am quite concerned about Curtis' worries about fsync, though. There's not any fundamental reason for fsync to block other operations, but that doesn't mean that it's been implemented reasonably everywhere :-(. We need to take a look at that. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I resent this since it didn't seem to get to the list. After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: Code from: /usr/src/sys/syscalls/vfs_syscalls int fsync(p, uap) struct proc *p; struct fsync_args /* { syscallarg(int) fd; } */ *uap; { register struct vnode *vp; struct file *fp; vm_object_t obj; int error; if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0) return (error); vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); if (VOP_GETVOBJECT(vp, obj) == 0) vm_object_page_clean(obj, 0, 0, 0); if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 vp-v_mount (vp-v_mount-mnt_flag MNT_SOFTDEP) bioops.io_fsync) error = (*bioops.io_fsync)(vp); VOP_UNLOCK(vp, 0, p); return (error); } Notice the calls to: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); .. VOP_UNLOCK(vp, 0, p); surrounding the call to VOP_FSYNC. From the man pages for VOP_UNLOCK: HEADER STUFF . VOP_LOCK(struct vnode *vp, int flags, struct proc *p); int VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p); int VOP_ISLOCKED(struct vnode *vp, struct proc *p); int vn_lock(struct vnode *vp, int flags, struct proc *p); DESCRIPTION These calls are used to serialize access to the filesystem, such as to prevent two writes to the same file from happening at the same time. The arguments are: vp the vnode being locked or unlocked flags One of the lock request types: LK_SHARED Shared lock LK_EXCLUSIVE Exclusive lock LK_UPGRADEShared-to-exclusive upgrade LK_EXCLUPGRADEFirst shared-to-exclusive upgrade LK_DOWNGRADE Exclusive-to-shared downgrade LK_RELEASERelease any type of lock LK_DRAIN Wait for all lock activity to end The lock type may be or'ed with these lock flags: LK_NOWAITDo not sleep to wait for lock LK_SLEEPFAIL Sleep, then return failure LK_CANRECURSEAllow recursive exclusive lock LK_REENABLE Lock is to be reenabled after drain LK_NOPAUSE No spinloop The lock type may be or'ed with these control flags: LK_INTERLOCKSpecify when the caller already has a simple lock (VOP_LOCK will unlock the simple lock after getting the lock) LK_RETRYRetry until locked LK_NOOBJDon't create object p process context to use for the locks Kernel code should use vn_lock() to lock a vnode rather than calling VOP_LOCK() directly. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] Small patch for PL/Perl Misbehavior with
On Thu, 3 Oct 2002, Tom Lane wrote: That seems a little weird. Does Perl really expect people to do that (ie, is it a documented part of some API)? I wonder whether there is some other action that we're supposed to take instead, but are missing... Not that I know of: clearing out the $@ variable manually was just my way of getting around the problem in practice. I think the underlying issue may be tied to the fact that it's running a function generated within a Safe Module, but I'm not enough of a Perl Guru to say anything more decisive than that. If this is what we'd have to do, I think a better way would be perlerrmsg = pstrdup(SvPV(ERRSV, PL_na)); sv_setpv(perl_get_sv(@,FALSE),); elog(ERROR, plperl: error from function: %s, perlerrmsg); Splitting the ERROR into a NOTICE with the useful info and an ERROR without any isn't real good, because the NOTICE could get dropped on the floor (either because of min_message_level or a client that just plain loses notices). Yeah, that's a cleaner solution. I take it anything pstrdup'd by PostgreSQL gets freed automatically by the backend? (I wasn't familiar enough with the backend to know how to ask for memory confident in the understanding that it would at some point be freed. ;) Jw. -- John Worsley - [EMAIL PROTECTED] http://www.openvein.com/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [GENERAL] Small patch for PL/Perl Misbehavior with Runtime Error Reporting
John Worsley [EMAIL PROTECTED] writes: Yeah, that's a cleaner solution. I take it anything pstrdup'd by PostgreSQL gets freed automatically by the backend? Pretty much. The only situation where it wouldn't be is if CurrentMemoryContext is pointing at TopMemoryContext or another long-lived context --- but we are *very* chary about how much code we allow to run with such a setting. User-definable functions can safely assume that palloc'd space will live only long enough for them to return something to their caller in it. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: [ much snipped ] vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); Hm, I take it a vnode is what's usually called an inode, ie the unique identification data for a specific disk file? This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) As I commented before, I think we could do with an extra process to issue WAL writes in places where they're not in the critical path for a foreground process. But that seems to be orthogonal from this issue. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Threaded Sorting
Well, that's why I was soliciting developer input as to exactly what goes on with sorts. From what I seem to be hearing, all sorts result in temp files being created and/or used. If that's the case then yes, I can understand the fixation. Of course that opens the door for it being a horrible implementation. If that's not the case, then parallel sorts still seem like a rather obvious route to look into. Greg On Fri, 2002-10-04 at 14:15, Bruce Momjian wrote: Greg Copeland wrote: -- Start of PGP signed section. I see. I just always assumed that it would be done as part of table space effort as it's such a defacto feature. I am curious as to why no one has commented on the other rather obvious performance enhancement which was brought up in this thread. Allowing for parallel sorting seems rather obvious and is a common enhancement yet seems to of been completely dismissed as people seem to be fixated on I/O. Go figure. We think we are fixated on I/O because we think that is where the delay is. Is there a reason we shouldn't think that? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Threaded Sorting
Bruce Momjian [EMAIL PROTECTED] writes: Tom, what temp files do we use that aren't for sorting; I forgot. MATERIALIZE plan nodes are the only thing I can think of offhand that uses a straight temp file. But ISTM that if this makes sense for our internal temp files, it makes sense for user-created temp tables as well. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Threaded Sorting
Greg Copeland wrote: -- Start of PGP signed section. Well, that's why I was soliciting developer input as to exactly what goes on with sorts. From what I seem to be hearing, all sorts result in temp files being created and/or used. If that's the case then yes, I can understand the fixation. Of course that opens the door for it being a horrible implementation. If that's not the case, then parallel sorts still seem like a rather obvious route to look into. We use tape sorts, ala Knuth, meaning we sort in memory as much as possible, but when there is more data than fits in memory, rather than swapping, we write to temp files then merge the temp files (aka tapes). -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Threaded Sorting
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom, what temp files do we use that aren't for sorting; I forgot. MATERIALIZE plan nodes are the only thing I can think of offhand that uses a straight temp file. But ISTM that if this makes sense for our internal temp files, it makes sense for user-created temp tables as well. Yes, I was thinking that, but of course, those are real tables, rather than just files. Not sure how clean it will be to mix those in the same directory. We haven't in the past. Is it a good idea? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
It appears the fsync problem is pervasive. Here's Linux 2.4.19's version from fs/buffer.c: lock- down(inode-i_sem); ret = filemap_fdatasync(inode-i_mapping); err = file-f_op-fsync(file, dentry, 1); if (err !ret) ret = err; err = filemap_fdatawait(inode-i_mapping); if (err !ret) ret = err; unlock-up(inode-i_sem); But this is probably not a big factor as you outline below because the WALWriteLock is causing the same kind of contention. tom lane wrote: This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) I hope you're right but I see some very similar contention problems in the case of many small transactions because of the WALWriteLock. Assume Transaction A which writes a lot of buffers and XLog entries, so the Commit forces a relatively lengthy fsynch. Transactions B - E block not on the kernel lock from fsync but on the WALWriteLock. When A finishes the fsync and subsequently releases the WALWriteLock B unblocks and gets the WALWriteLock for its fsync for the flush. C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT. B Releases and now C writes its XLOG_XACT_COMMIT. There now seems to be a lot of contention on the WALWriteLock. This is a shame for a system that has no locking at the logical level and therefore seems like it could be very, very fast and offer incredible concurrency. As I commented before, I think we could do with an extra process to issue WAL writes in places where they're not in the critical path for a foreground process. But that seems to be orthogonal from this issue. It's only orthogonal to the fsync-specific contention issue. We now have to worry about WALWriteLock semantics causes the same contention. Your idea of a separate LogWriter process could very nicely solve this problem and accomplish a few other things at the same time if we make a few enhancements. Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. The nice part is that the WALWriteLock semantics could be changed to allow the LogWriter to write to disk while WALWriteLocks are acquired by back-end servers. WALWriteLocks would only be held for the brief time needed to copy the entries into the log buffer. The LogWriter would only need to grab a lock to determine the current end of the log buffer. Since it would be writing blocks that occur earlier in the cache than the XLogInsert log writers it won't need to grab a WALWriteLock before writing the cache buffers. Many transactions would commit on the same fsync (now really a write with O_DSYNC) and we would get optimal write throughput for the log system. This would handle all the issues I had and it doesn't sound like a huge change. In fact, it ends up being almost semantically identical to the aio_write suggestion I made orignally, except the LogWriter is doing the background writing instead of the OS and we don't have to worry about aio implementations and portability. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Threaded Sorting
On Fri, 2002-10-04 at 14:31, Bruce Momjian wrote: We use tape sorts, ala Knuth, meaning we sort in memory as much as possible, but when there is more data than fits in memory, rather than swapping, we write to temp files then merge the temp files (aka tapes). Right, which is what I originally assumed. On lower end systems, that works great. Once you allow that people may actually have high-end systems with multiple CPUs and lots of memory, wouldn't it be nice to allow for huge improvements on large sorts? Basically, you have two ends of the spectrum. One, where you don't have enough memory and become I/O bound. The other is where you have enough memory but are CPU bound; where potentially you have extra CPUs to spare. Seems to me they are not mutually exclusive. Unless I've missed something, the ideal case is to never use tapes for sorting. Which is saying, you're trying to optimize an already less an ideal situation (which is of course good). I'm trying to discuss making it a near ideal use of available resources. I can understand why addressing the seemingly more common I/O bound case would receive priority, however, I'm at a loss as to why the other would be completely ignored. Seems to me, implementing both would even work in a complimentary fashion on the low-end cases and yield more breathing room for the high-end cases. What am I missing for the other case to be completely ignored? Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Threaded Sorting
Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: ... But ISTM that if this makes sense for our internal temp files, it makes sense for user-created temp tables as well. Yes, I was thinking that, but of course, those are real tables, rather than just files. Not sure how clean it will be to mix those in the same directory. We haven't in the past. Is it a good idea? Sure we have --- up till recently, pg_temp files just lived in the database directory. I think it was you that added the pg_temp subdirectory, and the reason you did it was to let people symlink the temp files to someplace else. But that's just a zeroth-order approximation to providing a tablespace facility for these things. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Bruce Momjian wrote: I am again confused. When we do write(), we don't have to lock anything, do we? (Multiple processes can write() to the same file just fine.) We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Does aio more easily allow the kernel to order those write? Is that the issue? Well, certainly the kernel already order the writes. Just because we write() doesn't mean it goes to disk. Only fsync() or the kernel do that. We don't have to lock anything, but most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. The fsync call is more contentious than typical writes because its duration is usually longer so it holds the locks longer over more pages and structures. That is the real issue. The contention caused by fsync'ing very frequently which blocks other writers and readers. For the buffer manager, the blocking of readers is probably even more problematic when the cache is a small percentage (say 10% to 15%) of the total database size because most leaf node accesses will result in a read. Each of these reads will have to wait on the fsync as well. Again, a very well written file system probably can minimize this but I've not seen any. Further comment on: We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Writing out a bunch of calls at the end, after having consumed a lot of CPU cycles and then waiting is not as efficient as writing them out, while those CPU cycles are being used. We are currently waisting the time it takes for a given process to write. The thinking probably has been that this is no big deal because other processes, say B, C and D can use the CPU cycles while process A blocks. This is true UNLESS the other processes are blocking on reads or writes caused by process A doing the final writes and fsync. Yes, but Oracle is threaded, right, so, yes, they clearly could win with it. I read the second URL and it said we could issue separate writes and have them be done in an optimal order. However, we use the file system, not raw devices, so don't we already have that in the kernel with fsync()? Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. Probably. Having seen the Informix 5/7 debacle, I don't want to fall into the trap where we add stuff that just makes things faster on SMP/threaded systems when it makes our code _slower_ on single CPU systems, which is exaclty what Informix did in Informix 7, and we know how that ended (lost customers, bought by IBM). I don't think that's going to happen to us, but I thought I would mention it. Yes, I hate improvements that make things worse for most people. Any changes I'd contemplate would be simply another configuration driven optimization that could be turned off very easily. - Curtis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Large databases, performance
On 3 Oct 2002 at 18:53, Manfred Koizar wrote: On Thu, 03 Oct 2002 21:47:03 +0530, Shridhar Daithankar [EMAIL PROTECTED] wrote: I believe that was vacuum analyze only. Well there is VACUUM [tablename]; and there is ANALYZE [tablename]; And VACUUM ANALYZE [tablename]; is VACUUM followed by ANALYZE. I was using vacuum analyze. Good that you pointed out. Now I will modify the postgresql auto vacuum daemon that I wrote to analyze only in case of excesive inserts. I hope that's lighter on performance compared to vacuum analyze.. Bye Shridhar -- Mix's Law: There is nothing more permanent than a temporary building. There is nothing more permanent than a temporary tax. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
I'd give you the first and third of those. As Andrew noted, the argument that it's more standard-compliant is not very solid. The standard doesn't say anything about transaction in this regard. Yes, it sais statement. Note also, that a typical SELECT only session would not advance CURRENT_TIMESTAMP at all in the typical autocommit off mode that the Spec is all about. What do others think? I liked your proposal to advance CURRENT_TIMESTAMP at each statement start. (It would not advance inside a stored procedure). Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] (Fwd) Re: Any Oracle 9 users? A test please...
Howdy All, You have to explicitly commit transactions in oracle using SQL*Plus. However, DUAL (eg. SELECT current_timestamp FROM DUAL;) is special in this case. It is a table in the sys schema, used for selecting constants, pseudo-columns, etc. I'm not sure if this helps but see: http://download-east.oracle.com/otndoc/oracle9i/901_doc/server.901/a90125/queries2.htm#2054162http://download-east.oracle.com/otndoc/oracle9i/901_doc/server.901/a90125/queries2.htm#2054162 rob 'Oracle 9 tester' :P On Mon, 30 Sep 2002, Bruce Momjian wrote: OK, I just received this answer from an Oracle 9 tester. It shows CURRENT_TIMESTAMP changing during the transaction. Thanks, Dan. Dan, it wasn't clear if this was in a transaction or not. Does Oracle have autocommit off by default so you are always in a transaction? --- Dan Langille wrote: A very quick answer: --- Forwarded message follows --- Date: Mon, 30 Sep 2002 13:03:51 -0400 (EDT) From: Agent Drek [EMAIL PROTECTED] To: [EMAIL PROTECTED] [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Re: Any Oracle 9 users? A test please... In-Reply-To: 3D984877.19685.801EEC30@localhost Message-ID: Pine.BSF.4.44.0209301303030.50384- [EMAIL PROTECTED] MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: [EMAIL PROTECTED] On Mon, 30 Sep 2002, Dan Langille wrote: Date: Mon, 30 Sep 2002 12:49:59 -0400 From: Dan Langille [EMAIL PROTECTED] Reply-To: [EMAIL PROTECTED] [EMAIL PROTECTED] To: [EMAIL PROTECTED] [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Any Oracle 9 users? A test please... Followups to [EMAIL PROTECTED] please! Any Oracle 9 users out there? I need this run: BEGIN; SELECT CURRENT_TIMESTAMP; -- wait 5 seconds SELECT CURRENT_TIMESTAMP; Are those two timestamps the same? Thanks Our DBA says: snip from irc data SQL SELECT current_timestamp FROM DUAL; data CURRENT_TIMESTAMP data -- - data 30-SEP-02 01.06.42.660969 PM -04:00 data SQL SELECT current_timestamp FROM DUAL; data CURRENT_TIMESTAMP data -- - data 30-SEP-02 01.06.48.837372 PM -04:00 data (you have to include 'from dual' for 'non-table' selects) -- Derek Marshall Smash and Pow Inc 'digital plumber' http://www.smashpow.net To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-database in the body of the message --- End of forwarded message --- -- Dan Langille I'm looking for a computer job: http://www.freebsddiary.org/dan_langille.php -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [COMMITTERS] pgsql/contrib/rserv ApplySnapshot.in CleanLog. ...
Hello Peter, Tuesday, October 01, 2002, 1:42:46 AM, you wrote: PE Bruce Momjian writes: Peter, the author is questioning why his Makefile changes were wrong. Would you elaborate? PE Because we rely on the built-in library lookup functionality instead of PE hardcoding the full file name. Agh! I finally read up on module loading http://developer.postgresql.org/docs/postgres/xfunc-c.html#XFUNC-C-DYNLOAD and now I seem to understand. You see, the problem with the current Makefile is as follows: it substitutes '$libdir' into both .sql and perl files. While this is good enough for sql, $libdir is consumed by Perl and thus perl scripts do NOT work. Thanks for elaborating, I'll try to produce a better patch ASAP. -- Ñ óâàæåíèåì, Àëåêñåé Áîðçîâ îòäåë èíòåðíåò-ïðîåêòîâ ÎÎÎ ÐÄÂ-Ìåäèà http://www.rdw.ru http://www.vashdosug.ru ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] (Fwd) Re: Any Oracle 9 users? A test please...
The original tester says this is an anonymous procedure. On 30 Sep 2002 at 15:07, Bruce Momjian wrote: It is not clear to me; is this its own transaction or a function call? -- - Dan Langille wrote: And just for another opinion, which supports the first. From now, unless you indicate otherwise, I'll only report tests which have both values the same. From: Shawn O'Connor [EMAIL PROTECTED] To: Dan Langille [EMAIL PROTECTED] Subject: Re: Any Oracle 9 users? A test please... In-Reply-To: 3D985663.24174.80554E83@localhost Message-ID: [EMAIL PROTECTED] MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-PMFLAGS: 35127424 0 1 P2A7A0.CNM Okay, here you are: -- DECLARE time1 TIMESTAMP; time2 TIMESTAMP; sleeptime NUMBER; BEGIN sleeptime := 5; SELECT CURRENT_TIMESTAMP INTO time1 FROM DUAL; DBMS_LOCK.SLEEP(sleeptime); SELECT CURRENT_TIMESTAMP INTO time2 FROM DUAL; DBMS_OUTPUT.PUT_LINE(TO_CHAR(time1)); DBMS_OUTPUT.PUT_LINE(TO_CHAR(time2)); END; / 30-SEP-02 11.54.09.583576 AM 30-SEP-02 11.54.14.708333 AM PL/SQL procedure successfully completed. -- Hope this helps! -Shawn On Mon, 30 Sep 2002, Dan Langille wrote: We're testing this just to see what Oracle does. What you are saying is what we expect to happen. But could you do that test for us from the command line? Thanks. On 30 Sep 2002 at 10:31, Shawn O'Connor wrote: I'm assuming your doing this as some sort of anonymous PL/SQL function: Don't you need to do something like: SELECT CURRENT_TIMESTAMP FROM DUAL INTO somevariable? and to wait five seconds probably: EXECUTE DBMS_LOCK.SLEEP(5); But to answer your question-- When this PL/SQL function is run the values of current_timestamp are not the same, they will be sepearated by five seconds or so. Hope this helps! -Shawn On Mon, 30 Sep 2002, Dan Langille wrote: Followups to [EMAIL PROTECTED] please! Any Oracle 9 users out there? I need this run: BEGIN; SELECT CURRENT_TIMESTAMP; -- wait 5 seconds SELECT CURRENT_TIMESTAMP; Are those two timestamps the same? Thanks -- Dan Langille I'm looking for a computer job: http://www.freebsddiary.org/dan_langille.php To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-database in the body of the message -- Dan Langille I'm looking for a computer job: http://www.freebsddiary.org/dan_langille.php --- End of forwarded message --- -- Dan Langille I'm looking for a computer job: http://www.freebsddiary.org/dan_langille.php -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 -- Dan Langille I'm looking for a computer job: http://www.freebsddiary.org/dan_langille.php ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [GENERAL] [HACKERS] Anyone want to assist with the translation
Hi Justin, --On Donnerstag, 3. Oktober 2002 09:23 +1000 Justin Clift [EMAIL PROTECTED] wrote: Hi Michael, Michael Paesold wrote: snip Hi Justin, I am from Austria, and I would like to help. I could provide a German translation. The Babelfish's translation is really funny. Machine translation is readable, but it is no advocacy. ;-) I do not really nead an interface, but just tell me in what way you want the texts. Cool. Could you deal with an OpenOffice Calc or M$ Excel file having the lines of English text in one column, and doing the German translation into a second column? That might be easiest, and will allow a cut-n-paste of the German version straight into the database backend. Haha cutpaste ;-) Ever heard of csv? :-)) However, I can also have a look at it, if desired. Regards Tino Sound workable to you? :-) Regards and best wishes, Justin Clift Best Regards, Michael Paesold -- My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there. - Indira Gandhi ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] [HACKERS] Anyone want to assist with the
Hi Justin, Good point. For the moment we've whipped up that MS Excel document (created in OpenOffice of course) of all the English text strings in the site and emailed it to the volunteers. :) Btw. did you ever unzip the native OpenOffice (aka StarOffice) file? So far community members have volunteered for German, Turkish, French, Spanish, Brazilian Portuguese, and Polish. Cool. :) Want to co-ordinate with the other two German language volunteers? Sure. So I'm here :-) Regards Tino ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [GENERAL] Large databases, performance
Can you comment on the tools you are using to do the insertions (Perl, Java?) and the distribution of data (all random, all static), and the transaction scope (all inserts in one transaction, each insert as a single transaction, some group of inserts as a transaction). I'd be curious what happens when you submit more queries than you have processors (you had four concurrent queries and four CPUs), if you care to run any additional tests. Also, I'd report the query time in absolute (like you did) and also in 'Time/number of concurrent queries. This will give you a sense of how the system is scaling as the workload increases. Personally I am more concerned about this aspect than the load time, since I am going to guess that this is where all the time is spent. Was the original posting on GENERAL or HACKERS. Is this moving the PERFORMANCE for follow-up? I'd like to follow this discussion and want to know if I should join another group? Thanks, Charlie P.S. Anyone want to comment on their expectation for 'commercial' databases handling this load? I know that we cannot speak about specific performance metrics on some products (licensing restrictions) but I'd be curious if folks have seen some of the databases out there handle these dataset sizes and respond resonably. Shridhar Daithankar wrote: Hi, Today we concluded test for database performance. Attached are results and the schema, for those who have missed earlier discussion on this. We have (almost) decided that we will partition the data across machines. The theme is, after every some short interval a burst of data will be entered in new table in database, indexed and vacuume. The table(s) will be inherited so that query on base table will fetch results from all the children. The application has to consolidate all the data per node basis. If the database is not postgresql, app. has to consolidate data across partitions as well. Now we need to investigate whether selecting on base table to include children would use indexes created on children table. It's estimated that when entire data is gathered, total number of children tables would be around 1K-1.1K across all machines. This is in point of average rate of data insertion i.e. 5K records/sec and total data size, estimated to be 9 billion rows max i.e. estimated database size is 900GB. Obviously it's impossible to keep insertion rate on an indexed table high as data grows. So partitioning/inheritance looks better approach. Postgresql is not the final winner as yet. Mysql is in close range. I will keep you guys posted about the result. Let me know about any comments.. Bye Shridhar -- Price's Advice:It's all a game -- play it to have fun. Machine Compaq Proliant Server ML 530 Intel Xeon 2.4 Ghz Processor x 4, 4 GB RAM, 5 x 72.8 GB SCSI HDD RAID 0 (Striping) Hardware Setup, Mandrake Linux 9.0 Cost - $13,500 ($1,350 for each additional 72GB HDD) Performance Parameter MySQL 3.23.52 MySQL 3.23.52 PostgreSQL 7.2.2 WITHOUT InnoDB WITH InnoDB for with built-in support for transactional transactional support for transactions support Complete Data Inserts + building a composite index 40 GB data, 432,000,000 tuples 3738 secs 18720 secs 20628 secs about 100 bytes each, schema on 'schema' sheet composite index on 3 fields (esn, min, datetime) Load Speed 115570 tuples/second23076 tuples/second 20942 tuples/second Database Size on Disk 48 GB 87 GB 111 GB Average per partition
Re: [GENERAL] [HACKERS] Anyone want to assist with the
Hi Michael, yeah, I got :-) I'm busy reviewing :-) Regards Tino --On Donnerstag, 3. Oktober 2002 21:54 +0200 Michael Paesold [EMAIL PROTECTED] wrote: Tino Wildenhain [EMAIL PROTECTED] wrote: Hi Justin, Good point. For the moment we've whipped up that MS Excel document (created in OpenOffice of course) of all the English text strings in the site and emailed it to the volunteers. :) Btw. did you ever unzip the native OpenOffice (aka StarOffice) file? So far community members have volunteered for German, Turkish, French, Spanish, Brazilian Portuguese, and Polish. Cool. :) Want to co-ordinate with the other two German language volunteers? Sure. So I'm here :-) Regards Tino You should have already got at least two mails, haven't you? Michael ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: Note also, that a typical SELECT only session would not advance CURRENT_TIMESTAMP at all in the typical autocommit off mode that the Spec is all about. True, but the spec also says to default to serializable transaction mode. So in a single-transaction session like you are picturing, the successive SELECTs would all see a frozen snapshot of the database. Freezing CURRENT_TIMESTAMP goes right along with that, and in fact makes a lot of sense, because it tells you exactly what time your snapshot of the database state was taken. This line of thought opens another can of worms: should the behavior of CURRENT_TIMESTAMP depend on serializable vs. read-committed mode? Maybe SetQuerySnapshot is the routine that ought to capture the statement-start-time timestamp value. We could define CURRENT_TIMESTAMP as the time of the active database snapshot. Or at least offer a fourth parameter to that parameterized now() to return this time. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Return of INSTEAD rules
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: I am confused how yours differs from mine. I don't see how the last matching tagged query would not be from an INSTEAD rule. You could have both INSTEAD and non-INSTEAD rules firing for the same original query. If the alphabetically-last rule is a non-INSTEAD rule, then there's a difference. How do we get multiple rules on a query? I thought it was mostly INSERT/UPDATE/DELETE, and those all operate on a single table. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith wrote: Bruce Momjian wrote: I may be missing something here, but other backends don't block while one writes to WAL. I don't think they'll block until they get to the fsync or XLogWrite call while another transaction is fsync'ing. I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. Yes, I can see some contention, but what does aio solve? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Threaded Sorting
Hans-Jürgen Schönig wrote: Did anybody think about threaded sorting so far? Assume an SMP machine. In the case of building an index or in the case of sorting a lot of data there is just one backend working. Therefore just one CPU is used. What about starting a thread for every temporary file being created? This way CREATE INDEX could use many CPUs. Maybe this is worth thinking about because it will speed up huge databases and enterprise level computing. We haven't thought about it yet because there are too many buggy thread implementations. We are probably just now getting to a point where we can consider it. However, lots of databases have moved to threads for all sorts of things and ended up with a royal mess of code. Threads can only improve things in a few areas of the backend so it would be nice if we could limit the exposure to threads to those areas; sorting could certainly be one of them, but frankly, I think disk I/O is our limiting factore there. I would be interested to see some tests that showed otherwise. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)
Tom Lane wrote: Moving to the left requires an explicit cast (or at least an assignment to a column). I know this looks strange to someone who knows that our numeric type beats float4/float8 on both range and precision, but it's effectively mandated by the SQL spec. Any combination of exact and inexact numeric types is supposed to yield an inexact result per spec, thus numeric + float8 yields float8 not numeric. Another reason for doing it this way is that a numeric literal like 123.456 can be initially typed as numeric, and later implicitly promoted to float4 or float8 if context demands it. Doing that the other way 'round would introduce problems with precision loss. We had speculated about introducing an unknown_numeric pseudo-type to avoid that problem, but the above hierarchy eliminates the need for unknown_numeric. We can initially type a literal as the smallest thing it will fit in, and then do implicit promotion as needed. (7.3 is not all the way there on that plan, but 7.4 will be.) Do we know that defaulting floating constants will not be a performance hit? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Return of INSTEAD rules
Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: I am confused how yours differs from mine. I don't see how the last matching tagged query would not be from an INSTEAD rule. You could have both INSTEAD and non-INSTEAD rules firing for the same original query. If the alphabetically-last rule is a non-INSTEAD rule, then there's a difference. How do we get multiple rules on a query? I thought it was mostly INSERT/UPDATE/DELETE, and those all operate on a single table. You can create as many rules as you want. One reasonably likely scenario is that you have a view, you make an ON INSERT DO INSTEAD rule to support insertions into the view (by inserting into some underlying table(s) instead), and then you add some not-INSTEAD rules to perform logging into other tables that aren't part of the view but just keep track of activity. You'd not want the logging activity to usurp the count result for this setup, I think, even if it happened last. (Indeed, that might be *necessary*, if for some reason it needed to access the rows inserted into the view's base table.) This approach would give us a general principle that applies in all cases: not-INSTEAD rules don't affect the returned command result. Perhaps that would answer Manfred's thought that we should be able to label which rules affect the result. If you have any INSTEAD rules, then it doesn't matter exactly how many you have, so you can mark them INSTEAD or not to suit your fancy. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)
Bruce Momjian [EMAIL PROTECTED] writes: Do we know that defaulting floating constants will not be a performance hit? Uh ... what's your concern exactly? The datatype coercion (if any) will happen once at parse time, not at runtime. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] CURRENT_TIMESTAMP
Bruce Momjian [EMAIL PROTECTED] wrote: Yes, clearly, we will need to have all three time values available to users. With three people now suggesting we don't change, I will just add to TODO: Add now(transaction|statement|clock) functionality Is that good? CURRENT_TIMESTAMP etc. are converted to now()::TIMESTAMP, at least in 7.2, right? So when there are all three options available, it would be easy to change the behaviour of CURRENT_DATE/TIME/TIMESTAMP, right? SET .. or GUC would be options, no? Best Regards, Michael Paesold ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Return of INSTEAD rules
On Thu, 3 Oct 2002 22:21:27 -0400 (EDT), Bruce Momjian [EMAIL PROTECTED] wrote: so I propose we handle INSTEAD rules this way: that we return the oid and tuple count of the last INSTEAD rule query with a tag matching the main query. Bruce, this won't work for this example CREATE RULE visible_delete AS -- DELETE rule ON DELETE TO visible DO INSTEAD COUNT UPDATE table3 SET deleted = 1 WHERE pk = old.pk; because here we don't have a rule query with a matching tag. Same applies for CREATE RULE v_update AS -- UPDATE rule ON UPDATE TO v DO INSTEAD NOTHING; I wrote: One argument against automatically don't count non-INSTEAD rules and count the last statement in INSTEAD rules Seems I introduced a little bit of confusion here by argueing against something that has never been proposed before. Funny, that this non-existent proposal is now seriously discussed :-( Has the idea of extending the syntax to explicitly mark queries as COUNTed already been rejected? If yes, I cannot help here. If no, I keep telling you that this approach can emulate most of the other possible solutions still under discussion. Bruce wrote: If there is more than one matching tag in the INSTEAD rule, the user has the option to place the query he wants for the return at the end of the rule. Are you sure this is always possible without unwanted side effects? Servus Manfred ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. Bruce Momjian wrote: Yes, I can see some contention, but what does aio solve? Well, theoretically, aio lets the file system handle the writes without requiring any locks being held by the processes issuing those reads. The disk i/o scheduler can therefore issue the writes using spinlocks or something very fast since it controls the timing of each of the actual writes. In some systems this is handled by the kernal and can be very fast. I suspect that with large RAID controllers or intelligent disk systems like EMC this is even more important because they should be able to handle a much higher level of concurrent i/o. Now whether or not the common file systems handle this well, I can't say, Take a look at some comments on how Oracle uses asynchronous I/O http://www.ixora.com.au/notes/redo_write_multiplexing.htm http://www.ixora.com.au/notes/asynchronous_io.htm http://www.ixora.com.au/notes/raw_asynchronous_io.htm It seems that OS support for this will likely increase and that this issue will become more and more important as uses contemplate SMP systems or if threading is added to certain PostgreSQL subsystems. It might be easier for me to implement the change I propose and then see what kind of difference it makes. I wanted to run the idea past this group first. We can all postulate whether or not it will work but we won't know unless we try it. My real issue is one of what happens in the event that it does work. I've had very good luck implementing this sort of thing for other systems but I don't yet know the range of i/o requests that PostgreSQL makes. Assuming we can demonstrate no detrimental effects on system reliability and that the change is implemented in such a way that it can be turned on or off easily, will a 50% or better increase in speed for updates justify the sort or change I am proposing. 20%? 10%? - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] any experience with IA-64
I, probably, will have a chance to work with postgres on Linux IA-64. Is there any optimization for postgresql ? Regards, Oleg _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
Tom Lane [EMAIL PROTECTED] wrote: Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: Note also, that a typical SELECT only session would not advance CURRENT_TIMESTAMP at all in the typical autocommit off mode that the Spec is all about. True, but the spec also says to default to serializable transaction mode. So in a single-transaction session like you are picturing, the successive SELECTs would all see a frozen snapshot of the database. Freezing CURRENT_TIMESTAMP goes right along with that, and in fact makes a lot of sense, because it tells you exactly what time your snapshot of the database state was taken. This line of thought opens another can of worms: should the behavior of CURRENT_TIMESTAMP depend on serializable vs. read-committed mode? Maybe SetQuerySnapshot is the routine that ought to capture the statement-start-time timestamp value. We could define CURRENT_TIMESTAMP as the time of the active database snapshot. Or at least offer a fourth parameter to that parameterized now() to return this time. regards, tom lane That is a very good point. At least with serializable transactions it seems perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you think about read-commited level? Can time be commited? ;-) It would be even more surprising to new users if the implementation of CURRENT_TIMESTAMP would depend on trx serialization level. Regards, Michael Paesold ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Threaded Sorting
On Fri, 4 Oct 2002, Bruce Momjian wrote: Hans-Jürgen Schönig wrote: Did anybody think about threaded sorting so far? Assume an SMP machine. In the case of building an index or in the case of sorting a lot of data there is just one backend working. Therefore just one CPU is used. What about starting a thread for every temporary file being created? This way CREATE INDEX could use many CPUs. Maybe this is worth thinking about because it will speed up huge databases and enterprise level computing. We haven't thought about it yet because there are too many buggy thread implementations. We are probably just now getting to a point where we can consider it. However, lots of databases have moved to threads for all sorts of things and ended up with a royal mess of code. Threads can only improve things in a few areas of the backend so it would be nice if we could limit the exposure to threads to those areas; sorting could certainly be one of them, but frankly, I think disk I/O is our limiting factore there. I would be interested to see some tests that showed otherwise. Wouldn't the type of disk subsystem really make a big difference here? With a couple of U160 cards and a dozen 15krpm hard drives, I would imagine I/O would no longer be as much of an issue as a single drive system would be. It seems like sometimes we consider these issues more from the one or two SCSI drives perspective insted of the big box o drives perspective. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Return of INSTEAD rules
Tom Lane wrote: You can create as many rules as you want. One reasonably likely scenario is that you have a view, you make an ON INSERT DO INSTEAD rule to support insertions into the view (by inserting into some underlying table(s) instead), and then you add some not-INSTEAD rules to perform logging into other tables that aren't part of the view but just keep track of activity. You'd not want the logging activity to usurp the count result for this setup, I think, even if it happened last. (Indeed, that might be *necessary*, if for some reason it needed to access the rows inserted into the view's base table.) This approach would give us a general principle that applies in all cases: not-INSTEAD rules don't affect the returned command result. Perhaps that would answer Manfred's thought that we should be able to label which rules affect the result. If you have any INSTEAD rules, then it doesn't matter exactly how many you have, so you can mark them INSTEAD or not to suit your fancy. Oh, I like that, and rules fire alphabetically, right? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Return of INSTEAD rules
Manfred Koizar wrote: On Thu, 3 Oct 2002 22:21:27 -0400 (EDT), Bruce Momjian [EMAIL PROTECTED] wrote: so I propose we handle INSTEAD rules this way: that we return the oid and tuple count of the last INSTEAD rule query with a tag matching the main query. Bruce, this won't work for this example CREATE RULE visible_delete AS -- DELETE rule ON DELETE TO visible DO INSTEAD COUNT UPDATE table3 SET deleted = 1 WHERE pk = old.pk; because here we don't have a rule query with a matching tag. Same applies for True, but because we have said we are going to return the tag of the original command, I don't think we have anything valid to return in this case to match the tag. CREATE RULE v_update AS -- UPDATE rule ON UPDATE TO v DO INSTEAD NOTHING; This is OK because the default is return zeros. I wrote: One argument against automatically don't count non-INSTEAD rules and count the last statement in INSTEAD rules Seems I introduced a little bit of confusion here by argueing against something that has never been proposed before. Funny, that this non-existent proposal is now seriously discussed :-( Has the idea of extending the syntax to explicitly mark queries as COUNTed already been rejected? If yes, I cannot help here. If no, I Well, I am hoping to find something that was automatic. If we do our best, and we still get complains, we can add some syntax. I am concerned that adding syntax is just over-designing something that isn't necessary. keep telling you that this approach can emulate most of the other possible solutions still under discussion. Bruce wrote: If there is more than one matching tag in the INSTEAD rule, the user has the option to place the query he wants for the return at the end of the rule. Are you sure this is always possible without unwanted side effects? I am sure it isn't always possible, but let's do our best and see how people react. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
Michael Paesold wrote: Tom Lane [EMAIL PROTECTED] wrote: Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: Note also, that a typical SELECT only session would not advance CURRENT_TIMESTAMP at all in the typical autocommit off mode that the Spec is all about. True, but the spec also says to default to serializable transaction mode. So in a single-transaction session like you are picturing, the successive SELECTs would all see a frozen snapshot of the database. Freezing CURRENT_TIMESTAMP goes right along with that, and in fact makes a lot of sense, because it tells you exactly what time your snapshot of the database state was taken. This line of thought opens another can of worms: should the behavior of CURRENT_TIMESTAMP depend on serializable vs. read-committed mode? Maybe SetQuerySnapshot is the routine that ought to capture the statement-start-time timestamp value. We could define CURRENT_TIMESTAMP as the time of the active database snapshot. Or at least offer a fourth parameter to that parameterized now() to return this time. regards, tom lane That is a very good point. At least with serializable transactions it seems perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you think about read-commited level? Can time be commited? ;-) It would be even more surprising to new users if the implementation of CURRENT_TIMESTAMP would depend on trx serialization level. Yes, CURRENT_TIMESTAMP changing based on transaction serializable/read commited would be quite confusing. Also, because our default is read committed, we would end up with CURRENT_TIMESTAMP being statement level, which actually does give us a logical place to allow CURRENT_TIMESTAMP to change, but I thought people voted against that. However, imagine a query that used CURRENT_TIMESTAMP in the WHERE clause to find items that were not in the future. Would a CURRENT_TIMESTAMP test in a multi-statement transaction want to check based on transaction start, or on the tuples visible at the time the statement started? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith wrote: Yes, I can see some contention, but what does aio solve? Well, theoretically, aio lets the file system handle the writes without requiring any locks being held by the processes issuing those reads. The disk i/o scheduler can therefore issue the writes using spinlocks or something very fast since it controls the timing of each of the actual writes. In some systems this is handled by the kernal and can be very fast. I am again confused. When we do write(), we don't have to lock anything, do we? (Multiple processes can write() to the same file just fine.) We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Does aio more easily allow the kernel to order those write? Is that the issue? Well, certainly the kernel already order the writes. Just because we write() doesn't mean it goes to disk. Only fsync() or the kernel do that. I suspect that with large RAID controllers or intelligent disk systems like EMC this is even more important because they should be able to handle a much higher level of concurrent i/o. Now whether or not the common file systems handle this well, I can't say, Take a look at some comments on how Oracle uses asynchronous I/O http://www.ixora.com.au/notes/redo_write_multiplexing.htm http://www.ixora.com.au/notes/asynchronous_io.htm http://www.ixora.com.au/notes/raw_asynchronous_io.htm Yes, but Oracle is threaded, right, so, yes, they clearly could win with it. I read the second URL and it said we could issue separate writes and have them be done in an optimal order. However, we use the file system, not raw devices, so don't we already have that in the kernel with fsync()? It seems that OS support for this will likely increase and that this issue will become more and more important as uses contemplate SMP systems or if threading is added to certain PostgreSQL subsystems. Probably. Having seen the Informix 5/7 debacle, I don't want to fall into the trap where we add stuff that just makes things faster on SMP/threaded systems when it makes our code _slower_ on single CPU systems, which is exaclty what Informix did in Informix 7, and we know how that ended (lost customers, bought by IBM). I don't think that's going to happen to us, but I thought I would mention it. Assuming we can demonstrate no detrimental effects on system reliability and that the change is implemented in such a way that it can be turned on or off easily, will a 50% or better increase in speed for updates justify the sort or change I am proposing. 20%? 10%? Yea, let's see what boost we get, and the size of the patch, and we can review it. It is certainly worth researching. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Threaded Sorting
Hans-Jürgen Schönig wrote: Threads are bad - I know ... I like the idea of a pool of processes instead of threads - from my point of view this would be useful. I am planning to run some tests (GEQO, AIX, sorts) as soon as I have time to do so (still too much work ahead before :( ...). If I had time I'd love to do something for the PostgreSQL community :(. As far as sorting is concerned: It would be fine if it was possible to define an alternative location for temporary sort files using SET. If you had multiple disks this would help in the case of concurrent sorts because this way people could insert and index many tables at once without having to access just one storage system. This would be an easy way out of the IO limitation ... - at least for some problems. Bingo! Want to increase sorting performance, give it more I/O bandwidth, and it will take 1/100th of the time to do threading. Ingres had a nice feature where you could specify sort directories and it would cycle through those directories while it did the tape sort. Added to TODO: * Allow sorting to use multiple work directories -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Threaded Sorting
scott.marlowe wrote: We haven't thought about it yet because there are too many buggy thread implementations. We are probably just now getting to a point where we can consider it. However, lots of databases have moved to threads for all sorts of things and ended up with a royal mess of code. Threads can only improve things in a few areas of the backend so it would be nice if we could limit the exposure to threads to those areas; sorting could certainly be one of them, but frankly, I think disk I/O is our limiting factore there. I would be interested to see some tests that showed otherwise. Wouldn't the type of disk subsystem really make a big difference here? With a couple of U160 cards and a dozen 15krpm hard drives, I would imagine I/O would no longer be as much of an issue as a single drive system would be. It seems like sometimes we consider these issues more from the one or two SCSI drives perspective insted of the big box o drives perspective. Yes, it is mostly for non-RAID drives, but also, sometimes single drives can be faster. When you have a drive array, it isn't as easy to hit each drive and keep it running sequentially. Of course, I don't have any hard numbers on that. ;-) -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
Bruce Momjian [EMAIL PROTECTED] wrote: That is a very good point. At least with serializable transactions it seems perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you think about read-commited level? Can time be commited? ;-) It would be even more surprising to new users if the implementation of CURRENT_TIMESTAMP would depend on trx serialization level. Yes, CURRENT_TIMESTAMP changing based on transaction serializable/read commited would be quite confusing. Also, because our default is read committed, we would end up with CURRENT_TIMESTAMP being statement level, which actually does give us a logical place to allow CURRENT_TIMESTAMP to change, but I thought people voted against that. However, imagine a query that used CURRENT_TIMESTAMP in the WHERE clause to find items that were not in the future. Would a CURRENT_TIMESTAMP test in a multi-statement transaction want to check based on transaction start, or on the tuples visible at the time the statement started? Well, in a serializable transaction there would be no difference at all, at least if CURRENT_TIMESTAMP is consistent within the transaction. Any changes outside the transaction after SetQuerySnapshot would not be seen by the transaction anyway. In read-commited, I think it's different. If CURRENT_TIMESTAMP is frozen, than the behavior would be the same as in serializable level, if CURRENT_TIMESTAMP advances with each statement, the result would also change. That is an inherent problem with read-commited though and has not so much to do with the timestamp behavior. Regards, Michael Paesold ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Threaded Sorting
Bruce Momjian wrote: scott.marlowe wrote: snip It seems like sometimes we consider these issues more from the one or two SCSI drives perspective insted of the big box o drives perspective. Yes, it is mostly for non-RAID drives, but also, sometimes single drives can be faster. When you have a drive array, it isn't as easy to hit each drive and keep it running sequentially. Of course, I don't have any hard numbers on that. ;-) Arrghh... please remember that big bunch of drives != all in one array. It's common to have a bunch of drives and allocate different ones for different tasks appropriately, whether in array sets, individually, mirrored, etc. 100% totally feasible to have a separate 15k SCSI drive or two just purely for doing sorts if it would assist in throughput. :-) Regards and best wishes, Justin Clift -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] -- My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there. - Indira Gandhi ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: numeric hierarchy again (was Re: [HACKERS] floor function in 7.3b2)
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Do we know that defaulting floating constants will not be a performance hit? Uh ... what's your concern exactly? The datatype coercion (if any) will happen once at parse time, not at runtime. Yes, I realize it is during parsing. I was just wondering if making constants coming in from the parser NUMERIC is a performance hit? I see in gram.y that FCONST comes in as a Float so I don't even see were we make it NUMERIC. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] any experience with IA-64
Oleg Bartunov wrote: I, probably, will have a chance to work with postgres on Linux IA-64. Is there any optimization for postgresql ? None we know of. I think people have already gotten it working. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [GENERAL] Anyone want to assist with the translation of the
Hi Justin, you want probably use the language-negotiation rather then a query variable :-) Regards Tino --On Donnerstag, 3. Oktober 2002 08:53 +1000 Justin Clift [EMAIL PROTECTED] wrote: Hi everyone, Have just put together a prototype page to show off the multi-lingual capabilities that the Advocacy sites' infrastructure has: http://advocacy.postgresql.org/?lang=de The text was translated to german via Altavista's Babelfish, so it's probably only about 80% accurate, but it conveys the concept. Is anyone interested in translating the English version to other languages? All Latin based languages should be fine (German, French, Italian, Spanish, Portuguese, Turkish, Greek, etc). If there's strong interest, then an interface to let volunteers translators do it easily can be constructed over the next fortnight or so. :-) Regards and best wishes, Justin Clift -- My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there. - Indira Gandhi ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [GENERAL] Anyone want to assist with the translation of the Advocacy
Hi everyone, Have just put together a prototype page to show off the multi-lingual capabilities that the Advocacy sites' infrastructure has: http://advocacy.postgresql.org/?lang=de The text was translated to german via Altavista's Babelfish, so it's probably only about 80% accurate, but it conveys the concept. Is anyone interested in translating the English version to other languages? All Latin based languages should be fine (German, French, Italian, Spanish, Portuguese, Turkish, Greek, etc). If there's strong interest, then an interface to let volunteers translators do it easily can be constructed over the next fortnight or so. :-) Regards and best wishes, Justin Clift Justin, I would be glad to translate it to brazilian portuguese. Here we have a lot of companies starting to use PostgreSQL, including the one where I work and some of our clients. It would be very nice to have this site translated. Cheers, -- Diogo de Oliveira Biazus [EMAIL PROTECTED] Ikono Sistemas e Automação http://www.ikono.com.br ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Threaded Sorting
Bruce Momjian [EMAIL PROTECTED] writes: Bingo! Want to increase sorting performance, give it more I/O bandwidth, and it will take 1/100th of the time to do threading. Added to TODO: * Allow sorting to use multiple work directories Yeah, I like that. Actually it should apply to all temp files not only sorting. A crude hack would be to allow there to be multiple pg_temp_NNN/ subdirectories (read symlinks) in a database, and then the code would automatically switch among these. Probably a cleaner idea would be to somehow integrate this with tablespace management --- if you could mark some tablespaces as intended for temp stuff, the system could round-robin among those as it creates temp files and/or temp tables. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Threaded Sorting
Greg Copeland [EMAIL PROTECTED] writes: ... I can understand why addressing the seemingly more common I/O bound case would receive priority, however, I'm at a loss as to why the other would be completely ignored. Bruce already explained that we avoid threads because of portability and robustness considerations. The notion of a sort process pool seems possibly attractive. I'm unconvinced that it's going to be a win though because of the cost of shoving data across address-space boundaries. Another issue is that the sort comparison function can be anything, including user-defined code that does database accesses or other interesting stuff. This would mean that the sort auxiliary process would have to adopt the database user identity of the originating process, and quite possibly absorb a whole lot of other context information before it could correctly/safely execute the comparison function. That pushes the overhead up a lot more. (The need to allow arbitrary operations in the comparison function would put a pretty substantial crimp on a thread-based approach, too, even if we were willing to ignore the portability issue.) Still, if you want to try it out, feel free ... this is an open-source project, and if you can't convince other people that an idea is worth implementing, that doesn't mean you can't implement it yourself and prove 'em wrong. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Threaded Sorting
On Fri, 2002-10-04 at 15:07, Tom Lane wrote: the sort comparison function can be anything, including user-defined code that does database accesses or other interesting stuff. This This is something that I'd not considered. would mean that the sort auxiliary process would have to adopt the database user identity of the originating process, and quite possibly absorb a whole lot of other context information before it could correctly/safely execute the comparison function. That pushes the overhead up a lot more. Significantly! Agreed. Still, if you want to try it out, feel free ... this is an open-source project, and if you can't convince other people that an idea is worth implementing, that doesn't mean you can't implement it yourself and prove 'em wrong. No Tom, my issue wasn't if I could or could not convince someone but rather that something has been put on the table requesting additional feedback on it's feasibility but had been completely ignored. Fact is, I knew I didn't know enough about the implementation details to even attempt to convince anyone of anything. I simply wanted to explore the idea or rather the feasibility of the idea. In theory, it's a great idea. In practice, I had no idea, thus my desire to seek additional input. As such, it seems a practical implementation may prove difficult. I now understand. Thank you for taking the take to respond in a manner that satisfies my curiosity. That's all I was looking for. :) Best Regards, Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: ... most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? Then there's no need to call fsync() at all, except during checkpoints (which actually issue sync() not fsync(), anyway). Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, on safety grounds (we want to be sure we have a consistent WAL up to the end of what we've written). Even if we can allow some reordering when a single transaction puts out a large volume of WAL data, I fail to see where any large gain is going to come from. We're going to be issuing those writes sequentially and that ought to match the disk layout about as well as can be hoped anyway. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. We do not fsync buffer pages; in fact a transaction commit doesn't write buffer pages at all. I think the above is just a misunderstanding of what's really happening. We have synchronous WAL writing, agreed, but we want that AFAICS. Data block writes are asynchronous (between checkpoints, anyway). There is one thing in the current WAL code that I don't like: if the WAL buffers fill up then everybody who would like to make WAL entries is forced to wait while some space is freed, which means a write, which is synchronous if you are using O_DSYNC. It would be nice to have a background process whose only task is to issue write()s as soon as WAL pages are filled, thus reducing the probability that foreground processes have to wait for WAL writes (when they're not committing that is). But this could be done portably with one more postmaster child process; I see no real need to dabble in aio_write. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Potential Large Performance Gain in WAL synching
... most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? To make this competitive, the WAL writes would need to be improved to do more than one block (up to 256k or 512k per write) with one write call (if that much is to be written for this tx to be able to commit). This should actually not be too difficult since the WAL buffer is already contiguous memory. If that is done, then I bet O_DSYNC will beat any other config we currently have. With this, a separate disk for WAL and large transactions you shoud be able to see your disks hit the max IO figures they are capable of :-) Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: To make this competitive, the WAL writes would need to be improved to do more than one block (up to 256k or 512k per write) with one write call (if that much is to be written for this tx to be able to commit). This should actually not be too difficult since the WAL buffer is already contiguous memory. Hmmm ... if you were willing to dedicate a half meg or meg of shared memory for WAL buffers, that's doable. I was originally thinking of having the (still hypothetical) background process wake up every time a WAL page was completed and available to write. But it could be set up so that there is some slop, and it only wakes up when the number of writable pages exceeds N, for some N that's still well less than the number of buffers. Then it could write up to N sequential pages in a single write(). However, this would only be a win if you had few and large transactions. Any COMMIT will force a write of whatever we have so far, so the idea of writing hundreds of K per WAL write can only work if it's hundreds of K between commit records. Is that a common scenario? I doubt it. If you try to set it up that way, then it's more likely that what will happen is the background process seldom awakens at all, and each committer effectively becomes responsible for writing all the WAL traffic since the last commit. Wouldn't that lose compared to someone else having written the previous WAL pages in background? We could certainly build the code to support this, though, and then experiment with different values of N. If it turns out N==1 is best after all, I don't think we'd have wasted much code. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Improving backend startup interlock
Tom Lane writes: $ man flock No manual entry for flock. $ HPUX has generally taken the position of adopting both BSD and SysV features, so if it doesn't exist here, it's not portable to older Unixen ... If only local locking is at issue then finding any one of fcntl() locking, flock(), or lockf() would do. All Unixen will have one or more of these and autoconf machinery exists to find them. The issue Tom raised about NFS support remains: locking over NFS introduces new failure modes. It also only works for NFS clients that support NFS locking, which not all do. Mind you NFS users are currently entirely unprotected from someone starting a postmaster on a different NFS client using the same data directory right now, which file locking would prevent. So there is some win for NFS users as well as local filesystem users. (Anyone using NFS care to put their hand up? Maybe nobody does?) Is the benefit of better local filesystem behaviour plus multiple client protection for NFS users who have file locking enough to outweigh the drawbacks? My two cents says it is, but my two cents are worth approximately USD$0.01, which is to say not very much ... Regards, Giles ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Improving backend startup interlock
Giles Lean [EMAIL PROTECTED] wrote: Tom Lane writes: $ man flock No manual entry for flock. $ HPUX has generally taken the position of adopting both BSD and SysV features, so if it doesn't exist here, it's not portable to older Unixen ... If only local locking is at issue then finding any one of fcntl() locking, flock(), or lockf() would do. All Unixen will have one or more of these and autoconf machinery exists to find them. The issue Tom raised about NFS support remains: locking over NFS introduces new failure modes. It also only works for NFS clients that support NFS locking, which not all do. Mind you NFS users are currently entirely unprotected from someone starting a postmaster on a different NFS client using the same data directory right now, which file locking would prevent. So there is some win for NFS users as well as local filesystem users. (Anyone using NFS care to put their hand up? Maybe nobody does?) Is the benefit of better local filesystem behaviour plus multiple client protection for NFS users who have file locking enough to outweigh the drawbacks? My two cents says it is, but my two cents are worth approximately USD$0.01, which is to say not very much ... Well, I am going to do some tests with postgresql and our netapp filer later in October. If that setup proves to work fast and reliable I would also be interested in such a locking. I don't care about the feature if I find the postgresql/NFS/netapp-filer setup to be unreliable or bad performing. I'll see. Regards, Michael Paesold ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Hmmm ... if you were willing to dedicate a half meg or meg of shared memory for WAL buffers, that's doable. Yup, configuring Informix to three 2 Mb buffers (LOGBUF 2048) here. However, this would only be a win if you had few and large transactions. Any COMMIT will force a write of whatever we have so far, so the idea of writing hundreds of K per WAL write can only work if it's hundreds of K between commit records. Is that a common scenario? I doubt it. It should help most for data loading, or mass updating, yes. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith writes: I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. Why not? Other than the necessary synchronisation for attributes such as file size and modification times, multiple processes can readily write to different areas of the same file at the same time. fsync() may not return until after the buffers it schedules are written, but it doesn't have to block subsequent writes to different buffers in the file either. (Note too Tom Lane's responses about when fsync() is used and not used.) I'll have to write a test and see if there really is a problem. Please do. I expect you'll find things aren't as bad as you fear. In another posting, you write: Hmm, I keep hearing that buffer block writes are asynchronous but I don't read that in the code at all. There are simple write calls with files that are not opened with O_NOBLOCK, so they'll be done synchronously. The code for this is relatively straighforward (once you get past the storage manager abstraction) so I don't see what I might be missing. There is a confusion of terminology here: the write() is synchronous from the point of the application only in that the data is copied into kernel buffers (or pages remapped, or whatever) before the system call returns. For files opened with O_DSYNC the write() would wait for the data to be written to disk. Thus O_DSYNC is synchronous I/O, but there is no equivalently easy name for the regular flush to disk after write() returns that the Unix kernel has done ~forever. The asynchronous I/O that you mention (aio) is a third thing, different from both regular write() and write() with O_DSYNC. I understand that with aio the data is not even transferred to the kernel before the aio_write() call returns, but I've never programmed with aio and am not 100% sure how it works. Regards, Giles ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Improving backend startup interlock
Michael Paesold wrote: Giles Lean [EMAIL PROTECTED] wrote: Mind you NFS users are currently entirely unprotected from someone starting a postmaster on a different NFS client using the same data directory right now, which file locking would prevent. So there is some win for NFS users as well as local filesystem users. (Anyone using NFS care to put their hand up? Maybe nobody does?) Is the benefit of better local filesystem behaviour plus multiple client protection for NFS users who have file locking enough to outweigh the drawbacks? My two cents says it is, but my two cents are worth approximately USD$0.01, which is to say not very much ... Well, I am going to do some tests with postgresql and our netapp filer later in October. If that setup proves to work fast and reliable I would also be interested in such a locking. I don't care about the feature if I find the postgresql/NFS/netapp-filer setup to be unreliable or bad performing. We have multiple Oracle databases running over NFS from an HPUX server to a netapp and have been pleased with the performance overall. It does require some tuning to get it right, and it hasn't been entirely without issues, but I don't see us going back to local storage. We also just recently set up a Linux box running Oracle against an NFS mounted netapp. Soon I'll be adding Postgres on the same machine, initially using locally attached storage, but at some point I may need to shift to the netapp due to data volume. If you do try Postgres on the netapp, please post your results/experience and I'll do the same. Anyway, I guess I qualify as interested in an NFS safe locking method. Joe ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Threaded Sorting
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: ... But ISTM that if this makes sense for our internal temp files, it makes sense for user-created temp tables as well. Yes, I was thinking that, but of course, those are real tables, rather than just files. Not sure how clean it will be to mix those in the same directory. We haven't in the past. Is it a good idea? Sure we have --- up till recently, pg_temp files just lived in the database directory. I think it was you that added the pg_temp subdirectory, and the reason you did it was to let people symlink the temp files to someplace else. But that's just a zeroth-order approximation to providing a tablespace facility for these things. OK, TODO updated: * Allow sorting, temp files, temp tables to use multiple work directories FYI, I originally created that directory so a postmaster startup could clear that dir. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Tom Lane wrote: Curtis Faith [EMAIL PROTECTED] writes: After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: [ much snipped ] vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); Hm, I take it a vnode is what's usually called an inode, ie the unique identification data for a specific disk file? Yes, Virtual Inode. I think it is virtual because it is used for NFS, where the handle really isn't an inode. This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) I think the small issue is: proc1 proc2 write fsync write fync Proc2 has to wait for the fsync, but the write is so short compared to the fsync, I don't see an issue. Now, if someone would come up with code that did only one fsync for the above case, that would be a big win. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
OK, are we agreed to leave CURRENT_TIMESTAMP/now() alone and just add now(string)? If no one replies, I will assume that is a yes and I will add it to TODO. --- Michael Paesold wrote: Bruce Momjian [EMAIL PROTECTED] wrote: That is a very good point. At least with serializable transactions it seems perfectly reasonable to return a frozen CURRENT_TIMESTAMP. What do you think about read-commited level? Can time be commited? ;-) It would be even more surprising to new users if the implementation of CURRENT_TIMESTAMP would depend on trx serialization level. Yes, CURRENT_TIMESTAMP changing based on transaction serializable/read commited would be quite confusing. Also, because our default is read committed, we would end up with CURRENT_TIMESTAMP being statement level, which actually does give us a logical place to allow CURRENT_TIMESTAMP to change, but I thought people voted against that. However, imagine a query that used CURRENT_TIMESTAMP in the WHERE clause to find items that were not in the future. Would a CURRENT_TIMESTAMP test in a multi-statement transaction want to check based on transaction start, or on the tuples visible at the time the statement started? Well, in a serializable transaction there would be no difference at all, at least if CURRENT_TIMESTAMP is consistent within the transaction. Any changes outside the transaction after SetQuerySnapshot would not be seen by the transaction anyway. In read-commited, I think it's different. If CURRENT_TIMESTAMP is frozen, than the behavior would be the same as in serializable level, if CURRENT_TIMESTAMP advances with each statement, the result would also change. That is an inherent problem with read-commited though and has not so much to do with the timestamp behavior. Regards, Michael Paesold -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly