Re: [HACKERS] Merge algorithms for large numbers of tapes
Two pass will create the count of subfiles proportional to: Subfile_count = original_stream_size/sort_memory_buffer_size The merge pass requires (sizeof record * subfile_count) memory. That is true from an algorithmic perspective. But to make the merge efficient you would need to have enough RAM to cache a reasonably large block per subfile_count. Else you would need to reread the same page/block from one subfile multiple times. (If you had one disk per subfile you could also rely on the disk's own cache, but I think we can rule that out) Example: You have a 7 gigabyte table to sort and you have 100 MB sort buffer. The number of subfiles will be: 70 / 1 = 70 files To be efficient you need (70 + 1) \* max(record_size, 256k) = 18 Mb Plus you need a structure per subfile that points to the current record in the buffer. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Function's final statement must not be a SELECT
Hello, I want do sql wrap for woid plpgsql function. But void SQL function must not finish SELECT cmd. I don't know any others command which I can use. Can You help me? Thank You Pavel Stehule _ Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Merge algorithms for large numbers of tapes
On Fri, Mar 10, 2006 at 09:57:28AM +0100, Zeugswetter Andreas DCP SD wrote: Two pass will create the count of subfiles proportional to: Subfile_count = original_stream_size/sort_memory_buffer_size The merge pass requires (sizeof record * subfile_count) memory. That is true from an algorithmic perspective. But to make the merge efficient you would need to have enough RAM to cache a reasonably large block per subfile_count. Else you would need to reread the same page/block from one subfile multiple times. (If you had one disk per subfile you could also rely on the disk's own cache, but I think we can rule that out) But what about the OS cache? Linux will read upto the next 128KB of a file if it's contiguous on disk, which is likely with modern filesystems. It's likely to be much fairer than any way we can come up with to share memory. Question is, do we want our algorithm to rely on that caching? -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. signature.asc Description: Digital signature
Re: [HACKERS] Coverity Open Source Defect Scan of PostgreSQL
On Wed, Mar 08, 2006 at 10:34:38PM -0800, Ben Chelf wrote: On 3/8/06, Josh Berkus josh ( at ) agliodbs ( dot ) com wrote: Actually, I thougth that Neil/eDB did this with their copy. Is there any way to get a copy of that training configuration? Just to jump in on this thread, we can absolutely configure elog -- if you have the config already, great. If not, if you can just send me the prototype/macro expansion for 'elog' and the constant values that are passed in the case where it exits, I'll add that config. Thanks! I don't know it anyone has responded to this but it works as follows. the actual source can be seen here: http://developer.postgresql.org/cvsweb.cgi/pgsql/src/include/utils/elog.h?rev=1.82 elog expands to elog_finish, which doesn't return if the first argument is = ERROR (which is the number 20). ereport(X) is more complex. You want the first argument of that but it expands to something similar to: errstart(X), errfinish() if X = ERROR in the call to errstart, errfinish doesn't return. Hope this helps, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. signature.asc Description: Digital signature
[HACKERS] Proposal for updatable views
Hi folks, Please find attached a patch that implements SQL92-compatible updatable views. The patch introduces new semantics into the rule system: implicit and explicit rules. Implicit rules are created to implement updatable views: _INSERT _NOTHING_INSERT (unconditional DO INSTEAD NOTHING rule) _DELETE _NOTHING_DELETE (unconditional DO INSTEAD NOTHING rule) _UPDATE _NOTHING_UPDATE (unconditional DO INSTEAD NOTHING rule) These rules are marked 'implicit' in pg_rewrite, the rewriter is teached to handle them different, depending on wether they are created with a rule condition (a view's CHECK OPTION) or not. The CHECK OPTION itself is implemented with a new system function and a conditional rule that evaluates the view's WHERE condition (pg_view_update_error()). The supported syntax is CREATE VIEW foo AS SELECT ... [WITH [LOCAL | CASCADED] CHECK OPTION]; The LOCAL and CASCADED keywords are optional when a CHECK OPTION is specified, the default is CASCADED (this syntax creates a shift/reduce conflict in the grammar file i don't know how to fix). If a user wants his own rules with CREATE RULE to be created, the implicit rule gets dropped, depending what action the user selects. The patch introduces support for pg_dump as well. Please note that the patch isn't complete yet, but it seems it's necessary to discuss its implementation on -hackers now. Bernd pgsql-view_update_8.2dev.tar.bz2 Description: Binary data ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal for SYNONYMS
--On Donnerstag, März 09, 2006 17:23:11 -0500 Tom Lane [EMAIL PROTECTED] wrote: (BTW, there was some work being done on updatable views, but I think it's stalled. I suspect the reason is that our current rule system is just too odd to support updatable views reasonably. I've been wondering if an implementation based on allowing triggers on views would be any more manageable.) It has stalled because I want more than SQL92 to be implemented...but that's another story. I've sent my latest patches against HEAD to -hackers for discussion. Bernd ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Updateable views was:(Re: [HACKERS] Proposal for SYNONYMS)
Jaime Casanova wrote: On 3/9/06, Tom Lane [EMAIL PROTECTED] wrote: Josh Berkus josh@agliodbs.com writes: Eh? I thought that it was just syntatic sugar that was missing. I've built lots of updatable views manually; I don't see what's difficult about it. I think you'll find that corner cases like inserts involving nextval() don't work real well with a rule-based updatable view. But perhaps I'm just scarred by the many complaints we've had about rules. With a plain unconditional DO INSTEAD rule it might be OK ... the last time i talk with Bernd Helmle, he was preparing the code to send to patches for discussion... that was two months ago... the current code had problems with casts and i think with domains too... i will contact with Bernd to know if he did some more work, if not i can send to patches the latest path he sent me... I'd certainly be interested in having auto-updatable views in 8.2 - even if it was only for the simplest of cases. If I can be of any help testing etc. let me know. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Updateable views was:(Re: [HACKERS] Proposal for
--On Freitag, März 10, 2006 09:43:04 + Richard Huxton dev@archonet.com wrote: I'd certainly be interested in having auto-updatable views in 8.2 - even if it was only for the simplest of cases. If I can be of any help testing etc. let me know. Yeah, that would be cool. I've sent the latest patch to -hackers. Feel free to check it out. I currently know that array fields (e.g. field[1]) causes problems, but i'm pretty sure there's much more work left... Bernd ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Where Can I Find The Code Segment For WAL Control?
Charlie, I'm currently working on a similar solution (it's true I'm only in the thinking phase). I don't have too much time to spend on it (~few hours per week, on the train during commuting), so it's not progressing too fast... Nevertheless, I would like to do a first proof-of-concept version in a contrib module with minimal coding involved. I don't have time right now to explain all the details how I would like to do it, but in essence I want to: - create a few functions which expose the WAL files and the data base files as streams; - use the archive_command framework to control when the WAL files can be recycled, in combination with a WAL subscription system to make sure the standby won't be missing WAL files (with safeguards so the primary system will not go out of WAL file system space if the standby can't keep up with the WAL stream); - create a standby manager program which only needs to know how to access the primary server in order to create the standby (by connecting to it through normal data base connections and using the above mentioned functions to stream the files); These are just the main ideas, the devil is in the details. As per answering your question, a possible starting point would be to take a look at: src/backend/postmaster/pg_arch.c This one deals with archiving WAL files, so it could give you some insight in how you can deal with the WAL files. Or just search the source tree for 'WAL' ... I guess there's quite a few places in the code where WAL is involved. Next week I'll post more detailed design (some charts I did to better understand the idea, and some more text to add details), if anybody is interested in co-working on it, otherwise I'll just keep working in my own pace until it gets done... Cheers, Csaba. On Fri, 2006-03-10 at 07:34, 王宝兵 wrote: I am now trying to develop the PG to support real-time backup.My architecture is somehow similar to the Database Mirroring technology of SQL Server 2005.The server end of the system is consisted of two DB servers one is the Principal server,the other is the Mirror server.whenever the Principal flushes its log buffer to the local log file,it must send the content of its buffer to the Mirror simultaneously.After the mirror receives the buffer,it write to its own log file and send a response message to the Principal,then the Mirror redo/undo its log.By this way, we can guarantee the database instances of the two servers identical. But now I encounter a problem.I don't know where the functions to control log buffer are.Which code segment may I refer to? I have just participated the PG project for a short time,and I will appreciate your help very much! Look forward to you all! Charlie Wang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Enhanced containment selectivity function
Bruce Momjian ha scritto: Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. Great. I would just like to remind that Tom said: I'd be willing to consider exporting those functions from selfuncs.c. so that the selector function could be moved to contrib/ltree, which is its natural place. It could also be noted that a similar feature could be useful outside ltree: I guess there are plenty of cases when scanning statistics would give a better result than using a constant selectivity. Best regards -- Matteo Beccati http://phpadsnew.com http://phppgads.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Merge algorithms for large numbers of tapes
Two pass will create the count of subfiles proportional to: Subfile_count = original_stream_size/sort_memory_buffer_size The merge pass requires (sizeof record * subfile_count) memory. That is true from an algorithmic perspective. But to make the merge efficient you would need to have enough RAM to cache a reasonably large block per subfile_count. Else you would need to reread the same page/block from one subfile multiple times. (If you had one disk per subfile you could also rely on the disk's own cache, but I think we can rule that out) But what about the OS cache? Linux will read upto the next 128KB of a file if it's contiguous on disk, which is likely with modern filesystems. It's likely to be much fairer than any way we can come up with to share memory. We were discussing how much RAM is needed, and not how much the backend allocates itself. So if the backend needs to duplicate some of the OS cache, that will only add to the memory requirement. The most likely scenario is, that the backend additionally holds one page per subfile. Question is, do we want our algorithm to rely on that caching? Currently we do, and I don't think that is so bad actually. The only optimization I would consider, is adding a sequential access hint to the tape file :-) open. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] problem with large maintenance_work_mem settings and
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: LOG: begin index sort: unique = f, workMem = 8024000, randomAccess = f LOG: switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec elapsed 32.43 sec LOG: finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed 3814.82 sec LOG: performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec LOG: finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec elapsed 7827.25 sec LOG: performsort done (except 2-way final merge): CPU 348.25s/7132.99u sec elapsed 7846.47 sec after that the postmaster is now consuming 99% CPU for about 22 hours(!) I'll look into it, but I was already wondering if we shouldn't bound the number of tapes somehow. It's a bit hard to believe that 28000 tapes is a sane setting. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] problem with large maintenance_work_mem settings and
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: LOG: begin index sort: unique = f, workMem = 8024000, randomAccess = f LOG: switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec elapsed 32.43 sec LOG: finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed 3814.82 sec LOG: performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec LOG: finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec elapsed 7827.25 sec LOG: performsort done (except 2-way final merge): CPU 348.25s/7132.99u sec elapsed 7846.47 sec after that the postmaster is now consuming 99% CPU for about 22 hours(!) I'll look into it, but I was already wondering if we shouldn't bound the number of tapes somehow. It's a bit hard to believe that 28000 tapes is a sane setting. heh - don't think it is a sane setting either (and it doesn't look like that pg is using more than 2GB anyway). If this testing helps with defining appropriate upper bounds to prevent bad behaviour like this (not responding to signals any more and eating CPU like mad) I'm more than happy. And the ltsReleaseBlock-fix already reduced dumprestore times for one of our production databases by at about 15% which is already quite an impressive improvment on its own ;-) Stefan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] problem with large maintenance_work_mem settings and
On Fri, 2006-03-10 at 09:31 -0500, Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: LOG: begin index sort: unique = f, workMem = 8024000, randomAccess = f LOG: switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec elapsed 32.43 sec LOG: finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed 3814.82 sec LOG: performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec LOG: finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec elapsed 7827.25 sec LOG: performsort done (except 2-way final merge): CPU 348.25s/7132.99u sec elapsed 7846.47 sec after that the postmaster is now consuming 99% CPU for about 22 hours(!) I'll look into it, but I was already wondering if we shouldn't bound the number of tapes somehow. It's a bit hard to believe that 28000 tapes is a sane setting. Well, since they are not actually tapes, why not? I thought you had changed the memory settings so that the 28658 was a maximum, not the actual number it used. In the above example we are only using 2 runs (tapes)... so it should just read in run 1 and then run 2 into memory and complete the final merge. Seems reasonable to limit tapes to say 10,000. 28000 tapes allows you to sort 448 TB without any merging... Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] problem with large maintenance_work_mem settings and
Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2006-03-10 at 09:31 -0500, Tom Lane wrote: I'll look into it, but I was already wondering if we shouldn't bound the number of tapes somehow. It's a bit hard to believe that 28000 tapes is a sane setting. I thought you had changed the memory settings so that the 28658 was a maximum, not the actual number it used. Well, it does set up the control structures with 28658 entries, but the I/O buffers and so on are not allocated unless used (in this instance only two will get used). logtape.c itself does not look like it has any serious problem with too many tapes, but maybe tuplesort.c does. Or the problem Stefan has stumbled across might be unrelated to number of tapes, anyway --- we still need to dig. Seems reasonable to limit tapes to say 10,000. 28000 tapes allows you to sort 448 TB without any merging... Yeah, I was thinking MAXTAPES = 1 might not be a bad idea. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] problem with large maintenance_work_mem settings and
I'll look into it, but I was already wondering if we shouldn't bound the number of tapes somehow. It's a bit hard to believe that 28000 tapes is a sane setting. Well, since they are not actually tapes, why not? I wonder what the OS does when we repeatedly open and close those files because we are short on filedescriptors ? Will it replace cached pages of a file that we have closed *more* aggressively ? Maybe we should limit the files to how many files we would actually be able to hold open in parallel ? Or keep more that one tape in one file and remember a start offset into the file per tape. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include
On 3/10/2006 10:53 AM, Bruce Momjian wrote: Jan Wieck wrote: On 3/8/2006 5:31 PM, Bruce Momjian wrote: Alvaro Herrera wrote: Bruce Momjian wrote: Log Message: --- Remove Christof Petig copyright on include file, per author request. Huh, I thought what he actually told was that the file was released under BSD license. Maybe I missed it, but I didn't see him asking to remove the copyright. We certainly have copyrights attributed to individual people. Jan Wieck has his name on the PL/Tcl and PL/pgSQL files, for example. We should not have individual copyrights to individuals in our source tree. If Jan's is in there, it should be removed too (with his approval). The only copyright holder should be PostgreSQL Global Development Group. Jan, would you fix that? We've been there before, and not only once. I wonder why we still find more of these, because every time it comes up I agree to removal of it. Perhaps your name reproduces somehow. :-) Scary to see it plastered all over the place. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] problem with large maintenance_work_mem settings and
Zeugswetter Andreas DCP SD [EMAIL PROTECTED] writes: I'll look into it, but I was already wondering if we shouldn't bound the number of tapes somehow. It's a bit hard to believe that 28000 tapes is a sane setting. Well, since they are not actually tapes, why not? I wonder what the OS does when we repeatedly open and close those files because we are short on filedescriptors ? At the moment, nothing, because all the tapes are just I/O buffers on the same OS-level file (or more accurately, one file per gigabyte of data). If we get rid of logtape.c as Luke wants to do, then we might have some issues here. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] random observations while testing with a 1,8B row table
Hi all! During my testing of large work_mem and maintainence_work_mem setting wrt to CREATE INDEX and sorting I encountered a number of things wrt to doing various operations on such a large table (about 106GB on disk with no dead tuples). I will summarize some of the just in case somebody is interested: - table used has 5 integer columns non-indexed during the loads - hardware is a Dual Opteron 280 with 4 [EMAIL PROTECTED],4GHz and 16GB RAM, data is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL and data. 1. data loading - I'm using COPY with batches of 300M rows it takes *) with one copy running it takes about 20minutes/batch to load the data (~250k rows/sec) and virtually no context switches. *) with two copys running concurrently it takes a bit less then 30 minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall) *) with three copy it takes about 40min/batch at 140k context switches/sec (380k rows/sec overall) the profiles for those runs look very similiar to: samples %symbol name 5065118 20.9607 XLogInsert 3496868 14.4709 DoCopy 2807313 11.6174 CopyReadLine 1373621 5.6844 PageAddItem 1227069 5.0779 heap_formtuple 1193319 4.9383 LWLockAcquire 8942433.7006 hash_search 7174272.9689 LWLockRelease 6993592.8941 pg_atoi 6913852.8611 FunctionCall3 6403832.6501 heap_insert 5793312.3974 int4in 4112861.7020 AllocSetReset 3764521.5579 hash_any 3492201.4452 RelationGetBufferForTuple 2615681.0824 AllocSetAlloc 2575111.0656 ReadBuffer while the amount of IO going on is quite a lot it looks like we are still mostly CPU-bound for COPY. 2. updating all of the rows in the table: I updated all of the rows in the table with a simple UPDATE testtable set a=a+1; this took about 2,5 hours (~200rows/sec) with a profile looking like: samples %symbol name 27860285 26.5844 XLogInsert 4828077 4.6070 PageAddItem 4490535 4.2849 heap_update 4267647 4.0722 slot_deform_tuple 3996750 3.8137 LWLockAcquire 3716184 3.5460 slot_getattr 3454679 3.2965 hash_search 2998742 2.8614 hash_any 2909261 2.7760 heap_fill_tuple 2825256 2.6959 LWLockRelease 2283086 2.1785 LockBuffer 2135048 2.0373 ExecTargetList 1636017 1.5611 ExecEvalVar 1632377 1.5576 UnpinBuffer 1566087 1.4944 RelationGetBufferForTuple 1561378 1.4899 ExecMakeFunctionResultNoSets 1511366 1.4421 ReadBuffer 1381614 1.3183 heap_compute_data_size 3. vacuuming this table - it turned out that VACUUM FULL is completly unusable on a table(which i actually expected before) of this size not only to the locking involved but rather due to a gigantic memory requirement and unbelievable slowness. It seems that the heap-scan part of vacuum full completed after about 2 hours ending up with a postmaster having a resident size of about 8,5GB(!!!) with maintainance_work_mem set to 1GB. profile for this stage looks like: samples %symbol name 941058 26.0131 scan_heap 35 12.2852 HeapTupleSatisfiesVacuum 2421176.6927 TransactionIdIsInProgress 2200446.0825 _mdfd_getseg 2125715.8760 hash_search 1869635.1681 TransactionIdPrecedes 1760164.8655 SetBufferCommitInfoNeedsSave 1376683.8055 TransactionIdDidCommit 1370683.7889 PageRepairFragmentation 1114743.0814 TransactionLogFetch 1038142.8697 LWLockAcquire 1029252.8451 LWLockRelease 1024562.8321 hash_any 67199 1.8575 BufferAlloc after that the postmaster started slowly consuming more and more memory, doing virtually no IO and eating CPU like mad with a profile similiar to: samples %symbol name 2708391248 94.1869 repair_frag 155395833 5.4040 enough_space 5707137 0.1985 XLogInsert 1410703 0.0491 PageAddItem 6916160.0241 BgBufferSync I actually ended up canceling the VACUUM FULL after about 50 hours of runtime with a resident size of ~11,5GB. Stefan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] PostgreSQL Anniversary Summit, Call for Contributions
Alvaro, Thanks for the pointers. I'm looking forward to finding somebody who wants to sponsor me on this issue ... or maybe get me a passport from the Holy See. The other pointy bit is that the letter of invitation needs to indicate the inviter's relationship to the person being invited. I expect that would need to be a tad more specific than merely he's some guy from Sweden that I heard about on the Internet... Rats :-( This isn't going to be a problem for you. We'll put your stuff in the works; Andrew or Neil will have to invite you. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] random observations while testing with a 1,8B row
Stefan, On 3/10/06 9:40 AM, Stefan Kaltenbrunner [EMAIL PROTECTED] wrote: I will summarize some of the just in case somebody is interested: I am! - table used has 5 integer columns non-indexed during the loads - hardware is a Dual Opteron 280 with 4 [EMAIL PROTECTED],4GHz and 16GB RAM, data is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL and data. How many connections out of the machine? How many disks behind each LUN? So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's 108GB/1.8B = 60 Bytes per row on disk. I wonder what all that overhead is? 1. data loading - I'm using COPY with batches of 300M rows it takes *) with one copy running it takes about 20minutes/batch to load the data (~250k rows/sec) and virtually no context switches. *) with two copys running concurrently it takes a bit less then 30 minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall) *) with three copy it takes about 40min/batch at 140k context switches/sec (380k rows/sec overall) So, from 15 MB/s up to about 20 MB/s. while the amount of IO going on is quite a lot it looks like we are still mostly CPU-bound for COPY. It's what we see almost always. In this case if your I/O configuration is capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s, you will be CPU limited. The 3x is approximate, and based on observations, the reasoning underneath it is that Postgres is writing the data several times, once to the WAL, then from the WAL to the heap files. 2. updating all of the rows in the table: I updated all of the rows in the table with a simple UPDATE testtable set a=a+1; this took about 2,5 hours (~200rows/sec) Ugh. This is where Bizgres MPP shines, I'll try to recreate your test and post results. This scales linearly in Bizgres MPP with the number of disks and CPUs available, but I would hope for much more than that. 3. vacuuming this table - it turned out that VACUUM FULL is completly unusable on a table(which i actually expected before) of this size not only to the locking involved but rather due to a gigantic memory requirement and unbelievable slowness. Simple vacuum should be enough IMO. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [COMMITTERS] pgsql: Remove Christof Petig copyright on include file,
On Friday 10 March 2006 07:53, Bruce Momjian wrote: Jan Wieck wrote: On 3/8/2006 5:31 PM, Bruce Momjian wrote: Alvaro Herrera wrote: Bruce Momjian wrote: Log Message: --- Remove Christof Petig copyright on include file, per author request. Huh, I thought what he actually told was that the file was released under BSD license. Maybe I missed it, but I didn't see him asking to remove the copyright. We certainly have copyrights attributed to individual people. Jan Wieck has his name on the PL/Tcl and PL/pgSQL files, for example. We should not have individual copyrights to individuals in our source tree. If Jan's is in there, it should be removed too (with his approval). The only copyright holder should be PostgreSQL Global Development Group. Jan, would you fix that? We've been there before, and not only once. I wonder why we still find more of these, because every time it comes up I agree to removal of it. Perhaps your name reproduces somehow. :-) That pesky replication daemon... ;-) -- Darcy Buskermolen Wavefire Technologies Corp. http://www.wavefire.com ph: 250.717.0200 fx: 250.763.1759 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PostgreSQL Anniversary Summit, Call for Contributions
Just curious, but how do ppl come to Canada as tourists from other countries? I don't imagine they need to be invited by a Canadian, do they? On Fri, 10 Mar 2006, Josh Berkus wrote: Alvaro, Thanks for the pointers. I'm looking forward to finding somebody who wants to sponsor me on this issue ... or maybe get me a passport from the Holy See. The other pointy bit is that the letter of invitation needs to indicate the inviter's relationship to the person being invited. I expect that would need to be a tad more specific than merely he's some guy from Sweden that I heard about on the Internet... Rats :-( This isn't going to be a problem for you. We'll put your stuff in the works; Andrew or Neil will have to invite you. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] random observations while testing with a 1,8B row table
Luke Lonergan wrote: Stefan, On 3/10/06 9:40 AM, Stefan Kaltenbrunner [EMAIL PROTECTED] wrote: I will summarize some of the just in case somebody is interested: I am! heh - not surprised :-) - table used has 5 integer columns non-indexed during the loads - hardware is a Dual Opteron 280 with 4 [EMAIL PROTECTED],4GHz and 16GB RAM, data is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL and data. How many connections out of the machine? How many disks behind each LUN? 2 HBAs in the server, 2x2 possible paths to each LUN. 6 disks for the WAL and 12 disks for the data So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's 108GB/1.8B = 60 Bytes per row on disk. I wonder what all that overhead is? 1. data loading - I'm using COPY with batches of 300M rows it takes *) with one copy running it takes about 20minutes/batch to load the data (~250k rows/sec) and virtually no context switches. *) with two copys running concurrently it takes a bit less then 30 minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall) *) with three copy it takes about 40min/batch at 140k context switches/sec (380k rows/sec overall) So, from 15 MB/s up to about 20 MB/s. while the amount of IO going on is quite a lot it looks like we are still mostly CPU-bound for COPY. It's what we see almost always. In this case if your I/O configuration is capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s, you will be CPU limited. the IO-System I use should be capable of doing that if pushed hard enough :-) The 3x is approximate, and based on observations, the reasoning underneath it is that Postgres is writing the data several times, once to the WAL, then from the WAL to the heap files. 2. updating all of the rows in the table: I updated all of the rows in the table with a simple UPDATE testtable set a=a+1; this took about 2,5 hours (~200rows/sec) Ugh. This is where Bizgres MPP shines, I'll try to recreate your test and post results. This scales linearly in Bizgres MPP with the number of disks and CPUs available, but I would hope for much more than that. interesting to know, but still I'm testing/playing with postgresql here not bizgres MPP ... 3. vacuuming this table - it turned out that VACUUM FULL is completly unusable on a table(which i actually expected before) of this size not only to the locking involved but rather due to a gigantic memory requirement and unbelievable slowness. Simple vacuum should be enough IMO. sure, that was mostly meant as an experiment, if I had to do this on a production database I would most likely use CLUSTER to get the desired effect (which in my case was purely getting back the diskspace wasted by dead tuples) Stefan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] random observations while testing with a 1,8B row table
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: 3. vacuuming this table - it turned out that VACUUM FULL is completly unusable on a table(which i actually expected before) of this size not only to the locking involved but rather due to a gigantic memory requirement and unbelievable slowness. sure, that was mostly meant as an experiment, if I had to do this on a production database I would most likely use CLUSTER to get the desired effect (which in my case was purely getting back the diskspace wasted by dead tuples) Yeah, the VACUUM FULL algorithm is really designed for situations where just a fraction of the rows have to be moved to re-compact the table. It might be interesting to teach it to abandon that plan and go to a CLUSTER-like table rewrite once the percentage of dead space is seen to reach some suitable level. CLUSTER has its own disadvantages though (2X peak disk space usage, doesn't work on core catalogs, etc). regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] random observations while testing with a 1,8B row
Stefan, On 3/10/06 11:48 AM, Stefan Kaltenbrunner [EMAIL PROTECTED] wrote: 2 HBAs in the server, 2x2 possible paths to each LUN. 6 disks for the WAL and 12 disks for the data So - you have 18 disks worth of potential bandwidth, not factoring loss due to RAID. That's roughly 18 * 60 = 1,080 MB/s. If we organized that into four banks, one for each CPU and made each one RAID5 and left two disks for spares, you'd have 12 disks working for you at 720MB/s, which is possibly double the number of active FC channels you have, unless they are all active, in which case you have a nicely matched 800MB/s of FC. So, from 15 MB/s up to about 20 MB/s. Gee - seems a long distance from 700 MB/s potential :-) the IO-System I use should be capable of doing that if pushed hard enough :-) I would expect some 10x this if configured well. interesting to know, but still I'm testing/playing with postgresql here not bizgres MPP ... Sure. Still, what I'd expect is something like 10x this update rate using the parallelism buried in your hardware. If you configure the same machine with 4 Bizgres MPP segments running on 4 LUNs I think you'd be shocked at the speedups. - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] problem with large maintenance_work_mem settings and
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: samples %symbol name 350318533 98.8618 mergepreread 9718220.2743 tuplesort_gettuple_common 4136740.1167 tuplesort_heap_siftup I don't have enough memory to really reproduce this, but I've come close enough that I believe I see what's happening. It's an artifact of the code I added recently to prevent the SortTuple array from growing during the merge phase, specifically the mergeslotsfree logic. You can get into a state where mergeslotsfree is at the lower limit of what the code will allow, and then if there's a run of tuples that should come from a single tape, mergepreread ends up sucking just one tuple per call from that tape --- and with the outer loop over 28000 tapes that aren't doing anything, each call is pretty expensive. I had mistakenly assumed that the mergeslotsfree limit would be a seldom-hit corner case, but it seems it's not so hard to get into that mode after all. The code really needs to do a better job of sharing the available array slots among the tapes. Probably the right answer is to allocate so many free array slots to each tape, similar to the per-tape limit on memory usage --- I had thought that the memory limit would cover matters but it doesn't. Another thing I am wondering about is the code's habit of prereading from all tapes when one goes empty. This is clearly pretty pointless in the final-merge-pass case: we might as well just reload from the one that went empty, and not bother scanning the rest. However, in the scenario where we are rewriting the data to tape, I think we still need the preread-from-all behavior in order to keep things efficient in logtape.c. logtape likes it if you alternate a lot of reads with a lot of writes, so once you've started reading you really want to refill memory completely. It might also be worth remembering the index of the last active tape so that we don't iterate over thousands of uninteresting tapes in mergepreread. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] random observations while testing with a 1,8B row table
On Mar 10, 2006, at 11:54 AM, Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: 3. vacuuming this table - it turned out that VACUUM FULL is completly unusable on a table(which i actually expected before) of this size not only to the locking involved but rather due to a gigantic memory requirement and unbelievable slowness. sure, that was mostly meant as an experiment, if I had to do this on a production database I would most likely use CLUSTER to get the desired effect (which in my case was purely getting back the diskspace wasted by dead tuples) Yeah, the VACUUM FULL algorithm is really designed for situations where just a fraction of the rows have to be moved to re-compact the table. It might be interesting to teach it to abandon that plan and go to a CLUSTER-like table rewrite once the percentage of dead space is seen to reach some suitable level. CLUSTER has its own disadvantages though (2X peak disk space usage, doesn't work on core catalogs, etc). I get bitten by this quite often (customer machines, one giant table, purge out a lot of old data). CLUSTER is great for that, given the headroom, though I've often resorted to a dump and restore because I've not had the headroom for cluster, and it's a lot less downtime than a full vacuum. While the right fix there is to redo the application engine side to use table partitioning, I keep wondering whether it would be possible to move rows near the end of the table to the beginning in one, non- locking phase (vacuum to populate FSM with free space near beginning of table, touch rows starting at end of table, repeat) and then finish off with a vacuum full to tidy up the remainder and truncate the files (or a simpler lock the table and truncate anything unused at the end). Cheers, Steve ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] random observations while testing with a 1,8B row table
Luke Lonergan wrote: Stefan, On 3/10/06 11:48 AM, Stefan Kaltenbrunner [EMAIL PROTECTED] wrote: 2 HBAs in the server, 2x2 possible paths to each LUN. 6 disks for the WAL and 12 disks for the data So - you have 18 disks worth of potential bandwidth, not factoring loss due to RAID. That's roughly 18 * 60 = 1,080 MB/s. If we organized that into four banks, one for each CPU and made each one RAID5 and left two disks for spares, you'd have 12 disks working for you at 720MB/s, which is possibly double the number of active FC channels you have, unless they are all active, in which case you have a nicely matched 800MB/s of FC. wrong(or rather extremely optimistic) the array itself only has two (redundant) FC-loops(@2GB )to the attached expansion chassis. The array has 2 active/active controllers (with a failover penalty) with two host interfaces each, furthermore it has write-cache mirroring(to the standby controller) enabled which means the traffic has to go over the internal FC-loop too. beside that the host(as I said) itself only has two HBAs @2GB which are configured for failover which limits the hosts maximum available bandwith to less than 200MB/S per LUN. So, from 15 MB/s up to about 20 MB/s. Gee - seems a long distance from 700 MB/s potential :-) well the array is capable of about 110MB/s write per controller head (a bit more half the possible due to write mirroring enabled which uses delta-syncronisation). WAL and data are on different controllers though by default. the IO-System I use should be capable of doing that if pushed hard enough :-) I would expect some 10x this if configured well. see above ... interesting to know, but still I'm testing/playing with postgresql here not bizgres MPP ... Sure. Still, what I'd expect is something like 10x this update rate using the parallelism buried in your hardware. If you configure the same machine with 4 Bizgres MPP segments running on 4 LUNs I think you'd be shocked at the speedups. that might be true, though it might sound a bit harsh I really prefer to spend the small amount of spare time I have with testing(and helping to improve if possible) postgresql than playing with a piece of commercial software I'm not going to use anyway ... Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposal for SYNONYMS
On Thu, 9 Mar 2006, Jonah H. Harris wrote: 2) For my comprehension, what's the difference between a SYNONYM and a single-object (possibly updatable) view? Not a whole lot actually. If we had updateable views, I'd suggest that people change their create synonym syntax to create view. One key difference would be that synonyms track schema updates, like adding a column, to the referenced object that a view would not. Kris Jurka ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] random observations while testing with a 1,8B row
Stefan, On 3/10/06 12:23 PM, Stefan Kaltenbrunner [EMAIL PROTECTED] wrote: wrong(or rather extremely optimistic) the array itself only has two (redundant) FC-loops(@2GB )to the attached expansion chassis. The array has 2 active/active controllers (with a failover penalty) with two host interfaces each, furthermore it has write-cache mirroring(to the standby controller) enabled which means the traffic has to go over the internal FC-loop too. beside that the host(as I said) itself only has two HBAs @2GB which are configured for failover which limits the hosts maximum available bandwith to less than 200MB/S per LUN. Wow - the ickiness of SAN fro a performance / value standpoint never ceases to astound me. Gee - seems a long distance from 700 MB/s potential :-) well the array is capable of about 110MB/s write per controller head (a bit more half the possible due to write mirroring enabled which uses delta-syncronisation). WAL and data are on different controllers though by default. So - you're getting 20MB/s on loading from a potential of 200MB/s? I would expect some 10x this if configured well. see above ... OTOH - configured well could include taking the disks out of the smart (?) chassis, plugging them into a dumb chassis and deploying 2 dual channel U320 SCSI adapters - total cost of about $3,000. that might be true, though it might sound a bit harsh I really prefer to spend the small amount of spare time I have with testing(and helping to improve if possible) postgresql than playing with a piece of commercial software I'm not going to use anyway ... No problem - that's our job anyway - to make the case for Postgres' use in typical large scale use-cases like the one you describe. - Luke ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Enhanced containment selectivity function
Matteo Beccati wrote: Bruce Momjian ha scritto: Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. Great. I would just like to remind that Tom said: I'd be willing to consider exporting those functions from selfuncs.c. so that the selector function could be moved to contrib/ltree, which is its natural place. It could also be noted that a similar feature could be useful outside ltree: I guess there are plenty of cases when scanning statistics would give a better result than using a constant selectivity. Yes, I talked to Tom today about moving that stuff into /contrib. I will work on the patch to do that. -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal for SYNONYMS
Kris Jurka [EMAIL PROTECTED] writes: One key difference would be that synonyms track schema updates, like adding a column, to the referenced object that a view would not. That raises a fairly interesting point, actually. What would you expect to happen here: CREATE TABLE foo ...; CREATE SYNONYM bar FOR foo; CREATE VIEW v AS SELECT * FROM bar; DROP SYNONYM bar; With the implementations being proposed, v would effectively be stored as SELECT * FROM foo and thus would be unaffected by the DROP SYNONYM. Is that what people will expect? Is it what happens in Oracle? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal for SYNONYMS
On fös, 2006-03-10 at 16:51 -0500, Tom Lane wrote: Kris Jurka [EMAIL PROTECTED] writes: One key difference would be that synonyms track schema updates, like adding a column, to the referenced object that a view would not. That raises a fairly interesting point, actually. What would you expect to happen here: CREATE TABLE foo ...; CREATE SYNONYM bar FOR foo; CREATE VIEW v AS SELECT * FROM bar; DROP SYNONYM bar; With the implementations being proposed, v would effectively be stored as SELECT * FROM foo and thus would be unaffected by the DROP SYNONYM. Is that what people will expect? Is it what happens in Oracle? At least on Oracle8, you could create a synonym on a non-existing table, so if table FOO does not exist: CREATE SYNONYM BAR FOR FOO; -- no error SELECT * FROM BAR; -- error synonym translation is no longuer valid CREATE TABLE FOO (a varchar2(10)); INSERT INTO FOO VALUES ('a'); SELECT * FROM BAR; -- no error CREATE VIEW X AS SELECT * FROM BAR; SELECT * FROM X; -- no error DROP SYNONYM X; -- no error SELECT * FROM BAR; -- error gnari ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] problem with large maintenance_work_mem settings and
I wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: samples %symbol name 350318533 98.8618 mergepreread 9718220.2743 tuplesort_gettuple_common 4136740.1167 tuplesort_heap_siftup I don't have enough memory to really reproduce this, but I've come close enough that I believe I see what's happening. It's an artifact of the code I added recently to prevent the SortTuple array from growing during the merge phase, specifically the mergeslotsfree logic. I've committed a fix for this; if you're still interested, please retest. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] There is a problem with the download site?
Hi, i am trying to download the windows version since 3 hours ago and just get an error page no matters if i try the FTP browser, ftp mirrors or bittorrent -- regards, Jaime Casanova What they (MySQL) lose in usability, they gain back in benchmarks, and that's all that matters: getting the wrong answer really fast. Randal L. Schwartz ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] There is a problem with the download site?
fixing some typos, sorry is late... i am trying to download the windows version since 3 hours ago and just get an error page no matters if i try the FTP browser, ftp mirrors or bittorrent MOre on this, i was able to download from bittorrent when i manually paste the url and remove the http://wwwmaster.postgresql.org/redir? part and in the rest of the url i change %2F for / (without ) so it becomes from: http://bt.postgresql.org/binary%2Fv8.1.3%2Fwin32%2Fpostgresql-8.1.3-1.zip.torrent to http://bt.postgresql.org/binary/v8.1.3/win32/postgresql-8.1.3-1.zip.torrent -- regards, Jaime Casanova What they (MySQL) lose in usability, they gain back in benchmarks, and that's all that matters: getting the wrong answer really fast. Randal L. Schwartz ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match