Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
Keep in mind that we support platforms without O_DSYNC. I am not sure whether there are any that don't have O_SYNC either, but I am fairly sure that we measured O_SYNC to be slower than fsync()s on some platforms. This measurement is quite understandable, since the current software does 8k writes, and the OS only has a chance to write bigger blocks in the write+fsync case. In the O_SYNC case you need to group bigger blocks yourself. (bigger blocks are essential for max IO) I am still convinced, that writing bigger blocks would allow the fastest solution. But reading the recent posts the solution might only be to change the current loop foreach dirty 8k WAL buffer write 8k to one or two large write calls. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [GENERAL] Large databases, performance
On 3 Oct 2002 at 8:54, Charles H. Woloszynski wrote: I'd be curious what happens when you submit more queries than you have processors (you had four concurrent queries and four CPUs), if you care to run any additional tests. Also, I'd report the query time in absolute (like you did) and also in 'Time/number of concurrent queries. This will give you a sense of how the system is scaling as the workload increases. Personally I am more concerned about this aspect than the load time, since I am going to guess that this is where all the time is spent. OK. I am back from my cave after some more tests are done. Here are the results. I am not repeating large part of it but answering your questions.. Don't ask me how these numbers changed. I am not the person who conducts the test neither I have access to the system. Rest(or most ) of the things remains same.. MySQL 3.23.52 with innodb transaction support: 4 concurrent queries:- 257.36 ms 40 concurrent queries :- 35.12 ms Postgresql 7.2.2 4 concurrent queries:- 257.43 ms 40 concurrent queries :- 41.16 ms Though I can not report oracle numbers, suffice to say that they fall in between these two numbers. Oracle seems to be hell lot faster than mysql/postgresql to load raw data even when it's installed on reiserfs. We plan to run XFS tests later in hope that that would improve mysql/postgresql load times. In this run postgresql has better load time than mysql/innodb ( 18270 sec v/s 17031 sec.) Index creation times are faster as well (100 sec v/s 130 sec). Don't know what parameters are changed. Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All numbers include indexes. This is really going to be a problem when things are deployed. Any idea how can it be taken down? WAL is out, it's not counted. Schema optimisation is later issue. Right now all three databases are using same schema.. Will it help in this situation if I recompile posgresql with block size say 32K rather than 8K default? Will it saev some overhead and offer better performance in data load etc? Will keep you guys updated.. Regards, Shridhar --- Shridhar Daithankar LIMS CPE Team Member, PSPL. mailto:[EMAIL PROTECTED] Phone:- +91-20-5678900 Extn.270 Fax :- +91-20-5678901 --- ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Analysis of ganged WAL writes
Tom Lane kirjutas E, 07.10.2002 kell 01:07: To test this, I made a modified version of pgbench in which each transaction consists of a simple insert into table_NNN values(0); where each client thread has a separate insertion target table. This is about the simplest transaction I could think of that would generate a WAL record each time. Running this modified pgbench with postmaster parameters postmaster -i -N 120 -B 1000 --wal_buffers=250 and all other configuration settings at default, CVS tip code gives me a pretty consistent 115-118 transactions per second for anywhere from 1 to 100 pgbench client threads. This is exactly what I expected, since the database (including WAL file) is on a 7200 RPM SCSI drive. The theoretical maximum rate of sync'd writes to the WAL file is therefore 120 per second (one per disk revolution), but we lose a little because once in awhile the disk has to seek to a data file. Inserting the above patch, and keeping all else the same, I get: $ mybench -c 1 -t 1 bench1 number of clients: 1 number of transactions per client: 1 number of transactions actually processed: 1/1 tps = 116.694205 (including connections establishing) tps = 116.722648 (excluding connections establishing) $ mybench -c 5 -t 2000 -S -n bench1 number of clients: 5 number of transactions per client: 2000 number of transactions actually processed: 1/1 tps = 282.808341 (including connections establishing) tps = 283.656898 (excluding connections establishing) in an ideal world this would be 5*120=600 tps. Have you any good any ideas what holds it back for the other 300 tps ? If it has CPU utilisation of only 50% then there must be still some moderate lock contention. btw, what is the number for 1-5-10 clients with fsync off ? $ mybench -c 10 -t 1000 bench1 number of clients: 10 number of transactions per client: 1000 number of transactions actually processed: 1/1 tps = 443.131083 (including connections establishing) tps = 447.406534 (excluding connections establishing) CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients to 10% idle at 10 or more. So this does seem to be a nice win, and unless I hear objections I will apply it ... 3x speedup is not just nice, it's great ;) -- Hannu ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] cross-posts (was Re: [GENERAL] Large databases,
On Sun, 2002-10-06 at 22:20, Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: ... Avoiding cross-posting would be nice, since I am getting lots of duplicate messages these days. Cross-posting is a fact of life, and in fact encouraged, on the pg lists. I suggest adapting. Try sending set all unique your-email-address to the PG majordomo server; this sets you up to get only one copy of each cross-posted message. That doesn't seem to work any more: set all unique [EMAIL PROTECTED] The all mailing list is not supported at PostgreSQL User Support Lists. What do I need to send now? Marc? -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] cross-posts (was Re: [GENERAL] Large databases,
On Sun, 2002-10-06 at 22:20, Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: ... Avoiding cross-posting would be nice, since I am getting lots of duplicate messages these days. Cross-posting is a fact of life, and in fact encouraged, on the pg lists. I suggest adapting. Try sending set all unique your-email-address to the PG majordomo server; this sets you up to get only one copy of each cross-posted message. That doesn't seem to work any more: set all unique [EMAIL PROTECTED] The all mailing list is not supported at PostgreSQL User Support Lists. What do I need to send now? Marc? it is: set ALL unique your-email if you also don't want to get emails that have already been cc'd to you, you can use: set ALL eliminatecc your-email for a full list of set options send: help set to majordomo. Regards, Michael Paesold ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] cross-posts (was Re: [GENERAL] Large databases,
On Mon, 2002-10-07 at 07:01, Michael Paesold wrote: On Sun, 2002-10-06 at 22:20, Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: ... Avoiding cross-posting would be nice, since I am getting lots of duplicate messages these days. Cross-posting is a fact of life, and in fact encouraged, on the pg lists. I suggest adapting. Try sending set all unique your-email-address to the PG majordomo server; this sets you up to get only one copy of each cross-posted message. That doesn't seem to work any more: set all unique [EMAIL PROTECTED] The all mailing list is not supported at PostgreSQL User Support Lists. What do I need to send now? Marc? it is: set ALL unique your-email if you also don't want to get emails that have already been cc'd to you, you can use: set ALL eliminatecc your-email for a full list of set options send: help set to majordomo. Thanks. That worked great. (I use Mailman, and didn't realize the ALL needed to be capitalized. LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
Hannu Krosing [EMAIL PROTECTED] writes: in an ideal world this would be 5*120=600 tps. Have you any good any ideas what holds it back for the other 300 tps ? Well, recall that the CPU usage was about 20% in the single-client test. (The reason I needed a variant version of pgbench is that this machine is too slow to do more than 120 TPC-B transactions per second anyway.) That says that the best possible throughput on this test scenario is 5 transactions per disk rotation --- the CPU is just not capable of doing more. I am actually getting about 4 xact/rotation for 10 or more clients (in fact it seems to reach that plateau at 8 clients, and be close to it at 7). I'm inclined to think that the fact that it's 4 not 5 is just a matter of not quite there --- there's some additional CPU overhead due to lock contention, etc, and any slowdown at all will cause it to miss making 5. The 20% CPU figure was approximate to begin with, anyway. The other interesting question is why we're not able to saturate the machine with only 4 or 5 clients. I think pgbench itself is probably to blame for that: it can't keep all its backend threads constantly busy ... especially not when several of them report back transaction completion at essentially the same instant, as will happen under ganged-commit conditions. There will be intervals where multiple backends are waiting for pgbench to send a new command. That delay in starting a new command cycle is probably enough for them to miss the bus of getting included in the next commit write. That's just a guess though; I don't have tools that would let me see exactly what's happening. Anyone else want to reproduce the test on a different system and see what it does? If it has CPU utilisation of only 50% then there must be still some moderate lock contention. No, that's I/O wait I think, forced by the quantization of the number of transactions that get committed per rotation. btw, what is the number for 1-5-10 clients with fsync off ? About 640 tps at 1 and 5, trailing off to 615 at 10, and down to 450 at 100 clients (now that must be lock contention...) regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Threaded Sorting
Bingo = great :). The I/O problem seems to be solved :). A table space concept would be top of the histlist :). The symlink version is not very comfortable and I think it would be a real hack. Also: If we had a clean table space concept it would be real advantage. In the first place it would be enough to define a directory (alter tablespace, changing sizes etc. could be a lot of work). How could CREATE TABLESPACE look like? Personally I like the Oracle Syntax. Is it already time to work on the parser for CREATE/ALTER/DROP TABLESPACE? Hans Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Bingo! Want to increase sorting performance, give it more I/O bandwidth, and it will take 1/100th of the time to do threading. Added to TODO: * Allow sorting to use multiple work directories Yeah, I like that. Actually it should apply to all temp files not only sorting. A crude hack would be to allow there to be multiple pg_temp_NNN/ subdirectories (read symlinks) in a database, and then the code would automatically switch among these. Probably a cleaner idea would be to somehow integrate this with tablespace management --- if you could mark some tablespaces as intended for temp stuff, the system could round-robin among those as it creates temp files and/or temp tables. regards, tom lane -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Threaded Sorting
Greg Copeland wrote: I wouldn't hold your breath for any form of threading. Since PostgreSQL is process based, you might consider having a pool of sort processes which address this but I doubt you'll get anywhere talking about threads here. Greg I came across the problem yesterday. We thought about SMP and did some tests on huge tables. The postmaster was running full speed to get the stuff sorted - even on an IDE system. I asked my friends who are doing a lot of work with Oracle on huge SMP machines. I was told that Oracle has a mechanism which can run efficient sorts on SMP machines. It seems to speed up sorting a lot. If we could reduce the time needed to build up an index by 25% it would be a wonderful thing. Just think of a scenario: 1 thread: 24 hours many threads: 18 hours We could gain 6 hours which is a LOT. We have many people running PostgreSQL on systems having wonderful IO systems - in this case IO is not the bottleneck anymore. I had a brief look at the code used for sorting. It is very well documented so maybe it is worth thinking about a parallel algorithm. When talking about threads: A pool of processes for sorting? Maybe this could be useful but I doubt if it the best solution to avoid overhead. Somewhere in the TODO it says that there will be experiments with a threaded backend. This make me think that threads are not a big no no. Hans -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Threaded Sorting
Threads are not the best solutions when it comes to portability. A prefer a process model as well. My concern was that a process model might be a bit too slow for that but if we had processes in memory this would be wonderful thing. Using it for small amounts of data is pretty useless - I totally agree but when it comes to huge amounts of data it can be useful. It is a mechanism for huge installations with a lot of data. Hans That was a fork IIRC. Threading is not used in baseline PostgreSQL nor is there any such plans that I'm aware of. People from time to time ask about threads for this or that and are always told what I'm telling you. The use of threads leads to portability issues not to mention PostgreSQL is entirely built around the process model. Tom is right to dismiss the notion of adding additional CPUs to something that is already I/O bound, however, the concept it self should not be dismissed. Applying multiple CPUs to a sort operation is well accepted and understood technology. At this point, perhaps Tom or one of the other core developers having insight in this area would be willing to address how readily such a mechanism could could be put in place. Also, don't be so fast to dismiss what the process model can do. There is not reason to believe that having a process pool would not be able to perform wonderful things if implemented properly. Basically, the notion would be that the backend processing the query would solicit assistance from the sort pool if one or more processes were available. At that point, several methods could be employed to divide the work. Some form of threshold would also have to be created to prevent the pool from being used when a single backend is capable of addressing the need. Basically the idea is, you only have the pool assist with large tuple counts and then, only when resources are available and resource are available from within the pool. By doing this, you avoid additional overhead for small sort efforts and gain when it matters the most. Regards, Greg -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Moving to PostGres
Hello, I am looking at moving our company away from MS SQL. Have been looking at DB2 and it looks to have some good features. Now wondering if POSTGRESQL could be a viable alternative. I have a few questions though; 1. What is the postgresql equiv to Stored procedures and can they be written in another langauage such s JAVA? 2. How well is JAva supported for developing DB applications using PG? 3. What are the limitations to PG compared to DB2, Oracle, Sybase ? Thanks Ben ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Moving to PostGres
In article XU5o9.11017$[EMAIL PROTECTED], Benjamin Stewart [EMAIL PROTECTED] wrote: I am looking at moving our company away from MS SQL. Here's a good place to start: http://techdocs.postgresql.org/redir.php?link=/techdocs/sqlserver2pgsql.php -- http://www.spinics.net/linux/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bad rules
Thankyou very much for your enlightened comment, it worked a treat. I do not seem to be able to find references to this kind of useful information in the postgresql online manual or in books such as bruce momjian's 'postgresql-introduction and concepts'. Where is this info to be found other than the mailing list? Thanks again. Regards Steve -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED]] Sent: 04 October 2002 15:48 To: Steve King Cc: PostgreSQL-development Subject: Re: [HACKERS] Bad rules Steve King [EMAIL PROTECTED] writes: I am using postgres 7.2, and have rule on a table which causes a notify if an insert/update/delete is performed on the table. The table is very very small. When performing a simple (very simple) update on the table this takes about 3 secs, when I remove the rule it is virtually instantaneous. The rest of the database seems to perform fine, have you any ideas or come across this before?? Let's see the rule exactly? NOTIFY per se is not slow in my experience. (One thing to ask: have you done a VACUUM FULL on pg_listener in recent memory? Heavy use of LISTEN/NOTIFY does tend to bloat that table if you don't keep after it with VACUUM.) regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] Implicit Lock Row
I currently develop an interface to simulate a indexed sequential file management with PostgreSql. I must reproduce the same philosophy used of control of locking of the records. I seek a solution to lock and unlock implicitly a row of a table. The locking of several rows, of the same table or various tables, can last a long time and consequently locking cannot be included in a transaction for not to lock the whole table for the other users. There is a viable solution with PostgreSql? There is an accessible basic structure of locking? Thank you.
Re: [HACKERS] Use of sync() [was Re: Potential Large Performance Gain in WAL synching]
Tom Lane [EMAIL PROTECTED] writes: Doug McNaught [EMAIL PROTECTED] writes: In my understanding, it means all currently dirty blocks in the file cache are queued to the disk driver. The queued writes will eventually complete, but not necessarily before sync() returns. I don't think subsequent write()s will block, unless the system is low on buffers and has to wait until dirty blocks are freed by the driver. We don't need later write()s to block. We only need them to not hit disk before the sync-queued writes hit disk. So I guess the question boils down to what queued to the disk driver means --- has the order of writes been determined at that point? It's certainy possible that new write(s) get put into the queue alongside old ones--I think the Linux block layer tries to do this when it can, for one. According to the manpage, Linux used to wait until everything was written to return from sync(), though I don't *think* it does anymore. But that's not mandated by the specs. So I don't think we can rely on such behavior (not reordering writes across a sync()), though it will probably happen in practice a lot of the time. AFAIK there isn't anything better than sync() + sleep() as far as the specs go. Yes, it kinda sucks. ;) -Doug ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Case insensitive columns
I have a problem similar to that described by Shaw Terwilliger on 2001-03-14 (Subject: Case Insensitive CHECK CONSTRAINTs): I need some case insensitive char/varchar columns. A unique index on lower(col_name) wo attachment: resume.exe ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Threaded Sorting
Threads are bad - I know ... I like the idea of a pool of processes instead of threads - from my point of view this would be useful. I am planning to run some tests (GEQO, AIX, sorts) as soon as I have time to do so (still too much work ahead before :( ...). If I had time I'd love to do something for the PostgreSQL community :(. As far as sorting is concerned: It would be fine if it was possible to define an alternative location for temporary sort files using SET. If you had multiple disks this would help in the case of concurrent sorts because this way people could insert and index many tables at once without having to access just one storage system. This would be an easy way out of the IO limitation ... - at least for some problems. Hans Bruce Momjian wrote: We haven't thought about it yet because there are too many buggy thread implementations. We are probably just now getting to a point where we can consider it. However, lots of databases have moved to threads for all sorts of things and ended up with a royal mess of code. Threads can only improve things in a few areas of the backend so it would be nice if we could limit the exposure to threads to those areas; sorting could certainly be one of them, but frankly, I think disk I/O is our limiting factore there. I would be interested to see some tests that showed otherwise. -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Table spaces again [was Re: [HACKERS] Threaded Sorting]
On 4 Oct 2002 at 21:13, Hans-Jürgen Schönig wrote: Bingo = great :). The I/O problem seems to be solved :). A table space concept would be top of the histlist :). The symlink version is not very comfortable and I think it would be a real hack. Also: If we had a clean table space concept it would be real advantage. In the first place it would be enough to define a directory (alter tablespace, changing sizes etc. could be a lot of work). How could CREATE TABLESPACE look like? Personally I like the Oracle Syntax. Well. I (hopefully) understand need to get table spaces. But I absolutely hate it the way oracle does it.. I am repeating all the points I posted before. There was no follow up. I hope I get some on this. 1) It tries to be a volume assuming OS handles volumes inefficiently. Same mentality as handling all disk I/O by in it self. May be worth when oracle did it but is it worth now? 2) It allows joining multiple volumes for performance reason. If you want to join multiple volume for performance, let RAID handle it. Is it job of RDBMS? 3) It puts multiple objets together. Why? I never fully understood having a opeque file sitting on drive v/s neatly laid directory structure. I would always prefer the directory structure. Can anybody please tell me in detail.(Not just a pointing towards TODO items) 1) What a table space supposed to offer? 2) What a directory structure does not offer that table space does? 3) How do they compare for advantages/disadvantages.. Oracle familiarity is out. That's not even close to being good merit IMO. If postgresql moves to oracle way of doing things, .. well, I won't be as much hapy as I am now.. Thanks for your patience.. Bye Shridhar -- Newton's Little-Known Seventh Law: A bird in the hand is safer than one overhead. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Implicit Lock Row
On 5 Oct 2002 at 23:56, Antoine Lobato wrote: I currently develop an interface to simulate a indexed sequential file management with PostgreSql. I must reproduce the same philosophy used of control of locking of the records. I seek a solution to lock and unlock implicitly a row of a table. The locking of several rows, of the same table or various tables, can last a long time and consequently locking cannot be included in a transaction for not to lock the whole table for the other users. There is a viable solution with PostgreSql? There is an accessible basic structure of locking? You can use select for update to lock rows. HTH Bye Shridhar -- Strategy: A long-range plan whose merit cannot be evaluated until sometime after those creating it have left the organization. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]
Can anybody please tell me in detail.(Not just a pointing towards TODO items) 1) What a table space supposed to offer? They allow you to define a maximum amount of storage for a certain set of data. They help you to define the location of data. They help you to define how much data can be used by which ressource. 2) What a directory structure does not offer that table space does? You need to the command line in order to manage quotas - you might not want that. Quotas are handled differently on ever platform (if available). With tablespaces you can assign 30mb to use a, 120mb to user b etc. ... Table spaces are a nice abstraction layer to the file system. 3) How do they compare for advantages/disadvantages.. Oracle familiarity is out. That's not even close to being good merit IMO. If postgresql moves to oracle way of doing things, .. well, I won't be as much hapy as I am now.. Thanks for your patience.. how would you handle table spaces? just propose it to the hackers' list ... we should definitely discuss that ... a bad implementation of table spaces would be painful ... Hans -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [GENERAL] Large databases, performance
On Mon, 07 Oct 2002 15:07:29 +0530, Shridhar Daithankar [EMAIL PROTECTED] wrote: Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All numbers include indexes. This is really going to be a problem when things are deployed. Any idea how can it be taken down? Shridhar, if i'm not mistaken, a char(n)/varchar(n) column is stored as a 32-bit integer specifying the length followed by as many characters as the length tells. On 32-bit Intel hardware this structure is aligned on a 4-byte boundary. For your row layout this gives the following sizes (look at the phys size column): | FieldField Null Indexed phys mini | Name Type size | | type intnono 4 4 | esn char (10) noyes 1611 | min char (10) noyes 1611 | datetime timestamp noyes 8 8 | opc0 char (3) nono 8 4 | opc1 char (3) nono 8 4 | opc2 char (3) nono 8 4 | dpc0 char (3) nono 8 4 | dpc1 char (3) nono 8 4 | dpc2 char (3) nono 8 4 | npa char (3) nono 8 4 | nxx char (3) nono 8 4 | rest char (4) nono 8 5 | field0 intyes no 4 4 | field1 char (4) yes no 8 5 | field2 intyes no 4 4 | field3 char (4) yes no 8 5 | field4 intyes no 4 4 | field5 char (4) yes no 8 5 | field6 intyes no 4 4 | field7 char (4) yes no 8 5 | field8 intyes no 4 4 | field9 char (4) yes no 8 5 | - - | 176 116 Ignoring nulls for now, you have to add 32 bytes for a v7.2 heap tuple header and 4 bytes for ItemIdData per tuple, ending up with 212 bytes per tuple or ca. 85 GB heap space for 43200 tuples. Depending on fill factor similar calculations give some 30 GB for your index. Now if we had a datatype with only one byte for the string length, char columns could be byte aligned and we'd have column sizes given under mini in the table above. The columns would have to be rearranged according to alignment requirements. Thus 60 bytes per heap tuple and 8 bytes per index tuple could be saved, resulting in a database size of ~ 85 GB (index included). And I bet this would be significantly faster, too. Hackers, do you think it's possible to hack together a quick and dirty patch, so that string length is represented by one byte? IOW can a database be built that doesn't contain any char/varchar/text value longer than 255 characters in the catalog? If I'm not told that this is impossibly, I'd give it a try. Shridhar, if such a patch can be made available, would you be willing to test it? What can you do right now? Try using v7.3 beta and creating your table WITHOUT OIDS. This saves 8 bytes per tuple; not much, but better save 4% than nothing. Servus Manfred ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pg_filedump
Alvaro Herrera [EMAIL PROTECTED] writes: I'm trying to get something from pg_filedump. However, the version published in sources.redhat.com/rhdb doesn't grok a lot of changes in current CVS. I changed all those and made it compile... but looks like that's only the easy part. I get bogus values everywhere (block sizes, item numbers, etc). Does somebody know whether it's mantained for current versions? AFAIK, no one has yet updated it for 7.3's changes in tuple header format. That needs to get done sometime soon ... regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]
2) What a directory structure does not offer that table space does? You need to the command line in order to manage quotas - you might not want that. Mount a directory on a partition. If the data exceeds on that partition, there would be disk error. Like tablespace getting overflown. I have seen both the scenarios in action.. Of course it can be done somehow. However, with tablespaces it is more db-like and you need not be familiar with the operating system itself. Just think of a company having several different operating systems (suns, linux, bsd, ...). what do you think could be done in this case? my answer would be an abstraction layer called table spaces ... Quotas are handled differently on ever platform (if available). Yeah. But that's sysadmins responsibility not DBA's. Maybe many people ARE the sysadmins of their PostgreSQL box ... When developing a database with an open mind people should try to see a problem from more than just one perspective. Why should anybody just forget about sysdbas??? With tablespaces you can assign 30mb to use a, 120mb to user b etc. ... Table spaces are a nice abstraction layer to the file system. Hmm.. And how does that fit in database metaphor? What practical use is that? I can't imagine as I am a developer and not a DBA. One of our customers did some minor hosting projects with PostgreSQL. That's what he wanted to have because it is a practical issue. a. you don't want to have more than one instance per machine. b. you want to assign a certain amount of space to a certain user without using quotas. just think of administration tools - tablespaces are as simple as a select. per directory is a first step - a good step and a good idea but tablespaces are a useful invention. just think of hosting companies, hybrid environments, etc ... tablespaces or not a devil and sysdbas may be developers ... Hans -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]
On 7 Oct 2002 at 16:49, Hans-Jürgen Schönig wrote: Mount a directory on a partition. If the data exceeds on that partition, there would be disk error. Like tablespace getting overflown. I have seen both the scenarios in action.. Of course it can be done somehow. However, with tablespaces it is more db-like and you need not be familiar with the operating system itself. Just think of a company having several different operating systems (suns, linux, bsd, ...). what do you think could be done in this case? my answer would be an abstraction layer called table spaces ... OK. Point noted. Suspended till next point. Quotas are handled differently on ever platform (if available). Yeah. But that's sysadmins responsibility not DBA's. Maybe many people ARE the sysadmins of their PostgreSQL box ... When developing a database with an open mind people should try to see a problem from more than just one perspective. Why should anybody just forget about sysdbas??? If DBA is sysadmin, does it make a difference if he learnes about mount/ln or table spaces. Yes it does. Table spaces are limited to databases but mount/ln is useful for any general purpose sysadmin work. That answers the last point as well, I guess.. With tablespaces you can assign 30mb to use a, 120mb to user b etc. ... Table spaces are a nice abstraction layer to the file system. Hmm.. And how does that fit in database metaphor? What practical use is that? I can't imagine as I am a developer and not a DBA. One of our customers did some minor hosting projects with PostgreSQL. That's what he wanted to have because it is a practical issue. a. you don't want to have more than one instance per machine. b. you want to assign a certain amount of space to a certain user without using quotas. just think of administration tools - tablespaces are as simple as a select. Agreed. Perfect point and I didn't thought of it. But it can be done in directory structure as well. Of course it's quite a deviation from what one thinks as plain old directory structure. But if this is one point where table spaces win, let's borrow that. There is lot of baggage in table spaces that can be left out.. Besides AFAIU, tablespaces implements quota using data files which are pre- allocated. Pre-claiming space/resource is the evil of everything likes of oracle do and runs in exact opposite direction of postgresql philosophy. If postgresql has to implement quotas on object, it should do without preclaiming space. Besides if postgresql offers quota on per object basis in directory/object scheme, I am sure that's far more granular than tablespaces. Choice is good.. per directory is a first step - a good step and a good idea but tablespaces are a useful invention. just think of hosting companies, hybrid environments, etc ... tablespaces or not a devil and sysdbas may be developers ... It's not about devil. It's about revaluating need once again. Especially at the level of tablespace concept in itself. Bye Shridhar -- Oblivion together does not frighten me, beloved.-- Thalassa (in Anne Mulhall's body), Return to Tomorrow, stardate 4770.3. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance
Shridhar Daithankar [EMAIL PROTECTED] writes: MySQL 3.23.52 with innodb transaction support: 4 concurrent queries :- 257.36 ms 40 concurrent queries :- 35.12 ms Postgresql 7.2.2 4 concurrent queries :- 257.43 ms 40 concurrent queries :- 41.16 ms I find this pretty fishy. The extreme similarity of the 4-client numbers seems improbable, from what I know of the two databases. I suspect your numbers are mostly measuring some non-database-related overhead --- communications overhead, maybe? Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All numbers include indexes. This is really going to be a problem when things are deployed. Any idea how can it be taken down? 7.3 should be a little bit better because of Manfred's work on reducing tuple header size --- if you create your tables WITHOUT OIDS, you should save 8 bytes per row compared to earlier releases. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance
On 7 Oct 2002 at 10:30, Tom Lane wrote: Shridhar Daithankar [EMAIL PROTECTED] writes: MySQL 3.23.52 with innodb transaction support: 4 concurrent queries:- 257.36 ms 40 concurrent queries :- 35.12 ms Postgresql 7.2.2 4 concurrent queries:- 257.43 ms 40 concurrent queries :- 41.16 ms I find this pretty fishy. The extreme similarity of the 4-client numbers seems improbable, from what I know of the two databases. I suspect your numbers are mostly measuring some non-database-related overhead --- communications overhead, maybe? I don't know but three numbers, postgresql/mysql/oracle all are 25x.xx ms. The clients were on same machie as of server. So no real area to point at.. Only worry is database size. Postgresql is 111GB v/s 87 GB for mysql. All numbers include indexes. This is really going to be a problem when things are deployed. Any idea how can it be taken down? 7.3 should be a little bit better because of Manfred's work on reducing tuple header size --- if you create your tables WITHOUT OIDS, you should save 8 bytes per row compared to earlier releases. Got it.. Bye Shridhar -- Sweater, n.:A garment worn by a child when its mother feels chilly. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]
=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes: how would you handle table spaces? The plan that's been discussed simply defines a tablespace as being a directory somewhere; physical storage of individual tables would remain basically the same, one or more files under the containing directory. The point of this being, of course, that the DBA could create the tablespace directories on different partitions or volumes in order to provide the behavior he wants. In my mind this would be primarily a cleaner, more flexible reimplementation of the existing database location feature. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]
Quotas are handled differently on ever platform (if available). Yeah. But that's sysadmins responsibility not DBA's. Maybe many people ARE the sysadmins of their PostgreSQL box ... When developing a database with an open mind people should try to see a problem from more than just one perspective. Why should anybody just forget about sysdbas??? If DBA is sysadmin, does it make a difference if he learnes about mount/ln or table spaces. Yes it does. Table spaces are limited to databases but mount/ln is useful for any general purpose sysadmin work. That answers the last point as well, I guess.. I agree but still, think of hybrid systems .. Agreed. Perfect point and I didn't thought of it. But it can be done in directory structure as well. Of course it's quite a deviation from what one thinks as plain old directory structure. But if this is one point where table spaces win, let's borrow that. There is lot of baggage in table spaces that can be left out.. Besides AFAIU, tablespaces implements quota using data files which are pre- allocated. Pre-claiming space/resource is the evil of everything likes of oracle do and runs in exact opposite direction of postgresql philosophy. If postgresql has to implement quotas on object, it should do without preclaiming space. Besides if postgresql offers quota on per object basis in directory/object scheme, I am sure that's far more granular than tablespaces. Choice is good.. I didn't think of pre allocation - this is pretty much like Oracle would do it. I was thinking of having a maximum size or something like that. Overhead such as EXTENDS and things like that don't seem too useful for me - that's what a filesystem can be used for. I agree with Tom: If a tablespace was a directory it would be pretty simple and pretty useful. If people could define a maximum size it would be more than perfect. All I think is necessary is: - having data in different, user defined, locations - having the chance to define a maximum size for that tablespace. Suggestion: CREATE TABLESPACE: Create a directory with a certain size (optional) - nothing special here. ALTER TABLESPACE: resize table space. resizing is possible if the amount of data in the tablespace new size of tablespace DROP TABLESPACE: remove table space. the question in this case is - what about the objects in the tablespace? objects can not always be deleted (just think of inheritance and parent tables) It's not about devil. It's about revaluating need once again. Especially at the level of tablespace concept in itself. That's why people should discuss it and think about it :). People want a good implementation or no implementation :). This is Open Source - it is designed to be discussed :). Hans -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at http://www.postgresql.at, cluster.postgresql.at http://cluster.postgresql.at, www.cybertec.at http://www.cybertec.at, kernel.cybertec.at http://kernel.cybertec.at ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Table spaces again [was Re: [HACKERS] Threaded Sorting]
Is this NOT what I have been after for many months now. I dropped the tablespace/location idea before 7.2 because that didn't seem to be any interest. Please see my past email's for the SQL commands and on disk directory layout I have proposed. I have a working 7.2 system with tablespaces/locations (what ever you want to call them, I like locations because tablespace are an Oracle thing). I would like to get this code ported into 7.4. Jim =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes: how would you handle table spaces? The plan that's been discussed simply defines a tablespace as being a directory somewhere; physical storage of individual tables would remain basically the same, one or more files under the containing directory. The point of this being, of course, that the DBA could create the tablespace directories on different partitions or volumes in order to provide the behavior he wants. In my mind this would be primarily a cleaner, more flexible reimplementation of the existing database location feature. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Moving to PostGres
Benjamin Stewart [EMAIL PROTECTED] writes: 1. What is the postgresql equiv to Stored procedures and can they be written in another langauage such s JAVA? PostgreSQL supports user-defined functions; in 7.3 (currently in beta) they can return sets of tuples. You can define functions in Java using http://pljava.sf.net , or in a variety of other languages (Perl, Python, Tcl, Ruby, C, PL/PgSQL, SQL, sh, etc.) 2. How well is JAva supported for developing DB applications using PG? Pretty well, I guess. If you have a specific question, ask it. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] reminder for those working on docs
Since 7.4 is getting real close and docs are going to be going through their final once-overs. Please remember to have a look at the DocNote comments that have been submitted. Once 7.4 is released the current notes will be gone. http://www.postgresql.org/idocs/checknotes.php The above url will show the notes and what they're in relation to with a link to that particular piece of documentation. Vince. -- == Vince Vielhaber -- KA8CSHemail: [EMAIL PROTECTED]http://www.pop4.net 56K Nationwide Dialup from $16.00/mo at Pop4 Networking http://www.camping-usa.com http://www.cloudninegifts.com http://www.meanstreamradio.com http://www.unknown-artists.com == ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [GENERAL] Large databases, performance
On Mon, 07 Oct 2002 19:48:31 +0530, Shridhar Daithankar [EMAIL PROTECTED] wrote: I say if it's a char field, there should be no indicator of length as it's not required. Just store those many characters straight ahead.. This is out of reach for a quick hack ... Sure. But the server machine is not available this week. Some other project is using it. So the results won't be out unless at least a week from now. :-) This time results are correct. Postgresql loads data faster, indexes it faster and queries in almost same time.. Way to go.. Great! And now let's work on making selects faster, too. Servus Manfred ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance
Shridhar Daithankar [EMAIL PROTECTED] writes: I say if it's a char field, there should be no indicator of length as it's not required. Just store those many characters straight ahead.. Your assumption fails when considering UNICODE or other multibyte character encodings. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On 6 Oct 2002, Greg Copeland wrote: On Sat, 2002-10-05 at 14:46, Curtis Faith wrote: 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. Actually, I believe that write will be *mostly* asynchronous while aio_write will always be asynchronous. In a buffer poor environment, I believe write will degrade into a synchronous operation. In an ideal situation, I think they will prove to be on par with one another with a slight bias toward aio_write. In less than ideal situations where buffer space is at a premium, I think aio_write will get the leg up. Browsed web and came across this piece of text regarding a Linux-KAIO patch by Silicon Graphics... The asynchronous I/O (AIO) facility implements interfaces defined by the POSIX standard, although it has not been through formal compliance certification. This version of AIO is implemented with support from kernel modifications, and hence will be called KAIO to distinguish it from AIO facilities available from newer versions of glibc/librt. Because of the kernel support, KAIO is able to perform split-phase I/O to maximize concurrency of I/O at the device. With split-phase I/O, the initiating request (such as an aio_read) truly queues the I/O at the device as the first phase of the I/O request; a second phase of the I/O request, performed as part of the I/O completion, propagates results of the request. The results may include the contents of the I/O buffer on a read, the number of bytes read or written, and any error status. Preliminary experience with KAIO have shown over 35% improvement in database performance tests. Unit tests (which only perform I/O) using KAIO and Raw I/O have been successful in achieving 93% saturation with 12 disks hung off 2 X 40 MB/s Ultra-Wide SCSI channels. We believe that these encouraging results are a direct result of implementing a significant part of KAIO in the kernel using split-phase I/O while avoiding or minimizing the use of any globally contented locks. Well... In a worse case scenario, it seems that aio_write does get a win. I personally would at least like to see an aio implementation and would be willing to even help benchmark it to benchmark/validate any returns in performance. Surely if testing reflected a performance boost it would be considered for baseline inclusion? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Mon, 2002-10-07 at 10:38, Antti Haapala wrote: Browsed web and came across this piece of text regarding a Linux-KAIO patch by Silicon Graphics... Ya, I have read this before. The problem here is that I'm not aware of which AIO implementation on Linux is the forerunner nor do I have any idea how it's implementation or performance details defer from that of other implementations on other platforms. I know there are at least two aio efforts underway for Linux. There could yet be others. Attempting to cite specifics that only pertain to Linux and then, only with a specific implementation which may or may not be in general use is questionable. Because of this I simply left it as saying that I believe my analysis is pessimistic. Anyone have any idea of Red Hat's Advanced Server uses KAIO or what? Preliminary experience with KAIO have shown over 35% improvement in database performance tests. Unit tests (which only perform I/O) using KAIO and Raw I/O have been successful in achieving 93% saturation with 12 disks hung off 2 X 40 MB/s Ultra-Wide SCSI channels. We believe that these encouraging results are a direct result of implementing a significant part of KAIO in the kernel using split-phase I/O while avoiding or minimizing the use of any globally contented locks. The problem here is, I have no idea what they are comparing to (worse case read/writes which we know PostgreSQL *mostly* isn't suffering from). If we assume that PostgreSQL's read/write operations are somewhat optimized (as it currently sounds like they are), I'd seriously doubt we'd see that big of a difference. On the other hand, I'm hoping that if an aio postgresql implementation does get done we'll see something like a 5%-10% performance boost. Even still, I have nothing to pin that on other than hope. If we do see a notable performance increase for Linux, I have no idea what it will do for other platforms. Then, there are all of the issues that Tom brought up about bloat/uglification and maintainability. So, while I certainly do keep those remarks in my mind, I think it's best to simply encourage the effort (or something like it) and help determine where we really sit by means of empirical evidence. Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
Greg Copeland [EMAIL PROTECTED] writes: Ya, I have read this before. The problem here is that I'm not aware of which AIO implementation on Linux is the forerunner nor do I have any idea how it's implementation or performance details defer from that of other implementations on other platforms. The implementation of AIO in 2.5 is the one by Ben LaHaise (not SGI). Not sure what the performance is like -- although it's been merged into 2.5 already, so someone can do some benchmarking. Can anyone suggest a good test? Keep in mind that glibc has had a user-space implementation for a little while (although I'd guess the performance to be unimpressive), so AIO would not be *that* kernel-version specific. Anyone have any idea of Red Hat's Advanced Server uses KAIO or what? RH AS uses Ben LaHaise's implemention of AIO, I believe. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Table spaces again [was Re: Threaded Sorting]
Shridhar Daithankar [EMAIL PROTECTED] wrote: [snip] On 7 Oct 2002 at 15:52, Hans-Jürgen Schönig wrote: [snip] With tablespaces you can assign 30mb to use a, 120mb to user b etc. ... Table spaces are a nice abstraction layer to the file system. Hmm.. And how does that fit in database metaphor? What practical use is that? I can't imagine as I am a developer and not a DBA. Virtual hosting at ISP's for example. I believe giving each database it's own transaction log would be a great advantage of this scheme. Well, if you think of Tom's recent patch (ganged WAL writes), from a performance point of view, this would only be good if each transaction log had it's own disk. Otherwise a single transaction log is still better. I think tablespaces is a good idea. I also prefer associating tablespaces with directory structures better over the oracle style. Regards, Michael Paesold ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] BTree metapage lock and freelist structure
Hello hackers, I'm thinking about the btree metapage locking problem. In the current situation there are three operations that lock the metapage: 1. when looking for the root page (share lock, getroot) 2. when setting a new root page (exclusive lock, setroot) 3. when extending the relation (exclusive lock, extend). Now, I want to add three more operations: 4. add a page to the freelist (addfree) 5. get a page from the freelist (getfree) 6. shrink a relation (shrink). The shrink operation only happens when one tries to add the last page of the relation to the freelist. I don't know if that's very common, but in case of relation truncating or massive deletion this is important. What I want is to be able to do getroot and setroot without being blocked by any of the other four. Of course the other four are all mutually exclusive. There doesn't seem to be a way to acquire two different locks on the same page, so I propose to lock the InvalidBlockNumer for the btree, and use that as the lock to be obtained before doing extend, addfree, getfree or shrink. The setroot/getroot operations would still use the locking on BTREE_METAPAGE, so they can proceed whether the InvalidBlockNumber is blocked or not. On a different topic: the freelist structure I think should be represented as a freelist int32 number (btm_numfreepages) in BTMetaPageData, and a pointer to the first BlockNumber. Adding a new page is done by sequentially scanning the list until a zero is found or the end of the block is reached. Getting a page sequentially scans the same list until a blocknumber 0 is found (iff btm_numfreepages is greater than zero). This allows for ~ 2000 free pages (very unlikely to actually happen if the shrink operation is in place). Comments? Another solution would be to have a separate page for the freelist, but it seems to be a waste. -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) Porque francamente, si para saber manejarse a uno mismo hubiera que rendir examen... ¿Quién es el machito que tendría carnet? (Mafalda) ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: Parallel Executors [was RE: [HACKERS] Threaded Sorting]
Curtis Faith wrote: The current transaction/user state seems to be stored in process global space. This could be changed to be a sointer to a struct stored in a back-end specific shared memory area which would be accessed by the executor process at execution start. The backend would destroy and recreate the shared memory and restart execution in the case where an executor process dies much like the postmaster does with backends now. To the extent the executor process might make changes to the state, which I'd try to avoid if possible (don't know if it is), the executors could obtain locks, otherwise if the executions were constrained to isolated elements (changes to different indexes for example) it seems like it would be possible using an architecture where you have: Imagine there is a PL/Tcl function. On the first call in a session, the PL/Tcl interpreter get's created (that's during execution, okay?). Now the procedure that's called inside of that interpreter creates a global variable ... a global Tcl variable inside of that interpreter, which is totally unknown to the backend since it doesn't know what Tcl is at all and that variable is nothing than an entry in a private hash table inside of that interpreter. On a subsequent call to any PL/Tcl function during that session, it might be good if that darn hashtable entry exists. How do you propose to let this happen? And while at it, the Tcl procedure next calls spi_exec, causing the PL/Tcl function handler to call SPI_exec(), so your isolated executor all of the sudden becomes a fully operational backend, doing the parsing, planning and optimizing, or what? Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: Parallel Executors [was RE: [HACKERS] Threaded Sorting]
Curtis Faith wrote: The current transaction/user state seems to be stored in process global space. This could be changed to be a sointer to a struct stored in a back-end specific shared memory area which would be accessed by the executor process at execution start. The backend would destroy and recreate the shared memory and restart execution in the case where an executor process dies much like the postmaster does with backends now. To the extent the executor process might make changes to the state, which I'd try to avoid if possible (don't know if it is), the executors could obtain locks, otherwise if the executions were constrained to isolated elements (changes to different indexes for example) it seems like it would be possible using an architecture where you have: Jan Wieck replied: Imagine there is a PL/Tcl function. On the first call in a session, the PL/Tcl interpreter get's created (that's during execution, okay?). Now the procedure that's called inside of that interpreter creates a global variable ... a global Tcl variable inside of that interpreter, which is totally unknown to the backend since it doesn't know what Tcl is at all and that variable is nothing than an entry in a private hash table inside of that interpreter. On a subsequent call to any PL/Tcl function during that session, it might be good if that darn hashtable entry exists. How do you propose to let this happen? And while at it, the Tcl procedure next calls spi_exec, causing the PL/Tcl function handler to call SPI_exec(), so your isolated executor all of the sudden becomes a fully operational backend, doing the parsing, planning and optimizing, or what? You bring up a good point, we couldn't do what I propose for all situations. I had never anticipated that splitting things up would be the rule. For example, the optimizer would have to decide whether it made sense to split up a query from a strictly performance perspective. So now, if we consider the fact that some things could not be done with split backend execution, the logic becomes: if ( splitting is possible splitting is faster ) do the split execution; else do the normal execution; Since the design already splits the backend internally into a separate execution phase, it seems like one could keep the current current implementation for the typical case where splitting doesn't buy anything or cases where there is complex state information that needs to be maintained. If there are no triggers or functions that will be accessed by a given query then I don't see your concerns applying. If there are triggers or other conditions which preclude multi-process execution, we can keep exactly the same behavior as now. The plan execution entry could easily be a place where it either A) did the same thing it currently does or B) passed execution off to a pool as per the original proposal. I have to believe that most SELECTs won't be affected by your concerns. Additionally, even in the case of an UPDATE, many times there are large portions of the operation's actual work that wouldn't be affected even if there are lots of triggers on the tables being updated. The computation of the inside of the WHERE could often be split out without causing any problems with context or state information. The master executor could always be the original backend as it is now and this would be the place where the UPDATE part would be processed after the WHERE tuples had been identified. As with any optimization, it is more complicated and won't handle all the cases. It's just an idea to handle common cases that would otherwise be much slower. That having been said, I'm sure there are much lower hanging fruit on the performance tree and likely will be for a little while. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
On Sun, 2002-10-06 at 11:46, Tom Lane wrote: I can't personally get excited about something that only helps if your server is starved for RAM --- who runs servers that aren't fat on RAM anymore? But give it a shot if you like. Perhaps your analysis is pessimistic. snipped I don't find it far fetched to imagine situations where people may commit large amounts of memory for the database yet marginally starve available memory for file system buffers. Especially so on heavily I/O bound systems or where sporadicly other types of non-database file activity may occur. snipped Of course, that opens the door for simply adding more memory and/or slightly reducing the amount of memory available to the database (thus making it available elsewhere). Now, after all that's said and done, having something like aio in use would seemingly allowing it to be somewhat more self-tuning from a potential performance perspective. Good points. Now for some surprising news (at least it surprised me). I researched the file system source on my system (FreeBSD 4.6) and found that the behavior was optimized for non-database access to eliminate unnecessary writes when temp files are created and deleted rapidly. It was not optimized to get data to the disk in the most efficient manner. The syncer on FreeBSD appears to place dirtied filesystem buffers into work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer processes one of the queues and increments a counter syncer_delayno. On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd of the writes that were buffered are processed. If the syncer gets behind and the writes for a given second exceed one second to process the syncer does not wait but begins processing the next queue. AFAICT this means that there is no opportunity to have writes combined by the disk since they are processed in buckets based on the time the writes came in. Also, it seems very likely that many installations won't have enough buffers for 30 seconds worth of changes and that there would be some level of SYNCHRONOUS writing because of this delay and the syncer process getting backed up. This might happen once per second as the buffers get full and the syncer has not yet started for that second interval. Linux might handle this better. I saw some emails exchanged a year or so ago about starting writes immediately in a low-priority way but I'm not sure if those patches got applied to the linux kernel or not. The source I had access to seems to do something analogous to FreeBSD but using fixed percentages of the dirty blocks or a minimum number of blocks. They appear to be handled in LRU order however. On-disk caches are much much larger these days so it seems that some way of getting the data out sooner would result in better write performance for the cache. My newer drive is a 10K RPM IBM Ultrastar SCSI and it has a 4M cache. I don't see these caches getting smaller over time so not letting the disk see writes will become more and more of a performance drain. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] Hot Backup
Hello to all the Doers of Postgres!!! Last time I went through forums, people spoke highly about 7.3 and its capability to do hot backups. My problem is if the database goes down and I lose my main data store, then I will lose all transactions back to the time I did the pg_dump. Other databases (i e Oracle) solves this by retaining their archive logs in some physically separate storage. So, when you lose your data, you can restore the data from back-up, and then apply your archive log, and avoid losing any committed transactions. Postgresql has been lacking this all along. I've installed postgres 7.3b2 and still don't see any archive's flushed to any other place. Please let me know how is hot backup procedure implemented in current 7.3 beta(2) release. Thanks. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Analysis of ganged WAL writes
I wrote: That says that the best possible throughput on this test scenario is 5 transactions per disk rotation --- the CPU is just not capable of doing more. I am actually getting about 4 xact/rotation for 10 or more clients (in fact it seems to reach that plateau at 8 clients, and be close to it at 7). After further thought I understand why it takes 8 clients to reach full throughput in this scenario. Assume that we have enough CPU oomph so that we can process four transactions, but not five, in the time needed for one revolution of the WAL disk. If we have five active clients then the behavior will be like this: 1. Backend A becomes ready to commit. It locks WALWriteLock and issues a write/flush that will only cover its own commit record. Assume that it has to wait one full disk revolution for the write to complete (this will be the steady-state situation). 2. While A is waiting, there is enough time for B, C, D, and E to run their transactions and become ready to commit. All eventually block on WALWriteLock. 3. When A finishes its write and releases WALWriteLock, B will acquire the lock and initiate a write that (with my patch) will cover C, D, and E's commit records as well as its own. 4. While B is waiting for the disk to spin, A receives a new transaction from its client, processes it, and becomes ready to commit. It blocks on WALWriteLock. 5. When B releases the lock, C, D, E acquire it and quickly fall through, seeing that they need do no work. Then A acquires the lock. GOTO step 1. So with five active threads, we alternate between committing one transaction and four transactions on odd and even disk revolutions. It's pretty easy to see that with six or seven active threads, we will alternate between committing two or three transactions and committing four. Only when we get to eight threads do we have enough backends to ensure that four transactions are available to commit on every disk revolution. This must be so because the backends that are released at the end of any given disk revolution will not be able to participate in the next group commit, if there is already at least one backend ready to commit. So this solution isn't perfect; it would still be nice to have a way to delay initiation of the WAL write until just before the disk is ready to accept it. I dunno any good way to do that, though. I went ahead and committed the patch for 7.3, since it's simple and does offer some performance improvement. But maybe we can think of something better later on... regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Hot Backup
Sandeep Chadha [EMAIL PROTECTED] writes: Postgresql has been lacking this all along. I've installed postgres 7.3b2 and still don't see any archive's flushed to any other place. Please let me know how is hot backup procedure implemented in current 7.3 beta(2) release. AFAIK no such hot backup feature has been implemented for 7.3 -- you appear to have been misinformed. That said, I agree that would be a good feature to have. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
Tom, first of all, excellent job improving the current algorithm. I'm glad you look at the WALCommitLock code. This must be so because the backends that are released at the end of any given disk revolution will not be able to participate in the next group commit, if there is already at least one backend ready to commit. This is the major reason for my original suggestion about using aio_write. The writes don't block each other and there is no need for a kernel level exclusive locking call like fsync or fdatasync. Even the theoretical limit you mention of one transaction per revolution per committing process seem like a significant bottleneck. Is committing 1 and 4 transactions on every revolution good? It's certainly better than 1 per revolution. However, what if we could have done 3 transactions per process in the time it took for a single revolution? Then we are looking at (1 + 4)/ 2 = 2.5 transactions per revolution versus the theoretical maximum of (3 * 5) = 15 transactions per revolution if we can figure out a way to do non-blocking writes that we can guarantee are on the disk platter so we can return from commit. Separating out whether or not aio is viable. Do you not agree that eliminating the blocking would result in potentially a 6X improvement for the 5 process case? So this solution isn't perfect; it would still be nice to have a way to delay initiation of the WAL write until just before the disk is ready to accept it. I dunno any good way to do that, though. I still think that it would be much faster to just keep writing the WAL log blocks when they fill up and have a separate process wake the commiting process when the write completes. This would eliminate WAL writing as a bottleneck. I have yet to hear anyone say that this can't be done, only that we might not want to do it because the code might not be clean. I'm generally only happy when I can finally remove a bottleneck completely, but speeding one up by 3X like you have done is pretty damn cool for a day or two's work. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] v7.2.3 - tag'd, packaged ... need it checked ...
Marc G. Fournier writes: Looks good from my end, Peter, I pulled the same docs that I pulled for v7.2.2, which I hope is okay? Probably not, because the version number needs to be changed and they need to be rebuilt for each release. -- Peter Eisentraut [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Hot Backup
Hmmm, Then are there any new enhancements as far as backups are concerned between current 7.2.x to 7.3.x. Like can we do a tar when database is up and running or another feature. Thanks a bunch in advance. -Original Message- From: Neil Conway [mailto:[EMAIL PROTECTED]] Sent: Monday, October 07, 2002 1:48 PM To: Sandeep Chadha Cc: Tom Lane; [EMAIL PROTECTED]; pgsql-general Subject: Re: [HACKERS] Hot Backup Sandeep Chadha [EMAIL PROTECTED] writes: Postgresql has been lacking this all along. I've installed postgres 7.3b2 and still don't see any archive's flushed to any other place. Please let me know how is hot backup procedure implemented in current 7.3 beta(2) release. AFAIK no such hot backup feature has been implemented for 7.3 -- you appear to have been misinformed. That said, I agree that would be a good feature to have. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Analysis of ganged WAL writes
Curtis Faith [EMAIL PROTECTED] writes: Even the theoretical limit you mention of one transaction per revolution per committing process seem like a significant bottleneck. Well, too bad. If you haven't gotten your commit record down to disk, then *you have not committed*. This is not negotiable. (If you think it is, then turn off fsync and quit worrying ;-)) An application that is willing to have multiple transactions in flight at the same time can open up multiple backend connections to issue those transactions, and thereby perhaps beat the theoretical limit. But for serial transactions, there is not anything we can do to beat that limit. (At least not with the log structure we have now. One could imagine dropping a commit record into the nearest one of multiple buckets that are carefully scattered around the disk. But exploiting that would take near-perfect knowledge about disk head positioning; it's even harder to solve than the problem we're considering now.) I still think that it would be much faster to just keep writing the WAL log blocks when they fill up and have a separate process wake the commiting process when the write completes. This would eliminate WAL writing as a bottleneck. You're failing to distinguish total throughput to the WAL drive from response time seen by any one transaction. Yes, a policy of writing each WAL block once when it fills would maximize potential throughput, but it would also mean a potentially very large delay for a transaction waiting to commit. The lower the system load, the worse the performance on that scale. The scheme we now have (with my recent patch) essentially says that the commit delay seen by any one transaction is at most two disk rotations. Unfortunately it's also at least one rotation :-(, except in the case where there is no contention, ie, no already-scheduled WAL write when the transaction reaches the commit stage. It would be nice to be able to say at most one disk rotation instead --- but I don't see how to do that in the absence of detailed information about disk head position. Something I was toying with this afternoon: assume we have a background process responsible for all WAL writes --- not only filled buffers, but the currently active buffer. It periodically checks to see if there are unwritten commit records in the active buffer, and if so schedules a write for them. If this could be done during each disk rotation, just before the disk reaches the active WAL log block, we'd have an ideal solution. And it would not be too hard for such a process to determine the right time: it could measure the drive rotational speed by observing the completion times of successive writes to the same sector, and it wouldn't take much logic to empirically find the latest time at which a write can be issued and have a good probability of hitting the disk on time. (At least, this would work pretty well given a dedicated WAL drive, else there'd be too much interference from other I/O requests.) However, this whole scheme falls down on the same problem we've run into before: user processes can't schedule themselves with millisecond accuracy. The writer process might be able to determine the ideal time to wake up and make the check, but it can't get the Unix kernel to dispatch it then, at least not on most Unixen. The typical scheduling slop is one time slice, which is comparable to if not more than the disk rotation time. ISTM aio_write only improves the picture if there's some magic in-kernel processing that makes this same kind of judgment as to when to issue the ganged write for real, and is able to do it on time because it's in the kernel. I haven't heard anything to make me think that that feature actually exists. AFAIK the kernel isn't much more enlightened about physical head positions than we are. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
Curtis Faith wrote: Good points. Now for some surprising news (at least it surprised me). I researched the file system source on my system (FreeBSD 4.6) and found that the behavior was optimized for non-database access to eliminate unnecessary writes when temp files are created and deleted rapidly. It was not optimized to get data to the disk in the most efficient manner. The syncer on FreeBSD appears to place dirtied filesystem buffers into work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer processes one of the queues and increments a counter syncer_delayno. On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd of the writes that were buffered are processed. If the syncer gets behind and the writes for a given second exceed one second to process the syncer does not wait but begins processing the next queue. AFAICT this means that there is no opportunity to have writes combined by the disk since they are processed in buckets based on the time the writes came in. This is the trickle syncer. It prevents bursts of disk activity every 30 seconds. It is for non-fsync writes, of course, and I assume if the kernel buffers get low, it starts to flush faster. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Analysis of ganged WAL writes
On Tue, 2002-10-08 at 00:12, Curtis Faith wrote: Tom, first of all, excellent job improving the current algorithm. I'm glad you look at the WALCommitLock code. This must be so because the backends that are released at the end of any given disk revolution will not be able to participate in the next group commit, if there is already at least one backend ready to commit. This is the major reason for my original suggestion about using aio_write. The writes don't block each other and there is no need for a kernel level exclusive locking call like fsync or fdatasync. Even the theoretical limit you mention of one transaction per revolution per committing process seem like a significant bottleneck. Is committing 1 and 4 transactions on every revolution good? It's certainly better than 1 per revolution. Of course committing all 5 at each rev would be better ;) However, what if we could have done 3 transactions per process in the time it took for a single revolution? I may be missing something obvious, but I don't see a way to get more than 1 trx/process/revolution, as each previous transaction in that process must be written to disk before the next can start, and the only way it can be written to the disk is when the disk heads are on the right place and that happens exactly once per revolution. In theory we could devise some clever page interleave scheme that would allow us to go like this: fill one page - write page to disk, commit trx's - fill the page in next 1/3 of rev - write next page to disk ... , but this will work only for some limited set ao WAL page sizes. It could be possible to get near 5/trx/rev for 5 backends if we do the following (A-E are backends from Toms explanation): 1. write the page for A's trx to its proper pos P (wher P is page number) 2. if after sync for A returns and we already have more transactions waiting for write()+sync() of the same page, immediately write the _same_ page to pos P+N (where N is a tunable parameter). If N is small enough then P+N will be on the same cylinder for most cases and thus will get transactions B-E also committed on the same rev. 3. make sure that the last version will also be written to its proper place before the end of log will overwrite P+N. (This may be tricky.) 4. When restoring from WAL, always check for a page at EndPos+N for a possible newer version of last page. This scheme requires page numbers+page versions to be stored in each page and could get us near 1 trx/backend/rev performance, but it's hard to tell if it is really useful in real life. This could also possibly be extended to more than one end page and more than one continuation end page copy to get better than 1 trx/backend/rev. - Hannu ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
This is the trickle syncer. It prevents bursts of disk activity every 30 seconds. It is for non-fsync writes, of course, and I assume if the kernel buffers get low, it starts to flush faster. AFAICT, the syncer only speeds up when virtual memory paging fills the buffers past a threshold and even in that event it only speeds it up by a factor of two. I can't find any provision for speeding up flushing of the dirty buffers when they fill for normal file system writes, so I don't think that happens. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
On Tue, 2002-10-08 at 01:27, Tom Lane wrote: The scheme we now have (with my recent patch) essentially says that the commit delay seen by any one transaction is at most two disk rotations. Unfortunately it's also at least one rotation :-(, except in the case where there is no contention, ie, no already-scheduled WAL write when the transaction reaches the commit stage. It would be nice to be able to say at most one disk rotation instead --- but I don't see how to do that in the absence of detailed information about disk head position. Something I was toying with this afternoon: assume we have a background process responsible for all WAL writes --- not only filled buffers, but the currently active buffer. It periodically checks to see if there are unwritten commit records in the active buffer, and if so schedules a write for them. If this could be done during each disk rotation, just before the disk reaches the active WAL log block, we'd have an ideal solution. And it would not be too hard for such a process to determine the right time: it could measure the drive rotational speed by observing the completion times of successive writes to the same sector, and it wouldn't take much logic to empirically find the latest time at which a write can be issued and have a good probability of hitting the disk on time. (At least, this would work pretty well given a dedicated WAL drive, else there'd be too much interference from other I/O requests.) However, this whole scheme falls down on the same problem we've run into before: user processes can't schedule themselves with millisecond accuracy. The writer process might be able to determine the ideal time to wake up and make the check, but it can't get the Unix kernel to dispatch it then, at least not on most Unixen. The typical scheduling slop is one time slice, which is comparable to if not more than the disk rotation time. Standard for Linux has been 100Hz time slice, but it is configurable for some time. The latest RedHat (8.0) is built with 500Hz that makes about 4 slices/rev for 7200 rpm disks (2 for 15000rpm) ISTM aio_write only improves the picture if there's some magic in-kernel processing that makes this same kind of judgment as to when to issue the ganged write for real, and is able to do it on time because it's in the kernel. I haven't heard anything to make me think that that feature actually exists. AFAIK the kernel isn't much more enlightened about physical head positions than we are. At least for open source kernels it could be possible to 1. write a patch to kernel or 2. get the authors of kernel aio interested in doing it. or 3. the third possibility would be using some real-time (RT) OS or mixed RT/conventional OS where some threads can be scheduled for hard-RT . In an RT os you are supposed to be able to do exactly what you describe. I think that 2 and 3 could be outsourced (the respective developers talked into supporting it) as both KAIO and RT Linuxen/BSDs are probably also inetersted in high-profile applications so they could boast that using our stuff enabled PostgreSQL database run twice as fast. Anyway, getting to near-harware speeds for database will need more specific support from OS than web browsing or compiling. --- Hannu ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
Well, too bad. If you haven't gotten your commit record down to disk, then *you have not committed*. This is not negotiable. (If you think it is, then turn off fsync and quit worrying ;-)) I've never disputed this, so if I seem to be suggesting that, I've beee unclear. I'm just assuming that the disk can get a confirmation back to the INSERTing process in much less than one rotation. This would allow that process to start working again, perhaps in time to complete another transaction. An application that is willing to have multiple transactions in flight at the same time can open up multiple backend connections to issue those transactions, and thereby perhaps beat the theoretical limit. But for serial transactions, there is not anything we can do to beat that limit. (At least not with the log structure we have now. One could imagine dropping a commit record into the nearest one of multiple buckets that are carefully scattered around the disk. But exploiting that would take near-perfect knowledge about disk head positioning; it's even harder to solve than the problem we're considering now.) Consider the following scenario: Time measured in disk rotations. Time 1.00 - Process A Commits - Causing aio_write to log and wait Time 1.03 - aio_completes for Process A write - wakes process A Time 1.05 - Process A Starts another transaction. Time 1.08 - Process A Commits etc. I agree that a process can't proceed from a commit until it receives confirmation of the write, but if the write has hit the disk before a full rotation then the process should be able to continue processing new transactions You're failing to distinguish total throughput to the WAL drive from response time seen by any one transaction. Yes, a policy of writing each WAL block once when it fills would maximize potential throughput, but it would also mean a potentially very large delay for a transaction waiting to commit. The lower the system load, the worse the performance on that scale. You are assuming fsync or fdatasync behavior, I am not. There would be no delay under the scenario I describe. The transaction would exit commit as soon as the confirmation of the write is received from the aio system. I would hope that with a decent aio implementation this would generally be much less than one rotation. I think that the single transaction response time is very important and that's one of the chief problems I sought to solve when I proposed aio_writes for logging in my original email many moons ago. ISTM aio_write only improves the picture if there's some magic in-kernel processing that makes this same kind of judgment as to when to issue the ganged write for real, and is able to do it on time because it's in the kernel. I haven't heard anything to make me think that that feature actually exists. AFAIK the kernel isn't much more enlightened about physical head positions than we are. All aio_write has to do is pass the write off to the device as soon as it aio_write gets it bypassing the system buffers. The code on the disk's hardware is very good at knowing when the disk head is coming. IMHO, bypassing the kernel's less than enlightened writing system is the main point of using aio_write. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
On Mon, 2002-10-07 at 16:06, Curtis Faith wrote: Well, too bad. If you haven't gotten your commit record down to disk, then *you have not committed*. This is not negotiable. (If you think it is, then turn off fsync and quit worrying ;-)) At this point, I think we've come full circle. Can we all agree that this concept is a *potential* source of improvement in a variety of situations? If we can agree on that, perhaps we should move to the next stage in the process, validation? How long do you think it would take to develop something worthy of testing? Do we have known test cases which will properly (in)validate the approach that everyone will agree to? If code is reasonably clean so as to pass the smell test and shows a notable performance boost, will it be seriously considered for inclusion? If so, I assume it would become a configure option (--with-aio)? Regards, Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Analysis of ganged WAL writes
Greg Copeland wrote: snip If so, I assume it would become a configure option (--with-aio)? Or maybe a GUC use_aio ? :-) Regards and best wishes, Justin Clift Regards, Greg Name: signature.asc signature.asc Type: application/pgp-signature Description: This is a digitally signed message part -- My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there. - Indira Gandhi ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
On Mon, 2002-10-07 at 15:28, Bruce Momjian wrote: This is the trickle syncer. It prevents bursts of disk activity every 30 seconds. It is for non-fsync writes, of course, and I assume if the kernel buffers get low, it starts to flush faster. Doesn't this also increase the likelihood that people will be running in a buffer-poor environment more frequently that I previously asserted, especially in very heavily I/O bound systems? Unless I'm mistaken, that opens the door for a general case of why an aio implementation should be looked into. Also, on a side note, IIRC, linux kernel 2.5.x has a new priority elevator which is said to be MUCH better as saturating disks than ever before. Once 2.6 (or whatever it's number will be) is released, it may not be as much of a problem as it seems to be for FreeBSD (I think that's the one you're using). Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Analysis of ganged WAL writes
Curtis Faith [EMAIL PROTECTED] writes: Well, too bad. If you haven't gotten your commit record down to disk, then *you have not committed*. This is not negotiable. (If you think it is, then turn off fsync and quit worrying ;-)) I've never disputed this, so if I seem to be suggesting that, I've beee unclear. I'm just assuming that the disk can get a confirmation back to the INSERTing process in much less than one rotation. You've spent way too much time working with lying IDE drives :-( Do you really trust a confirm-before-write drive to make that write if it loses power? I sure don't. If you do trust your drive to hold that data across a crash, then ISTM the whole problem goes away anyway, as writes will complete quite independently of disk rotation. My Linux box has no problem claiming that it's completing several thousand TPS with a single client ... and yes, fsync is on, but it's using an IDE drive, and I don't know how to disable confirm-before-write on that drive. (That's why I did these tests on my old slow HP hardware.) Basically, the ganging of commit writes happens inside the disk controller on a setup like that. You still don't need aio_write --- unless perhaps to reduce wastage of IDE bus bandwidth by repeated writes, but that doesn't seem to be a scarce resource in this context. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
Well, I was thinking that aio may not be available on all platforms, thus the conditional compile option. On the other hand, wouldn't you pretty much want it either on or off for all instances? I can see that it would be nice for testing though. ;) Greg On Mon, 2002-10-07 at 16:23, Justin Clift wrote: Greg Copeland wrote: snip If so, I assume it would become a configure option (--with-aio)? Or maybe a GUC use_aio ? :-) Regards and best wishes, Justin Clift Regards, Greg Name: signature.asc signature.asc Type: application/pgp-signature Description: This is a digitally signed message part -- My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there. - Indira Gandhi signature.asc Description: This is a digitally signed message part
Re: [HACKERS] 7.2.3 patching done
On Mon, Oct 07, 2002 at 09:22:38PM +0200, Peter Eisentraut wrote: Tom Lane writes: But the source distribution hasn't *got* any binary files. There are some under doc/src/graphics, and then there are doc/postgres.tar.gz and doc/man.tar.gz. And what about publishing xdelta patches? -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) Escucha y olvidarás; ve y recordarás; haz y entenderás (Confucio) ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Analysis of ganged WAL writes
I may be missing something obvious, but I don't see a way to get more than 1 trx/process/revolution, as each previous transaction in that process must be written to disk before the next can start, and the only way it can be written to the disk is when the disk heads are on the right place and that happens exactly once per revolution. Okay, consider the following scenario. 1) Process A commits when the platter is at 0 degrees. 2) There are enough XLog writes from other processes to fill 1/4 platter rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit record and keeps writing other log entries as the head rotates. 3) Process A receives a confirmation of the write before the platter rotates 60 degrees. 4) Process A continues and adds another commit before the platter rotates to 90 degrees. This should be very possible and more and more likely in the future as CPUs get faster and faster relative to disks. I'm not suggesting this would happen all the time, just that it's possible and that an SMP machine with good CPUs and a fast I/O subsystem should be able to keep the log writing at close to I/O bandwidth limits. The case of bulk inserts is one where I would expect that for simple tables we should be able to peg the disks given today's hardware and enough inserting processes. - Curtis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Where to call SetQuerySnapshot
I noticed that the new EXECUTE statement does not call SetQuerySnapshot, which seems like a bad thing. The omission is masked by the defensive code in CopyQuerySnaphot, which will automatically do SetQuerySnapshot if it hasn't been done yet in the current transaction. However, this doesn't provide the right behavior in read-committed mode: if we are inside a transaction and not the first query, I'd think EXECUTE should take a new snapshot; but it won't. Comparable code can be found in COPY OUT, for which tcopy/utility.c does SetQuerySnapshot() before calling the command-specific routine. Questions for the list: 1. Where is the cleanest place to call SetQuerySnapshot() for utility statements that need it? Should we follow the lead of the existing COPY code, and add the call to the ExecuteStmt case in utility.c? Or should we move the calls into the command-specific routines (DoCopy and ExecuteQuery)? Or perhaps it should be done in postgres.c, which has this responsibility for non-utility statements? 2. Would it be a good idea to change CopyQuerySnapshot to elog(ERROR) instead of silently creating a snapshot when none has been made? I think I was the one who put in its auto-create-snapshot behavior, but I'm now feeling like that was a mistake. It hides omissions that we should want to find. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
I sent this yesterday, but it seems not to have made it to the list... I have a couple of comments orthogonal to the present discussion. 1) It would be fairly easy to write log records over a network to a dedicated process on another system. If the other system has an uninterruptible power supply, this is about as safe as writing to disk. This would get rid of the need for any fsync on the log at all. There would be extra code needed on restart to get the end of the log from the other system, but it doesn't seem like much. I think this would be an attractive option to a lot of people. Most people have at least two systems, and the requirements of the logging system would be minimal. 2) It is also possible, with kernel modifications, to have special logging partitions where log records are written where the head is. Tzi-cker Chueh and Lan Huang at Stony Brook (http://www.cs.sunysb.edu/~lanhuang/research.htm) have written this, although I don't think they have released any code. (A similar idea called WADS is mentioned in Gray Reuter's book.) If the people at Red Hat are interested in having some added value for using PostgreSQL on Red Hat Linux, this would be one idea. It could also be used to speed up ext3 and other journaling file systems. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
Greg Copeland [EMAIL PROTECTED] writes: Doesn't this also increase the likelihood that people will be running in a buffer-poor environment more frequently that I previously asserted, especially in very heavily I/O bound systems? Unless I'm mistaken, that opens the door for a general case of why an aio implementation should be looked into. Neil Conway replies: Well, at least for *this specific sitation*, it doesn't really change anything -- since FreeBSD doesn't implement POSIX AIO as far as I know, we can't use that as an alternative. I haven't tried it yet but there does seem to be an aio implementation that conforms to POSIX in FreeBSD 4.6.2. Its part of the kernel and can be found in: /usr/src/sys/kern/vfs_aio.c However, I'd suspect that the FreeBSD kernel allows for some way to tune the behavior of the syncer. If that's the case, we could do some research into what settings are more appropriate for FreeBSD, and recommend those in the docs. I don't run FreeBSD, however -- would someone like to volunteer to take a look at this? I didn't see anything obvious in the docs but I still believe there's some way to tune it. I'll let everyone know if I find some better settings. BTW Curtis, did you happen to check whether this behavior has been changed in FreeBSD 5.0? I haven't checked but I will. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
Greg Copeland [EMAIL PROTECTED] writes: Doesn't this also increase the likelihood that people will be running in a buffer-poor environment more frequently that I previously asserted, especially in very heavily I/O bound systems? Unless I'm mistaken, that opens the door for a general case of why an aio implementation should be looked into. Well, at least for *this specific sitation*, it doesn't really change anything -- since FreeBSD doesn't implement POSIX AIO as far as I know, we can't use that as an alternative. However, I'd suspect that the FreeBSD kernel allows for some way to tune the behavior of the syncer. If that's the case, we could do some research into what settings are more appropriate for FreeBSD, and recommend those in the docs. I don't run FreeBSD, however -- would someone like to volunteer to take a look at this? BTW Curtis, did you happen to check whether this behavior has been changed in FreeBSD 5.0? Also, on a side note, IIRC, linux kernel 2.5.x has a new priority elevator which is said to be MUCH better as saturating disks than ever before. Yeah, there are lots of new interesting features for database systems in the new kernel -- I'm looking forward to when 2.6 is widely deployed... Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
Curtis Faith wrote: This is the trickle syncer. It prevents bursts of disk activity every 30 seconds. It is for non-fsync writes, of course, and I assume if the kernel buffers get low, it starts to flush faster. AFAICT, the syncer only speeds up when virtual memory paging fills the buffers past a threshold and even in that event it only speeds it up by a factor of two. I can't find any provision for speeding up flushing of the dirty buffers when they fill for normal file system writes, so I don't think that happens. So you think if I try to write a 1 gig file, it will write enough to fill up the buffers, then wait while the sync'er writes out a few blocks every second, free up some buffers, then write some more? Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when it can't get a buffer, it will async write a dirty buffer to disk. As far as this AIO conversation is concerned, I want to see someone come up with some performance improvement that we can only do with AIO. Unless I see it, I am not interested in pursuing this thread. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [pgsql-performance] [GENERAL] Large databases, performance
On 7 Oct 2002 at 11:21, Tom Lane wrote: Shridhar Daithankar [EMAIL PROTECTED] writes: I say if it's a char field, there should be no indicator of length as it's not required. Just store those many characters straight ahead.. Your assumption fails when considering UNICODE or other multibyte character encodings. Correct but is it possible to have real char string when database is not unicode or when locale defines size of char, to be exact? In my case varchar does not make sense as all strings are guaranteed to be of defined length. While the argument you have put is correct, it's causing a disk space leak, to say so. Bye Shridhar -- Boucher's Observation: He who blows his own horn always plays the music several octaves higher than originally written. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [GENERAL] [HACKERS] Hot Backup
On 7 Oct 2002 at 13:48, Neil Conway wrote: Sandeep Chadha [EMAIL PROTECTED] writes: Postgresql has been lacking this all along. I've installed postgres 7.3b2 and still don't see any archive's flushed to any other place. Please let me know how is hot backup procedure implemented in current 7.3 beta(2) release. AFAIK no such hot backup feature has been implemented for 7.3 -- you appear to have been misinformed. Is replication an answer to hot backup? Bye Shridhar -- ink, n.:A villainous compound of tannogallate of iron, gum-arabic, and water, chiefly used to facilitate the infection of idiocy and promote intellectual crime. -- H.L. Mencken ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly