Re: [PATCHES] [HACKERS] Full page writes improvement, code update
On Tue, 2007-04-03 at 19:45 +0900, Koichi Suzuki wrote: Bruce Momjian wrote: Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches Thank you very much for including. Attached is an update of the patch according to Simon Riggs's comment about GUC name. The patch comes with its own install kit, which is great to review (many thanks), but hard to determine where you think code should go when committed. My guess based on your patch - the patch gets applied to core :-) - pg_compresslog *and* pg_decompresslog go to a contrib directory called contrib/lesslog? Can you please produce a combined patch that does all of the above, plus edits the contrib Makefile to add all of those, as well as editing the README so it doesn't mention the patch, just the contrib executables? The patch looks correct to me now. I haven't tested it yet, but will be doing so in the last week of April, which is when I'll be doing docs for this and other stuff, since time is pressing now. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load distributed checkpoint V3
ITAGAKI Takahiro wrote: Here is the latest version of Load distributed checkpoint patch. Unfortunately because of the recent instrumentation and CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix the bitrot and send an updated patch, please? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] --enable-xml instead of --with-libxml?
Bruce Momjian wrote: I have applied the following patch which adds documentation and an improved error message for an installation that does not use --with-libxml. and you broke the buildfarm ... you need to fix the regression test. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load distributed checkpoint V3
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: Unfortunately because of the recent instrumentation and CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix the bitrot and send an updated patch, please? The Logging checkpoints and other slowdown causes patch I submitted touches some of the same code as well, that's another possible merge coming depending on what order this all gets committed in. Running into what I dubbed perpetual checkpoints was one of the reasons I started logging timing information for the various portions of the checkpoint, to tell when it was bogged down with slow writes versus being held up in sync for various (possibly fixed with your CheckpointStartLock) issues. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] CREATE TABLE LIKE INCLUDING INDEXES support
Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: Uh, shouldn't CREATE TABLE LIKE INCLUDING CONSTRAINTS already be including any indexes in the parent table? You could argue it should for unique indexes since our unique indexes are how we implement unique constraints. But I see no particular reason to expect it to copy random other indexes. At least its name doesn't lead one to expect it to. I also thought it was sort of strange to have a command that otherwise is just copying definitions suddenly start building whole new objects. I think I was thinking it would be a long slow operation but I suppose creating an empty index isn't really noticeably slow. It could print a NOTICE similar to what's printed when you create a primary key or unique constraint. It does mean that users would be unable to create a partition, load data, then build indexes. Perhaps it would be nice to have an ALTER TABLE foo LIKE bar INCLUDING CONSTRAINTS as well. The patch already _only_ does constraint(unique) indexes: So, that's what this patch does. When a table is created with 'CREATE TABLE ... LIKE parent INCLUDING INDEXES' this iterates over the parent table indexes looking for constraint indexes, and alters the CreateStmtContext to include equivalent indexes on the child table. so I am just suggesting it do that always for INCLUDING CONSTRAINTS, with a notice as you suggest. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load distributed checkpoint V3
Greg Smith wrote: On Thu, 5 Apr 2007, Heikki Linnakangas wrote: Bgwriter has two goals: 1. keep enough buffers clean that normal backends never need to do a write 2. smooth checkpoints by writing buffers ahead of time Load distributed checkpoints will do 2. in a much better way than the bgwriter_all_* guc options. I think we should remove that aspect of bgwriter in favor of this patch. ... Let me suggest a different way of looking at this problem. At any moment, some percentage of your buffer pool is dirty. Whether it's 0% or 100% dramatically changes what the background writer should be doing. Whether most of the data is usage_count0 or not also makes a difference. None of the current code has any idea what type of buffer pool they're working with, and therefore they don't have enough information to make a well-informed prediction about what is going to happen in the near future. The purpose of the bgwriter_all_* settings is to shorten the duration of the eventual checkpoint. The reason to shorten the checkpoint duration is to limit the damage to other I/O activity it causes. My thinking is that assuming the LDC patch is effective (agreed, needs more testing) at smoothening the checkpoint, the duration doesn't matter anymore. Do you want to argue there's other reasons to shorten the checkpoint duration? I'll tell you what I did to the all-scan. I ran a few hundred hours worth of background writer tests to collect data on what it does wrong, then wrote a prototype automatic background writer that resets the all-scan parameters based on what I found. It keeps a running estimate of how dirty the pool at large is using a weighted average of the most recent scan with the past history. From there, I have a simple model that predicts how much of the buffer we can scan in any interval, and intends to enforce a maximum bound on the amount of physical I/O you're willing to stream out. The beta code is sitting at http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want to see what I've done so far. The parts that are done work fine--as long as you give it a reasonable % to scan by default, it will correct all_max_pages and the interval in real-time to meet the scan rate requested you want given how much is currently dirty; the I/O rate is computed but doesn't limit properly yet. Nice. Enforcing a max bound on the I/O seems reasonable, if we accept that shortening the checkpoint is a goal. Why haven't I brought this all up yet? Two reasons. The first is because it doesn't work on my system; checkpoints and overall throughput get worse when you try to shorten them by running the background writer at optimal aggressiveness. Under really heavy load, the writes slow down as all the disk caches fill, the background writer fights with reads on the data that isn't in the mostly dirty cache (introducing massive seek delays), it stops cleaning effectively, and it's better for it to not even try. My next generation of code was going to start with the LRU flush and then only move onto the all-scan if there's time leftover. The second is that I just started to get useful results here in the last few weeks, and I assumed it's too big of a topic to start suggesting major redesigns to the background writer mechanism at that point (from me at least!). I was waiting for 8.3 to freeze before even trying. If you want to push through a redesign there, maybe you can get away with it at this late moment. But I ask that you please don't remove anything from the current design until you have significant test results to back up that change. Point taken. I need to start testing the LDC patch. Since we're discussing this, let me tell what I've been thinking about the lru cleaning behavior of bgwriter. ISTM that that's more straigthforward to tune automatically. Bgwriter basically needs to ensure that the next X buffers with usage_count=0 in the clock sweep are clean. X is the predicted number of buffers backends will evict until the next bgwriter round. The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. We can use that as an estimate of X with some safety margin. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Fix mdsync never-ending loop problem
Heikki Linnakangas wrote: Here's a fix for the problem that on a busy system, mdsync never finishes. See the original problem description on hackers: http://archives.postgresql.org/pgsql-hackers/2007-04/msg00259.php The solution is taken from ITAGAKI Takahiro's Load Distributed Checkpoint patch. At the beginning of mdsync, the pendingOpsTable is copied to a linked list, and that list is then processed until it's empty. Here's an updated patch, the one I sent earlier is broken. I ignored the return value of list_delete_cell. We could just review and apply ITAGAKI's patch as it is instead of this snippet of it, but because that can take some time I'd like to see this applied before that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/storage/smgr/md.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v retrieving revision 1.127 diff -c -r1.127 md.c *** src/backend/storage/smgr/md.c 17 Jan 2007 16:25:01 - 1.127 --- src/backend/storage/smgr/md.c 5 Apr 2007 16:09:31 - *** *** 863,989 void mdsync(void) { ! bool need_retry; if (!pendingOpsTable) elog(ERROR, cannot sync without a pendingOpsTable); /* ! * The fsync table could contain requests to fsync relations that have ! * been deleted (unlinked) by the time we get to them. Rather than ! * just hoping an ENOENT (or EACCES on Windows) error can be ignored, ! * what we will do is retry the whole process after absorbing fsync ! * request messages again. Since mdunlink() queues a revoke message ! * before actually unlinking, the fsync request is guaranteed to be gone ! * the second time if it really was this case. DROP DATABASE likewise ! * has to tell us to forget fsync requests before it starts deletions. */ ! do { ! HASH_SEQ_STATUS hstat; ! PendingOperationEntry *entry; ! int absorb_counter; ! need_retry = false; /* ! * If we are in the bgwriter, the sync had better include all fsync ! * requests that were queued by backends before the checkpoint REDO ! * point was determined. We go that a little better by accepting all ! * requests queued up to the point where we start fsync'ing. */ AbsorbFsyncRequests(); ! absorb_counter = FSYNCS_PER_ABSORB; ! hash_seq_init(hstat, pendingOpsTable); ! while ((entry = (PendingOperationEntry *) hash_seq_search(hstat)) != NULL) { ! /* ! * If fsync is off then we don't have to bother opening the file ! * at all. (We delay checking until this point so that changing ! * fsync on the fly behaves sensibly.) ! */ ! if (enableFsync) ! { ! SMgrRelation reln; ! MdfdVec*seg; ! /* ! * If in bgwriter, we want to absorb pending requests every so ! * often to prevent overflow of the fsync request queue. This ! * could result in deleting the current entry out from under ! * our hashtable scan, so the procedure is to fall out of the ! * scan and start over from the top of the function. ! */ ! if (--absorb_counter = 0) ! { ! need_retry = true; ! break; ! } ! /* ! * Find or create an smgr hash entry for this relation. This ! * may seem a bit unclean -- md calling smgr? But it's really ! * the best solution. It ensures that the open file reference ! * isn't permanently leaked if we get an error here. (You may ! * say but an unreferenced SMgrRelation is still a leak! Not ! * really, because the only case in which a checkpoint is done ! * by a process that isn't about to shut down is in the ! * bgwriter, and it will periodically do smgrcloseall(). This ! * fact justifies our not closing the reln in the success path ! * either, which is a good thing since in non-bgwriter cases ! * we couldn't safely do that.) Furthermore, in many cases ! * the relation will have been dirtied through this same smgr ! * relation, and so we can save a file open/close cycle. ! */ ! reln = smgropen(entry-tag.rnode); ! ! /* ! * It is possible that the relation has been dropped or ! * truncated since the fsync request was entered. Therefore, ! * allow ENOENT, but only if we didn't fail once already on ! * this file. This applies both during _mdfd_getseg() and ! * during FileSync, since fd.c might have closed the file ! * behind our back. ! */ ! seg = _mdfd_getseg(reln, ! entry-tag.segno * ((BlockNumber) RELSEG_SIZE), ! false, EXTENSION_RETURN_NULL); ! if (seg == NULL || ! FileSync(seg-mdfd_vfd) 0) ! { ! /* ! * XXX is there any point in allowing more than one try? ! * Don't see one at the moment, but easy to change the ! * test here if so. ! */ ! if (!FILE_POSSIBLY_DELETED(errno) || ! ++(entry-failures)
Re: [PATCHES] Fix mdsync never-ending loop problem
Alvaro Herrera [EMAIL PROTECTED] writes: I wonder if it wouldn't be better to clean this up by creating a separate typedef for segment numbers, with its own special values? Probably. I remember having thought about it when I put in the FORGET_DATABASE_FSYNC hack. I think I didn't do it because I needed to backpatch and so I wanted a minimal-size patch. Feel free to do it in HEAD ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
Heikki Linnakangas [EMAIL PROTECTED] writes: The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. We can use that as an estimate of X with some safety margin. You'd want some kind of moving-average smoothing in there, probably with a lot shorter ramp-up than ramp-down time constant, but this seems reasonable enough to try. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Fix mdsync never-ending loop problem
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Here's a fix for the problem that on a busy system, mdsync never finishes. See the original problem description on hackers: This leaks memory, no? (list_delete_cell only deletes the ListCell.) Oh, I just spotted another problem with it and posted an updated patch, but I missed that. But I dislike copying the table entries anyway, see comment on -hackers. Frankly the cycle id idea sounds more ugly and fragile to me. You'll need to do multiple scans of the hash table that way, starting from top every time you call AbsorbFsyncRequests (like we do know). But whatever... BTW, it's very hard to see what a patch like this is actually changing. It might be better to submit a version that doesn't reindent the chunks of code you aren't changing, so as to reduce the visual size of the diff. A note to the committer to reindent the whole function is sufficient (or if he forgets, pg_indent will fix it eventually). Ok, will do that. Or would you like to just take over from here? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Fix mdsync never-ending loop problem
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: But I dislike copying the table entries anyway, see comment on -hackers. Frankly the cycle id idea sounds more ugly and fragile to me. You'll need to do multiple scans of the hash table that way, starting from top every time you call AbsorbFsyncRequests (like we do know). How so? You just ignore entries whose cycleid is too large. You'd have to be careful about wraparound in the comparisons, but that's not hard to deal with. Also, AFAICS you still have the retry problem (and an even bigger memory leak problem) with this coding --- the to-do list doesn't eliminate the issue of correct handling of a failure. Ok, will do that. Or would you like to just take over from here? No, I'm up to my ears in varlena. You're the one in a position to test this, anyway. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Fix mdsync never-ending loop problem
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: But I dislike copying the table entries anyway, see comment on -hackers. Frankly the cycle id idea sounds more ugly and fragile to me. You'll need to do multiple scans of the hash table that way, starting from top every time you call AbsorbFsyncRequests (like we do know). How so? You just ignore entries whose cycleid is too large. You'd have to be careful about wraparound in the comparisons, but that's not hard to deal with. Also, AFAICS you still have the retry problem (and an even bigger memory leak problem) with this coding --- the to-do list doesn't eliminate the issue of correct handling of a failure. You have to start the hash_seq_search from scratch after each call to AbsorbFsyncRequests because it can remove entries, including the one the scan is stopped on. I think the failure handling is correct in the to-do list approach, when an entry is read from the list, it's checked that the entry hasn't been removed from the hash table. Actually there was a bug in the original LDC patch in the failure handling: it replaced the per-entry failures-counter with a local retry_counter variable, but it wasn't cleared after a successful write which would lead to bogus ERRORs when multiple relations are dropped during the mdsync. I kept the original per-entry counter, though the local variable approach could be made to work. The memory leak obviously needs to be fixed, but that's just a matter of adding a pfree. Ok, will do that. Or would you like to just take over from here? No, I'm up to my ears in varlena. You're the one in a position to test this, anyway. Ok. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Fix mdsync never-ending loop problem
Heikki Linnakangas [EMAIL PROTECTED] writes: I think the failure handling is correct in the to-do list approach, when an entry is read from the list, it's checked that the entry hasn't been removed from the hash table. Actually there was a bug in the original LDC patch in the failure handling: it replaced the per-entry failures-counter with a local retry_counter variable, but it wasn't cleared after a successful write which would lead to bogus ERRORs when multiple relations are dropped during the mdsync. I kept the original per-entry counter, though the local variable approach could be made to work. Yeah. One of the things that bothered me about the patch was that it would be easy to mess up by updating state in the copied entry instead of the real info in the hashtable. It would be clearer what's happening if the to-do list contains only the lookup keys and not the whole struct. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Fix mdsync never-ending loop problem
On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote: We could just review and apply ITAGAKI's patch as it is instead of this snippet of it, but because that can take some time I'd like to see this applied before that. I think we are just beginning to understand the quality of Itagaki's thinking. We should give him a chance to interact on this and if there are parts of his patch that we want, then it should be him that does it. I'm not sure that carving the good bits off each others patches is likely to help teamwork in the long term. At very least he deserves much credit for his farsighted work. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Fix mdsync never-ending loop problem
Simon Riggs wrote: On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote: We could just review and apply ITAGAKI's patch as it is instead of this snippet of it, but because that can take some time I'd like to see this applied before that. I think we are just beginning to understand the quality of Itagaki's thinking. We should give him a chance to interact on this and if there are parts of his patch that we want, then it should be him that does it. Itagaki, would you like to take a stab at this? I'm not sure that carving the good bits off each others patches is likely to help teamwork in the long term. At very least he deserves much credit for his farsighted work. Oh sure! Thank you for your efforts, Itagaki! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load distributed checkpoint V3
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. We can use that as an estimate of X with some safety margin. You'd want some kind of moving-average smoothing in there, probably with a lot shorter ramp-up than ramp-down time constant, but this seems reasonable enough to try. Ironically, I just noticed that we already have a patch in the patch queue that implements exactly that, again by Itagaki. I need to start paying more attention :-). Keep up the good work! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] autovacuum multiworkers, patch 5
Another problem seems to be that I'm not checking anywhere that a regular connection (not autovac) is not using an autovac-reserved PGPROC slot :-( I think I should tweak the logic that deals with ReservedBackends but it doesn't look entirely trivial. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PATCHES] Reviewers Guide to Deferred Transactions/Transaction Guarantee
transaction_guarantee.v11.patch - keep current, cleanup, more comments and docs Brief Performance Analysis -- I've tested 3 scenarios: 1. normal 2. wal_writer_delay = 100ms 3. wal_writer_delay = 100ms and transaction_guarantee = off On my laptop, with a scale=1 pgbench database with 1 connection I consistently get around 85 tps in mode (1), with a slight performance drop in mode (2). In mode (3) I get anywhere from 200tps - 900 tps, depending upon how well cached everything is, with 700 tps being fairly typical. fsync = on gives around 900tps. Also good speedups with multiple session tests. make installcheck passes in 120 sec in mode (3), though 155 sec in mode (1) and 158 sec in mode (2). Basic Implementation xact.c xact.h The basic implementation simply records the LSN of the xlog commit record in a shared memory area, the deferred fsync cache. ipci.c The cache is protected by an LWlock called DeferredFsyncLock. lwlock.h A WALWriter process wakes up regularly to perform a background flush of WAL up to the point of the highest LSN in the deferred fsync cache. walwriter.c walwriter.h postmaster.c WALWriter can be enabled only at server start. (All above same as March 11 version) Correctness --- postgres.c Only certain code paths can execute transaction_guarantee = off transactions, though the main code paths for OLTP allow it. xlog.c CreateCheckpoint() must protect against starting a checkpoint when commits are not yet flushed, so an additional flush must occur here. vacuum.c VACUUM FULL cannot move tuples until their states are all known, so this command triggers a background flush also. clog.c clog.h slru.c slru.h Changes to Clog and SLRU enforce the basic rule of WAL-before-data, which otherwise might allow the record of a commit to reach disk before the flush of the WAL. This is implemented by storing an LSN for each clog page. transam.c transam.h twophase.c xact.c The above files have API changes that allow the LSN at transaction commit to be passed through to the Clog. tqual.c tqual.h multixact.c multixact.h Visibility hint bits must also not be set before the transaction is flushed, so other changes are required to ensure we store the LSN of each transaction, not just the maximum LSN. Changes to tqual.c appear extensive, though this is just refactoring to allow us to make additional function calls before setting bits - there are no functional changes to any HeapTupleSatisfies... functions. xact.c Contains the module for the Deferred Transaction functions and in particular the deferred transaction cache. This could be a separate module, since there is only a slight link with the other xact.c code. User Interface -- guc.c postgresql.conf.sample guc_table.h New parameters have been added, with a new parameter grouping of WAL_COMMITS created to control the various commit parameters. Performance Tuning -- The WALWriter wakes up each eal_writer_delay milliseconds. There are two protections against mis-setting this parameter. pmsignal.h The WALWriter will also be woken by a signal if the DF cache has nearly filled and flushing would be desirable. The WALWriter will also loop without any delay if the number of transactions committed while it was writing WAL is above a threshold value. Docs The fsync parameter has been removed from postgresql.conf.sample and the docs, though it still exists in this patch to allow performance testing during Beta. It is suggested that fsync=on should mean the same thing as transaction_guarantee = off, wal_writer_delay = 100ms, if it is specified in postgresql.conf or on the server command line. A new section in wal.sgml willd escribe this in more detail, later. Open Questions -- 1. Should the DFC use a standard hash table? Custom code allows both additional speed and the ability to signal when it fills. 2. Should tqual.c update the LSN of a heap page with the LSN of the transaction commit that it can read from the DF cache? 3. Should the WALWriter also do the wal_buffers half-full write at the start of XLogInsert() ? 4. The recent changes to remove CheckpointStartLock haven't changed the code path for deferred transactions, so a similar solution might be possible there also. 5. Is it correct to do WAL-before-flush for clog only, or should this be multixact also? All of the above are fairly minor changes. Any other thoughts/comments/tests welcome. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com transaction_guarantee.v11.patch.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] pgbench transaction timestamps
Tatsuo, would you please comment on this patch too? No problem. I will come up with a comment by the end of this week. Here are comments. 1) latency log file format extention looks usefull (-x option). 2) it seems the cleanup feature (-X option) was withdrawed by the author, but the patches still include the feature. So I'm confused. -- Tatsuo Ishii SRA OSS, Inc. Japan --- Greg Smith wrote: This patch changes the way pgbench outputs its latency log files so that every transaction gets a timestamp and notes which transaction type was executed. It's a one-line change that just dumps some additional information that was already sitting in that area of code. I also made a couple of documentation corrections and clarifications on some of the more confusing features of pgbench. It's straightforward to parse log files in this format to analyze what happened during the test at a higher level than was possible with the original format. You can find some rough sample code to convert this latency format into CVS files and then into graphs at http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm which I'll be expanding on once I get all my little patches sent in here. If you recall the earlier version of this patch I submitted, it added a cleanup feature that did a vacuum and checkpoint after the test was finished and reported two TPS results. The idea was to quantify how much of a hit the eventual table maintenance required to clean up after the test would take. While those things do influence results and cause some of the run-to-run variation in TPS (checkpoints are particularly visible in the graphs), after further testing I concluded running a VACUUM VERBOSE and CHECKPOINT in a script afterwards and analyzing the results was more useful than integrating something into pgbench itself. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD Content-Description: [ Attachment, skipping... ] ---(end of broadcast)--- TIP 6: explain analyze is your friend -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Optimized pgbench for 8.3
Tatsuo, would you please comment on this patch? Sure. I will come up with a comment by the end of this week. The patches look good to me. BTW, is anybody working on enabling the fill factor to the tables used by pgbench? 8.3 will introduce HOT, and I think adding the feature will make it easier to test HOT. -- Tatsuo Ishii SRA OSS, Inc. Japan --- ITAGAKI Takahiro wrote: The attached is a patch to optimize contrib/pgbench using new 8.3 features. - Use DROP IF EXISTS to suppress errors for initial loadings. - Use a combination of TRUNCATE and COPY to reduce WAL on creating the accounts table. Also, there are some cosmetic changes. - Change the output of -v option from starting full vacuum... to starting vacuum accounts... in reflection of the fact. - Shape duplicated error checks into executeStatement(). There is a big performance win in COPY with no WAL feature. Thanks for the efforts! Regards, --- ITAGAKI Takahiro NTT Open Source Software Center [ Attachment, skipping... ] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load distributed checkpoint V3
Hello, long time no see. I'm sorry to interrupt your discussion. I'm afraid the code is getting more complicated to continue to use fsync(). Though I don't intend to say the current approach is wrong, could anyone evaluate O_SYNC approach again that commercial databases use and tell me if and why PostgreSQL's fsync() approach is better than theirs? This January, I got a good result with O_SYNC, which I haven't reported here. I'll show it briefly. Please forgive me for my abrupt email, because I don't have enough time. # Personally, I want to work in the community, if I'm allowed. And sorry again. I reported that O_SYNC resulted in very bad performance last year. But that was wrong. The PC server I borrowed was configured that all the disks form one RAID5 device. So, the disks for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5 device, resulting in I/O conflict. What I modified is md.c only. I just added O_SYNC to the open flag in mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want backends to use O_SYNC because mdextend() does not have to transfer data to disk. My evaluation environment was: CPU: Intel Xeon 3.2GHz * 2 (HT on) Memory: 4GB Disk: Ultra320 SCSI (perhaps configured as write back) OS: RHEL3.0 Update 6 Kernel: 2.4.21-37.ELsmp PostgreSQL: 8.2.1 The relevant settings of PostgreSQL are: shared_buffers = 2GB wal_buffers = 1MB wal_sync_method = open_sync checkpoint_* and bgwriter_* parameters are left as their defaults. I used pgbench, with the data of scaling factor 50. [without O_SYNC, original behavior] - pgbench -c1 -t16000 best response: 1ms worst response: 6314ms 10th worst response: 427ms tps: 318 - pgbench -c32 -t500 best response: 1ms worst response: 8690ms 10th worst response: 8668ms tps: 330 [with O_SYNC] - pgbench -c1 -t16000 best response: 1ms worst response: 350ms 10th worst response: 91ms tps: 427 - pgbench -c32 -t500 best response: 1ms worst response: 496ms 10th worst response: 435ms tps: 1117 If the write back cache were disabled, the difference would be smaller. Windows version showed similar improvements. However, this approach has two big problems. (1) Slow down bulk updates Updates of large amount of data get much slower because bgwriter seeks and writes dirty buffers synchronously page-by-page. For example: - COPY of accounts (5m records) and CHECKPOINT command after COPY without O_SYNC: 100sec with O_SYNC: 1046sec - UPDATE of all records of accounts without O_SYNC: 139sec with O_SYNC: 639sec - CHECKPOINT command for flushing 1.6GB of dirty buffers without O_SYNC: 24sec with O_SYNC: 126sec To mitigate this problem, I sorted dirty buffers by their relfilenode and block numbers and wrote multiple pages that are adjacent both on memory and on disk. The result was: - COPY of accounts (5m records) and CHECKPOINT command after COPY 227sec - UPDATE of all records of accounts 569sec - CHECKPOINT command for flushing 1.6GB of dirty buffers 71sec Still bad... (2) Can't utilize tablespaces Though I didn't evaluate, update activity would be much less efficient with O_SYNC than with fsync() when using multiple tablespaces, because there is only one bgwriter. Anyone can solve these problems? One of my ideas is to use scattered I/O. I hear that readv()/writev() became able to do real scattered I/O since kernel 2.6 (RHEL4.0). With kernels before 2.6, readv()/writev() just performed I/Os sequentially. Windows has provided reliable scattered I/O for years. Another idea is to use async I/O, possibly combined with multiple bgwriter approach on platforms where async I/O is not available. How about the chance Josh-san has brought? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Hi, I agree to put the patch to core and the others (pg_compresslog and pg_decompresslog) to contrib/lesslog. I will make separate materials to go to core and contrib. As for patches, we have tested against pgbench, DBT-2 and our propriatery benchmarks and it looked to run correctly. Regards; Simon Riggs wrote: On Tue, 2007-04-03 at 19:45 +0900, Koichi Suzuki wrote: Bruce Momjian wrote: Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches Thank you very much for including. Attached is an update of the patch according to Simon Riggs's comment about GUC name. The patch comes with its own install kit, which is great to review (many thanks), but hard to determine where you think code should go when committed. My guess based on your patch - the patch gets applied to core :-) - pg_compresslog *and* pg_decompresslog go to a contrib directory called contrib/lesslog? Can you please produce a combined patch that does all of the above, plus edits the contrib Makefile to add all of those, as well as editing the README so it doesn't mention the patch, just the contrib executables? The patch looks correct to me now. I haven't tested it yet, but will be doing so in the last week of April, which is when I'll be doing docs for this and other stuff, since time is pressing now. -- Koichi Suzuki ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Packed Varlena Update (v21)
stark [EMAIL PROTECTED] writes: [ packed varlena patch ] Applied with revisions. Waiting to see how badly the buildfarm complains ;-) ... but I did test on little- and big-endian machines with different MAXALIGN values. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Packed Varlena Update (v21)
I wrote: stark [EMAIL PROTECTED] writes: [ packed varlena patch ] Applied with revisions. Forgot to mention: one of the revisions was to not add the sizes.sql test, because the output was platform-dependent and is likely to get more so if any ability to change the toast thresholds gets put in. Can we improve that test to not expose any platform-dependent numbers? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: The purpose of the bgwriter_all_* settings is to shorten the duration of the eventual checkpoint. The reason to shorten the checkpoint duration is to limit the damage to other I/O activity it causes. My thinking is that assuming the LDC patch is effective (agreed, needs more testing) at smoothening the checkpoint, the duration doesn't matter anymore. Do you want to argue there's other reasons to shorten the checkpoint duration? My testing results suggest that LDC doesn't smooth the checkpoint usefully when under a high (30 client here) load, because (on Linux at least) the way the OS caches writes clashes badly with how buffers end up being evicted if the buffer pool fills back up before the checkpoint is done. In that context, anything that slows down the checkpoint duration is going to make the problem worse rather than better, because it makes it more likely that the tail end of the checkpoint will have to fight with the clients for write bandwidth, at which point they both suffer. If you just get the checkpoint done fast, the clients can't fill the pool as fast as the BufferSync is writing it out, and things are as happy as they can be without a major rewrite to all this code. I can get a tiny improvement in some respects by delaying 2-5 seconds between finishing the writes and calling fsync, because that gives Linux a moment to usefully spool some of the data to the disk controller's cache; beyond that any additional delay is a problem. Since it's only the high load cases I'm having trouble dealing with, this basically makes it a non-starter for me. The focus on checkpoint_timeout and ignoring checkpoint_segments in the patch is also a big issue for me. At the same time, I recognize that the approach taken in LDC probably is a big improvement for many systems, it's just a step backwards for my highest throughput one. I'd really enjoy hearing some results from someone else. The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. I see you've already found the other helpful Itagaki patch in this area. I know I would like to see his code for tracking evictions commited, then I'd like that to be added as another counter in pg_stat_bgwriter (I mentioned that to Magnus in passing when he was setting the stats up but didn't press it because of the patch dependency). Ideally, and this idea was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug hook, I think it's important that you know how every buffer written to disk got there--was it a background writer, a checkpoint, or an eviction that wrote it out? Track all those and you can really learn something about your write performance, data that's impossible to collect right now. However, as Itagaki himself points out, doing something useful with bgwriter_lru_maxpages is only one piece of automatically tuning the background writer. I hate to join in on chopping his patches up, but without some additional work I don't think the exact auto-tuning logic he then applies will work in all cases, which could make it more a problem than the current crude yet predictable method. The whole way bgwriter_lru_maxpages and num_to_clean play off each other in his code currently has a number of failure modes I'm concerned about. I'm not sure if a re-write using a moving-average approach (as I did in my auto-tuning writer prototype and as Tom just suggested here) will be sufficient to fix all of them. Was already on my to-do list to investigate that further. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate