Re: [HACKERS] Brain dump: btree collapsing
tom lane wrote: Hm. A single lock that must be grabbed for operations anywhere in the index is a concurrency bottleneck. I replied a bit later: I don't think it would be that bad. It's not a lock but a mutex that would only be held for relatively brief duration. It looks like tens or a few hundred instructions at most covered by the mutex and only during scans that cross page boundaries. Unless you were talking about a 16 to 64 way SMP box, I don't think there would be much actual contention since there would have to be contention for the same index at the same time and depending on the number of keys per page, very little of the run time is spent on the lock transition for traversal across pages as compared to all the other stuff that would be going on. tom also presciently wrote: But maybe we could avoid that After sleeping on it a bit, it occurs to me that you are again correct; the mutex is not required to do what I suggested. It is simply a matter of assuring that during the page traversal the lock is either obtained or the process added to the waiters queue before the current page lock is released. This will ensure than any process trying to obtain an exclusive or superexclusive lock will wait or fail an immediate lock acquisition. traversal pseudo code: if the next page lock can be obtained immediately grab it release the current page lock else if we would have to wait add current process to waiters queue release current page lock sleep on the lock end if end traversal pseudo code Like the current release before next page acquisition method, this won't cause deadlocks with other scanners going in the opposite direction or with concurrently inserting processes since the current page lock is released before sleeping. If this change is made then my original suggestion, that the merge process attempt a non-deadlocking acquisition of all the locks with a drop of all locks and a retry if any of the attempts would have caused a wait, should work fine. If the traversal code is changed as per above, a merging VACUUM process is guaranteed to either run into the lock held by the scanning process on the current page or the next page. This has the potential side benefit of permitting the simplification of the scan code in the case where the lock is obtained immediately. Since the scanning process would know that no inserting process has added pages to the tree because of a split, it wouldn't have to check for newly inserted pages to the right of the next page. I don't know how much work is involved in this check but if it is significant it could be eliminated in this case, one which is the most likely case in many scenarios. What do you think? - Curtis ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Brain dump: btree collapsing
tom lane wrote: How does that help? The left-moving indexscan still has no way to recover. It can't go back to the page it was on before and try to determine which entries you added there, because it has no reliable reference point to do so. The entry it previously returned might not be there anymore, and in a non-unique index key comparisons won't help. And even more to the point, how would it know you've changed the left page? It has no idea what the previous page version on the left page was, because it was never there before. Another way this could be handled is by not merging onto one of the existing pages but to a brand new page, a kind of special case shadow index page. That way the sibling pointers, and leaf page pointer in the parent could all be updated atomically to point to the new page. In-process index scans would still reference the merged pages which would not be deleted but marked as dead using a mechanism like you proposed for marking empty pages dead with the next-transaction-ID counter. Merging could be done after a VACUUM pass that performs deletion of empty pages in order to provide a pool of empty pages to use for the merge. This would keep the index from temporarily growing during the merge process. A similar approach could be used to reorder the index pages in the background. An index that was reordered to fairly closely reflect the tree as a breadth first traversal would provide much faster index scans if the file is not heavily fragmented. - Curtis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Brain dump: btree collapsing
tom lane wrote: Hmmm ... that might be made to work, but it would complicate inserts. By the time an insert navigates to the page it should insert on, it might find the page is dead, and then it would have no easy way to get to the replacement page (unless you want to dedicate another link field in every index page for that one use). I suppose it could recover by starting the search over again. Inserts could reread just the parent page if they encountered a dead leaf since the parent would have been correctly updated. Another problem is that the two dead pages are now outside of the btree structure and so their left and right links won't get updated in the face of subsequent splits and merges of the nearby pages. That seems like a show stopper that just defers the problem. A split of the left sibling would still screw up a scan that was using the old left leaf page and wanted to move left. Oh, well, the idea seemed worth exploring. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Brain dump: btree collapsing
I previously wrote: 5) A mutex/spinlock that was stored in the index could be acquired by the scan code like this: buf = ReadBuffer(rel, blkno); /* pin next page */ SpinLockAcquire( indexSpecificMutex );/* lock the index reorg mutex */ LockBuffer(buf, BUFFER_LOCK_UNLOCK);/* release lock on current page */ LockBuffer(buf, BT_READ); /* lock next page */ SpinLockRelease( indexSpecificMutex );/* unlock the index reorg mutex */ ReleaseBuffer(buf); /* now release pin on previously current page */ 6) The same index specific mutex/spinlock could be used for the merge code surrounding only the acquisition of the four page locks. This would obviate any problems with scans and page merges, since the lock acquisition for the merge could never occur while a scan was between pages. Further, with the reordering, the spinlock for the scan code doesn't seem particularly onerous since it would be surrounding only two LWLock calls. To reduce the overhead to an absolute minimum for the scan case these could be pushed down into a new IW call (probably necessary since the LockBuffer, ReleaseBuffer code does some error checking and such that one wouldn't want in code guarded by a mutex. I forgot to mention that the mutex would have to be release in the event the next page lock could not be immediately acquired just after the addition of the scan process to the lock waiters list to avoid blocking all scans and probably causing severe deadlock problems. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Brain dump: btree collapsing
tom lane wrote: How does that help? The left-moving indexscan still has no way to recover. It can't go back to the page it was on before and try to determine which entries you added there, because it has no reliable reference point to do so. The entry it previously returned might not be there anymore, and in a non-unique index key comparisons won't help. And even more to the point, how would it know you've changed the left page? It has no idea what the previous page version on the left page was, because it was never there before. Revisiting the idea I proposed in a previous email after more research, I believe it is definitely possible to implement a mutex based scheme that will prevent scans from being polluted by merges of index pages and that does not result in the mutex being held for any significant duration. 1) Current scan code does this: bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) { ... definitions go here... if (ScanDirectionIsForward(dir)) { if (!PageIsEmpty(page) offnum maxoff) offnum = OffsetNumberNext(offnum); else { /* walk right to the next page with data */ for (;;) { /* if we're at end of scan, release the buffer and return */ ... skip code here... /* step right one page */ blkno = opaque-btpo_next; _bt_relbuf(rel, *bufP); *bufP = _bt_getbuf(rel, blkno, BT_READ); ... skip rest of code... 3) Note that the calls _bt_relbuf(rel, *bufP); *bufP = _bt_getbuf(rel, blkno, BT_READ); a) appear adjacent to each other b) relbuf calls: LockBuffer(buf, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buf); c) getbuf only calls: buf = ReadBuffer(rel, blkno); LockBuffer(buf, access); in the case of an existing buffer, rather than a new one. 4) This could easily be reordered into: buf = ReadBuffer(rel, blkno); /* pin next page */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on current page */ LockBuffer(buf, BT_READ); /* lock next page */ ReleaseBuffer(buf); /* now release pin on previously current page */ without affecting the logic of the code or causing any deadlock problems since the release still occurs before the lock of the next page. 5) A mutex/spinlock that was stored in the index could be acquired by the scan code like this: buf = ReadBuffer(rel, blkno); /* pin next page */ SpinLockAcquire( indexSpecificMutex );/* lock the index reorg mutex */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on current page */ LockBuffer(buf, BT_READ); /* lock next page */ SpinLockRelease( indexSpecificMutex );/* unlock the index reorg mutex */ ReleaseBuffer(buf); /* now release pin on previously current page */ 6) The same index specific mutex/spinlock could be used for the merge code surrounding only the acquisition of the four page locks. This would obviate any problems with scans and page merges, since the lock acquisition for the merge could never occur while a scan was between pages. Further, with the reordering, the spinlock for the scan code doesn't seem particularly onerous since it would be surrounding only two LWLock calls. To reduce the overhead to an absolute minimum for the scan case these could be pushed down into a new IW call (probably necessary since the LockBuffer, ReleaseBuffer code does some error checking and such that one wouldn't want in code guarded by a mutex. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Brain dump: btree collapsing
tom lane initially wrote: Restructuring the tree during page deletion --- We will delete only completely-empty pages. If we were to merge nearly-empty pages by moving data items from one page to an adjacent one, this would imply changing the parent's idea of the bounding key between them --- which is okay if we are just deleting an internal key in the parent, but what if the pages have different parent pages? and a bit later wrote: My feeling is that what we need to fix now is index bloat during normal operation. How about doing deletion of partial pages with reorganization among sibling pages only (where the parent pages are the same)? This avoids the messiness of propogating the deletes to differing parent pages but gets most of the value of reorganization. ISTM, that a VACUUM that only reclaims empty pages will be helpful in certain cases but won't help much at all in many other common normal operation cases which would be helped by partial reorganization. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Brain dump: btree collapsing
tom lane wrote: Got any evidence to back that up? I was relying on [Johnson89] Johnson, T. and Shasha, D. Utilization of B-trees with Inserts, Deletes and Modifies ACM Symp. on PODS, 235-246, 1989. which provides a ton of math and simulations leading up to the conclusion that collapsing btree pages before they are fully empty doesn't really gain anything meaningful in terms of storage utilization, in most scenarios. Well, I don't have that specific paper handy but I do have [JS93] Theodore Johnson , Dennis Shasha, B-trees with inserts and deletes: why free-at-empty is better than merge-at-half which appears to be their later thinking on the same subject. Note the following: A merge-at-half B-tree will always have a space utilization of at least 50%. When all operations are modify operations, or when the number of insert operations is the same as the number of delete operations, then the utilization will be about 60%. In contrast, a free-at-empty B-tree has a 0% lower bound on its space utilization, and will have about 39% utilization on a pure-modify instruction mix. However, the space utilization of a free-at-empty B-tree remains high if there are just a few more insert operations than delete operations. Thus, merge-at-half usually buys little in terms of space utilization. In Figure 6, we showed that the restructuring rate of a merge-at-half B-tree is significantly larger than the restructuring rate of a free-at-empty B-tree for all values of q * :1. For many concurrent B-tree algorithms used in practice [4, 13], restructuring causes a serialization bottleneck. Thus, one simple but important way to increase concurrency in B-tree operations is to reduce the probability of restructuring. Since merge-at-half buys little space utilization but is expensive in terms of restructuring, we recommend that B-trees (especially concurrently accessed ones) use free-at-empty. I don't dispute their conclusions in that context and under the circumstances they outline of random distribution of deletion and insertion values for the index keys. However, as [Jannink]: Implementing Deletion in B+-trees. SIGMOD RECORD, v.24, n.1, p.33-38, 1995 points out that assumption doesn't hold under other possibly common circumstances, specifically circumstances where the deletes are taking place in significant sections of the index at a much higher rate than inserts. Consider from [Jannink95]: There has been some research on the acceptability of relaxing the constraint of minimum node size to reduce the number of so-called unsafe tree operations, i.e., those which contain node splitting and merging [ZH89]. The debate has culminated in analysis of a weaker form of the deletion algorithm which we call lazy deletion, that imposes no constraints on the number of entries left in the nodes, allowing them to empty completely before simply removing them. According to [GR93], most database system implementations of B+-trees have adopted this approach. Its most effective use is when it is vital to allow concurrent access to the tree [JS93b], and excessive splitting and merging of nodes would restrict concurrency. [JS89] derives some analytic solutions calculating memory utilization for B+-trees under a mix of insertions and lazy deletions, based on previous research which considered insertions only [BY89]. The simulations in [JS89] support its analysis to show that in typical situations, where deletions don't outnumber insertions in the mix of operations, the tree nodes will contain acceptable percentages of entries. One of the work's assumptions [JS93a] is that the keys and tree operations are chosen uniformly from a random distribution. This assumption is unreasonable in certain realistic situations such as one described below. Allowing interior nodes with only a single pointer to exist in a B+-tree creates the possibility for arbitrarily unbalanced trees of any height, which are virtually empty, and in which access times have degenerated from the logarithmic bound B+-trees are meant to guarantee to a worst case unbounded access time. Since nodes are not removed until they are completely empty, the lazy deletion algorithm does not regulate tree height effectively. Jannink then illustrates an example where an index is created based on a timestamp where the basic assumption of Johnson and Sasha does not hold since it is not a random distribution but a monotonically increasing value. His example is an extreme one but I believe there are many instances where a timestamp, sequence or some other monotonically increasing value is used in an index and where deletes are taking place much more frequently for largely older values. Since sequences are often used as foreign keys a significant number of indexes fit into this category. Consider a web site that tracked users and that deleted inactive accounts. There are many real-world scenarios where the number of inactive accounts is very high as a percentage of
Re: [HACKERS] Brain dump: btree collapsing
tom lane wrote: Curtis Faith [EMAIL PROTECTED] writes: I don't dispute their conclusions in that context and under the circumstances they outline of random distribution of deletion and insertion values for the index keys. [But the random-distribution assumption doesn't always hold.] That's a fair point. Nonetheless, we have little choice: we cannot move keys around during concurrent operations. If we push keys to the right, we may cause an indexscan moving left to miss them, and vice versa. So we can only eliminate empty pages. We could possibly allow VACUUM FULL to collapse partly-full pages, since in that case we know there are no concurrent scans. Couldn't we do an exclusive lock call on both leaf pages and the parent using a new LockBuffer type function, named something like LockBufferNoWait, that uses LWLockConditionalAcquire instead of LWLockAcquire, in the event that all three exclusive locks cannot be obtained release all three locks, sleep, and try again for N retries. (Actually, this would probably be four locks since the sibling pointer of one of the siblings would have to be updated to point to the new merged page instead of the to-be-deleted page.) Having exclusive locks on all three pages prior to a merge would ensure that no scans were interrupted during that merge. Releasing all the exclusive locks in the event of failure to obtain any of the locks will eliminate the possibility of creating deadlocks. After N retries, VACCUUM could abort leaving the merge to be picked up in another future VACUUM run. I would think that this would be fairly easy to implement and that except for very heavily scanned indexes, most of the time the locks could be acquired and the merge would take place without causing any concurrency problems. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [mail] Re: [HACKERS] Windows Build System
Christopher Browne wrote: snip From the MySQL site's page about MySQL vs PostgreSQL: http://www.mysql.com/doc/en/MySQL-PostgreSQL_features.html MySQL Server works better on Windows than PostgreSQL does. MySQL Server runs as a native Windows application (a service on NT/2000/XP), while PostgreSQL is run under the Cygwin emulation. That seems pretty straightforward. But it's not /nearly/ that straightforward. If you look at the downloads that MySQL AB provides, they point you to a link that says Windows binaries use the Cygwin library. Which apparently means that this feature is not actually a feature. Unlike PostgreSQL, which is run under the Cygwin emulation, MySQL runs as a native Windows application (with Cygwin emulation). Apparently those are not at all the same thing, even though they are both using Cygwin... Justin Clift replied: Hmm... wonder if they're meaning that MySQL compiles and executes as a True native windows application (skipping any unix compatibility calls), and it's just some of the support utils that use cygwin, or if they're trying to say that PostgreSQL has to operate entirely in the cygwin environment, whereas they don't? I just downloaded the latest productin source (3.3.55) and it appears to me that: 1) It uses Cygwin emulation via a dll. 2) It uses Visual Studio C++ 6.0 for the primary build environment. It compiles out of the box without having to learn Unix-style build systems, config, make, etc. No warnings, no errors, it just builds out of the box. If I did not have a lot of experience building databases I certainly would have found their support for Windows compelling. This is a big reason why they are #1. 3) The statement by the MySQL folks above that MySQL runs as a native Windows application (a service on NT/2000/XP) is indicative of why MySQL is kicking PostgreSQL's butt in terms of popularity. It is marketing speak at its best. It is technically true, MySQL runs as a service. As Christopher Browne points out, they still use the Cygwin Emulation layer. The statement is misleading, however, as it implies that they don't use any emulation but they do. The salient points: a) Running as a service is important as this the way NT/2000 administrators manage server tasks. The fact that PostgreSQL's Cygwin emulation doesn't do this is very indicative of inferior Windows support. b) MySQL recognizes that the important issue is to appear to be a well supported Windows application rather than to actually be one. c) It is probably much easier to add the support for running as an NT service than it is to write a true native port with no Cygwin dependency. NT Service support is basically a single funtion wrapper for certain API calls (startup, shutdown, etc.) that enable the Windows administration tools to deal with all servers in a similar manner. They have worked on that which makes them look better, makes their prospective customers happier, and makes it easier to support. Exactly what any good product development organization that listens to their customers would have done. flame on IMHO, PostgreSQL will never have the same level of use in the field as MySQL currently does as long as there is the kind head in the sand attitude about Windows that I've seen here on the hackers list, especially as evidenced by the recent outright attacks against those who are simply trying to port PostgreSQL to the largest platform out there today. There have been some very legitimate points about Windows being a new platform, one that will likely see a lot of users, and therefore one that should be more thoroughly tested before release than the typical port to another flavor of *nix. However, the way the conversation started reminds me of some of the chat discussions I've seen between young teens. I was a Mac developer way, way back and long ago realized that the best often loses and that better marketing beats better engineering every single time. \flame off DISCLAIMER: I hate Microsoft and Windows drives me nuts. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [mail] Re: [HACKERS] Windows Build System
tom lane wrote: flame on In all honesty, I do not *want* Windows people to think that they're not running on the poor stepchild platform. We should distinguish between poor stepchild from a client support perspective and a production environment perspective. What is the downside to supporting development of client products better? That is what I am really suggesting. If people are deciding what open-source database server they want to use, Linux or FreeBSD is the obvious choice for the server OS. The kind of people who are inclined to use PostgreSQL or MySQL will mostly NOT be considering Windows servers. I have no objection to there being a Windows port that people can use to do SQL-client development on their laptops. But let us please not confuse this with an industrial-strength solution; nor give any level of support that might lead others to make such confusion. All we can do is simply to make it clear that Windows is not recommended for production server use and outline all the reasons why Windows sucks for that purpose. Beyond that, if people want to shoot themselves in the head, they will do so and I don't see much point in trying to stop them. The MySQL guys made the right choice here: they don't want to buy into making Windows a grade-A platform, either. flame off flame retardent on How does providing a native Windows executable that doesn't require Cygwin accomplish your objective. It seems to me that you are going to have the problem if you release a native version irrespective of the issue at hand (Visual C++ project support). I don't see how making it easier to build adds to this problem. I also don't see how making it harder for Windows client developer to adopt PostgreSQL helps anyone. flame retardent off I hate Microsoft and I don't like Windows, but I am forced to use it because the software we need to run our business runs only on Windows. I use Unix whenever possible and whenever reliability is required. - Curtis P.S. The lack of a real C++ client library that supports the most common development environment out there is another problem that seriously impedes Windows client developers. I like libpqxx, Jeroen did a find job. However, one needs to jump through hoops to get it to run on Visual C++ 6.0 at the moment. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Windows Build System
Curtis Faith wrote: The Visual C++ Workspaces and Projects files are actually text files that have a defined format. I don't think the format is published but it looks pretty easy to figure out. Hannu Krosing wrote: will probably change between releases Even if the format changes, the environment always has a converter that updates the project and workspace files to the new format. In other words, Visual C++ 6.0 reads 5.0 projects, 7.0 reads 6.0, etc. The format is mostly a bunch of options specifications (which wouldn't get touched) followed by a set of named groups of source files. Even if the overall format changes, it will be much more likely to change in the specifications rather than the way lists of source file formats are specified. A conversion script would only need to: 1) Read in the template file (containing all the options specifications and Visual C++ speficic stuff, debug and release target options, libraries to link in, etc.) This part might change with new versions of the IDE and would be manually created by someone with Visual C++ experience. 2) Read in the postgreSQL group/directory map, or alternately just mirror the groups with the directories. 3) Output the files from the PostgreSQL directories in the appropriate grouping according to the project format into the appropriate space in the template. An excerpt of the format follows: # Begin Group Access # Begin Group Common # PROP Default_Filter cpp;c;cxx # Begin Source File SOURCE=.\access\common\heaptuple.c # End Source File # Begin Source File SOURCE=.access\common\indextuple.c # End Source File ... other files in access\common go here # End Group # Begin Group Index # PROP Default_Filter cpp;c;cxx # Begin Source File SOURCE=.\access\index\genam.c # End Source File # Begin Source File SOURCE=.access\index\indexam.c # End Source File ... other files in access\index go here # End Group # End Group As you can see, this is a really easy format, and the folder/group mapping with directories is pretty natural and probably the way to go. Using the approach I outline, it should be possible to have the Unix make system automatically run the BuildWindowsProjectFile tool whenever any makefile changes so the Windows projects would stay up to date without additional work for Unix developers. Hannu Krosing also wrote: (also I dont think you can easily compile C source on a C# compiler) ;/ I don't think it makes much sense target a compiler that won't compile the source, therefore, if what you say is true, we shouldn't bother with targeting C#. Why does this matter? I didn't propose targeting Visual C#.NET. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: Windows Build System was: [HACKERS] Win32 port patches submitted
tom lane writes: You think we should drive away our existing unix developers in the mere hope of attracting windows developers? Sorry, it isn't going to happen. Tom brings up a good point, that changes to support Windows should not add to the tasks of those who are doing the bulk of the work on Unixen. I don't think, however, that this necessarily means that having Windows developers use Cygwin is the right solution. We need to come up with a way to support Windows Visual C++ projects without adding work to the other developers. I believe this is possible and have outlined some ways at the end, but first some rationale: One of the biggest benefits to Open Source projects is the ability to get in there and debug/fix problems using the source. PostgreSQL will lose out to lesser DBs if there is no easy way to build and DEBUG the source on Windows. This is true whether one admits that Windows sucks or not. A developer faced with the decision of choosing: A) a system that has a native Windows Visual C++ project that runs and compiles the release with no work. B) a system that requires learning a new way of building, new tools, new who knows what else. will always choose A unless there is a very compelling reason to choose B. There are plenty of reasons a clever (or even not so clever) Windows developer can use to justify using MySQL or another open source DB instead of PostgreSQL. It is a bit harder for the neophyte to find and believe the compelling reasons to use PostgreSQL. We need to make it easier to choose PostgreSQL not harder. Think about it from this perspective. How many of you would even think about working on a project if it required that you stop using your favorite toolset, gmake? EMACS? grep? shell scripts? etc.? Professional developers spend years honing their skills. Learning the proper use of the tools involved is a very big part of the process. IMHO, having a native port without native (read Visual C++) project support is a a huge missed opportunity. Further, lack of Windows project files implies that PostgreSQL just a Unix port and that the Windows support is immature, whether the code is well supported or not. POSSIBLE SOLUTIONS: The Visual C++ Workspaces and Projects files are actually text files that have a defined format. I don't think the format is published but it looks pretty easy to figure out. The Visual C++ environment does not require dependency specification, it builds dependency trees by keeping track of the #include files used during preprocessing. Because of this, it should be possible to: A) Write a script/tool that reads the input files from Unix makefiles to build a list of the files in PostgreSQL and place them in appropriate projects. or alternately: B) A script/tool that recurses the directories and does the same sort of thing. There could be some sort of mapping between directories and projects in Visual C++. In short, for most organizations being able to easily build using the source is a prerequisite for USING an open source database, not just for being part of the DEVELOPMENT effort. -Curtis P.S. I speak from personal experience, I would have been able to help out a lot more if I didn't have to spend 90% of my time working with PostgreSQL learning Unix (or relearning) and gnu tool issues. I don't personally mind so much because I wanted to learn it better anyway, but it has definitely limited my ability to help, so far. This is especially true since I don't have the opportunity to immerse myself in Unix/PostgreSQL for days at a time and suffer from significant switching costs. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Windows Build System
I (Curtis Faith) previously wrote: The Visual C++ Workspaces and Projects files are actually text files that have a defined format. I don't think the format is published but it looks pretty easy to figure out. Hannu Krosing replied: will probably change between releases Even if the format changes, the environment always has a converter that updates the project and workspace files to the new format. In other words, Visual C++ 6.0 reads 5.0 projects, 7.0 reads 6.0, etc. The format is mostly a bunch of options specifications (which wouldn't get touched) followed by a set of named groups of source files. Even if the overall format changes, it will be much more likely to change in the specifications rather than the way lists of source file formats are specified. A conversion tool, call it BuildWindowsProjectFile, would only need to: 1) Read in the template file (containing all the options specifications and Visual C++ speficic stuff, debug and release target options, libraries to link in, etc.) This part might change with new versions of the IDE and would be manually created by someone with Visual C++ experience. 2) Read in the postgreSQL group/directory map, or alternately just mirror the groups with the directories. 3) Output the files from the PostgreSQL directories in the appropriate grouping according to the project format into the appropriate space in the template. An excerpt of the format follows: # Begin Group Access # Begin Group Common # PROP Default_Filter cpp;c;cxx # Begin Source File SOURCE=.\access\common\heaptuple.c # End Source File # Begin Source File SOURCE=.access\common\indextuple.c # End Source File ... other files in access\common go here # End Group # Begin Group Index # PROP Default_Filter cpp;c;cxx # Begin Source File SOURCE=.\access\index\genam.c # End Source File # Begin Source File SOURCE=.access\index\indexam.c # End Source File ... other files in access\index go here # End Group # End Group As you can see, this is a really simple format, and the direct folder/group mapping to PostgreSQL directory is pretty natural and probably the way to go. Using the approach I outline, it should be possible to have the Unix make system automatically run the BuildWindowsProjectFile tool whenever any makefile changes so the Windows projects would stay up to date without additional work for Unix developers. Hannu Krosing also wrote: (also I dont think you can easily compile C source on a C# compiler) ;/ I don't think it makes much sense target a compiler that won't compile the source, therefore, if what you say is true, we shouldn't bother with targeting C#. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Windows Build System
Hannu Krosing asked: Does anyone know how MySQL and interbase/firebird do it ? From the MySQL web site for version 4.0: The Windows binaries use the Cygwin library. Source code for the version of Cygwin we have used is available on this page. I think this offers a very big opportunity to differentiate. If we had project support it would make PostgreSQL a more natural choice for Windows developers. Even in a shop that uses Unix servers with Windows front-ends, having a database server that runs on your own machine makes it much easier to debug, write code on airplanes, etc. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[PERFORM] [HACKERS] Realtime VACUUM, was: performance of insert/delete/update
tom lane wrote: Sure, it's just shuffling the housekeeping work from one place to another. The thing that I like about Postgres' approach is that we put the housekeeping in a background task (VACUUM) rather than in the critical path of foreground transaction commit. Thinking with my marketing hat on, MVCC would be a much bigger win if VACUUM was not required (or was done automagically). The need for periodic VACUUM just gives ammunition to the PostgreSQL opponents who can claim we are deferring work but that it amounts to the same thing. A fully automatic background VACUUM will significantly reduce but will not eliminate this perceived weakness. However, it always seemed to me there should be some way to reuse the space more dynamically and quickly than a background VACUUM thereby reducing the percentage of tuples that are expired in heavy update cases. If only a very tiny number of tuples on the disk are expired this will reduce the aggregate performance/space penalty of MVCC into insignificance for the majority of uses. Couldn't we reuse tuple and index space as soon as there are no transactions that depend on the old tuple or index values. I have imagined that this was always part of the long-term master plan. Couldn't we keep a list of dead tuples in shared memory and look in the list first when deciding where to place new values for inserts or updates so we don't have to rely on VACUUM (even a background one)? If there are expired tuple slots in the list these would be used before allocating a new slot from the tuple heap. The only issue is determining the lowest transaction ID for in-process transactions which seems relatively easy to do (if it's not already done somewhere). In the normal shutdown and startup case, a tuple VACUUM could be performed automatically. This would normally be very fast since there would not be many tuples in the list. Index slots would be handled differently since these cannot be substituted one for another. However, these could be recovered as part of every index page update. Pages would be scanned before being written and any expired slots that had transaction ID's lower than the lowest active slot would be removed. This could be done for non-leaf pages as well and would result in only reorganizing a page that is already going to be written thereby not adding much to the overall work. I don't think that internal pages that contain pointers to values in nodes further down the tree that are no longer in the leaf nodes because of this partial expired entry elimination will cause a problem since searches and scans will still work fine. Does VACUUM do something that could not be handled in this realtime manner? - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] 500 tpsQL + WAL log implementation
I have been experimenting with empirical tests of file system and device level writes to determine the actual constraints in order to speed up the WAL logging code. Using a raw file partition and a time-based technique for determining the optimal write position, I am able to get 8K writes physically written to disk synchronously in the range of 500 to 650 writes per second using FreeBSD raw device partitions on IDE disks (with write cache disabled). I will be testing it soon under linux with 10,00RPM SCSI which should be even better. It is my belief that the mechanism used to achieve these speeds could be incorporated into the existing WAL logging code as an abstraction that looks to the WAL code just like the file level access currently used. The current speeds are limited by the speed of a single disk rotation. For a 7,200 RPM disk this is 120/second, for a 10,000 RPM disk this is 166.66/second The mechanism works by adjusting the seek offset of the write by using gettimeofday to determine approximately where the disk head is in its rotation. The mechanism does not use any AIO calls. Assuming the following: 1) Disk rotation time is 8.333ms or 8333us (7200 RPM). 2) A write at offset 1,500K completes at system time 103s 000ms 000us 3) A new write is requested at system time 103s 004ms 166us 4) A 390K per rotation alignment of the data on the disk. 5) A write must be sent at least 20K ahead of the current head position to ensure that it is written in less than one rotation. It can be determined from the above that a write for an offset of something slightly more than 195K past the last write, or offset 1,695K will be ahead of the current location of the head and will therefore complete in less than a single rotation's time. The disk specific metrics (rotation speed, bytes per rotation, base write time, etc.) can be derived empirically through a tester program that would take a few minutes to run and which could be run at log setup time. The obvious problem with the above mechanism is that the WAL log needs to be able to read from the log file in transaction order during recovery. This could be provided for using an abstraction that prepends the logical order for each block written to the disk and makes sure that the log blocks contain either a valid logical order number or some other marker indicating that the block is not being used. A bitmap of blocks that have already been used would be kept in memory for quickly determining the next set of possible unused blocks but this bitmap would not need to be written to disk except during normal shutdown since in the even of a failure the bitmaps would be reconstructed by reading all the blocks from the disk. Checkpointing and something akin to log rotation could be handled using this mechanism as well. So, MY REAL QUESTION is whether or not this is the sort of speed improvement that warrants the work of writing the required abstraction layer and making this very robust. The WAL code should remain essentially unchanged, with perhaps new calls for the five or six routines used to access the log files, and handle the equivalent of log rotation for raw device access. These new calls would either use the current file based implementation or the new logging mechanism depending on the configuration. I anticipate that the extra work required for a PostgreSQL administrator to use the proposed logging mechanism would be to: 1) Create a raw device partition of the appropriate size 2) Run the metrics tester for that device partition 3) Set the appropriate configuration parameters to indicate raw WAL logging I anticipate that the additional space requirements for this system would be on the order of 10% to 15% beyond the current file-based implementation's requirements. So, is this worth doing? Would a robust implementation likely be accepted for 7.4 assuming it can demonstrate speed improvements in the range of 500tps? - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] 500 tpsQL + WAL log implementation
tom lane wrote: What can you do *without* using a raw partition? I dislike that idea for two reasons: portability and security. The portability disadvantages are obvious. And in ordinary system setups Postgres would have to run as root in order to write on a raw partition. It occurs to me that the same technique could be used without any raw device access. Preallocate a large WAL file and apply the method within it. You'll have more noise in the measurements due to greater variability in the physical positioning of the blocks --- but it's rather illusory to imagine that you know the disk geometry with any accuracy anyway. Modern drives play a lot of games under the hood. A write to a raw disk file is immediate and completes with minimal system overhead. I'll try to test a file-based approach using a write followed by an immediate fdatasynch and see if that approaches the speed of the raw partition access. I suspect we'll get decent performance, only perhaps 10% to 15% slower. As you mention, there is nothing exact about the technique, so we should be able to get similar improvements with a file based system. I've been able to get over 1,500 raw writes confirmed to disk using raw partition writes each slightly offset ahead of they other, yet, only somewhere between 500 and 650 on a sustained basis using the technique I described because of the noise in the geometry measurements and variable timing for the writes themselves. This scares me quite a bit too. The reason that the existing implementation maxes out at one WAL write per rotation is that for small transactions it's having to repeatedly write the same disk sector. You could only get around that by writing multiple versions of the same WAL page at different disk locations. Reliably reconstructing what data to use is not something that I'm prepared to accept on a handwave... I'm pretty sure this could be done very reliably but at the cost of slightly slower reading after a failure for redo. I figured that whenever a transaction wrote to the log it would set the log offset marker for new transactions to force the next transaction to use a new block. This would result in space waste which could be partially offset by using writes smaller than the 8K block size (along disk block size boundaries, 512 bytes for my disk). This has the advantage of making it fairly easy to make sure that the log can be reconstructed in order since there would be no partial block writes to worry about. I believe that 4 to 8 full rotations worth of usable blocks could be maintained and blocks would be written to the lowest offset tracks first unless there were no free blocks of sufficient size. This would probably result in 90% to 95% utilization of the blocks (disregarding waste inside the blocks themselves). When the lowest offset track filled up sufficiently, another empty track would be added to the usable blocks list and the lowest offset track taken off the unused list. This would ensure that a read of 4 to 8 tracks, which needs to be a fixed number for any given installation, could reconstruct the order of the WAL log since at no time would blocks be out of order beyond that range. Disk space is much cheaper than CPU and memory so I think that a logging system that used as much as three or four times the space but is three or four times faster would be a worthwhile improvement for those systems where updates or insert volume are very heavy. Obviously, this needs to be an option, not the default configuration. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Prepare enabled pgbench
Tatsuo, are you or anyone else working on adding PREPARE, EXECUTE support to pgbench? If not, I can do it myself and if you are interested, I'll send you the patch. - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] 500 tpsQL + WAL log implementation
I have been experimenting with empirical tests of file system and device level writes to determine the actual constraints in order to speed up the WAL logging code. Using a raw file partition and a time-based technique for determining the optimal write position, I am able to get 8K writes physically written to disk synchronously in the range of 500 to 650 writes per second using FreeBSD raw device partitions on IDE disks (with write cache disabled). I will be testing it soon under linux with 10,00RPM SCSI which should be even better. It is my belief that the mechanism used to achieve these speeds could be incorporated into the existing WAL logging code as an abstraction that looks to the WAL code just like the file level access currently used. The current speeds are limited by the speed of a single disk rotation. For a 7,200 RPM disk this is 120/second, for a 10,000 RPM disk this is 166.66/second The mechanism works by adjusting the seek offset of the write by using gettimeofday to determine approximately where the disk head is in its rotation. The mechanism does not use any AIO calls. Assuming the following: 1) Disk rotation time is 8.333ms or 8333us (7200 RPM). 2) A write at offset 1,500K completes at system time 103s 000ms 000us 3) A new write is requested at system time 103s 004ms 166us 4) A 390K per rotation alignment of the data on the disk. 5) A write must be sent at least 20K ahead of the current head position to ensure that it is written in less than one rotation. It can be determined from the above that a write for an offset of something slightly more than 195K past the last write, or offset 1,695K will be ahead of the current location of the head and will therefore complete in less than a single rotation's time. The disk specific metrics (rotation speed, bytes per rotation, base write time, etc.) can be derived empirically through a tester program that would take a few minutes to run and which could be run at log setup time. The obvious problem with the above mechanism is that the WAL log needs to be able to read from the log file in transaction order during recovery. This could be provided for using an abstraction that prepends the logical order for each block written to the disk and makes sure that the log blocks contain either a valid logical order number or some other marker indicating that the block is not being used. A bitmap of blocks that have already been used would be kept in memory for quickly determining the next set of possible unused blocks but this bitmap would not need to be written to disk except during normal shutdown since in the even of a failure the bitmaps would be reconstructed by reading all the blocks from the disk. Checkpointing and something akin to log rotation could be handled using this mechanism as well. So, MY REAL QUESTION is whether or not this is the sort of speed improvement that warrants the work of writing the required abstraction layer and making this very robust. The WAL code should remain essentially unchanged, with perhaps new calls for the five or six routines used to access the log files, and handle the equivalent of log rotation for raw device access. These new calls would either use the current file based implementation or the new logging mechanism depending on the configuration. I anticipate that the extra work required for a PostgreSQL administrator to use the proposed logging mechanism would be to: 1) Create a raw device partition of the appropriate size 2) Run the metrics tester for that device partition 3) Set the appropriate configuration parameters to indicate raw WAL logging I anticipate that the additional space requirements for this system would be on the order of 10% to 15% beyond the current file-based implementation's requirements. So, is this worth doing? Would a robust implementation likely be accepted for 7.4 assuming it can demonstrate speed improvements in the range of 500tps? - Curtis ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Postgresql and multithreading
Bruce Momjian wrote: Let me add one more thing on this thread. This is one email in a long list of Oh, gee, you aren't using that wizz-bang new sync/thread/aio/raid/raw feature discussion where someone shows up and wants to know why. Does anyone know how to address these, efficiently? You've brought up good points here. I'm sure that you consider me guilty of this with regard to my aio discussions. So I might offer a few specific suggestions. 1) Someone's taking the time to generate a summary of the current thinking with regard to a particular whiz-bang feature. - I can do this as a first pass for aio, if you think it's a good idea. 2) Including the pros and cons of the feature/implementation and how close the group is to deciding whether something would be worth doing. - I can also do this. 3) A set of criteria that would need to be met or demonstrated before a feature would be considered good enough for inclusion into the main code. If there was a separate section of the document Developer's FAQ that handled this, this would help. 4) A development/feature philosophy section. Maybe three or four paragraphs with one specific example: perhaps multi-threading. 5) I'd also suggest changing some specfics in the FAQ Section 1.2 where there is currently: * The usual process for source additions is: * * Review the TODO list. * Discuss hackers the desirability of the fix/feature. * How should it behave in complex circumstances? Add here that a check should be made to the new section in the FAQ described above. Also, section 1.1 has: * Discussion on the patch typically happens here. If the patch adds a * major feature, it would be a good idea to talk about it first on * the HACKERS list, in order to increase the chances of it being * accepted, as well as toavoid duplication of effort. We should perhaps add here a section describing the phenomenon you describe regarding whiz-bang features. I tried to prepare as best I could before bringing anything forward to HACKERS. In particular, I read the last two years of archives with anything to do with the WAL log and looked at the current code, read the TODOs, read a fair amount of discussions about aio. etc. So I was attempting to comply with my interpretation of the process. Yet, despite these efforts, you no doubt consider me guilty of creating unnecessary work, an outcome neither of us desired. I'm undeterred in my desire to come up with something meaningful and am working on some interesting tests. I do, however, now know that the level of scepticism and the standards of architectural purity are high. I consider this good all around. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Analysis of ganged WAL writes
Since in your case all transactions A-E want the same buffer written, the memory (not it's content) will also be the same. But no, it won't: the successive writes will ask to write different snapshots of the same buffer. Successive writes would write different NON-OVERLAPPING sections of the same log buffer. It wouldn't make sense to send three separate copies of the entire block. That could indeed cause problems. If a separate log writing process was doing all the writing, it could pre-gang the writes. However, I'm not sure this is necessary. I'll test the simpler way first. The problem I can see offhand is how the kaio system can tell which transaction can be safely notified of the write, Yup, exactly. Whose snapshot made it down to (stable) disk storage? If you do as above, it can inform the transactions when the blocks get written to disk since there are no inconsistent writes. If transactions A, B and C had commits in block 1023, and the aio system writes that block to the disk, it can notify all three that their transaction write is complete when that block (or partial block) is written to disk. It transaction C's write didn't make it into the buffer, I've got to assume the aio system or the disk cache logic can handle realizing that it didn't queue that write and therefore not inform transaction C of a completion. - Curtis ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Analysis of ganged WAL writes
Curtis Faith [EMAIL PROTECTED] writes: Successive writes would write different NON-OVERLAPPING sections of the same log buffer. It wouldn't make sense to send three separate copies of the entire block. That could indeed cause problems. So you're going to undo the code's present property that all writes are block-sized? Aren't you worried about incurring page-in reads because the kernel can't know that we don't care about data beyond what we've written so far in the block? Yes, I'll try undoing the current behavior. I'm not really worried about doing page-in reads because the disks internal buffers should contain most of the blocks surrounding the end of the log file. If the successive partial writes exceed a block (which they will in heavy use) then most of the time this won't be a problem anyway since the disk will gang the full blocks before writing. If the inserts are not coming fast enough to fill the log then the disks cache should contain the data from the last time that block (or the block before) was written. Disks have become pretty good at this sort of thing since writing sequentially is a very common scenario. It may not work, but one doesn't make significant progress without trying things that might not work. If it doesn't work, then I'll make certain that commit log records always fill the buffer they are written too, with variable length commit records and something to identify the size of the padding used to fill the rest of the block. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Analysis of ganged WAL writes
Curtis Faith [EMAIL PROTECTED] writes: I'm not really worried about doing page-in reads because the disks internal buffers should contain most of the blocks surrounding the end of the log file. If the successive partial writes exceed a block (which they will in heavy use) then most of the time this won't be a problem anyway since the disk will gang the full blocks before writing. You seem to be willing to make quite a large number of assumptions about what the disk hardware will do or not do. I trust you're going to test your results on a wide range of hardware before claiming they have any general validity ... Tom, I'm actually an empiricist. I trust nothing that I haven't tested or read the code for myself. I've found too many instances of bugs or poor implementations in the O/S to believe without testing. On the other hand, one has to make some assumptions in order to devise useful tests. I'm not necessarily expecting that I'll come up with something that will help everyone all the time. I'm just hoping that I can come up with something that will help those using modern hardware, most of the time. Even if it works, this will probably become one of those flags that need to be tested as part of the performance analysis for any given system. Or perhaps ideally, I'll come up with a LogDiskTester that simulates log output and determines the best settings to use for a given disk and O/S, optimized for size or speed, heavy inserts, etc. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
So you think if I try to write a 1 gig file, it will write enough to fill up the buffers, then wait while the sync'er writes out a few blocks every second, free up some buffers, then write some more? Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when it can't get a buffer, it will async write a dirty buffer to disk. We've addressed this scenario before, if I recall, the point Greg made earlier is that buffers getting full means writes become synchronous. I was trying to point out was that it was very likely that the buffers will fill even for large buffers and that the writes are going to be driven out not by efficient ganging but by something approaching LRU flushing, with an occasional once a second slightly more efficient write of 1/32nd of the buffers. Once the buffers get full, all subsequent writes turn into synchronous writes, since even if the kernel writes asynchronously (meaning it can do other work), the writing process can't complete, it has to wait until the buffer has been flushed and is free for the copy. So the relatively poor implementation (for database inserts at least) of the syncer mechanism will cost a lot of performance if we get to this synchronous write mode due to a full buffer. It appears this scenario is much more likely than I had thought. Do you not think this is a potential performance problem to be explored? I'm only pursuing this as hard as I am because I feel like it's deja vu all over again. I've done this before and found a huge improvement (12X to 20X for bulk inserts). I'm not necessarily expecting that level of improvement here but my gut tells me there is more here than seems obvious on the surface. As far as this AIO conversation is concerned, I want to see someone come up with some performance improvement that we can only do with AIO. Unless I see it, I am not interested in pursuing this thread. If I come up with something via aio that helps I'd be more than happy if someone else points out a non-aio way to accomplish the same thing. I'm by no means married to any particular solutions, I care about getting problems solved. And I'll stop trying to sell anyone on aio. - Curtis ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Analysis of ganged WAL writes
You example of 1 trx/proc/rev will wok _only_ if no more and no less than 1/4 of platter is filled by _other_ log writers. Not really, if 1/2 the platter has been filled we'll still get in one more commit in for a given rotation. If more than a rotation's worth of writing has occurred that means we are bumping into the limit of disk I/O and that it the limit that we can't do anything about without doing interleaved log files. The case of bulk inserts is one where I would expect that for simple tables we should be able to peg the disks given today's hardware and enough inserting processes. bulk inserts should probably be chunked at higher level by inserting several records inside a single transaction. Agreed, that's much more efficient. There are plenty of situations where the inserts and updates are ongoing rather than initial, Shridhar's real-world test or TPC benchmarks, for example. - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: Parallel Executors [was RE: [HACKERS] Threaded Sorting]
Curtis Faith wrote: The current transaction/user state seems to be stored in process global space. This could be changed to be a sointer to a struct stored in a back-end specific shared memory area which would be accessed by the executor process at execution start. The backend would destroy and recreate the shared memory and restart execution in the case where an executor process dies much like the postmaster does with backends now. To the extent the executor process might make changes to the state, which I'd try to avoid if possible (don't know if it is), the executors could obtain locks, otherwise if the executions were constrained to isolated elements (changes to different indexes for example) it seems like it would be possible using an architecture where you have: Jan Wieck replied: Imagine there is a PL/Tcl function. On the first call in a session, the PL/Tcl interpreter get's created (that's during execution, okay?). Now the procedure that's called inside of that interpreter creates a global variable ... a global Tcl variable inside of that interpreter, which is totally unknown to the backend since it doesn't know what Tcl is at all and that variable is nothing than an entry in a private hash table inside of that interpreter. On a subsequent call to any PL/Tcl function during that session, it might be good if that darn hashtable entry exists. How do you propose to let this happen? And while at it, the Tcl procedure next calls spi_exec, causing the PL/Tcl function handler to call SPI_exec(), so your isolated executor all of the sudden becomes a fully operational backend, doing the parsing, planning and optimizing, or what? You bring up a good point, we couldn't do what I propose for all situations. I had never anticipated that splitting things up would be the rule. For example, the optimizer would have to decide whether it made sense to split up a query from a strictly performance perspective. So now, if we consider the fact that some things could not be done with split backend execution, the logic becomes: if ( splitting is possible splitting is faster ) do the split execution; else do the normal execution; Since the design already splits the backend internally into a separate execution phase, it seems like one could keep the current current implementation for the typical case where splitting doesn't buy anything or cases where there is complex state information that needs to be maintained. If there are no triggers or functions that will be accessed by a given query then I don't see your concerns applying. If there are triggers or other conditions which preclude multi-process execution, we can keep exactly the same behavior as now. The plan execution entry could easily be a place where it either A) did the same thing it currently does or B) passed execution off to a pool as per the original proposal. I have to believe that most SELECTs won't be affected by your concerns. Additionally, even in the case of an UPDATE, many times there are large portions of the operation's actual work that wouldn't be affected even if there are lots of triggers on the tables being updated. The computation of the inside of the WHERE could often be split out without causing any problems with context or state information. The master executor could always be the original backend as it is now and this would be the place where the UPDATE part would be processed after the WHERE tuples had been identified. As with any optimization, it is more complicated and won't handle all the cases. It's just an idea to handle common cases that would otherwise be much slower. That having been said, I'm sure there are much lower hanging fruit on the performance tree and likely will be for a little while. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
On Sun, 2002-10-06 at 11:46, Tom Lane wrote: I can't personally get excited about something that only helps if your server is starved for RAM --- who runs servers that aren't fat on RAM anymore? But give it a shot if you like. Perhaps your analysis is pessimistic. snipped I don't find it far fetched to imagine situations where people may commit large amounts of memory for the database yet marginally starve available memory for file system buffers. Especially so on heavily I/O bound systems or where sporadicly other types of non-database file activity may occur. snipped Of course, that opens the door for simply adding more memory and/or slightly reducing the amount of memory available to the database (thus making it available elsewhere). Now, after all that's said and done, having something like aio in use would seemingly allowing it to be somewhat more self-tuning from a potential performance perspective. Good points. Now for some surprising news (at least it surprised me). I researched the file system source on my system (FreeBSD 4.6) and found that the behavior was optimized for non-database access to eliminate unnecessary writes when temp files are created and deleted rapidly. It was not optimized to get data to the disk in the most efficient manner. The syncer on FreeBSD appears to place dirtied filesystem buffers into work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer processes one of the queues and increments a counter syncer_delayno. On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd of the writes that were buffered are processed. If the syncer gets behind and the writes for a given second exceed one second to process the syncer does not wait but begins processing the next queue. AFAICT this means that there is no opportunity to have writes combined by the disk since they are processed in buckets based on the time the writes came in. Also, it seems very likely that many installations won't have enough buffers for 30 seconds worth of changes and that there would be some level of SYNCHRONOUS writing because of this delay and the syncer process getting backed up. This might happen once per second as the buffers get full and the syncer has not yet started for that second interval. Linux might handle this better. I saw some emails exchanged a year or so ago about starting writes immediately in a low-priority way but I'm not sure if those patches got applied to the linux kernel or not. The source I had access to seems to do something analogous to FreeBSD but using fixed percentages of the dirty blocks or a minimum number of blocks. They appear to be handled in LRU order however. On-disk caches are much much larger these days so it seems that some way of getting the data out sooner would result in better write performance for the cache. My newer drive is a 10K RPM IBM Ultrastar SCSI and it has a 4M cache. I don't see these caches getting smaller over time so not letting the disk see writes will become more and more of a performance drain. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
Tom, first of all, excellent job improving the current algorithm. I'm glad you look at the WALCommitLock code. This must be so because the backends that are released at the end of any given disk revolution will not be able to participate in the next group commit, if there is already at least one backend ready to commit. This is the major reason for my original suggestion about using aio_write. The writes don't block each other and there is no need for a kernel level exclusive locking call like fsync or fdatasync. Even the theoretical limit you mention of one transaction per revolution per committing process seem like a significant bottleneck. Is committing 1 and 4 transactions on every revolution good? It's certainly better than 1 per revolution. However, what if we could have done 3 transactions per process in the time it took for a single revolution? Then we are looking at (1 + 4)/ 2 = 2.5 transactions per revolution versus the theoretical maximum of (3 * 5) = 15 transactions per revolution if we can figure out a way to do non-blocking writes that we can guarantee are on the disk platter so we can return from commit. Separating out whether or not aio is viable. Do you not agree that eliminating the blocking would result in potentially a 6X improvement for the 5 process case? So this solution isn't perfect; it would still be nice to have a way to delay initiation of the WAL write until just before the disk is ready to accept it. I dunno any good way to do that, though. I still think that it would be much faster to just keep writing the WAL log blocks when they fill up and have a separate process wake the commiting process when the write completes. This would eliminate WAL writing as a bottleneck. I have yet to hear anyone say that this can't be done, only that we might not want to do it because the code might not be clean. I'm generally only happy when I can finally remove a bottleneck completely, but speeding one up by 3X like you have done is pretty damn cool for a day or two's work. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
This is the trickle syncer. It prevents bursts of disk activity every 30 seconds. It is for non-fsync writes, of course, and I assume if the kernel buffers get low, it starts to flush faster. AFAICT, the syncer only speeds up when virtual memory paging fills the buffers past a threshold and even in that event it only speeds it up by a factor of two. I can't find any provision for speeding up flushing of the dirty buffers when they fill for normal file system writes, so I don't think that happens. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
Well, too bad. If you haven't gotten your commit record down to disk, then *you have not committed*. This is not negotiable. (If you think it is, then turn off fsync and quit worrying ;-)) I've never disputed this, so if I seem to be suggesting that, I've beee unclear. I'm just assuming that the disk can get a confirmation back to the INSERTing process in much less than one rotation. This would allow that process to start working again, perhaps in time to complete another transaction. An application that is willing to have multiple transactions in flight at the same time can open up multiple backend connections to issue those transactions, and thereby perhaps beat the theoretical limit. But for serial transactions, there is not anything we can do to beat that limit. (At least not with the log structure we have now. One could imagine dropping a commit record into the nearest one of multiple buckets that are carefully scattered around the disk. But exploiting that would take near-perfect knowledge about disk head positioning; it's even harder to solve than the problem we're considering now.) Consider the following scenario: Time measured in disk rotations. Time 1.00 - Process A Commits - Causing aio_write to log and wait Time 1.03 - aio_completes for Process A write - wakes process A Time 1.05 - Process A Starts another transaction. Time 1.08 - Process A Commits etc. I agree that a process can't proceed from a commit until it receives confirmation of the write, but if the write has hit the disk before a full rotation then the process should be able to continue processing new transactions You're failing to distinguish total throughput to the WAL drive from response time seen by any one transaction. Yes, a policy of writing each WAL block once when it fills would maximize potential throughput, but it would also mean a potentially very large delay for a transaction waiting to commit. The lower the system load, the worse the performance on that scale. You are assuming fsync or fdatasync behavior, I am not. There would be no delay under the scenario I describe. The transaction would exit commit as soon as the confirmation of the write is received from the aio system. I would hope that with a decent aio implementation this would generally be much less than one rotation. I think that the single transaction response time is very important and that's one of the chief problems I sought to solve when I proposed aio_writes for logging in my original email many moons ago. ISTM aio_write only improves the picture if there's some magic in-kernel processing that makes this same kind of judgment as to when to issue the ganged write for real, and is able to do it on time because it's in the kernel. I haven't heard anything to make me think that that feature actually exists. AFAIK the kernel isn't much more enlightened about physical head positions than we are. All aio_write has to do is pass the write off to the device as soon as it aio_write gets it bypassing the system buffers. The code on the disk's hardware is very good at knowing when the disk head is coming. IMHO, bypassing the kernel's less than enlightened writing system is the main point of using aio_write. - Curtis ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Analysis of ganged WAL writes
I may be missing something obvious, but I don't see a way to get more than 1 trx/process/revolution, as each previous transaction in that process must be written to disk before the next can start, and the only way it can be written to the disk is when the disk heads are on the right place and that happens exactly once per revolution. Okay, consider the following scenario. 1) Process A commits when the platter is at 0 degrees. 2) There are enough XLog writes from other processes to fill 1/4 platter rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit record and keeps writing other log entries as the head rotates. 3) Process A receives a confirmation of the write before the platter rotates 60 degrees. 4) Process A continues and adds another commit before the platter rotates to 90 degrees. This should be very possible and more and more likely in the future as CPUs get faster and faster relative to disks. I'm not suggesting this would happen all the time, just that it's possible and that an SMP machine with good CPUs and a fast I/O subsystem should be able to keep the log writing at close to I/O bandwidth limits. The case of bulk inserts is one where I would expect that for simple tables we should be able to peg the disks given today's hardware and enough inserting processes. - Curtis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]
Greg Copeland [EMAIL PROTECTED] writes: Doesn't this also increase the likelihood that people will be running in a buffer-poor environment more frequently that I previously asserted, especially in very heavily I/O bound systems? Unless I'm mistaken, that opens the door for a general case of why an aio implementation should be looked into. Neil Conway replies: Well, at least for *this specific sitation*, it doesn't really change anything -- since FreeBSD doesn't implement POSIX AIO as far as I know, we can't use that as an alternative. I haven't tried it yet but there does seem to be an aio implementation that conforms to POSIX in FreeBSD 4.6.2. Its part of the kernel and can be found in: /usr/src/sys/kern/vfs_aio.c However, I'd suspect that the FreeBSD kernel allows for some way to tune the behavior of the syncer. If that's the case, we could do some research into what settings are more appropriate for FreeBSD, and recommend those in the docs. I don't run FreeBSD, however -- would someone like to volunteer to take a look at this? I didn't see anything obvious in the docs but I still believe there's some way to tune it. I'll let everyone know if I find some better settings. BTW Curtis, did you happen to check whether this behavior has been changed in FreeBSD 5.0? I haven't checked but I will. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Parallel Executors [was RE: [HACKERS] Threaded Sorting]
tom lane wrote: Curtis Faith [EMAIL PROTECTED] writes: What about splitting out parsing, optimization and plan generation from execution and having a separate pool of exececutor processes. As an optimizer finished with a query plan it would initiate execution by grabbing an executor from a pool and passing it the plan. So different executors would potentially handle the queries from a single transaction? How will you deal with pushing transaction-local state from one to the other? Even if you restrict it to switching at transaction boundaries, you still have session-local state (at minimum user ID and SET settings) to worry about. Hmmm, what transaction boundaries did you mean? Since we are talking about single statement parallization, there must be some specific internal semantics that you believe need isolation. It seems like we'd be able to get most of the benefit and restrict the parallization in a way that would preserve this isolation but I'm curious what you were specifically referring to? The current transaction/user state seems to be stored in process global space. This could be changed to be a sointer to a struct stored in a back-end specific shared memory area which would be accessed by the executor process at execution start. The backend would destroy and recreate the shared memory and restart execution in the case where an executor process dies much like the postmaster does with backends now. To the extent the executor process might make changes to the state, which I'd try to avoid if possible (don't know if it is), the executors could obtain locks, otherwise if the executions were constrained to isolated elements (changes to different indexes for example) it seems like it would be possible using an architecture where you have: Main Executor: Responsible for updating global meta data from each sub-executor and assembling the results of multiple executions. In the case of multiple executor sorts, the main executor would perform a merge sort on the results of it and it's subordinates pre-sorted sub-sets of the relation. Subordinate Executor: Executes sub-plans and returns results or meta-data update information into front-end shared memory directly. To make this optimal, the index code would have to be changed to support the idea of partial scans. In the case of btrees it would be pretty easy using the root page to figure out what index values delineated different 1/2's, 1/3's, 1/4's etc. of the index space. I'm not sure what you'd have to do to support this for table scans as I don't know the PostgreSQL tuple storage mechanism, yet. This does not seem like too much architectural complexity or performance overhead (even for the single executor case) for a big gain for complex query performance. Being able to apply multiple CPUs to a single query is attractive, but I've not yet seen schemes for it that don't look like the extra CPU power would be chewed up in overhead :-(. Do you remember specifc overhead problems/issues? - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] Anyone else having list server problems?
I've been getting only about 60% of the emails sent to the list. I see many emails in the archives that I never got via email. Is anyone else having this problems? - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Threaded Sorting
tom lane writes: The notion of a sort process pool seems possibly attractive. I'm unconvinced that it's going to be a win though because of the cost of shoving data across address-space boundaries. What about splitting out parsing, optimization and plan generation from execution and having a separate pool of exececutor processes. As an optimizer finished with a query plan it would initiate execution by grabbing an executor from a pool and passing it the plan. This would allow the potential for passing partial plans to multiple executors so a given query might be split up into three or four pieces and then executed in parallel with the results passed through a shared memory area owned by each executor process. This would allow for potential optimization of sorts without threads or incurring the overhead problems you mentioned for a separate sorter process. The optimizer could do things like split a scan into 3 or 4 pieces before sending it off for execution and then join the pieces back together. It could also make complex queries much faster if there are idling CPUs if the optimizer was updated to take advantage of this. If we are going to split things apart, then this should be done at a natural communication boundary right? The code has this logical split right now anyway so the change would be more natural. OTOH, there are much bigger fish to fry at the moment, I suspect. - Curtis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
You are confusing WALWriteLock with WALInsertLock. A transaction-committing flush operation only holds the former. XLogInsert only needs the latter --- at least as long as it doesn't need to write. Well that make things better than I thought. We still end up with a disk write for each transaction though and I don't see how this can ever get better than (Disk RPM)/ 60 transactions per second, since commit fsyncs are serialized. Every fsync will have to wait almost a full revolution to reach the end of the log. As a practial matter then everyone will use commit_delay to improve this. This will pessimize performance except in the case where WAL traffic is very heavy, because it means you don't commit until the block containing your commit record is filled. What if you are the only active backend? We could handle this using a mechanism analogous to the current commit delay. If there are more than commit_siblings other processes running then do the write automatically after commit_delay seconds. This would make things no more pessimistic than the current implementation but provide the additional benefit of allowing the LogWriter to write in optimal sizes if there are many transactions. The commit_delay method won't be as good in many cases. Consider a update scenario where a larger commit delay gives better throughput. A given transaction will flush after commit_delay milliseconds. The delay is very unlikely to result in a scenario where the dirty log buffers are the optimal size. As a practical matter I think this would tend to make the writes larger than they would otherwise have been and this would unnecessarily delay the commit on the transaction. I do not, however, see any value in forcing all the WAL writes to be done by a single process; which is essentially what you're saying we should do. That just adds extra process-switch overhead that we don't really need. I don't think that an fsync will ever NOT cause the process to get switched out so I don't see how another process doing the write would result in more overhead. The fsync'ing process will block on the fsync, so there will always be at least one process switch (probably many) while waiting for the fsync to comlete since we are talking many milliseconds for the fsync in every case. The log file would be opened O_DSYNC, O_APPEND every time. Keep in mind that we support platforms without O_DSYNC. I am not sure whether there are any that don't have O_SYNC either, but I am fairly sure that we measured O_SYNC to be slower than fsync()s on some platforms. Well there is no reason that the logwriter couldn't be doing fsyncs instead of O_DSYNC writes in those cases. I'd leave this switchable using the current flags. Just change the semantics a bit. - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
In particular, it would seriously degrade performance if the WAL file isn't on its own spindle but has to share bandwidth with data file access. If the OS is stupid I could see this happening. But if there are buffers and some sort of elevator algorithm the I/O won't happen at bad times. I agree with you though that writing for every single insert probably does not make sense. There should be some blocking of writes. The optimal size would have to be derived empirically. What we really want, of course, is write on every revolution where there's something worth writing --- either we've filled a WAL blovk or there is a commit pending. But that just gets us back into the same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. I don't see how an extra process makes that problem any easier. The whole point of the extra process handling all the writes is so that it can write on every revolution, if there is something to write. It doesn't need to care if more commits will arrive soon. BTW, it would seem to me that aio_write() buys nothing over plain write() in terms of ability to gang writes. If we issue the write at time T and it completes at T+X, we really know nothing about exactly when in that interval the data was read out of our WAL buffers. We cannot assume that commit records that were stored into the WAL buffer during that interval got written to disk. Why would we need to make that assumption? The only thing we'd need to know is that a given write succeeded meaning that commits before that write are done. The advantage to aio_write in this scenario is when writes cross track boundaries or when the head is in the wrong spot. If we write in reasonable blocks with aio_write the write might get to the disk before the head passes the location for the write. Consider a scenario where: Head is at file offset 10,000. Log contains blocks 12,000 - 12,500 ..time passes.. Head is now at 12,050 Commit occurs writing block 12,501 In the aio_write case the write would already have been done for blocks 12,000 to 12,050 and would be queued up for some additional blocks up to potentially 12,500. So the write for the commit could occur without an additional rotation delay. We are talking 85 to 200 milliseconds delay for this rotation on a single disk. I don't know how often this happens in actual practice but it might occur as often as every other time. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large PerformanceGain in WAL synching
Bruce Momjian wrote: So every backend is to going to wait around until its fsync gets done by the backend process? How is that a win? This is just another version of our GUC parameters: #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5# range 1-1000 which attempt to delay fsync if other backends are nearing commit. Pushing things out to another process isn't a win; figuring out if someone else is coming for commit is. It's not the same at all. My proposal make two extremely important changes from a performance perspective. 1) WALWriteLocks are never held by processes for lengthy transations. Only for long enough to copy the log entry into the buffer. This means real work can be done by other processes while a transaction is waiting for it's commit to finish. I'm sure that blocking on XLogInsert because another transaction is performing an fsync is extremely common with frequent update scenarios. 2) The log is written using optimal write sizes which is much better than a user-defined guess of the microseconds to delay the fsync. We should be able to get the bottleneck to be the maximum write throughput of the disk with the modifications to Tom Lane's scheme I proposed. Remember, write() is fast, fsync is slow. Okay, it's clear I missed the point about Unix write earlier :-) However, it's not just saving fsyncs that we need to worry about. It's the unnecessary blocking of other processes that are simply trying to append some log records in the course of whatever updating, inserting they are doing. They may be a long way from commit. fsync being slow is the whole reason for not wanting to have exclusive locks held for the duration of an fsync. On an SMP machine this change alone would probably speed things up by an order of magnitude (assuming there aren't any other similar locks causing the same problem). - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
So, you are saying that we may get back aio confirmation quicker than if we issued our own write/fsync because the OS was able to slip our flush to disk in as part of someone else's or a general fsync? I don't buy that because it is possible our write() gets in as part of someone else's fsync and our fsync becomes a no-op, meaning there aren't any dirty buffers for that file. Isn't that also possible? Separate out the two concepts: 1) Writing of incomplete transactions at the block level by a background LogWriter. I think it doesn't matter whether the write is aio_write or write, writing blocks when we get them should provide the benefit I outlined. Waiting till fsync could miss the opporunity to write before the head passes the end of the last durable write because the drive buffers might empty causing up to a full rotation's delay. 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. This is not a big difference but it seems to me that the OS might be able to avoid some context switches by grouping copying in the case of aio_write. I've heard anecdotal reports that this is significantly faster for some things but I don't know for certain. Also, remember the kernel doesn't know where the platter rotation is either. Only the SCSI drive can reorder the requests to match this. The OS can group based on head location, but it doesn't know much about the platter location, and it doesn't even know where the head is. The kernel doesn't need to know anything about platter rotation. It just needs to keep the disk write buffers full enough not to cause a rotational latency. It's not so much a matter of reordering as it is of getting the data into the SCSI drive before the head passes the last write's position. If the SCSI drive's buffers are kept full it can continue writing at its full throughput. If the writes stop and the buffers empty it will need to wait up to a full rotation before it gets to the end of the log again Also, does aio return info when the data is in the kernel buffers or when it is actually on the disk? Simply, aio allows us to do the write and get notification when it is complete. I don't see how that helps us, and I don't see any other advantages to aio. To use aio, we need to find something that _can't_ be solved with more traditional Unix API's, and I haven't seen that yet. This aio thing is getting out of hand. It's like we have a hammer, and everything looks like a nail, or a use for aio. Yes, while I think its probably worth doing and faster, it won't help as much as just keeping the drive buffers full even if that's by using write calls. I still don't understand the opposition to aio_write. Could we just have the configuration setup determine whether one or the other is used? I don't see why we wouldn't use the faster calls if they were present and reliable on a given system. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
No question about that! The sooner we can get stuff to the WAL buffers, the more likely we will get some other transaction to do our fsync work. Any ideas on how we can do that? More like the sooner we get stuff out of the WAL buffers and into the disk's buffers whether by write or aio_write. It doesn't do any good to have information in the XLog unless it gets written to the disk buffers before they empty. We hesitate to add code relying on new features unless it is a significant win, and in the aio case, we would have different WAL disk write models for with/without aio, so it clearly could be two code paths, and with two code paths, we can't as easily improve or optimize. If we get 2% boost out of some feature, but it later discourages us from adding a 5% optimization, it is a loss. And, in most cases, the 2% optimization is for a few platform, while the 5% optimization is for all. This code is +15 years old, so we are looking way down the road, not just for today's hot feature. I'll just have to implement it and see if it's as easy and isolated as I think it might be and would allow the same algorithm for aio_write or write. I can't tell you how many aio/mmap/fancy feature discussions we have had, and we obviously discuss them, but in the end, they end up being of questionable value for the risk/complexity; but, we keep talking, hoping we are wrong or some good ideas come out of it. I'm all in favor of keeping clean designs. I'm very pleased with how easy PostreSQL is to read and understand given how much it does. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: ... most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. tom lane replies: Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. Well I'm basing this on past empirical testing and having read some man pages that describe fsync under this exact scenario. I'll have to write a test to prove this one way or another. I'll also try and look into the linux/BSD source for the common file systems used for PostgreSQL. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? Then there's no need to call fsync() at all, except during checkpoints (which actually issue sync() not fsync(), anyway). No, they're not exactly the same thing. Consider: Process A File System - --- Writes index buffer .idling... Writes entry to log cache . Writes another index buffer . Writes another log entry. Writes tuple buffer . Writes another log entry. Index scan . Large table sort. Writes tuple buffer . Writes another log entry. Writes . Writes another index buffer . Writes another log entry. Writes another index buffer . Writes another log entry. Index scan . Large table sort. Commit . File Write Log Entry. .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache Write Commit Log Entry .idling... .idling... Write to cache Call fsync .idling... .idling... Write all buffers to device. .DONE. In this case, Process A is waiting for all the buffers to write at the end of the transaction. With asynchronous I/O this becomes: Process A File System - --- Writes index buffer .idling... Writes entry to log cache Queue up write - move head to cylinder Writes another index buffer Write log entry to media Writes another log entryImmediate write to cylinder since head is still there. Writes tuple buffer . Writes another log entryQueue up write - move head to cylinder Index scan .busy with scan... Large table sortWrite log entry to media Writes tuple buffer . Writes another log entryQueue up write - move head to cylinder Writes . Writes another index buffer Write log entry to media Writes another log entryQueue up write - move head to cylinder Writes another index buffer . Writes another log entryWrite log entry to media Index scan . Large table sortWrite log entry to media Commit . Write Commit Log Entry Immediate write to cylinder since head is still there. .DONE. Effectively the real work of writing the cache is done while the CPU for the process is busy doing index scans, sorts, etc. With the WAL log on another device and SCSI I/O the log writing should almost always be done except for the final commit write. Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, on safety grounds (we want to be sure we have a consistent WAL up to the end of what we've written). Even if we can allow some reordering when a single transaction puts out a large volume of WAL data, I fail to see where any large gain is going to come from. We're going to be issuing those writes sequentially and that ought to match the disk layout about as well as can be hoped anyway. My comment was applying to reads and writes of other processes not the WAL log. In my original email, recall I mentioned using the O_APPEND open flag which will ensure that all log entries are done sequentially. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. We do not fsync buffer pages; in fact a transaction commit doesn't write buffer pages at all. I think the above is just a misunderstanding of what's really
[HACKERS] fsync exlusive lock evidence WAS: Potential Large Performance Gain in WAL synching
After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: Code from: /usr/src/sys/syscalls/vfs_syscalls int fsync(p, uap) struct proc *p; struct fsync_args /* { syscallarg(int) fd; } */ *uap; { register struct vnode *vp; struct file *fp; vm_object_t obj; int error; if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0) return (error); vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); if (VOP_GETVOBJECT(vp, obj) == 0) vm_object_page_clean(obj, 0, 0, 0); if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 vp-v_mount (vp-v_mount-mnt_flag MNT_SOFTDEP) bioops.io_fsync) error = (*bioops.io_fsync)(vp); VOP_UNLOCK(vp, 0, p); return (error); } Notice the calls to: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); .. VOP_UNLOCK(vp, 0, p); surrounding the call to VOP_FSYNC. From the man pages for VOP_UNLOCK: HEADER STUFF . VOP_LOCK(struct vnode *vp, int flags, struct proc *p); int VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p); int VOP_ISLOCKED(struct vnode *vp, struct proc *p); int vn_lock(struct vnode *vp, int flags, struct proc *p); DESCRIPTION These calls are used to serialize access to the filesystem, such as to prevent two writes to the same file from happening at the same time. The arguments are: vp the vnode being locked or unlocked flags One of the lock request types: LK_SHARED Shared lock LK_EXCLUSIVE Exclusive lock LK_UPGRADEShared-to-exclusive upgrade LK_EXCLUPGRADEFirst shared-to-exclusive upgrade LK_DOWNGRADE Exclusive-to-shared downgrade LK_RELEASERelease any type of lock LK_DRAIN Wait for all lock activity to end The lock type may be or'ed with these lock flags: LK_NOWAITDo not sleep to wait for lock LK_SLEEPFAIL Sleep, then return failure LK_CANRECURSEAllow recursive exclusive lock LK_REENABLE Lock is to be reenabled after drain LK_NOPAUSE No spinloop The lock type may be or'ed with these control flags: LK_INTERLOCKSpecify when the caller already has a simple lock (VOP_LOCK will unlock the simple lock after getting the lock) LK_RETRYRetry until locked LK_NOOBJDon't create object p process context to use for the locks Kernel code should use vn_lock() to lock a vnode rather than calling VOP_LOCK() directly. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I resent this since it didn't seem to get to the list. After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: Code from: /usr/src/sys/syscalls/vfs_syscalls int fsync(p, uap) struct proc *p; struct fsync_args /* { syscallarg(int) fd; } */ *uap; { register struct vnode *vp; struct file *fp; vm_object_t obj; int error; if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0) return (error); vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); if (VOP_GETVOBJECT(vp, obj) == 0) vm_object_page_clean(obj, 0, 0, 0); if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 vp-v_mount (vp-v_mount-mnt_flag MNT_SOFTDEP) bioops.io_fsync) error = (*bioops.io_fsync)(vp); VOP_UNLOCK(vp, 0, p); return (error); } Notice the calls to: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); .. VOP_UNLOCK(vp, 0, p); surrounding the call to VOP_FSYNC. From the man pages for VOP_UNLOCK: HEADER STUFF . VOP_LOCK(struct vnode *vp, int flags, struct proc *p); int VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p); int VOP_ISLOCKED(struct vnode *vp, struct proc *p); int vn_lock(struct vnode *vp, int flags, struct proc *p); DESCRIPTION These calls are used to serialize access to the filesystem, such as to prevent two writes to the same file from happening at the same time. The arguments are: vp the vnode being locked or unlocked flags One of the lock request types: LK_SHARED Shared lock LK_EXCLUSIVE Exclusive lock LK_UPGRADEShared-to-exclusive upgrade LK_EXCLUPGRADEFirst shared-to-exclusive upgrade LK_DOWNGRADE Exclusive-to-shared downgrade LK_RELEASERelease any type of lock LK_DRAIN Wait for all lock activity to end The lock type may be or'ed with these lock flags: LK_NOWAITDo not sleep to wait for lock LK_SLEEPFAIL Sleep, then return failure LK_CANRECURSEAllow recursive exclusive lock LK_REENABLE Lock is to be reenabled after drain LK_NOPAUSE No spinloop The lock type may be or'ed with these control flags: LK_INTERLOCKSpecify when the caller already has a simple lock (VOP_LOCK will unlock the simple lock after getting the lock) LK_RETRYRetry until locked LK_NOOBJDon't create object p process context to use for the locks Kernel code should use vn_lock() to lock a vnode rather than calling VOP_LOCK() directly. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
It appears the fsync problem is pervasive. Here's Linux 2.4.19's version from fs/buffer.c: lock- down(inode-i_sem); ret = filemap_fdatasync(inode-i_mapping); err = file-f_op-fsync(file, dentry, 1); if (err !ret) ret = err; err = filemap_fdatawait(inode-i_mapping); if (err !ret) ret = err; unlock-up(inode-i_sem); But this is probably not a big factor as you outline below because the WALWriteLock is causing the same kind of contention. tom lane wrote: This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) I hope you're right but I see some very similar contention problems in the case of many small transactions because of the WALWriteLock. Assume Transaction A which writes a lot of buffers and XLog entries, so the Commit forces a relatively lengthy fsynch. Transactions B - E block not on the kernel lock from fsync but on the WALWriteLock. When A finishes the fsync and subsequently releases the WALWriteLock B unblocks and gets the WALWriteLock for its fsync for the flush. C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT. B Releases and now C writes its XLOG_XACT_COMMIT. There now seems to be a lot of contention on the WALWriteLock. This is a shame for a system that has no locking at the logical level and therefore seems like it could be very, very fast and offer incredible concurrency. As I commented before, I think we could do with an extra process to issue WAL writes in places where they're not in the critical path for a foreground process. But that seems to be orthogonal from this issue. It's only orthogonal to the fsync-specific contention issue. We now have to worry about WALWriteLock semantics causes the same contention. Your idea of a separate LogWriter process could very nicely solve this problem and accomplish a few other things at the same time if we make a few enhancements. Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. The nice part is that the WALWriteLock semantics could be changed to allow the LogWriter to write to disk while WALWriteLocks are acquired by back-end servers. WALWriteLocks would only be held for the brief time needed to copy the entries into the log buffer. The LogWriter would only need to grab a lock to determine the current end of the log buffer. Since it would be writing blocks that occur earlier in the cache than the XLogInsert log writers it won't need to grab a WALWriteLock before writing the cache buffers. Many transactions would commit on the same fsync (now really a write with O_DSYNC) and we would get optimal write throughput for the log system. This would handle all the issues I had and it doesn't sound like a huge change. In fact, it ends up being almost semantically identical to the aio_write suggestion I made orignally, except the LogWriter is doing the background writing instead of the OS and we don't have to worry about aio implementations and portability. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Bruce Momjian wrote: I am again confused. When we do write(), we don't have to lock anything, do we? (Multiple processes can write() to the same file just fine.) We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Does aio more easily allow the kernel to order those write? Is that the issue? Well, certainly the kernel already order the writes. Just because we write() doesn't mean it goes to disk. Only fsync() or the kernel do that. We don't have to lock anything, but most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. The fsync call is more contentious than typical writes because its duration is usually longer so it holds the locks longer over more pages and structures. That is the real issue. The contention caused by fsync'ing very frequently which blocks other writers and readers. For the buffer manager, the blocking of readers is probably even more problematic when the cache is a small percentage (say 10% to 15%) of the total database size because most leaf node accesses will result in a read. Each of these reads will have to wait on the fsync as well. Again, a very well written file system probably can minimize this but I've not seen any. Further comment on: We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Writing out a bunch of calls at the end, after having consumed a lot of CPU cycles and then waiting is not as efficient as writing them out, while those CPU cycles are being used. We are currently waisting the time it takes for a given process to write. The thinking probably has been that this is no big deal because other processes, say B, C and D can use the CPU cycles while process A blocks. This is true UNLESS the other processes are blocking on reads or writes caused by process A doing the final writes and fsync. Yes, but Oracle is threaded, right, so, yes, they clearly could win with it. I read the second URL and it said we could issue separate writes and have them be done in an optimal order. However, we use the file system, not raw devices, so don't we already have that in the kernel with fsync()? Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. Probably. Having seen the Informix 5/7 debacle, I don't want to fall into the trap where we add stuff that just makes things faster on SMP/threaded systems when it makes our code _slower_ on single CPU systems, which is exaclty what Informix did in Informix 7, and we know how that ended (lost customers, bought by IBM). I don't think that's going to happen to us, but I thought I would mention it. Yes, I hate improvements that make things worse for most people. Any changes I'd contemplate would be simply another configuration driven optimization that could be turned off very easily. - Curtis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. Bruce Momjian wrote: Yes, I can see some contention, but what does aio solve? Well, theoretically, aio lets the file system handle the writes without requiring any locks being held by the processes issuing those reads. The disk i/o scheduler can therefore issue the writes using spinlocks or something very fast since it controls the timing of each of the actual writes. In some systems this is handled by the kernal and can be very fast. I suspect that with large RAID controllers or intelligent disk systems like EMC this is even more important because they should be able to handle a much higher level of concurrent i/o. Now whether or not the common file systems handle this well, I can't say, Take a look at some comments on how Oracle uses asynchronous I/O http://www.ixora.com.au/notes/redo_write_multiplexing.htm http://www.ixora.com.au/notes/asynchronous_io.htm http://www.ixora.com.au/notes/raw_asynchronous_io.htm It seems that OS support for this will likely increase and that this issue will become more and more important as uses contemplate SMP systems or if threading is added to certain PostgreSQL subsystems. It might be easier for me to implement the change I propose and then see what kind of difference it makes. I wanted to run the idea past this group first. We can all postulate whether or not it will work but we won't know unless we try it. My real issue is one of what happens in the event that it does work. I've had very good luck implementing this sort of thing for other systems but I don't yet know the range of i/o requests that PostgreSQL makes. Assuming we can demonstrate no detrimental effects on system reliability and that the change is implemented in such a way that it can be turned on or off easily, will a 50% or better increase in speed for updates justify the sort or change I am proposing. 20%? 10%? - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Advice: Where could I be of help?
tom lane wrote: But more globally, I think that our worst problems these days have to do with planner misestimations leading to bad plans. The planner is usually *capable* of generating a good plan, but all too often it picks the wrong one. We need work on improving the cost modeling equations to be closer to reality. If that's at all close to your sphere of interest then I think it should be #1 priority --- it's both localized, which I think is important for a first project, and potentially a considerable win. This seems like a very interesting problem. One of the ways that I thought would be interesting and would solve the problem of trying to figure out the right numbers is to have certain guesses for the actual values based on statistics gathered during vacuum and general running and then have the planner run the best plan. Then during execution if the planner turned out to be VERY wrong about certain assumptions the execution system could update the stats that led to those wrong assumptions. That way the system would seek the correct values automatically. We could also gather the stats that the system produces for certain actual databases and then use those to make smarter initial guesses. I've found that I can never predict costs. I always end up testing empirically and find myself surprised at the results. We should be able to make the executor smart enough to keep count of actual costs (or a statistical approximation) without introducing any significant overhead. tom lane also wrote: There is no cache flushing. We have a shared buffer cache management algorithm that's straight LRU across all buffers. There's been some interest in smarter cache-replacement code; I believe Neil Conway is messing around with an LRU-2 implementation right now. If you've got better ideas we're all ears. Hmmm, this is the area that I think could lead to huge performance gains. Consider a simple system with a table tbl_master that gets read by each process many times but with very infrequent inserts and that contains about 3,000 rows. The single but heavily used index for this table is contained in a btree with a depth of three with 20 - 8K pages in the first two levels of the btree. Another table tbl_detail with 10 indices that gets very frequent inserts. There are over 300,000 rows. Some queries result in index scans over the approximatley 5,000 8K pages in the index. There is a 40M shared cache for this system. Everytime a query which requires the index scan runs it will blow out the entire cache since the scan will load more blocks than the cache holds. Only blocks that are accessed while the scan is going will survive. LRU is bad, bad, bad! LRU-2 might be better but it seems like it still won't give enough priority to the most frequently used blocks. I don't see how it would do better for the above case. I once implemented a modified cache algorithm that was based on the clock algorithm for VM page caches. VM paging is similar to databases in that there is definite locality of reference and certain pages are MUCH more likely to be requested. The basic idea was to have a flag in each block that represented the access time in clock intervals. Imagine a clock hand sweeping across a clock, every access is like a tiny movement in the clock hand. Blocks that are not accessed during a sweep are candidates for removal. My modification was to use access counts to increase the durability of the more accessed blocks. Each time a block is accessed it's flag is shifted left (up to a maximum number of shifts - ShiftN ) and 1 is added to it. Every so many cache accesses (and synchronously when the cache is full) a pass is made over each block, right shifting the flags (a clock sweep). This can also be done one block at a time each access so the clock is directly linked to the cache access rate. Any blocks with 0 are placed into a doubly linked list of candidates for removal. New cache blocks are allocated from the list of candidates. Accesses of blocks in the candidate list just removes them from the list. An index root node page would likely be accessed frequently enough so that all it's bits would be set so it would take ShiftN clock sweeps. This algorithm increased the cache hit ratio from 40% to about 90% for the cases I tested when compared to a simple LRU mechanism. The paging ratio is greatly dependent on the ratio of the actual database size to the cache size. The bottom line that it is very important to keep blocks that are frequently accessed in the cache. The top levels of large btrees are accessed many hundreds (actually a power of the number of keys in each page) of times more frequently than the leaf pages. LRU can be the worst possible algorithm for something like an index or table scan of large tables since it flushes a large number of potentially frequently accessed blocks in favor of ones that are very unlikely to be retrieved again. tom lane also wrote: This is an
[HACKERS] Potential Large Performance Gain in WAL synching
I've been looking at the TODO lists and caching issues and think there may be a way to greatly improve the performance of the WAL. I've made the following assumptions based on my reading in the manual and the WAL archives since about November 2000: 1) WAL is currently fsync'd before commit succeeds. This is done to ensure that the D in ACID is satisfied. 2) The wait on fsync is the biggest time cost for inserts or updates. 3) fsync itself probably increases contention for file i/o on the same file since some OS file system cache structures must be locked as part of fsync. Depending on the file system this could be a significant choke on total i/o throughput. The issue is that there must be a definite record in durable storage for the log before one can be certain that a transaction has succeeded. I'm not familiar with the exact WAL implementation in PostgreSQL but am familiar with others including ARIES II, however, it seems that it comes down to making sure that the write to the WAL log has been positively written to disk. So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? A transaction would simple do all the log writing using aio_write and block until all the last log aio request has completed using aio_waitcomplete. The call to aio_waitcomplete won't return until the log record has been written to the disk. Opening with O_DSYNC ensures that when i/o completes the write has been written to the disk, and aio_write with O_APPEND opened files ensures that writes append in the order they are received, hence when the aio_write for the last log entry for a transaction completes, the transaction can be sure that its log records are in durable storage (IDE problems aside). It seems to me that this would: 1) Preserve the required D semantics. 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. 3) Obviate the need for commit_delay, since there is no blocking and the file system and the disk controller can put multiple writes to the log together as the drive is waiting for the end of the log file to come under one of the heads. Here are the relevant TODO's: Delay fsync() when other backends are about to commit too [fsync] Determine optimal commit_delay value Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options Allow multiple blocks to be written to WAL with one write() Am I missing something? Curtis Faith Principal Galt Capital, LLP -- Galt Capitalhttp://www.galtcapital.com 12 Wimmelskafts Gade Post Office Box 7549 voice: 340.776.0144 Charlotte Amalie, St. Thomasfax: 340.776.0244 United States Virgin Islands 00801 cell: 340.643.5368 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Potential Large Performance Gain in WAL synching
tom lane replies: Curtis Faith [EMAIL PROTECTED] writes: So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? We already offer an O_DSYNC option. It's not obvious to me what aio_write brings to the table (aside from loss of portability). You still have to wait for the final write to complete, no? Well, for starters by the time the write which includes the commit log entry is written, much of the writing of the log for the transaction will already be on disk, or in a controller on its way. I don't see any O_NONBLOCK or O_NDELAY references in the sources so it looks like the log writes are blocking. If I read correctly, XLogInsert calls XLogWrite which calls write which blocks. If these assumptions are correct, there should be some significant gain here but I won't know how much until I try to change it. This issue only affects the speed of a given back-ends transaction processing capability. The REAL issue and the one that will greatly affect total system throughput is that of contention on the file locks. Since fsynch needs to obtain a write lock on the file descriptor, as does the write calls which originate from XLogWrite as the writes are written to the disk, other back-ends will block while another transaction is committing if the log cache fills to the point where their XLogInsert results in a XLogWrite call to flush the log cache. I'd guess this means that one won't gain much by adding other back-end processes past three or four if there are a lot of inserts or updates. The method I propose does not result in any blocking because of writes other than the final commit's write and it has the very significant advantage of allowing other transactions (from other back-ends) to continue until they enter commit (and blocking waiting for their final commit write to complete). 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. The key here is that a thread that has not committed and therefore is not blocking can do work while other threads (should have said back-ends or processes) are waiting on their commit writes. - Curtis P.S. If I am right in my assumptions about the way the current system works, I'll bet the change would speed up inserts in Shridhar's huge database test by at least a factor of two or three, perhaps even an order of magnitude. :-) -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 03, 2002 7:17 PM To: Curtis Faith Cc: Pgsql-Hackers Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching Curtis Faith [EMAIL PROTECTED] writes: So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? We already offer an O_DSYNC option. It's not obvious to me what aio_write brings to the table (aside from loss of portability). You still have to wait for the final write to complete, no? 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Advice: Where could I be of help?
I wrote: My modification was to use access counts to increase the durability of the more accessed blocks. tom lane replies: You could do it that way too, but I'm unsure whether the extra complexity will buy anything. Ultimately, I think an LRU-anything algorithm is equivalent to a clock sweep for those pages that only get touched once per some-long-interval: the single-touch guys get recycled in order of last use, which seems just like a clock sweep around the cache. The guys with some amount of preference get excluded from the once-around sweep. To determine whether LRU-2 is better or worse than some other preference algorithm requires a finer grain of analysis than this. I'm not a fan of more complex must be better, so I'd want to see why it's better before buying into it ... I'm definitely not a fan of more complex must be better either. In fact, its surprising how often the real performance problems are easy to fix and simple while many person years are spent solving the issue everyone knows must be causing the performance problems only to find little gain. The key here is empirical testing. If the cache hit ratio for LRU-2 is much better then there may be no need here. OTOH, it took less than less than 30 lines or so of code to do what I described, so I don't consider it too, too more complex :=} We should run a test which includes running indexes (or is indices the PostgreSQL convention?) that are three or more times the size of the cache to see how well LRU-2 works. Is there any cache performance reporting built into pgsql? tom lane wrote: Shouldn't the OS be responsible for scheduling those writes appropriately? Ye good olde elevator algorithm ought to handle this; and it's at least one layer closer to the actual disk layout than we are, thus more likely to issue the writes in a good order. It's worth experimenting with, perhaps, but I'm pretty dubious about it. I wasn't proposing anything other than changing the order of the writes, not actually ensuring that they get written that way at the level you describe above. This will help a lot on brain-dead file systems that can't do this ordering and probably also in cases where the number of blocks in the cache is very large. On a related note, while looking at the code, it seems to me that we are writing out the buffer cache synchronously, so there won't be any possibility of the file system reordering anyway. This appears to be a huge performance problem. I've read claims in the archives that that the buffers are written asynchronously but my read of the code says otherwise. Can someone point out my error? I only see calls that ultimately call FileWrite or write(2) which will block without a O_NOBLOCK open. I thought one of the main reasons for having a WAL is so that you can write out the buffer's asynchronously. What am I missing? I wrote: Then during execution if the planner turned out to be VERY wrong about certain assumptions the execution system could update the stats that led to those wrong assumptions. That way the system would seek the correct values automatically. tom lane replied: That has been suggested before, but I'm unsure how to make it work. There are a lot of parameters involved in any planning decision and it's not obvious which ones to tweak, or in which direction, if the plan turns out to be bad. But if you can come up with some ideas, go to it! I'll have to look at the current planner before I can suggest anything concrete. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: The REAL issue and the one that will greatly affect total system throughput is that of contention on the file locks. Since fsynch needs to obtain a write lock on the file descriptor, as does the write calls which originate from XLogWrite as the writes are written to the disk, other back-ends will block while another transaction is committing if the log cache fills to the point where their XLogInsert results in a XLogWrite call to flush the log cache. tom lane wrote: But that's exactly *why* we have a log cache: to ensure we can buffer a reasonable amount of log data between XLogFlush calls. If the above scenario is really causing a problem, doesn't that just mean you need to increase wal_buffers? Well, in cases where there are a lot of small transactions the contention will not be on the XLogWrite calls from caches getting full but from XLogWrite calls from transaction commits which will happen very frequently. I think this will have a detrimental effect on very high update frequency performance. So while larger WAL caches will help in the case of cache flushing because of its being full I don't think it will make any difference for the potentially more common case of transaction commits. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Bruce Momjian wrote: I may be missing something here, but other backends don't block while one writes to WAL. I don't think they'll block until they get to the fsync or XLogWrite call while another transaction is fsync'ing. I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. - Curtis -Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED]] Sent: Friday, October 04, 2002 12:44 AM To: Curtis Faith Cc: Tom Lane; Pgsql-Hackers Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching Curtis Faith wrote: The method I propose does not result in any blocking because of writes other than the final commit's write and it has the very significant advantage of allowing other transactions (from other back-ends) to continue until they enter commit (and blocking waiting for their final commit write to complete). 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. The key here is that a thread that has not committed and therefore is not blocking can do work while other threads (should have said back-ends or processes) are waiting on their commit writes. I may be missing something here, but other backends don't block while one writes to WAL. Remember, we are proccess based, not thread based, so the write() call only blocks the one session. If you had threads, and you did a write() call that blocked other threads, I can see where your idea would be good, and where async i/o becomes an advantage. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] Advice: Where could I be of help?
All, I'd like to help work on some 7.4 features, however, since you've not seen my name before, I'm obviously new to the list and the org. I really like working on speed optimizations and rewrites. I have 15 years experience with C++-based systems and databases, and have worked on commercial database engines (i.e. indexing and query execution systems), sql execution and optimization, various lex and yacc based compilers and parsers. I've generally been able to get code to perform as well or better than competitive systems with similar functionality, and usually have been able to beat other code by 3 to 10 X. My unix experience is reasonable but I'm not an expert. Any suggestions for where to start? I don't mind digging into very hairy code or large problems. I'm willing to run the risk of a patch not being accepted (for large changes) since I'll make sure whatever I do is well known to those who will do the accept/deny and the approach approved of ahead of time. Since I'm new here, I'm thinking a problem that would not otherwise get handled by the experienced group would be the best place to start. Where is the system especially slow? I've read the TODO's, and the last five months of the archives for this list, so I have some general ideas. I've also had a lot experience marketing to I.T. organizations so I'd be happy to help out on the Product Marketing for PostgreSQL advocacy, i.e. developing a marketing strategy, press releases, etc. - Curtis Curtis Faith Principal Galt Capital, LLP -- Galt Capitalhttp://www.galtcapital.com 12 Wimmelskafts Gade Post Office Box 7549 voice: 340.776.0144 Charlotte Amalie, St. Thomasfax: 340.776.0244 United States Virgin Islands 00801 cell: 340.643.5368 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
FW: [HACKERS] Advice: Where could I be of help?
Forgot to cc' the list. -Original Message- From: Curtis Faith [mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 02, 2002 10:59 PM To: Tom Lane Subject: RE: [HACKERS] Advice: Where could I be of help? Tom, Here are the things that I think look interesting: 1) Eliminate unchanged column indices: Prevent index uniqueness checks when UPDATE does not modifying column Small little task that will make a noticeable improvement. I've done this before in a b* tree system, it had a huge impact. Should be pretty isolated. 2) Use indexes for min() and max() or convert to SELECT col FROM tab ORDER BY col DESC LIMIT 1 if appropriate index exists and WHERE clause acceptable - This will probably be a little more involved but I've done this exact optimization in a SQL system 6 or 7 years ago. 3) General cache and i/o optimization: Use bitmaps to fetch heap pages in sequential order Based on my reading of the emails in [performance] it appears to me that there might be huge potential in the caching system. I've worked on these caches and there are some very non-intuitive interactions between database type access and file systems that I believe offer good potential for improvement. I'm basing this assessment on the assumption that the sorts of improvements discussed in the [performance] emails have not been added in subsequent releases. Where does the current code stand? How are we currently doing cache flushing in general and for indices in particular? 4) General index improvements including: Order duplicate index entries by tid for faster heap lookups Add FILLFACTOR to btree index creation I've done the first one before and fill factor is pretty easy, as well. 5) Bitmaps: Implement a bitmap index: Use bitmaps to combine existing indexes I've done something similar, it looks pretty interesting. 6) Improve concurrency of hash indexes (Neil Conway)- Probably more exploration than implementation and fairly isolated problem. Based on past experience, from a bang-for-buck perspective, I'd probably do this in the numerical order. What do you think? I know what I like and can do but I don't really know enough about PostgreSQL's performance weaknesses yet. What are we getting killed on? - Curtis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Tom Lane Sent: Wednesday, October 02, 2002 6:55 PM To: Curtis Faith Cc: [EMAIL PROTECTED] Subject: Re: [HACKERS] Advice: Where could I be of help? Bruce Momjian [EMAIL PROTECTED] writes: I would read the developers corner stuff, the developers FAQ, pick a TODO item, and try a patch. It's that simple. Yup. I'd also suggest starting with something relatively small and localized (the nearby suggestion to fix IN/EXISTS, for example, is probably not a good first project --- and anyway I was going to work on that myself this month ;-)). Neil Conway's thought of working on plpgsql seems a good one to me; and as he says, there's lots of other possibilities. What do you find interesting in the TODO list? regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html