Re: [HACKERS] Vista/IPv6
On Thursday 12 April 2007 00:56, you wrote: IIRC, cmake is a replacement for make, not for configure. Or did I miss something? CMake also has configure functionality. And it would require rewriting all the unix makefiles in cmake format, and it was one of the major requirements for this undertaking that this would not be required. The CMakeLists.txt files can coexist with the current Makefiles. You could just do out of tree builds to keep from overwriting the original Makefiles during the generation. Of course, generating MSVC project files wouldn't overwrite the Makefiles in any case. wt -- Warren Turkal (w00t) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Vista/IPv6
On Fri, Apr 13, 2007 at 12:44:41AM -0600, Warren Turkal wrote: On Thursday 12 April 2007 00:56, you wrote: IIRC, cmake is a replacement for make, not for configure. Or did I miss something? CMake also has configure functionality. And it would require rewriting all the unix makefiles in cmake format, and it was one of the major requirements for this undertaking that this would not be required. The CMakeLists.txt files can coexist with the current Makefiles. You could just do out of tree builds to keep from overwriting the original Makefiles during the generation. Of course, generating MSVC project files wouldn't overwrite the Makefiles in any case. That would mean maintainig two sets of build files. That's never going to work. The only option if we were to go with that is, IMHO, to move *everything* over to it. And I don't think you'll get a lot of people supporting that ;) //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Eliminating unnecessary left joins
I have this exact problem a lot. There are actually cases where you can eliminate regular joins, not just left joins. For example: CREATE TABLE partner ( id serial, namevarchar(40) not null, primary key (id) ); CREATE TABLE project ( id serial, namevarchar(40) not null, partner_id integer not null references project (id) ^^^ -- I assume typo, should be partner ); CREATE VIEW project_view AS SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM project p, partner pp WHERE p.partner_id = pp.id; Same advice to you: 1. add not null to your id's 2. CREATE VIEW project_view AS SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM project p left outer join partner pp ON p.partner_id = pp.id; 3. wait (or implement :-) the left join optimization in pg Andreas ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] conflicting gettimeofday with MinGW
On Wed, Apr 11, 2007 at 06:17:19PM +0200, Zeugswetter Andreas ADI SD wrote: That page is ages out of date. The intended sync is apparently broken. The current download area is on sourceforge http://sf.net/project/showfiles.php?group_id=2435 *sigh* And what is in 3.12, which is apparently the current version? Sorry that was implied. sys/time.h did not change between 3.10 and 3.12. There is no #define to remove the declaration. In cvs the file is marked as dead and gives a reference to cygwin's w32api. Maybe the answer is to simply note that the mingw-runtime binary from sourceforge above 3.9 has a broken sys/time.h (it is not the file from cygwin but the dead one from cvs). I think that soudns reasonable. Something for the mingw FAQ. Care to provide bruce with a patch? //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Vista/IPv6
On Thu, Apr 12, 2007 at 08:46:09PM +0900, Hiroshi Saito wrote: src/include/pg_config.h.win32 /* Define to 1 if you have support for IPv6. */ // #define HAVE_IPV6 1 What do you think? It's defined ni the msvc build script, see http://archives.postgresql.org/pgsql-committers/2007-04/msg00148.php It is a meaning with win32.mak. Then, It is the outside of the msvc tool. I see. But - does it work when build with MSVC6? IIRC, MSVC6 pre-dates windows 2000 and the windows IPV6 support. Can you verify that it works if you manually add this #define and build with MSVC6? //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Yup, this is a good summary. You say you need to remove the optimization that avoids the logging of a new tuple because the full page image exists. I think we must already have the info in WAL which tuple inside the full page image is new (the one for which we avoided the WAL entry for). How about this: Leave current WAL as it is and only add the not removeable flag to full pages. pg_compresslog then replaces the full page image with a record for the one tuple that is changed. I tend to think it is not worth the increased complexity only to save bytes in the uncompressed WAL though. It is essentially what my patch proposes. My patch includes flag to full page writes which can be removed. Ok, a flag that marks full page images that can be removed is perfect. But you also turn off the optimization that avoids writing regular WAL records when the info is already contained in a full-page image (increasing the uncompressed size of WAL). It was that part I questioned. As already stated, maybe I should not have because it would be too complex to reconstruct a regular WAL record from the full-page image. But that code would also be needed for WAL based partial replication, so if it where too complicated we would eventually want a switch to turn off the optimization anyway (at least for heap page changes). Another point about pg_decompresslog: Why do you need a pg_decompresslog ? Imho pg_compresslog should already do the replacing of the full_page with the dummy entry. Then pg_decompresslog could be a simple gunzip, or whatever compression was used, but no logic. Just removing full page writes does not work. If we shift the rest of the WAL, then LSN becomes inconsistent in compressed archive logs which pg_compresslog produces. For recovery, we have to restore LSN as the original WAL. Pg_decompresslog restores removed full page writes as a dumm records so that recovery redo functions won't be confused. Ah sorry, I needed some pgsql/src/backend/access/transam/README reading. LSN is the physical position of records in WAL. Thus your dummy record size is equal to what you cut out of the original record. What about disconnecting WAL LSN from physical WAL record position during replay ? Add simple short WAL records in pg_compresslog like: advance LSN by 8192 bytes. Andreas ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] where to write small reusable functions ?
Hi, I'm working to implement a new feature to pg_dump: the ability to dump objects like function, indexes... And I notice that there some usefull functions like pg_malloc, pg_calloc... So I've added pg_free to avoid the sequence if-not-null-free-point-to-NULL, now I'd like to add a function pg_strcat like this char * pg_strcat (char *dest,char *src) { /* pg_realloc is a safer function than realloc */ dest=pg_realloc(dest,strlen(dest)+strlen(src)+1); strcat(dest,src); return dest; } But, in that case, those functions are only usable for pg_dump, what about the rest of code ? We don't have a central location for those small reusable snippets of code ? Regards, .D.
Re: [HACKERS] where to write small reusable functions ?
Dany DeBontridder wrote: I'm working to implement a new feature to pg_dump: the ability to dump objects like function, indexes... pg_dump already dumps functions and indexes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] where to write small reusable functions ?
Hi, char * pg_strcat (char *dest,char *src) { /* pg_realloc is a safer function than realloc */ dest=pg_realloc(dest,strlen(dest)+strlen(src)+1); strcat(dest,src); return dest; } Postgres already has something for the above functionality. See makeStringInfo, appendStringInfo. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] where to write small reusable functions ?
Am Freitag, 13. April 2007 14:28 schrieb Dany DeBontridder: But, in that case, those functions are only usable for pg_dump, what about the rest of code ? We don't have a central location for those small reusable snippets of code ? The main point of these functions is to catch errors and exit the program. But that behavior is very program-dependent, so I don't think it'd be useful to put them in a central location. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] where to write small reusable functions ?
On 4/13/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Dany DeBontridder wrote: I'm working to implement a new feature to pg_dump: the ability to dump objects like function, indexes... pg_dump already dumps functions and indexes. Right but you can't dump only one or two functions or only the functions and nothing else. (the same for index, triggers...) D.
Re: [HACKERS] [PATCHES] Reviewers Guide to Deferred Transactions/Transaction Guarantee
I agree with Tom's reasoning about the suggested simplifications, sorry. 3. Should the WALWriter also do the wal_buffers half-full write at the start of XLogInsert() ? That should go away entirely; to me the main point of the separate wal-writer process is to take over responsibility for not letting too many dirty wal buffers accumulate. That also sounds a lot simpler, but I think Bruce wanted to be able to give some time guarantee to the not waiting for fsync txns. When a commit only half-filled the page and no more WAL comes in for a long time, there is only WALWriter to do the IO. The WALWriter would need to only flush a half-full page after timeout iff it contains a commit record. One more question on autocommit: Do we wait for a flush for an autocommitted DML ? Seems we generally should not. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] conflicting gettimeofday with MinGW
Magnus Hagander wrote: On Wed, Apr 11, 2007 at 06:17:19PM +0200, Zeugswetter Andreas ADI SD wrote: That page is ages out of date. The intended sync is apparently broken. The current download area is on sourceforge http://sf.net/project/showfiles.php?group_id=2435 *sigh* And what is in 3.12, which is apparently the current version? Sorry that was implied. sys/time.h did not change between 3.10 and 3.12. There is no #define to remove the declaration. In cvs the file is marked as dead and gives a reference to cygwin's w32api. Maybe the answer is to simply note that the mingw-runtime binary from sourceforge above 3.9 has a broken sys/time.h (it is not the file from cygwin but the dead one from cvs). I think that soudns reasonable. Something for the mingw FAQ. Care to provide bruce with a patch? No, that means that we can't build with currently released Mingw, AIUI. I think the solution is probably to rename our replacement gettimeofday and add a #define to translate the calls. That's pretty much what we've done in other cases, and has seemed to be a more robust strategy than supplying replacements for standard library routines with routines of the same name. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes: But you also turn off the optimization that avoids writing regular WAL records when the info is already contained in a full-page image (increasing the uncompressed size of WAL). It was that part I questioned. That's what bothers me about this patch, too. It will be increasing the cost of writing WAL (more data - more CRC computation and more I/O, not to mention more contention for the WAL locks) which translates directly to a server slowdown. The main arguments that I could see against Andreas' alternative are: 1. Some WAL record types are arranged in a way that actually would not permit the reconstruction of the short form from the long form, because they throw away too much data when the full-page image is substituted. An example that's fresh in my mind is that the current format of the btree page split WAL record discards newitemoff in that case, so you couldn't identify the inserted item in the page image. Now this is only saving two bytes in what's usually going to be a darn large record anyway, and it complicates the code to do it, so I wouldn't cry if we changed btree split to include newitemoff always. But there might be some other cases where more data is involved. In any case, someone would have to look through every single WAL record type to determine whether reconstruction is possible and fix it if not. 2. The compresslog utility would have to have specific knowledge about every compressible WAL record type, to know how to convert it to the short format. That means an ongoing maintenance commitment there. I don't think this is unacceptable, simply because we need only teach it about a few common record types, not everything under the sun --- anything it doesn't know how to fix, just leave alone, and if it's an uncommon record type it really doesn't matter. (I guess that means that we don't really have to do #1 for every last record type, either.) So I don't think either of these is a showstopper. Doing it this way would certainly make the patch more acceptable, since the argument that it might hurt rather than help performance in some cases would go away. What about disconnecting WAL LSN from physical WAL record position during replay ? Add simple short WAL records in pg_compresslog like: advance LSN by 8192 bytes. I don't care for that, as it pretty much destroys some of the more important sanity checks that xlog replay does. The page boundaries need to match the records contained in them. So I think we do need to have pg_decompresslog insert dummy WAL entries to fill up the space saved by omitting full pages. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Group Commit
On Tue, 2007-04-10 at 11:40 +0100, Heikki Linnakangas wrote: Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I've been working on the patch to enhance our group commit behavior. The patch is a dirty hack at the moment, but I'm settled on the algorithm I'm going to use and I know the issues involved. One question that just came to mind is whether Simon's no-commit-wait patch doesn't fundamentally alter the context of discussion for this. I was certainly intending that it would. Aside from the prospect that people won't really care about group commit if they can just use the periodic-WAL-sync approach, ISTM that one way to get group commit is to just make everybody wait for the dedicated WAL writer to write their commit record. With a sufficiently short delay between write/fsync attempts in the background process, won't that net out at about the same place as a complicated group-commit patch? Possibly. To get efficient group commit there would need to be some kind of signaling between the WAL writer and normal backends. I think there is some in the patch, but I'm not sure if it gives efficient group commit. A constant delay will just give us something similar to commit_delay. Agreed. I've refrained from spending time on group commit until the commit-no-wait patch lands, because it's going to conflict anyway. I'm starting to feel we should not try to rush group commit into 8.3, unless it somehow falls out of the commit-no-wait patch by accident, given that we're past feature freeze and coming up with a proper group commit algorithm would need a lot of research and testing. Better do it for 8.4 with more time, we've got enough features on plate for 8.3 anyway. My feeling was that I couldn't get both done for 8.3, and that including the WAL Writer in 8.3 would make the dev path clearer for a later attempt upon group commit. I think it was worth exploring whether it would be easy, but I think we can see it'll take a lot of work to make it fly right. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCHES] Reviewers Guide to Deferred Transactions/TransactionGuarantee
On Thu, 2007-04-12 at 15:56 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: transaction_guarantee.v11.patch Thanks for the review. I can't help feeling that this is enormously overcomplicated. I agree with all but one of your comments, see below. The DFC in particular seems to not be worth its overhead. Why wouldn't we simply track the newest commit record at all times, and then whenever the wal writer wakes up, it would write/fsync that far (or write/fsync all completed WAL pages, if there's no new commit record to worry about)? The other interesting issue is not letting hint-bit updates get to disk in advance of the WAL flush, but I don't see a need to track those at a per-transaction level: just advance page LSN to latest commit record any time a hint bit is updated. The commit will likely be flushed before we'd be interested in writing the buffer out anyway. Moreover, the way you are doing it creates a conflict in that the DFC has to guarantee to remember every unflushed transaction, whereas it really needs to be just an approximate cache for its performance to be good. I've spent a few hours thinking on this and I'm happy with it now. The lure of removing that much code is too strong to resist; its certainly easier to remove code after freeze than it is to add it. Advancing the LSN too far was a worry of mine, but we have the code now to cope if that shows to be a problem in testing. So lets strip that out. I see the concern about not letting clog pages go to disk before the corresponding WAL data is flushed, but that could be handled much more simply: just force a flush through the newest commit record before any write of a clog page. Those writes are infrequent enough (every 32K transactions or one checkpoint) that this seems not a serious problem. This bit I'm not that happy with. You're right its fairly infrequent, but the clog pages are typically written when we extend the clog. That happens while holding XidGenLock and ProcArrayLock, so holding those across an additional (and real) I/O is going to make that blockage worse. We've been to great pains in other places to remove logjams and we know that the follow-on effects of logjams are not swift to clear when the system is running at full load on multiple CPU systems. The code to implement this is pretty clean: a few extra lines in clog/slru and bubbled-up API changes. I was actually thinking of adding something to the bgwriter to clean the LRU block of the clog, if it was dirty, once per cycle, to further reduce the possibility of I/O at that point. AFAIK there is no need to associate any forced flush with multixacts; there is no state saved across crashes for those anyway. Agreed. I don't see a point in allowing the WAL writer to be disabled --- I believe it will be a performance win just like the bgwriter, independently of whether transaction_guarantee is used or not, by helping to keep down the number of dirty WAL buffers. That in turn allows some other simplifications, like not needing an assign hook for transaction_guarantee. That would be pleasant. The other changes make hint bit setting need a LWlock request, so I wanted to include a way of saying I never ever want to use transaction_guarantee = off. I see the beauty of your suggestion and agree. So keep the parameter, but let it default to 100ms? Range 10-1000ms? I disagree with your desire to remove the fsync parameter. It may have less use than before with this feature, but that doesn't mean it has none. OK 3. Should the WALWriter also do the wal_buffers half-full write at the start of XLogInsert() ? That should go away entirely; to me the main point of the separate wal-writer process is to take over responsibility for not letting too many dirty wal buffers accumulate. Yes I'll make the agreed changes by next Wed/Thurs. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Vista/IPv6
Hi. From: Magnus Hagander [EMAIL PROTECTED] I see. But - does it work when build with MSVC6? IIRC, MSVC6 pre-dates windows 2000 and the windows IPV6 support. Can you verify that it works if you manually add this #define and build with MSVC6? I don't have IPV6 test environment recently However, It has the compile environment of VC6, and it is finished finely. Then, win32.mak is not the thing of only VC6. And VC 7.1, VC8, and they will be used. It should be then supported IPV6 at least. What confirmed #DEFINE of IPV6 and built it by VC6 has passed the test of IPV4 further again. Therefore, you should make it effective. Regards, Hiroshi Saito ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote: Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes: But you also turn off the optimization that avoids writing regular WAL records when the info is already contained in a full-page image (increasing the uncompressed size of WAL). It was that part I questioned. I think its right to question it, certainly. That's what bothers me about this patch, too. It will be increasing the cost of writing WAL (more data - more CRC computation and more I/O, not to mention more contention for the WAL locks) which translates directly to a server slowdown. I don't really understand this concern. Koichi-san has included a parameter setting that would prevent any change at all in the way WAL is written. If you don't want this slight increase in WAL, don't enable it. If you do enable it, you'll also presumably be compressing the xlog too, which works much better than gzip using less CPU. So overall it saves more than it costs, ISTM, and nothing at all if you choose not to use it. The main arguments that I could see against Andreas' alternative are: 1. Some WAL record types are arranged in a way that actually would not permit the reconstruction of the short form from the long form, because they throw away too much data when the full-page image is substituted. An example that's fresh in my mind is that the current format of the btree page split WAL record discards newitemoff in that case, so you couldn't identify the inserted item in the page image. Now this is only saving two bytes in what's usually going to be a darn large record anyway, and it complicates the code to do it, so I wouldn't cry if we changed btree split to include newitemoff always. But there might be some other cases where more data is involved. In any case, someone would have to look through every single WAL record type to determine whether reconstruction is possible and fix it if not. 2. The compresslog utility would have to have specific knowledge about every compressible WAL record type, to know how to convert it to the short format. That means an ongoing maintenance commitment there. I don't think this is unacceptable, simply because we need only teach it about a few common record types, not everything under the sun --- anything it doesn't know how to fix, just leave alone, and if it's an uncommon record type it really doesn't matter. (I guess that means that we don't really have to do #1 for every last record type, either.) So I don't think either of these is a showstopper. Doing it this way would certainly make the patch more acceptable, since the argument that it might hurt rather than help performance in some cases would go away. Yeh, its additional code paths, but it sounds like Koichi-san and colleagues are going to be trail blazing any bugs there and will be around to fix any more that emerge. What about disconnecting WAL LSN from physical WAL record position during replay ? Add simple short WAL records in pg_compresslog like: advance LSN by 8192 bytes. I don't care for that, as it pretty much destroys some of the more important sanity checks that xlog replay does. The page boundaries need to match the records contained in them. So I think we do need to have pg_decompresslog insert dummy WAL entries to fill up the space saved by omitting full pages. Agreed. I don't want to start touching something that works so well. We've been thinking about doing this for at least 3 years now, so I don't see any reason to baulk at it now. I'm happy with Koichi-san's patch as-is, assuming further extensive testing will be carried out on it during beta. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote: That's what bothers me about this patch, too. It will be increasing the cost of writing WAL (more data - more CRC computation and more I/O, not to mention more contention for the WAL locks) which translates directly to a server slowdown. I don't really understand this concern. The real objection is that a patch that's alleged to make WAL smaller actually does the exact opposite. Now maybe you can buy that back downstream of the archiver --- after yet more added-on processing --- but it still seems that there's a fundamental misdesign here. Koichi-san has included a parameter setting that would prevent any change at all in the way WAL is written. It bothers me that we'd need to have such a switch. That's just another way to shoot yourself in the foot, either by not enabling it (in which case applying pg_compresslog as it stands would actively break your WAL), or by enabling it when you weren't actually going to use pg_compresslog (because you misunderstood the documentation to imply that it'd make your WAL smaller by itself). What I want to see is a patch that doesn't bloat WAL at all and therefore doesn't need a switch. I think Andreas is correct to complain that it should be done that way. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
On Fri, 2007-04-13 at 11:47 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote: That's what bothers me about this patch, too. It will be increasing the cost of writing WAL (more data - more CRC computation and more I/O, not to mention more contention for the WAL locks) which translates directly to a server slowdown. I don't really understand this concern. The real objection is that a patch that's alleged to make WAL smaller actually does the exact opposite. Now maybe you can buy that back downstream of the archiver --- after yet more added-on processing --- but it still seems that there's a fundamental misdesign here. Koichi-san has included a parameter setting that would prevent any change at all in the way WAL is written. It bothers me that we'd need to have such a switch. That's just another way to shoot yourself in the foot, either by not enabling it (in which case applying pg_compresslog as it stands would actively break your WAL), or by enabling it when you weren't actually going to use pg_compresslog (because you misunderstood the documentation to imply that it'd make your WAL smaller by itself). What I want to see is a patch that doesn't bloat WAL at all and therefore doesn't need a switch. I think Andreas is correct to complain that it should be done that way. I agree with everything you say because we already had *exactly* this discussion when the patch was already submitted, with me saying everything you just said. After a few things have been renamed to show their correct function and impact, I am now comfortable with this patch. Writing lots of additional code simply to remove a parameter that *might* be mis-interpreted doesn't sound useful to me, especially when bugs may leak in that way. My take is that this is simple and useful *and* we have it now; other ways don't yet exist, nor will they in time for 8.3. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Simon Riggs [EMAIL PROTECTED] writes: Writing lots of additional code simply to remove a parameter that *might* be mis-interpreted doesn't sound useful to me, especially when bugs may leak in that way. My take is that this is simple and useful *and* we have it now; other ways don't yet exist, nor will they in time for 8.3. The potential for misusing the switch is only one small part of the argument; the larger part is that this has been done in the wrong way and will cost performance unnecessarily. The fact that it's ready now is not something that I think should drive our choices. I believe that it would be possible to make the needed core-server changes in time for 8.3, and then to work on compress/decompress on its own time scale and publish it on pgfoundry; with the hope that it would be merged to contrib or core in 8.4. Frankly the compress/decompress code needs work anyway before it could be merged (eg, I noted a distinct lack of I/O error checking). regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] conflicting gettimeofday with MinGW
Andrew Dunstan wrote: In cvs the file is marked as dead and gives a reference to cygwin's w32api. Maybe the answer is to simply note that the mingw-runtime binary from sourceforge above 3.9 has a broken sys/time.h (it is not the file from cygwin but the dead one from cvs). I think that soudns reasonable. Something for the mingw FAQ. Care to provide bruce with a patch? No, that means that we can't build with currently released Mingw, AIUI. I think the solution is probably to rename our replacement gettimeofday and add a #define to translate the calls. That's pretty much what we've done in other cases, and has seemed to be a more robust strategy than supplying replacements for standard library routines with routines of the same name. But if I read you right, they're removing it in the next version? That would indicate that they included it by mistake? (But sure, in general that's not a bad idea) //Magnus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] [Fwd: Index Advisor]
On Tue, 2007-04-10 at 12:18 -0700, Gurjeet Singh wrote: Also, although the whole plan-tree is available in get_relation_info(), but it wouldn't be the right place to scan other tables, for eg., for generating JOIN-INDEXes or materializing some intermediate joins. (sometime in the future we may support them!). I like Tom's suggestion. We never thought actually creating the indexes was a very good thing and I'd be happy to bury that idea for good. Speed is definitely a consideration if we are to re-plan thousands of SQL statements for a real workload. If we don't run the planner twice, then the developer will have to run it manually twice, and compare the costs manually (with and without v-indexes); virtually impossible for lage applications and introduction of another human-error possibility. AFAICS Tom hasn't referred to running twice or not, so I'm not very sure what you're referring to, sorry. If you could answer Tom's suggestions one by one directly underneath them it would be easier to discuss things. ISTM that you've done a great job, the trick is now to reach agreement and finish this. If there is something still to discuss, it needs to be very clearly tied back to Tom's comments so everyone can follow it, then agree it. If there is a problem in Tom's suggestions that directly effects the operation of the tool then we need to identify what that is. But if those hooks would give us all we need, then lets agree it and fix up the adviser plug-in later. We really, really, really need this. Lots. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)
Steve [EMAIL PROTECTED] writes: [ strange planner misbehavior in 8.2.3 ] After some off-list investigation (thanks, Steve, for letting me poke at your machine), the short answer is that the heuristics used by choose_bitmap_and() suck. The problem query is like select ... from ds where ds.receipt = '1998-12-30 0:0:0' and ds.encounter_id in ( ... 100 distinct values ... ); and the table has a truly remarkable variety of indexes on encounter_id, receipt, and combinations of them with other columns. The receipt condition is actually in effect a no-op, because all receipt dates are later than that, but because ineq_histogram_selectivity doesn't trust histogram data unreservedly we compute a selectivity of about 0.7 for it. That means that the indexes that cover both receipt and encounter_id are given a selectivity score just fractionally better than those involving encounter_id alone, and therefore they sort first in choose_bitmap_and's sort step, and the way that that routine is coded, only combinations of the very first index with other ones will be considered for a bitmap heap scan. So the possibility of using just the index on encounter_id alone is never considered, even though that alternative is vastly cheaper than the alternatives that are considered. (It happens that encounter_id is a low-order column in all the indexes that include receipt, and so these scans end up covering the whole index ... multiple times even. The cost estimation is fine --- the thing knows these are expensive --- what's falling down is the heuristic for which combinations of indexes to consider using in a bitmap scan.) The original coding of choose_bitmap_and involved a fuzzy comparison of selectivities, which would have avoided this problem, but we got rid of that later because it had its own problems. In fact, choose_bitmap_and has caused us enough problems that I'm thinking we need a fundamental rethink of how it works, rather than just marginal tweaks. If you haven't looked at this code before, the comments explain the idea well enough: /* * choose_bitmap_and *Given a nonempty list of bitmap paths, AND them into one path. * * This is a nontrivial decision since we can legally use any subset of the * given path set. We want to choose a good tradeoff between selectivity * and cost of computing the bitmap. * * The result is either a single one of the inputs, or a BitmapAndPath * combining multiple inputs. */ ... /* * In theory we should consider every nonempty subset of the given paths. * In practice that seems like overkill, given the crude nature of the * estimates, not to mention the possible effects of higher-level AND and * OR clauses. As a compromise, we sort the paths by selectivity. We * always take the first, and sequentially add on paths that result in a * lower estimated cost. * * We also make some effort to detect directly redundant input paths, as * can happen if there are multiple possibly usable indexes. (Another way * it can happen is that best_inner_indexscan will find the same OR join * clauses that create_or_index_quals has pulled OR restriction clauses * out of, and then both versions show up as duplicate paths.) We * consider an index redundant if any of its index conditions were already * used by earlier indexes. (We could use predicate_implied_by to have a * more intelligent, but much more expensive, check --- but in most cases * simple pointer equality should suffice, since after all the index * conditions are all coming from the same RestrictInfo lists.) * * You might think the condition for redundancy should be all index * conditions already used, not any, but this turns out to be wrong. * For example, if we use an index on A, and then come to an index with * conditions on A and B, the only way that the second index can be later * in the selectivity-order sort is if the condition on B is completely * non-selective. In any case, we'd surely be drastically misestimating * the selectivity if we count the same condition twice. * * We include index predicate conditions in the redundancy test. Because * the test is just for pointer equality and not equal(), the effect is * that use of the same partial index in two different AND elements is * considered redundant. (XXX is this too strong?) * * Note: outputting the selected sub-paths in selectivity order is a good * thing even if we weren't using that as part of the selection method, * because it makes the short-circuit case in MultiExecBitmapAnd() more * likely to apply. */ One idea I thought about was to sort by index scan cost, using selectivity only as a tiebreaker for cost, rather than the other way around as is currently done. This seems fairly plausible because indexscans that are cheaper than other indexscans likely return
Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)
Tom Lane wrote: One idea I thought about was to sort by index scan cost, using selectivity only as a tiebreaker for cost, rather than the other way around as is currently done. This seems fairly plausible because indexscans that are cheaper than other indexscans likely return fewer rows too, and so selectivity is already accounted for to some extent --- at least you can't have an enormously worse selectivity at lower cost, whereas Steve's example proves it doesn't work the other way. But I'm worried about breaking the reasoning about redundant indexes that's mentioned in the comments. Another alternative that would respond to the immediate problem is to maintain the current sort order, but as we come to each index, consider using that one alone, and throw away whatever AND we might have built up if that one alone beats the AND-so-far. This seems more conservative, as it's unlikely to break any cases that work well now, but on the other hand it feels like plastering another wart atop a structure that's already rather rickety. Has anyone got any thoughts about the best way to do this? How about doing both: sort the index by index scan cost; then pick the first index on the list and start adding indexes when they lower the cost. When adding each index, consider it by itself against the already stacked indexes. If the cost is lower, put this index at the top of the list, and restart the algorithm (after the sorting step of course). I think the concern about condition redundancy should be attacked separately. How about just comparing whether they have common prefixes of conditions? I admit I don't understand what would happen with indexes defined like (lower(A), B, C) versus (A, B) for example. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org