Re: [PATCHES] Page at a time index scan
On Fri, 2006-05-05 at 18:04 -0400, Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: On Fri, 5 May 2006, Tom Lane wrote: BTW, I just realized another bug in the patch: btbulkdelete fails to guarantee that it visits every page in the index. The first solution that occurs to me is to force page splits to choose the target page so that it's blkno the original page's blkno during vacuum. It would cause the index to become more fragmented more quickly, which is bad but perhaps tolerable. I have a sketch of a solution that doesn't require any change in page allocation behavior. Can anyone see any holes in this: Assume that we have some way to recognize whether a page has been split since the current btbulkdelete scan started. (A split would mark both the original page and the new right-hand page as newly split.) When btbulkdelete arrives at a page, it need take no special action unless the page is newly split *and* its right-link points to a lower physical address. If that's true, then after vacuuming the page, follow its right-link and vacuum that page; repeat until arriving at a page that is either not newly split or is above the current location of the outer loop. Then return to the outer, sequential-scan loop. We should also have btbulkdelete clear the newly-split marker whenever it processes a page; this ensures that no page is processed more than once, which is probably worth the possible extra I/O needed to clear the marker. (The cycles to re-scan a page are maybe not that important, but if we do reprocess pages we'll end up with a bogusly high tuple count. OTOH I'm not sure we can guarantee an accurate tuple count anyway, so maybe it doesn't matter.) AFAICS, this works correctly even if the test for a newly-split page sometimes yields false positives; thinking a page is newly split when it is not might cost a little extra I/O but won't lead to wrong results. This reduces the requirements for the newly-split marking mechanism. So, how do we do that marking? Noting that we have 16 bits we could use in BTPageOpaqueData without making that struct any bigger (because of alignment considerations), I'm thinking about a 16-bit counter for each index that is incremented at the start of each btbulkdelete operation and copied into the opaque data whenever a page is split. If the value's different from the current counter, the page definitely hasn't been split during btbulkdelete. There's a 1-in-65536 chance of a false positive, if the last split occurred some exact multiple of 65536 vacuums ago, but that's probably low enough to be acceptable. (We could reduce the odds of false positives in various ways, eg by saying that zero isn't a valid counter value and having btbulkdelete reset a page's counter to zero anytime it has to write the page anyway.) I read your earlier post about needing to lock everything and spent some time thinking about this. The issue of needing to lock everything means that we would never be able to do a partial vacuum of an index i.e. remove one page without a scan. I'm more concerned about making partial vacuum work than I am about speeding up an all-block vacuum. I'd been mulling over the locking problems and came up with something that looks identical to your proposal *except* for the value that gets written into the BTPageOpaqueData on the right-hand page. My thinking was to write the blockid of the original left hand page, so as to record the original ancestor since split. Thus if multiple splits occurred, then the original ancestor blockid would remain on record until VACUUM. In more detail: When we split a page if the ancestor blockid is not set, then we set it to be the blockid of the old page (new left hand page). If the ancestor blockid is already set then we copy that to the new right hand page. Every split will write a value to BTPageOpaqueData, though the values to use are already available without extra work. AFAICS this doesn't change the ability to scan the index in physical order, so we still get the benefits for that action. A *partial* vacuum of an index block can be achieved by visiting the block to be vacuumed, then following the link directly to the pre-split ancestor block (if there is one), then moving right again back to the target block. btbulkdelete always clears the marker when it cleans. This opens the door for the approach of notifying when a page is deletable and then having a background agent remove just that page, as patch-proposed recently by Junji TERAMOTO. I'm *not* presuming we have an agreed solution for partial vacuuming, but I do want to keep that door open by implementing a locking protocol that could support it. This still has the same sort of locking issues I complained of in regards to Heikki's idea, namely how do you make sure that everyone is using the new counter value before you start scanning? That seems soluble however. We'd just need to be
Re: [PATCHES] [PATCH] Magic block for modules
On Sun, May 07, 2006 at 08:21:43PM -0400, Tom Lane wrote: changes in any of the following: PG_VERSION_NUM CATALOG_VERSION_NO the size of 8 basic C types BLCKSZ=20 NAMEDATALEN=20 HAVE_INT64_TIMESTAMP INDEX_MAX_KEYS FUNC_MAX_ARGS VARHDRSZ MAXDIM The compiler used (only brand, not version) That seems way overkill to me. FUNC_MAX_ARGS is good to check, but most of those other things are noncritical for typical add-on modules. I was trying to find variables that when changed would make some things corrupt. For example, a changed NAMEDATALEN will make any use of the syscache a source of errors. A change in INDEX_MAX_KEYS will break the GiST interface, etc. I wondered about letting module writers to select which parts are relevent to them but that just seems like handing people a footgun. In particular I strongly object to the check on compiler. Some of us do use systems where gcc and vendor compilers are supposed to interoperate ... and aren't all those Windows compilers supposed to, too? AFAIK Maybe that's the case now, it didn't used to be. I seem to remember people having difficulties because they compiled the server with MinGW and the modules with VC++. I'll take it out though, it's not like it costs anything. it's considered the linker's job to prevent loading 32-bit code into a 64-bit executable or vice versa, so I don't think we need to be checking for common assumptions about sizeof(long). I know ELF headers contain some of this info, and unix in general doesn't try to allow different bit sizes in one binary. Windows used to (maybe still has) a mechanism to allow 32-bit code to call 16-bit libraries. Do they allow the same for 64-bit libs? I'm pretty sure we had agreed that magic blocks should be required; otherwise this check will accomplish little. Sure, I just didn't want to break every module in one weekend. I was thinking of adding it with LOG level now, send a message on -announce saying that at the beginning of the 8.2 freeze it will be an ERROR. Give people time to react. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [PATCHES] [PATCH] Magic block for modules
it's considered the linker's job to prevent loading 32-bit code into a 64-bit executable or vice versa, so I don't think we need to be checking for common assumptions about sizeof(long). I know ELF headers contain some of this info, and unix in general doesn't try to allow different bit sizes in one binary. Windows used to (maybe still has) a mechanism to allow 32-bit code to call 16-bit libraries. Do they allow the same for 64-bit libs? Yes, but it's not something that it does automatically - you have to specifically seti t up to call the thunking code. It's not something I think we need to support at all. (Performance is also quite horrible - at least on 16 vs 32, I'd assume the same for 32 vs 64) //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Page at a time index scan
Simon Riggs [EMAIL PROTECTED] writes: I read your earlier post about needing to lock everything and spent some time thinking about this. The issue of needing to lock everything means that we would never be able to do a partial vacuum of an index i.e. remove one page without a scan. I'm more concerned about making partial vacuum work than I am about speeding up an all-block vacuum. [ shrug... ] That's an illusory goal anyway. Retail tuple removal is just too inefficient. (No, I don't believe in that proposed patch.) My thinking was to write the blockid of the original left hand page, so as to record the original ancestor since split. Thus if multiple splits occurred, then the original ancestor blockid would remain on record until VACUUM. In more detail: When we split a page if the ancestor blockid is not set, then we set it to be the blockid of the old page (new left hand page). If the ancestor blockid is already set then we copy that to the new right hand page. Every split will write a value to BTPageOpaqueData, though the values to use are already available without extra work. Doesn't work, at least not for making it possible to vacuum part of the index. The conflicting indexscan could have stopped on a page, and then that page could have split, before your partial vacuum ever started. So tracing back only as far as the data has split since vacuum started is not enough to prevent conflict. (The other little problem is that we'd have to enlarge the BTOpaque overhead, because a block id doesn't fit in the available 16 bits.) I'm not very happy about an extra lock during page splitting, which adds a performance hit even for tables that never will need regular vacuuming (apart from occaisional wrap-around avoidance). While I'd rather not have done that, I don't believe that it makes for any material performance degradation. Normal splits all take the lock in shared mode and hence suffer no contention. Your proposal wouldn't make for less locking anyway, since it still assumes that there's a way to tell whether vacuum is active for a given index, which is just about the same amount of overhead as the code-as-committed. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [PATCH] Magic block for modules
Martijn van Oosterhout kleptog@svana.org writes: On Sun, May 07, 2006 at 08:21:43PM -0400, Tom Lane wrote: That seems way overkill to me. FUNC_MAX_ARGS is good to check, but most of those other things are noncritical for typical add-on modules. I was trying to find variables that when changed would make some things corrupt. For example, a changed NAMEDATALEN will make any use of the syscache a source of errors. A change in INDEX_MAX_KEYS will break the GiST interface, etc. By that rationale you'd have to record just about every #define in the system headers. And it still wouldn't be bulletproof --- what of custom-modified code with, say, extra fields inserted into some widely used struct? But you're missing the larger point, which is that in many cases this would be breaking stuff without any need at all. The majority of catversion bumps, for instance, are for things that don't affect the typical add-on module. So checking for identical catversion won't accomplish much except to force additional recompile churn on people doing development against CVS HEAD. The original proposal was just to check for major PG version match. I can see checking FUNC_MAX_ARGS too, because that has a very direct impact on the ABI that every external function sees, but I think the cost/benefit ratio rises pretty darn steeply after that. Another problem with an expansive list of stuff-to-check is where does the add-on module find it out from? AFAICS your proposal would make for a large laundry list of random headers that every add-on would now have to #include. If it's not defined by postgres.h or fmgr.h (which are two things that every backend addon is surely including already) then I'm dubious about using it in the magic block. Sure, I just didn't want to break every module in one weekend. I was thinking of adding it with LOG level now, send a message on -announce saying that at the beginning of the 8.2 freeze it will be an ERROR. Give people time to react. I think that will just mean that we'll break every module at the start of 8.2 freeze ;-). Unless we forget to change it to error, which IMHO is way too likely. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [WIP] The relminxid addition, try 3
Alvaro Herrera [EMAIL PROTECTED] writes: I'm not too sure about the XLOG routines -- I don't understand very well the business about attaching the changes to a buffer; I thought at first that since all the changes go to a tuple, they all belong to the buffer, so I assigned a single XLogRecData struct with all the info and the buffer containing the tuple; but then on replay, I got PANIC: invalid xlog record length 0 Generally you want an xlog record to consist of some fixed overhead plus attached data. The attached data is the part that should link to the buffer (normally it's data that is/will be actually stored into that buffer). The fixed overhead isn't in the buffer and doesn't link. But why do you need your own xlogging at all? Shouldn't these actions be perfectly ordinary updates of the relevant catalog tuples? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [PATCH] Magic block for modules
On Mon, May 08, 2006 at 10:32:47AM -0400, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: I was trying to find variables that when changed would make some things corrupt. For example, a changed NAMEDATALEN will make any use of the syscache a source of errors. A change in INDEX_MAX_KEYS will break the GiST interface, etc. By that rationale you'd have to record just about every #define in the system headers. And it still wouldn't be bulletproof --- what of custom-modified code with, say, extra fields inserted into some widely used struct? I can see that. That's why I specifically aimed at the ones defined in pg_config_manual.h, ie, the ones marked twiddle me. ... So checking for identical catversion won't accomplish much except to force additional recompile churn on people doing development against CVS HEAD. The original proposal was just to check for major PG version match. Ok, I've taken out CATVERSION and cut PG version to just the major version. I've also dropped the compiler and several others. Another problem with an expansive list of stuff-to-check is where does the add-on module find it out from? All these symbols are defined by including c.h only, which is included by postgres.h, so this is not an issue. I obviously didn't include any symbols that a module would need to add special includes for. The only outlier was CATVERSION but we're dropping that test. I think that will just mean that we'll break every module at the start of 8.2 freeze ;-). Unless we forget to change it to error, which IMHO is way too likely. Ok, one week then. Not everyone follows -patches and will be mighty confused when a CVS update suddenly breaks everything. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [PATCHES] Page at a time index scan
On Mon, 2006-05-08 at 10:18 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: I read your earlier post about needing to lock everything and spent some time thinking about this. The issue of needing to lock everything means that we would never be able to do a partial vacuum of an index i.e. remove one page without a scan. I'm more concerned about making partial vacuum work than I am about speeding up an all-block vacuum. [ shrug... ] That's an illusory goal anyway. Retail tuple removal is just too inefficient. (No, I don't believe in that proposed patch.) Current VACUUM optimizes for the case where random updates/deletes occur. If there are hotspots then scanning the whole relation is O(N) or even O(N^2) if we have to scan the indexes multiple times. We think we have a way to improve heap VACUUMs (bitmaps...) but we are still looking for an equivalent for indexes. My thinking was to write the blockid of the original left hand page, so as to record the original ancestor since split. Thus if multiple splits occurred, then the original ancestor blockid would remain on record until VACUUM. In more detail: When we split a page if the ancestor blockid is not set, then we set it to be the blockid of the old page (new left hand page). If the ancestor blockid is already set then we copy that to the new right hand page. Every split will write a value to BTPageOpaqueData, though the values to use are already available without extra work. Doesn't work, at least not for making it possible to vacuum part of the index. The conflicting indexscan could have stopped on a page, and then that page could have split, before your partial vacuum ever started. So tracing back only as far as the data has split since vacuum started is not enough to prevent conflict. That wasn't the proposal. Every split would be marked and stay marked until those blocks were VACUUMed. The data used to mark is readily available and doesn't rely on whether or not VACUUM is running. IMHO this does work. (The other little problem is that we'd have to enlarge the BTOpaque overhead, because a block id doesn't fit in the available 16 bits.) ISTM it is indeed a little problem. After CREATE INDEX, most index pages don't stay completely full, so worrying about alignment wastage might very occasionally save an I/O, but not enough for us to care. I'm not very happy about an extra lock during page splitting, which adds a performance hit even for tables that never will need regular vacuuming (apart from occaisional wrap-around avoidance). While I'd rather not have done that, I don't believe that it makes for any material performance degradation. Normal splits all take the lock in shared mode and hence suffer no contention. Your proposal wouldn't make for less locking anyway, since it still assumes that there's a way to tell whether vacuum is active for a given index, which is just about the same amount of overhead as the code-as-committed. The proposed scheme doesn't rely on knowing whether a VACUUM is active or not. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Page at a time index scan
Simon Riggs [EMAIL PROTECTED] writes: That wasn't the proposal. Every split would be marked and stay marked until those blocks were VACUUMed. The data used to mark is readily available and doesn't rely on whether or not VACUUM is running. IMHO this does work. OK, I misunderstood what you had in mind, but now that I do understand it doesn't seem terribly efficient. What you're suggesting is that we take as a vacuum group all the pages that have been split off from a single original page since that page was last vacuumed, and that this group must be vacuumed as a whole. That works, but it seems that the groups would get awfully large. In particular, this substantially penalizes btbulkdelete in hopes of someday improving matters for what remains an entirely fictional partial vacuum. As it stands today, btbulkdelete only has to worry about page groups formed since it began to run, not since the last vacuum. Changing the data representation like this would force it to retrace much more often and over much larger page groups. (The other little problem is that we'd have to enlarge the BTOpaque overhead, because a block id doesn't fit in the available 16 bits.) ISTM it is indeed a little problem. After CREATE INDEX, most index pages don't stay completely full, so worrying about alignment wastage might very occasionally save an I/O, but not enough for us to care. I don't think an extra 4 or 8 bytes need be a show-stopper, but it does force an initdb and thus make it harder to compare the two techniques. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Page at a time index scan
On Mon, 2006-05-08 at 11:26 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: That wasn't the proposal. Every split would be marked and stay marked until those blocks were VACUUMed. The data used to mark is readily available and doesn't rely on whether or not VACUUM is running. IMHO this does work. OK, I misunderstood what you had in mind, but now that I do understand it doesn't seem terribly efficient. What you're suggesting is that we take as a vacuum group all the pages that have been split off from a single original page since that page was last vacuumed, and that this group must be vacuumed as a whole. That works, but it seems that the groups would get awfully large. In particular, this substantially penalizes btbulkdelete in hopes of someday improving matters for what remains an entirely fictional partial vacuum. OK, so we have the germ of a new mechanism - and I very much agree that the idea of a partial vacuum is at present entirely fictional...but we at least have a starting place. As it stands today, btbulkdelete only has to worry about page groups formed since it began to run, not since the last vacuum. Changing the data representation like this would force it to retrace much more often and over much larger page groups. Yes, I saw the potential issue you mention - but for many cases the index grows forwards and so we wouldn't care in either case. Page splits that go to lower blockids are limited by available space, so would be less of a problem. I'm balancing the additional cost of page splits against the additional cost on the vacuum. I would prefer to keep in-line ops faster and pay a little extra on the out-of-line operations, if thats what it takes. I note your point that there is little contention, but there is still a cost and in many cases this cost is being paid on tables that never will be VACUUMed. For insert-intensive apps, this adds cost, for little benefit. For update-intensive apps, we're VACUUMing continually anyway so there's no benefit from doing this only-during-VACUUM. So we just optimised for slowly-but-continually churning tables (i.e. DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM performance for those that don't need it that often. That might be the common case, but it isn't the one thats hurting most. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] [PATCH] Round 2: Magic block for modules
Per feedback, here is an updated version. As was pointed out, the prior version was checking stuff that either changed too often to be useful or was never going to change at all. The error reporting is cleaned up too, it now releases the module before throwing the error. It now only checks four things: Major version number (7.4 or 8.1 for example) NAMEDATALEN FUNC_MAX_ARGS INDEX_MAX_KEYS The three constants were chosen because: 1. We document them in the config page in the docs 2. We mark them as changable in pg_config_manual.h 3. Changing any of these will break some of the more popular modules: FUNC_MAX_ARGS changes fmgr interface, every module uses this NAMEDATALEN changes syscache interface, every PL as well as tsearch uses this INDEX_MAX_KEYS breaks tsearch and anything using GiST. I considered others but ultimatly rejected them. For example, HAVE_INT64_TIMESTAMP, while changing the way timestamps are stored and being configurable by a configure option, doesn't actually break anything important (only the btree_gist example in contrib). Any more comments? Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. Index: doc/src/sgml/xfunc.sgml === RCS file: /projects/cvsroot/pgsql/doc/src/sgml/xfunc.sgml,v retrieving revision 1.112 diff -c -r1.112 xfunc.sgml *** doc/src/sgml/xfunc.sgml 23 Apr 2006 03:39:52 - 1.112 --- doc/src/sgml/xfunc.sgml 8 May 2006 17:41:33 - *** *** 1149,1154 --- 1149,1161 /para para + After the module has been found, PostgreSQL looks for its magic block. + This block contains information about the environment a module was + compiled in. The server uses this to verify the module was compiled + under the same assumptions and environment as the backend. +/para + +para The user ID the productnamePostgreSQL/productname server runs as must be able to traverse the path to the file you intend to load. Making the file or a higher-level directory not readable *** *** 1953,1958 --- 1960,1985 listitem para + To ensure your module is not loaded into an incompatible backend, it + is recommended to include a magic block. To do this you must include + the header filenamepgmagic.h/filename and declare the block as + follows: +/para + + programlisting + #include pgmagic.h + + PG_MODULE_MAGIC; + /programlisting + +para + If the module consists of multiple source files, this only needs to + be done in one of them. +/para + /listitem + + listitem +para Symbol names defined within object files must not conflict with each other or with symbols defined in the productnamePostgreSQL/productname server executable. You Index: src/backend/utils/fmgr/dfmgr.c === RCS file: /projects/cvsroot/pgsql/src/backend/utils/fmgr/dfmgr.c,v retrieving revision 1.82 diff -c -r1.82 dfmgr.c *** src/backend/utils/fmgr/dfmgr.c 5 Mar 2006 15:58:46 - 1.82 --- src/backend/utils/fmgr/dfmgr.c 8 May 2006 17:41:33 - *** *** 20,26 #include dynloader.h #include miscadmin.h #include utils/dynamic_loader.h ! /* * List of dynamically loaded files (kept in malloc'd memory). --- 20,26 #include dynloader.h #include miscadmin.h #include utils/dynamic_loader.h ! #include pgmagic.h /* * List of dynamically loaded files (kept in malloc'd memory). *** *** 60,65 --- 60,68 static char *expand_dynamic_library_name(const char *name); static char *substitute_libpath_macro(const char *name); + /* Magic structure that module needs to match to be accepted */ + static Pg_magic_struct magic_data = PG_MODULE_MAGIC_DATA; + /* * Load the specified dynamic-link library file, and look for a function * named funcname in it. (funcname can be NULL to just load the file.) *** *** 116,121 --- 119,125 if (file_scanner == NULL) { + PGModuleMagicFunction magic_func; /* * File not loaded yet. */ *** *** 146,151 --- 150,194 fullname, load_error))); } + /* Check the magic function to determine compatability */ + magic_func = pg_dlsym( file_scanner-handle, PG_MAGIC_FUNCTION_NAME_STRING ); + if( magic_func ) + { + Pg_magic_struct *module_magic_data = magic_func(); + if( module_magic_data-len != magic_data.len || + memcmp(
Re: [PATCHES] Page at a time index scan
Simon Riggs [EMAIL PROTECTED] writes: So we just optimised for slowly-but-continually churning tables (i.e. DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM performance for those that don't need it that often. That might be the common case, but it isn't the one thats hurting most. No, we just improved VACUUM performance for those that need it most. I was just doing some desultory experiments with today's code versus yesterday's (it's easy to make a direct comparison because they're initdb-compatible ... just stop one postmaster executable and start another on the same database). I made a table of 16M rows with an index over a random-data integer column. With a thoroughly disordered index (built on-the-fly as the random data was inserted), the time to VACUUM after deleting a small number of rows was 615 seconds with yesterday's code, 31 seconds today. With a perfectly-ordered index (identical table, but CREATE INDEX after all the data is in place), the times were about 28 and 26 seconds respectively. (I wouldn't put a *whole* lot of faith in these numbers, since they're from a Dell machine with a toy ATA drive, but anyway they do show that sequential access to the index makes a huge difference.) But perfectly-ordered indexes are impossible to maintain unless your data is either static or insert-at- the-end-only. Anyway, while the extra LWLock and shared-memory access during a split is annoying, I think it's really negligible (and so does pgbench). A page split is going to involve many times that much work --- you've got to acquire lock on at least four different buffers, visit the FSM, possibly ask the kernel for another disk page, do XLogInsert twice, etc. Any one of those things involves significantly more CPU effort and contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid of it, not at the price of slowing down VACUUM again. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [WIP] The relminxid addition, try 3
Tom Lane wrote: But why do you need your own xlogging at all? Shouldn't these actions be perfectly ordinary updates of the relevant catalog tuples? The XLog entry can be smaller, AFAICT (we need to store the whole new tuple in a heap_update, right?). If that's not a worthy goal I'll gladly rewrite this piece of code. Ah, there's another reason, and it's that I'm rewriting the tuple in place, not calling heap_update. I'm not sure if I can reuse log_heap_update for this purpose -- I'll take a look. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] [WIP] The relminxid addition, try 3
Alvaro Herrera [EMAIL PROTECTED] writes: Ah, there's another reason, and it's that I'm rewriting the tuple in place, not calling heap_update. Is that really a good idea, as compared to using heap_update? (Now, if you're combining this with the very grotty relpages/reltuples update code, then I'm all for making that xlog properly --- we've gotten away without it so far but it really should xlog the change.) regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Page at a time index scan
Tom, On 5/8/06 11:46 AM, Tom Lane [EMAIL PROTECTED] wrote: I made a table of 16M rows with an index over a random-data integer column. With a thoroughly disordered index (built on-the-fly as the random data was inserted), the time to VACUUM after deleting a small number of rows was 615 seconds with yesterday's code, 31 seconds today. With a perfectly-ordered index (identical table, but CREATE INDEX after all the data is in place), the times were about 28 and 26 seconds respectively. Very impressive! This corroborates findings we've had with index maintenance in the field - thanks for finding/fixing this. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] [WIP] The relminxid addition, try 3
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Ah, there's another reason, and it's that I'm rewriting the tuple in place, not calling heap_update. Is that really a good idea, as compared to using heap_update? Not sure -- we would leave dead tuples after the VACUUM is finished if we used heap_update, no? ISTM it's undesirable. (Now, if you're combining this with the very grotty relpages/reltuples update code, then I'm all for making that xlog properly --- we've gotten away without it so far but it really should xlog the change.) I hadn't done that, but I'll see what I can do. Notice however that I'm doing this in both pg_class _and_ pg_database. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [WIP] The relminxid addition, try 3
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: (Now, if you're combining this with the very grotty relpages/reltuples update code, then I'm all for making that xlog properly --- we've gotten away without it so far but it really should xlog the change.) I hadn't done that, but I'll see what I can do. Notice however that I'm doing this in both pg_class _and_ pg_database. BTW, one thing I was looking at last week was whether we couldn't refactor the relpages/reltuples updates to be done in a cleaner place. Right now UpdateStats is called (for indexes) directly from the index AM, which sucks from a modularity point of view, and what's worse it forces two successive updates of the same tuple during CREATE INDEX. We should reorganize things so this is done once at the outer level. It'd require some change of the ambuild() API, but considering we're hacking several other aspects of the AM API in this development cycle, that doesn't bother me. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] [WIP] The relminxid addition, try 3
Tom Lane wrote: BTW, one thing I was looking at last week was whether we couldn't refactor the relpages/reltuples updates to be done in a cleaner place. Right now UpdateStats is called (for indexes) directly from the index AM, which sucks from a modularity point of view, and what's worse it forces two successive updates of the same tuple during CREATE INDEX. We should reorganize things so this is done once at the outer level. It'd require some change of the ambuild() API, but considering we're hacking several other aspects of the AM API in this development cycle, that doesn't bother me. I'm rather stuck in Mammoth Replicator currently :-( so I'm not sure when I could devote some time to this task. If you want to do it, go ahead and I can adapt the relminxid patch to your changes. Or you can take the relminxid patch from where it currently is, if you feel so inclined. If you don't, I'll eventually come back to it, but I'm not sure when will that be. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] [WIP] The relminxid addition, try 3
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: We should reorganize things so this is done once at the outer level. It'd require some change of the ambuild() API, but considering we're hacking several other aspects of the AM API in this development cycle, that doesn't bother me. I'm rather stuck in Mammoth Replicator currently :-( so I'm not sure when I could devote some time to this task. OK, I'll try to get to it later this week. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Building with Visual C++
Tom Lane wrote: BTW, has anyone looked at the possibility of driving VC from gmake, so that we can continue to use the same Makefiles? Or is that just out of the question? The Coin 3D project had a wrapper program that makes VC look like a Unix compiler. Looking at http://www.coin3d.org/support/coin_faq/#extension_MSVCproject now, it seems they have changed to autogenerating a project file as well. But check there anyway; these guys have been doing this for years. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Building with Visual C++
BTW, has anyone looked at the possibility of driving VC from gmake, so that we can continue to use the same Makefiles? Or is that just out of the question? The Coin 3D project had a wrapper program that makes VC look like a Unix compiler. Looking at http://www.coin3d.org/support/coin_faq/#extension_MSVCproject now, it seems they have changed to autogenerating a project file as well. But check there anyway; these guys have been doing this for years. Definitly worth looking at - but htey still require cygwin to run their generators from what I can tell. //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [WIP] The relminxid addition, try 3
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: (Now, if you're combining this with the very grotty relpages/reltuples update code, then I'm all for making that xlog properly --- we've gotten away without it so far but it really should xlog the change.) I hadn't done that, but I'll see what I can do. Notice however that I'm doing this in both pg_class _and_ pg_database. It strikes me that the cleanest way to deal with this is to invent a single new type of xlog record, something like HEAP_UPDATE_IN_PLACE, which just replaces tuple contents with a new tuple of the same size. This would serve for both stats updates and unfreezing in both pg_class and pg_database, and might have other uses in future for non-transactional updates. It's a little bit more xlog space than your unfreeze-specific operations, but I don't think we need to sweat a few bytes for those; they're not likely to be common. Barring objections, I'll add this when I refactor UpdateStats. Got some other work I need to get back to right now, though. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Page at a time index scan
On Mon, 2006-05-08 at 14:46 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: So we just optimised for slowly-but-continually churning tables (i.e. DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM performance for those that don't need it that often. That might be the common case, but it isn't the one thats hurting most. No, we just improved VACUUM performance for those that need it most. I was just doing some desultory experiments with today's code versus yesterday's (it's easy to make a direct comparison because they're initdb-compatible ... just stop one postmaster executable and start another on the same database). I made a table of 16M rows with an index over a random-data integer column. With a thoroughly disordered index (built on-the-fly as the random data was inserted), the time to VACUUM after deleting a small number of rows was 615 seconds with yesterday's code, 31 seconds today. With a perfectly-ordered index (identical table, but CREATE INDEX after all the data is in place), the times were about 28 and 26 seconds respectively. (I wouldn't put a *whole* lot of faith in these numbers, since they're from a Dell machine with a toy ATA drive, but anyway they do show that sequential access to the index makes a huge difference.) But perfectly-ordered indexes are impossible to maintain unless your data is either static or insert-at- the-end-only. You and Heikki have achieved a marvelous thing, well done. Anyway, while the extra LWLock and shared-memory access during a split is annoying, I think it's really negligible (and so does pgbench). A page split is going to involve many times that much work --- you've got to acquire lock on at least four different buffers, visit the FSM, possibly ask the kernel for another disk page, do XLogInsert twice, etc. Any one of those things involves significantly more CPU effort and contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid of it, not at the price of slowing down VACUUM again. I'll raise the partial vacuum topic on -hackers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Building with Visual C++
Hi Magnus. I understood that this helped. #define PGBINDIR /usr/local/pgsql/bin #define PGSHAREDIR /usr/local/pgsql/share #define SYSCONFDIR /usr/local/pgsql/etc #define INCLUDEDIR /usr/local/pgsql/include #define PKGINCLUDEDIR /usr/local/pgsql/include #define INCLUDEDIRSERVER /usr/local/pgsql/include/server #define LIBDIR /usr/local/pgsql/lib #define PKGLIBDIR /usr/local/pgsql/lib #define LOCALEDIR #define DOCDIR /usr/local/pgsql/doc #define MANDIR /usr/local/pgsql/man It reconstructed on VC++6 with a part of your patch. Then, I am very good touch.:-) However, Would you add another patch of this? Regards, Hiroshi Saito --- src/include/port/win32.h.orig Mon May 8 14:45:11 2006 +++ src/include/port/win32.hMon May 8 15:15:09 2006 @@ -1,5 +1,8 @@ /* $PostgreSQL: pgsql/src/include/port/win32.h,v 1.51 2006/03/03 20:52:36 momjian Exp $ */ +#ifndef _PORT_WIN32_H +#define _PORT_WIN32_H + /* undefine and redefine after #include */ #undef mkdir @@ -11,6 +14,7 @@ #include errno.h #undef near +#define near pg_near /* Must be here to avoid conflicting with prototype in windows.h */ #define mkdir(a,b) mkdir(a) @@ -256,3 +260,5 @@ /* in backend/port/win32/error.c */ extern void _dosmaperr(unsigned long); + +#endif /* _PORT_WIN32_H */ --- src/include/getaddrinfo.h.orig Mon May 8 14:35:41 2006 +++ src/include/getaddrinfo.h Mon May 8 14:36:54 2006 @@ -43,7 +43,9 @@ #define EAI_SYSTEM (-11) #else /* WIN32 */ #if defined(WIN32_CLIENT_ONLY) +#ifndef WSA_NOT_ENOUGH_MEMORY #define WSA_NOT_ENOUGH_MEMORY (WSAENOBUFS) +#endif #define WSATYPE_NOT_FOUND (WSABASEERR+109) #endif #define EAI_AGAIN WSATRY_AGAIN ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq