Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Simon Riggs
On Fri, 2006-05-05 at 18:04 -0400, Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  On Fri, 5 May 2006, Tom Lane wrote:
  BTW, I just realized another bug in the patch: btbulkdelete fails to
  guarantee that it visits every page in the index.
 
  The first solution that occurs to me is to force page splits to choose the 
  target page so that it's blkno  the original page's blkno during vacuum. 
  It would cause the index to become more fragmented more quickly, which is 
  bad but perhaps tolerable.
 
 I have a sketch of a solution that doesn't require any change in page
 allocation behavior.  Can anyone see any holes in this:
 
 Assume that we have some way to recognize whether a page has been split
 since the current btbulkdelete scan started.  (A split would mark both the
 original page and the new right-hand page as newly split.)  When
 btbulkdelete arrives at a page, it need take no special action unless the
 page is newly split *and* its right-link points to a lower physical
 address.  If that's true, then after vacuuming the page, follow its
 right-link and vacuum that page; repeat until arriving at a page that is
 either not newly split or is above the current location of the outer loop.
 Then return to the outer, sequential-scan loop.
 
 We should also have btbulkdelete clear the newly-split marker whenever
 it processes a page; this ensures that no page is processed more than
 once, which is probably worth the possible extra I/O needed to clear the
 marker.  (The cycles to re-scan a page are maybe not that important, but
 if we do reprocess pages we'll end up with a bogusly high tuple count.
 OTOH I'm not sure we can guarantee an accurate tuple count anyway,
 so maybe it doesn't matter.)
 
 AFAICS, this works correctly even if the test for a newly-split page
 sometimes yields false positives; thinking a page is newly split when
 it is not might cost a little extra I/O but won't lead to wrong results.
 This reduces the requirements for the newly-split marking mechanism.
 
 So, how do we do that marking?  Noting that we have 16 bits we could use
 in BTPageOpaqueData without making that struct any bigger (because of
 alignment considerations), I'm thinking about a 16-bit counter for each
 index that is incremented at the start of each btbulkdelete operation and
 copied into the opaque data whenever a page is split.  If the value's
 different from the current counter, the page definitely hasn't been split
 during btbulkdelete.  There's a 1-in-65536 chance of a false positive,
 if the last split occurred some exact multiple of 65536 vacuums ago, but
 that's probably low enough to be acceptable.  (We could reduce the odds of
 false positives in various ways, eg by saying that zero isn't a valid
 counter value and having btbulkdelete reset a page's counter to zero
 anytime it has to write the page anyway.)

I read your earlier post about needing to lock everything and spent some
time thinking about this. The issue of needing to lock everything means
that we would never be able to do a partial vacuum of an index i.e.
remove one page without a scan. I'm more concerned about making partial
vacuum work than I am about speeding up an all-block vacuum.

I'd been mulling over the locking problems and came up with something
that looks identical to your proposal *except* for the value that gets
written into the BTPageOpaqueData on the right-hand page.

My thinking was to write the blockid of the original left hand page, so
as to record the original ancestor since split. Thus if multiple splits
occurred, then the original ancestor blockid would remain on record
until VACUUM. In more detail: When we split a page if the ancestor
blockid is not set, then we set it to be the blockid of the old page
(new left hand page). If the ancestor blockid is already set then we
copy that to the new right hand page. Every split will write a value to
BTPageOpaqueData, though the values to use are already available without
extra work.

AFAICS this doesn't change the ability to scan the index in physical
order, so we still get the benefits for that action.

A *partial* vacuum of an index block can be achieved by visiting the
block to be vacuumed, then following the link directly to the pre-split
ancestor block (if there is one), then moving right again back to the
target block. btbulkdelete always clears the marker when it cleans. This
opens the door for the approach of notifying when a page is deletable
and then having a background agent remove just that page, as
patch-proposed recently by Junji TERAMOTO. 

I'm *not* presuming we have an agreed solution for partial vacuuming,
but I do want to keep that door open by implementing a locking protocol
that could support it.

 This still has the same sort of locking issues I complained of in
 regards to Heikki's idea, namely how do you make sure that everyone is
 using the new counter value before you start scanning?  That seems
 soluble however.  We'd just need to be 

Re: [PATCHES] [PATCH] Magic block for modules

2006-05-08 Thread Martijn van Oosterhout
On Sun, May 07, 2006 at 08:21:43PM -0400, Tom Lane wrote:
  changes in any of the following:
 
  PG_VERSION_NUM
  CATALOG_VERSION_NO
  the size of 8 basic C types
  BLCKSZ=20
  NAMEDATALEN=20
  HAVE_INT64_TIMESTAMP
  INDEX_MAX_KEYS
  FUNC_MAX_ARGS
  VARHDRSZ
  MAXDIM
  The compiler used (only brand, not version)
 
 That seems way overkill to me.  FUNC_MAX_ARGS is good to check, but
 most of those other things are noncritical for typical add-on modules.

I was trying to find variables that when changed would make some things
corrupt. For example, a changed NAMEDATALEN will make any use of the
syscache a source of errors. A change in INDEX_MAX_KEYS will break the
GiST interface, etc. I wondered about letting module writers to select
which parts are relevent to them but that just seems like handing
people a footgun.

 In particular I strongly object to the check on compiler.  Some of us do
 use systems where gcc and vendor compilers are supposed to interoperate
 ... and aren't all those Windows compilers supposed to, too?  AFAIK

Maybe that's the case now, it didn't used to be. I seem to remember
people having difficulties because they compiled the server with MinGW
and the modules with VC++. I'll take it out though, it's not like it
costs anything.

 it's considered the linker's job to prevent loading 32-bit code into
 a 64-bit executable or vice versa, so I don't think we need to be
 checking for common assumptions about sizeof(long).

I know ELF headers contain some of this info, and unix in general
doesn't try to allow different bit sizes in one binary. Windows used to
(maybe still has) a mechanism to allow 32-bit code to call 16-bit
libraries. Do they allow the same for 64-bit libs?

 I'm pretty sure we had agreed that magic blocks should be required;
 otherwise this check will accomplish little.

Sure, I just didn't want to break every module in one weekend. I was
thinking of adding it with LOG level now, send a message on -announce
saying that at the beginning of the 8.2 freeze it will be an ERROR.
Give people time to react.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [PATCHES] [PATCH] Magic block for modules

2006-05-08 Thread Magnus Hagander
  it's considered the linker's job to prevent loading 32-bit 
 code into a 
  64-bit executable or vice versa, so I don't think we need to be 
  checking for common assumptions about sizeof(long).
 
 I know ELF headers contain some of this info, and unix in 
 general doesn't try to allow different bit sizes in one 
 binary. Windows used to (maybe still has) a mechanism to 
 allow 32-bit code to call 16-bit libraries. Do they allow the 
 same for 64-bit libs?

Yes, but it's not something that it does automatically - you have to
specifically seti t up to call the thunking code. It's not something I
think we need to support at all. (Performance is also quite horrible -
at least on 16 vs 32, I'd assume the same for 32 vs 64)


//Magnus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 I read your earlier post about needing to lock everything and spent some
 time thinking about this. The issue of needing to lock everything means
 that we would never be able to do a partial vacuum of an index i.e.
 remove one page without a scan. I'm more concerned about making partial
 vacuum work than I am about speeding up an all-block vacuum.

[ shrug... ] That's an illusory goal anyway.  Retail tuple removal is
just too inefficient.  (No, I don't believe in that proposed patch.)

 My thinking was to write the blockid of the original left hand page, so
 as to record the original ancestor since split. Thus if multiple splits
 occurred, then the original ancestor blockid would remain on record
 until VACUUM. In more detail: When we split a page if the ancestor
 blockid is not set, then we set it to be the blockid of the old page
 (new left hand page). If the ancestor blockid is already set then we
 copy that to the new right hand page. Every split will write a value to
 BTPageOpaqueData, though the values to use are already available without
 extra work.

Doesn't work, at least not for making it possible to vacuum part of the
index.  The conflicting indexscan could have stopped on a page, and then
that page could have split, before your partial vacuum ever started.
So tracing back only as far as the data has split since vacuum started
is not enough to prevent conflict.

(The other little problem is that we'd have to enlarge the BTOpaque
overhead, because a block id doesn't fit in the available 16 bits.)

 I'm not very happy about an extra lock during page splitting, which adds
 a performance hit even for tables that never will need regular vacuuming
 (apart from occaisional wrap-around avoidance).

While I'd rather not have done that, I don't believe that it makes for
any material performance degradation.  Normal splits all take the lock
in shared mode and hence suffer no contention.  Your proposal wouldn't
make for less locking anyway, since it still assumes that there's a way
to tell whether vacuum is active for a given index, which is just about
the same amount of overhead as the code-as-committed.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [PATCH] Magic block for modules

2006-05-08 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 On Sun, May 07, 2006 at 08:21:43PM -0400, Tom Lane wrote:
 That seems way overkill to me.  FUNC_MAX_ARGS is good to check, but
 most of those other things are noncritical for typical add-on modules.

 I was trying to find variables that when changed would make some things
 corrupt. For example, a changed NAMEDATALEN will make any use of the
 syscache a source of errors. A change in INDEX_MAX_KEYS will break the
 GiST interface, etc.

By that rationale you'd have to record just about every #define in the
system headers.  And it still wouldn't be bulletproof --- what of
custom-modified code with, say, extra fields inserted into some widely
used struct?

But you're missing the larger point, which is that in many cases this
would be breaking stuff without any need at all.  The majority of
catversion bumps, for instance, are for things that don't affect the
typical add-on module.  So checking for identical catversion won't
accomplish much except to force additional recompile churn on people
doing development against CVS HEAD.  The original proposal was just
to check for major PG version match.  I can see checking FUNC_MAX_ARGS
too, because that has a very direct impact on the ABI that every
external function sees, but I think the cost/benefit ratio rises pretty
darn steeply after that.

Another problem with an expansive list of stuff-to-check is where does
the add-on module find it out from?  AFAICS your proposal would make for
a large laundry list of random headers that every add-on would now have
to #include.  If it's not defined by postgres.h or fmgr.h (which are two
things that every backend addon is surely including already) then I'm
dubious about using it in the magic block.

 Sure, I just didn't want to break every module in one weekend. I was
 thinking of adding it with LOG level now, send a message on -announce
 saying that at the beginning of the 8.2 freeze it will be an ERROR.
 Give people time to react.

I think that will just mean that we'll break every module at the start
of 8.2 freeze ;-).  Unless we forget to change it to error, which IMHO
is way too likely.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 I'm not too sure about the XLOG routines -- I don't understand very well
 the business about attaching the changes to a buffer; I thought at first
 that since all the changes go to a tuple, they all belong to the buffer,
 so I assigned a single XLogRecData struct with all the info and the
 buffer containing the tuple; but then on replay, I got PANIC: invalid
 xlog record length 0

Generally you want an xlog record to consist of some fixed overhead plus
attached data.  The attached data is the part that should link to the
buffer (normally it's data that is/will be actually stored into that buffer).
The fixed overhead isn't in the buffer and doesn't link.

But why do you need your own xlogging at all?  Shouldn't these actions
be perfectly ordinary updates of the relevant catalog tuples?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [PATCH] Magic block for modules

2006-05-08 Thread Martijn van Oosterhout
On Mon, May 08, 2006 at 10:32:47AM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  I was trying to find variables that when changed would make some things
  corrupt. For example, a changed NAMEDATALEN will make any use of the
  syscache a source of errors. A change in INDEX_MAX_KEYS will break the
  GiST interface, etc.
 
 By that rationale you'd have to record just about every #define in the
 system headers.  And it still wouldn't be bulletproof --- what of
 custom-modified code with, say, extra fields inserted into some widely
 used struct?

I can see that. That's why I specifically aimed at the ones defined in
pg_config_manual.h, ie, the ones marked twiddle me.

 ... So checking for identical catversion won't
 accomplish much except to force additional recompile churn on people
 doing development against CVS HEAD.  The original proposal was just
 to check for major PG version match. 

Ok, I've taken out CATVERSION and cut PG version to just the major
version. I've also dropped the compiler and several others.

 Another problem with an expansive list of stuff-to-check is where does
 the add-on module find it out from? 

All these symbols are defined by including c.h only, which is included
by postgres.h, so this is not an issue. I obviously didn't include any
symbols that a module would need to add special includes for. The only
outlier was CATVERSION but we're dropping that test.

 I think that will just mean that we'll break every module at the start
 of 8.2 freeze ;-).  Unless we forget to change it to error, which IMHO
 is way too likely.

Ok, one week then. Not everyone follows -patches and will be mighty
confused when a CVS update suddenly breaks everything.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Simon Riggs
On Mon, 2006-05-08 at 10:18 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I read your earlier post about needing to lock everything and spent some
  time thinking about this. The issue of needing to lock everything means
  that we would never be able to do a partial vacuum of an index i.e.
  remove one page without a scan. I'm more concerned about making partial
  vacuum work than I am about speeding up an all-block vacuum.
 
 [ shrug... ] That's an illusory goal anyway.  Retail tuple removal is
 just too inefficient.  (No, I don't believe in that proposed patch.)

Current VACUUM optimizes for the case where random updates/deletes
occur. If there are hotspots then scanning the whole relation is O(N) or
even O(N^2) if we have to scan the indexes multiple times.

We think we have a way to improve heap VACUUMs (bitmaps...) but we are
still looking for an equivalent for indexes.

  My thinking was to write the blockid of the original left hand page, so
  as to record the original ancestor since split. Thus if multiple splits
  occurred, then the original ancestor blockid would remain on record
  until VACUUM. In more detail: When we split a page if the ancestor
  blockid is not set, then we set it to be the blockid of the old page
  (new left hand page). If the ancestor blockid is already set then we
  copy that to the new right hand page. Every split will write a value to
  BTPageOpaqueData, though the values to use are already available without
  extra work.
 
 Doesn't work, at least not for making it possible to vacuum part of the
 index.  The conflicting indexscan could have stopped on a page, and then
 that page could have split, before your partial vacuum ever started.
 So tracing back only as far as the data has split since vacuum started
 is not enough to prevent conflict.

That wasn't the proposal. Every split would be marked and stay marked
until those blocks were VACUUMed. The data used to mark is readily
available and doesn't rely on whether or not VACUUM is running.
IMHO this does work.

 (The other little problem is that we'd have to enlarge the BTOpaque
 overhead, because a block id doesn't fit in the available 16 bits.)

ISTM it is indeed a little problem. After CREATE INDEX, most index pages
don't stay completely full, so worrying about alignment wastage might
very occasionally save an I/O, but not enough for us to care. 

  I'm not very happy about an extra lock during page splitting, which adds
  a performance hit even for tables that never will need regular vacuuming
  (apart from occaisional wrap-around avoidance).
 
 While I'd rather not have done that, I don't believe that it makes for
 any material performance degradation.  Normal splits all take the lock
 in shared mode and hence suffer no contention.  Your proposal wouldn't
 make for less locking anyway, since it still assumes that there's a way
 to tell whether vacuum is active for a given index, which is just about
 the same amount of overhead as the code-as-committed.

The proposed scheme doesn't rely on knowing whether a VACUUM is active
or not.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 That wasn't the proposal. Every split would be marked and stay marked
 until those blocks were VACUUMed. The data used to mark is readily
 available and doesn't rely on whether or not VACUUM is running.
 IMHO this does work.

OK, I misunderstood what you had in mind, but now that I do understand
it doesn't seem terribly efficient.  What you're suggesting is that we
take as a vacuum group all the pages that have been split off from a
single original page since that page was last vacuumed, and that this
group must be vacuumed as a whole.  That works, but it seems that the
groups would get awfully large.  In particular, this substantially
penalizes btbulkdelete in hopes of someday improving matters for what
remains an entirely fictional partial vacuum.  As it stands today,
btbulkdelete only has to worry about page groups formed since it began
to run, not since the last vacuum.  Changing the data representation
like this would force it to retrace much more often and over much larger
page groups.

 (The other little problem is that we'd have to enlarge the BTOpaque
 overhead, because a block id doesn't fit in the available 16 bits.)

 ISTM it is indeed a little problem. After CREATE INDEX, most index pages
 don't stay completely full, so worrying about alignment wastage might
 very occasionally save an I/O, but not enough for us to care. 

I don't think an extra 4 or 8 bytes need be a show-stopper, but it does
force an initdb and thus make it harder to compare the two techniques.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Simon Riggs
On Mon, 2006-05-08 at 11:26 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  That wasn't the proposal. Every split would be marked and stay marked
  until those blocks were VACUUMed. The data used to mark is readily
  available and doesn't rely on whether or not VACUUM is running.
  IMHO this does work.
 
 OK, I misunderstood what you had in mind, but now that I do understand
 it doesn't seem terribly efficient.  What you're suggesting is that we
 take as a vacuum group all the pages that have been split off from a
 single original page since that page was last vacuumed, and that this
 group must be vacuumed as a whole.  That works, but it seems that the
 groups would get awfully large.  In particular, this substantially
 penalizes btbulkdelete in hopes of someday improving matters for what
 remains an entirely fictional partial vacuum.  

OK, so we have the germ of a new mechanism - and I very much agree that
the idea of a partial vacuum is at present entirely fictional...but we
at least have a starting place.

 As it stands today,
 btbulkdelete only has to worry about page groups formed since it began
 to run, not since the last vacuum.  Changing the data representation
 like this would force it to retrace much more often and over much larger
 page groups.

Yes, I saw the potential issue you mention - but for many cases the
index grows forwards and so we wouldn't care in either case. Page splits
that go to lower blockids are limited by available space, so would be
less of a problem. I'm balancing the additional cost of page splits
against the additional cost on the vacuum. I would prefer to keep
in-line ops faster and pay a little extra on the out-of-line operations,
if thats what it takes. I note your point that there is little
contention, but there is still a cost and in many cases this cost is
being paid on tables that never will be VACUUMed.

For insert-intensive apps, this adds cost, for little benefit.

For update-intensive apps, we're VACUUMing continually anyway so there's
no benefit from doing this only-during-VACUUM.

So we just optimised for slowly-but-continually churning tables (i.e.
DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM
performance for those that don't need it that often. That might be the
common case, but it isn't the one thats hurting most.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PATCHES] [PATCH] Round 2: Magic block for modules

2006-05-08 Thread Martijn van Oosterhout
Per feedback, here is an updated version. As was pointed out, the prior
version was checking stuff that either changed too often to be useful
or was never going to change at all. The error reporting is cleaned up
too, it now releases the module before throwing the error. It now only
checks four things:

Major version number (7.4 or 8.1 for example)
NAMEDATALEN
FUNC_MAX_ARGS
INDEX_MAX_KEYS

The three constants were chosen because:

1. We document them in the config page in the docs
2. We mark them as changable in pg_config_manual.h
3. Changing any of these will break some of the more popular modules:

FUNC_MAX_ARGS changes fmgr interface, every module uses this
NAMEDATALEN changes syscache interface, every PL as well as tsearch uses this
INDEX_MAX_KEYS breaks tsearch and anything using GiST.

I considered others but ultimatly rejected them. For example,
HAVE_INT64_TIMESTAMP, while changing the way timestamps are stored and
being configurable by a configure option, doesn't actually break
anything important (only the btree_gist example in contrib).

Any more comments?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.
Index: doc/src/sgml/xfunc.sgml
===
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/xfunc.sgml,v
retrieving revision 1.112
diff -c -r1.112 xfunc.sgml
*** doc/src/sgml/xfunc.sgml 23 Apr 2006 03:39:52 -  1.112
--- doc/src/sgml/xfunc.sgml 8 May 2006 17:41:33 -
***
*** 1149,1154 
--- 1149,1161 
 /para
  
 para
+ After the module has been found, PostgreSQL looks for its magic block.
+ This block contains information about the environment a module was
+ compiled in. The server uses this to verify the module was compiled
+ under the same assumptions and environment as the backend.
+/para
+ 
+para
  The user ID the productnamePostgreSQL/productname server runs
  as must be able to traverse the path to the file you intend to
  load.  Making the file or a higher-level directory not readable
***
*** 1953,1958 
--- 1960,1985 
  
listitem
 para
+ To ensure your module is not loaded into an incompatible backend, it
+ is recommended to include a magic block. To do this you must include
+ the header filenamepgmagic.h/filename and declare the block as
+ follows:
+/para
+ 
+ programlisting
+ #include pgmagic.h
+ 
+ PG_MODULE_MAGIC;
+ /programlisting
+ 
+para
+ If the module consists of multiple source files, this only needs to
+ be done in one of them.
+/para
+   /listitem
+ 
+   listitem
+para
  Symbol names defined within object files must not conflict
  with each other or with symbols defined in the
  productnamePostgreSQL/productname server executable.  You
Index: src/backend/utils/fmgr/dfmgr.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/fmgr/dfmgr.c,v
retrieving revision 1.82
diff -c -r1.82 dfmgr.c
*** src/backend/utils/fmgr/dfmgr.c  5 Mar 2006 15:58:46 -   1.82
--- src/backend/utils/fmgr/dfmgr.c  8 May 2006 17:41:33 -
***
*** 20,26 
  #include dynloader.h
  #include miscadmin.h
  #include utils/dynamic_loader.h
! 
  
  /*
   * List of dynamically loaded files (kept in malloc'd memory).
--- 20,26 
  #include dynloader.h
  #include miscadmin.h
  #include utils/dynamic_loader.h
! #include pgmagic.h
  
  /*
   * List of dynamically loaded files (kept in malloc'd memory).
***
*** 60,65 
--- 60,68 
  static char *expand_dynamic_library_name(const char *name);
  static char *substitute_libpath_macro(const char *name);
  
+ /* Magic structure that module needs to match to be accepted */
+ static Pg_magic_struct magic_data = PG_MODULE_MAGIC_DATA;
+ 
  /*
   * Load the specified dynamic-link library file, and look for a function
   * named funcname in it.  (funcname can be NULL to just load the file.)
***
*** 116,121 
--- 119,125 
  
if (file_scanner == NULL)
{
+   PGModuleMagicFunction magic_func;
/*
 * File not loaded yet.
 */
***
*** 146,151 
--- 150,194 
fullname, load_error)));
}
  
+   /* Check the magic function to determine compatability */
+   magic_func = pg_dlsym( file_scanner-handle, 
PG_MAGIC_FUNCTION_NAME_STRING );
+   if( magic_func )
+   {
+   Pg_magic_struct *module_magic_data = magic_func();
+   if( module_magic_data-len != magic_data.len ||
+   memcmp( 

Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 So we just optimised for slowly-but-continually churning tables (i.e.
 DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM
 performance for those that don't need it that often. That might be the
 common case, but it isn't the one thats hurting most.

No, we just improved VACUUM performance for those that need it most.
I was just doing some desultory experiments with today's code versus
yesterday's (it's easy to make a direct comparison because they're
initdb-compatible ... just stop one postmaster executable and start
another on the same database).  I made a table of 16M rows with an
index over a random-data integer column.  With a thoroughly disordered
index (built on-the-fly as the random data was inserted), the time to
VACUUM after deleting a small number of rows was 615 seconds with
yesterday's code, 31 seconds today.  With a perfectly-ordered index
(identical table, but CREATE INDEX after all the data is in place), the
times were about 28 and 26 seconds respectively.  (I wouldn't put a
*whole* lot of faith in these numbers, since they're from a Dell machine
with a toy ATA drive, but anyway they do show that sequential access to
the index makes a huge difference.)  But perfectly-ordered indexes are
impossible to maintain unless your data is either static or insert-at-
the-end-only.

Anyway, while the extra LWLock and shared-memory access during a split
is annoying, I think it's really negligible (and so does pgbench).
A page split is going to involve many times that much work --- you've
got to acquire lock on at least four different buffers, visit the FSM,
possibly ask the kernel for another disk page, do XLogInsert twice, etc.
Any one of those things involves significantly more CPU effort and
contention than _bt_vacuum_cycleid().  So while I'd be happy to get rid
of it, not at the price of slowing down VACUUM again.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Alvaro Herrera
Tom Lane wrote:

 But why do you need your own xlogging at all?  Shouldn't these actions
 be perfectly ordinary updates of the relevant catalog tuples?

The XLog entry can be smaller, AFAICT (we need to store the whole new
tuple in a heap_update, right?).  If that's not a worthy goal I'll
gladly rewrite this piece of code.

Ah, there's another reason, and it's that I'm rewriting the tuple in
place, not calling heap_update.  I'm not sure if I can reuse
log_heap_update for this purpose -- I'll take a look.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Ah, there's another reason, and it's that I'm rewriting the tuple in
 place, not calling heap_update.

Is that really a good idea, as compared to using heap_update?

(Now, if you're combining this with the very grotty relpages/reltuples
update code, then I'm all for making that xlog properly --- we've gotten
away without it so far but it really should xlog the change.)

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Luke Lonergan
Tom,

On 5/8/06 11:46 AM, Tom Lane [EMAIL PROTECTED] wrote:

 I made a table of 16M rows with an
 index over a random-data integer column.  With a thoroughly disordered
 index (built on-the-fly as the random data was inserted), the time to
 VACUUM after deleting a small number of rows was 615 seconds with
 yesterday's code, 31 seconds today.  With a perfectly-ordered index
 (identical table, but CREATE INDEX after all the data is in place), the
 times were about 28 and 26 seconds respectively.

Very impressive!  This corroborates findings we've had with index
maintenance in the field - thanks for finding/fixing this.

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Ah, there's another reason, and it's that I'm rewriting the tuple in
  place, not calling heap_update.
 
 Is that really a good idea, as compared to using heap_update?

Not sure -- we would leave dead tuples after the VACUUM is finished if
we used heap_update, no?  ISTM it's undesirable.

 (Now, if you're combining this with the very grotty relpages/reltuples
 update code, then I'm all for making that xlog properly --- we've gotten
 away without it so far but it really should xlog the change.)

I hadn't done that, but I'll see what I can do.  Notice however that I'm
doing this in both pg_class _and_ pg_database.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 (Now, if you're combining this with the very grotty relpages/reltuples
 update code, then I'm all for making that xlog properly --- we've gotten
 away without it so far but it really should xlog the change.)

 I hadn't done that, but I'll see what I can do.  Notice however that I'm
 doing this in both pg_class _and_ pg_database.

BTW, one thing I was looking at last week was whether we couldn't
refactor the relpages/reltuples updates to be done in a cleaner place.
Right now UpdateStats is called (for indexes) directly from the index
AM, which sucks from a modularity point of view, and what's worse it
forces two successive updates of the same tuple during CREATE INDEX.
We should reorganize things so this is done once at the outer level.
It'd require some change of the ambuild() API, but considering we're
hacking several other aspects of the AM API in this development cycle,
that doesn't bother me.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Alvaro Herrera
Tom Lane wrote:

 BTW, one thing I was looking at last week was whether we couldn't
 refactor the relpages/reltuples updates to be done in a cleaner place.
 Right now UpdateStats is called (for indexes) directly from the index
 AM, which sucks from a modularity point of view, and what's worse it
 forces two successive updates of the same tuple during CREATE INDEX.
 We should reorganize things so this is done once at the outer level.
 It'd require some change of the ambuild() API, but considering we're
 hacking several other aspects of the AM API in this development cycle,
 that doesn't bother me.

I'm rather stuck in Mammoth Replicator currently :-( so I'm not sure
when I could devote some time to this task.  If you want to do it, go
ahead and I can adapt the relminxid patch to your changes.  Or you can
take the relminxid patch from where it currently is, if you feel so
inclined.

If you don't, I'll eventually come back to it, but I'm not sure when
will that be.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 We should reorganize things so this is done once at the outer level.
 It'd require some change of the ambuild() API, but considering we're
 hacking several other aspects of the AM API in this development cycle,
 that doesn't bother me.

 I'm rather stuck in Mammoth Replicator currently :-( so I'm not sure
 when I could devote some time to this task.

OK, I'll try to get to it later this week.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Building with Visual C++

2006-05-08 Thread Peter Eisentraut
Tom Lane wrote:
 BTW, has anyone looked at the possibility of driving VC from gmake,
 so that we can continue to use the same Makefiles?  Or is that just
 out of the question?

The Coin 3D project had a wrapper program that makes VC look like a Unix 
compiler.  Looking at 
http://www.coin3d.org/support/coin_faq/#extension_MSVCproject now, it 
seems they have changed to autogenerating a project file as well.  But 
check there anyway; these guys have been doing this for years.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Building with Visual C++

2006-05-08 Thread Magnus Hagander
  BTW, has anyone looked at the possibility of driving VC 
 from gmake, so 
  that we can continue to use the same Makefiles?  Or is that 
 just out 
  of the question?
 
 The Coin 3D project had a wrapper program that makes VC look 
 like a Unix compiler.  Looking at 
 http://www.coin3d.org/support/coin_faq/#extension_MSVCproject
  now, it seems they have changed to autogenerating a project 
 file as well.  But check there anyway; these guys have been 
 doing this for years.

Definitly worth looking at - but htey still require cygwin to run their
generators from what I can tell.

//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [WIP] The relminxid addition, try 3

2006-05-08 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 (Now, if you're combining this with the very grotty relpages/reltuples
 update code, then I'm all for making that xlog properly --- we've gotten
 away without it so far but it really should xlog the change.)

 I hadn't done that, but I'll see what I can do.  Notice however that I'm
 doing this in both pg_class _and_ pg_database.

It strikes me that the cleanest way to deal with this is to invent a
single new type of xlog record, something like HEAP_UPDATE_IN_PLACE,
which just replaces tuple contents with a new tuple of the same size.
This would serve for both stats updates and unfreezing in both pg_class
and pg_database, and might have other uses in future for
non-transactional updates.  It's a little bit more xlog space than your
unfreeze-specific operations, but I don't think we need to sweat a few
bytes for those; they're not likely to be common.

Barring objections, I'll add this when I refactor UpdateStats.  Got some
other work I need to get back to right now, though.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Page at a time index scan

2006-05-08 Thread Simon Riggs
On Mon, 2006-05-08 at 14:46 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  So we just optimised for slowly-but-continually churning tables (i.e.
  DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM
  performance for those that don't need it that often. That might be the
  common case, but it isn't the one thats hurting most.
 
 No, we just improved VACUUM performance for those that need it most.
 I was just doing some desultory experiments with today's code versus
 yesterday's (it's easy to make a direct comparison because they're
 initdb-compatible ... just stop one postmaster executable and start
 another on the same database).  I made a table of 16M rows with an
 index over a random-data integer column.  With a thoroughly disordered
 index (built on-the-fly as the random data was inserted), the time to
 VACUUM after deleting a small number of rows was 615 seconds with
 yesterday's code, 31 seconds today.  With a perfectly-ordered index
 (identical table, but CREATE INDEX after all the data is in place), the
 times were about 28 and 26 seconds respectively.  (I wouldn't put a
 *whole* lot of faith in these numbers, since they're from a Dell machine
 with a toy ATA drive, but anyway they do show that sequential access to
 the index makes a huge difference.)  But perfectly-ordered indexes are
 impossible to maintain unless your data is either static or insert-at-
 the-end-only.

You and Heikki have achieved a marvelous thing, well done.

 Anyway, while the extra LWLock and shared-memory access during a split
 is annoying, I think it's really negligible (and so does pgbench).
 A page split is going to involve many times that much work --- you've
 got to acquire lock on at least four different buffers, visit the FSM,
 possibly ask the kernel for another disk page, do XLogInsert twice, etc.
 Any one of those things involves significantly more CPU effort and
 contention than _bt_vacuum_cycleid().  So while I'd be happy to get rid
 of it, not at the price of slowing down VACUUM again.

I'll raise the partial vacuum topic on -hackers. 

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Building with Visual C++

2006-05-08 Thread Hiroshi Saito

Hi Magnus.

I understood that this helped.

#define PGBINDIR /usr/local/pgsql/bin
#define PGSHAREDIR /usr/local/pgsql/share
#define SYSCONFDIR /usr/local/pgsql/etc
#define INCLUDEDIR /usr/local/pgsql/include
#define PKGINCLUDEDIR /usr/local/pgsql/include
#define INCLUDEDIRSERVER /usr/local/pgsql/include/server
#define LIBDIR /usr/local/pgsql/lib
#define PKGLIBDIR /usr/local/pgsql/lib
#define LOCALEDIR 
#define DOCDIR /usr/local/pgsql/doc
#define MANDIR /usr/local/pgsql/man

It reconstructed on VC++6 with a part of your patch.
Then, I am very good touch.:-)
However,  Would you add another patch of this?

Regards,
Hiroshi Saito
--- src/include/port/win32.h.orig   Mon May  8 14:45:11 2006
+++ src/include/port/win32.hMon May  8 15:15:09 2006
@@ -1,5 +1,8 @@
/* $PostgreSQL: pgsql/src/include/port/win32.h,v 1.51 2006/03/03 20:52:36 
momjian Exp $ */

+#ifndef _PORT_WIN32_H
+#define _PORT_WIN32_H
+
/* undefine and redefine after #include */
#undef mkdir

@@ -11,6 +14,7 @@
#include errno.h

#undef near
+#define near pg_near

/* Must be here to avoid conflicting with prototype in windows.h */
#define mkdir(a,b)  mkdir(a)
@@ -256,3 +260,5 @@

/* in backend/port/win32/error.c */
extern void _dosmaperr(unsigned long);
+
+#endif /* _PORT_WIN32_H */
--- src/include/getaddrinfo.h.orig  Mon May  8 14:35:41 2006
+++ src/include/getaddrinfo.h   Mon May  8 14:36:54 2006
@@ -43,7 +43,9 @@
#define EAI_SYSTEM  (-11)
#else   /* WIN32 */
#if defined(WIN32_CLIENT_ONLY)
+#ifndef WSA_NOT_ENOUGH_MEMORY
#define WSA_NOT_ENOUGH_MEMORY   (WSAENOBUFS)
+#endif
#define WSATYPE_NOT_FOUND   (WSABASEERR+109)
#endif
#define EAI_AGAIN   WSATRY_AGAIN

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq