Re: [HACKERS] Vista/IPv6

2007-04-13 Thread Warren Turkal
On Thursday 12 April 2007 00:56, you wrote:
 IIRC, cmake is a replacement for make, not for configure. Or did I miss
 something?

CMake also has configure functionality.

 And it would require rewriting all the unix makefiles in cmake format,
 and it was one of the major requirements for this undertaking that this
 would not be required.

The CMakeLists.txt files can coexist with the current Makefiles. You could 
just do out of tree builds to keep from overwriting the original Makefiles 
during the generation. Of course, generating MSVC project files wouldn't 
overwrite the Makefiles in any case.

wt
-- 
Warren Turkal (w00t)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Vista/IPv6

2007-04-13 Thread Magnus Hagander
On Fri, Apr 13, 2007 at 12:44:41AM -0600, Warren Turkal wrote:
 On Thursday 12 April 2007 00:56, you wrote:
  IIRC, cmake is a replacement for make, not for configure. Or did I miss
  something?
 
 CMake also has configure functionality.
 
  And it would require rewriting all the unix makefiles in cmake format,
  and it was one of the major requirements for this undertaking that this
  would not be required.
 
 The CMakeLists.txt files can coexist with the current Makefiles. You could 
 just do out of tree builds to keep from overwriting the original Makefiles 
 during the generation. Of course, generating MSVC project files wouldn't 
 overwrite the Makefiles in any case.

That would mean maintainig two sets of build files. That's never going to
work. The only option if we were to go with that is, IMHO, to move
*everything* over to it. And I don't think you'll get a lot of people
supporting that ;)

//Magnus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Eliminating unnecessary left joins

2007-04-13 Thread Zeugswetter Andreas ADI SD

 I have this exact problem a lot.  There are actually cases 
 where you can eliminate regular joins, not just left joins.  
 For example:
 
 CREATE TABLE partner (
   id  serial,
   namevarchar(40) not null,
   primary key (id)
 );
 
 CREATE TABLE project (
   id  serial,
   namevarchar(40) not null,
   partner_id  integer not null references project (id)

^^^ -- I assume typo, should be partner
 );
 
 CREATE VIEW project_view AS
 SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM 
 project p, partner pp WHERE p.partner_id = pp.id;

Same advice to you:

1. add not null to your id's
2. CREATE VIEW project_view AS
   SELECT p.id, p.name, p.partner_id, pp.name AS partner FROM 
   project p left outer join partner pp ON p.partner_id = pp.id;
3. wait (or implement :-) the left join optimization in pg

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] conflicting gettimeofday with MinGW

2007-04-13 Thread Magnus Hagander
On Wed, Apr 11, 2007 at 06:17:19PM +0200, Zeugswetter Andreas ADI SD wrote:
 
   That page is ages out of date. The intended sync is 
  apparently broken.
   The current download area is on sourceforge
   http://sf.net/project/showfiles.php?group_id=2435
  
 
  
  *sigh*
  
  And what is in 3.12, which is apparently the current version?
 
 Sorry that was implied. sys/time.h did not change between 3.10 and 3.12.
 There is no #define to remove the declaration.
 
 In cvs the file is marked as dead and gives a reference to cygwin's
 w32api.
 Maybe the answer is to simply note that the mingw-runtime binary from
 sourceforge above 3.9 has a broken sys/time.h (it is not the file from
 cygwin but the dead one from cvs). 

I think that soudns reasonable. Something for the mingw FAQ. Care to
provide bruce with a patch?

//Magnus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Vista/IPv6

2007-04-13 Thread Magnus Hagander
On Thu, Apr 12, 2007 at 08:46:09PM +0900, Hiroshi Saito wrote:
 src/include/pg_config.h.win32
 /* Define to 1 if you have support for IPv6. */
 // #define HAVE_IPV6 1
 
 What do you think?
 
 It's defined ni the msvc build script, see
 http://archives.postgresql.org/pgsql-committers/2007-04/msg00148.php
 
 It is a meaning with win32.mak. Then, It is the outside of the msvc tool.

I see.

But - does it work when build with MSVC6? IIRC, MSVC6 pre-dates windows
2000 and the windows IPV6 support.

Can you verify that it works if you manually add this #define and build
with MSVC6?

//Magnus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Zeugswetter Andreas ADI SD

  Yup, this is a good summary.
  
  You say you need to remove the optimization that avoids the logging
of 
  a new tuple because the full page image exists.
  I think we must already have the info in WAL which tuple inside the 
  full page image is new (the one for which we avoided the WAL entry 
  for).
  
  How about this:
  Leave current WAL as it is and only add the not removeable flag to 
  full pages.
  pg_compresslog then replaces the full page image with a record for
the 
  one tuple that is changed.
  I tend to think it is not worth the increased complexity only to
save 
  bytes in the uncompressed WAL though.
 
 It is essentially what my patch proposes.  My patch includes 
 flag to full page writes which can be removed.

Ok, a flag that marks full page images that can be removed is perfect.

But you also turn off the optimization that avoids writing regular
WAL records when the info is already contained in a full-page image
(increasing the
uncompressed size of WAL).
It was that part I questioned. As already stated, maybe I should not
have because
it would be too complex to reconstruct a regular WAL record from the
full-page image.  
But that code would also be needed for WAL based partial replication, so
if it where too
complicated we would eventually want a switch to turn off the
optimization anyway
(at least for heap page changes).

  Another point about pg_decompresslog:
  
  Why do you need a pg_decompresslog ? Imho pg_compresslog should 
  already do the replacing of the full_page with the dummy entry. Then

  pg_decompresslog could be a simple gunzip, or whatever compression
was 
  used, but no logic.
 
 Just removing full page writes does not work.   If we shift the rest
of 
 the WAL, then LSN becomes inconsistent in compressed archive logs
which 
 pg_compresslog produces.   For recovery, we have to restore LSN as the

 original WAL.   Pg_decompresslog restores removed full page writes as
a 
 dumm records so that recovery redo functions won't be confused.

Ah sorry, I needed some pgsql/src/backend/access/transam/README reading.

LSN is the physical position of records in WAL. Thus your dummy record
size is equal to what you cut out of the original record.
What about disconnecting WAL LSN from physical WAL record position
during replay ?
Add simple short WAL records in pg_compresslog like: advance LSN by 8192
bytes.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


[HACKERS] where to write small reusable functions ?

2007-04-13 Thread Dany DeBontridder

Hi,

I'm working to implement a new feature to pg_dump: the ability to dump
objects like function, indexes... And I notice that there some usefull
functions like pg_malloc, pg_calloc... So I've added pg_free to avoid the
sequence if-not-null-free-point-to-NULL, now I'd like to add a function
pg_strcat like this
char *
pg_strcat (char *dest,char *src)
{
/* pg_realloc is a safer function than realloc */
dest=pg_realloc(dest,strlen(dest)+strlen(src)+1);

strcat(dest,src);
return dest;
}

But, in that case, those functions are only usable for pg_dump, what about
the rest of code ? We don't have a central location for those small reusable
snippets of code ?

Regards,

.D.


Re: [HACKERS] where to write small reusable functions ?

2007-04-13 Thread Heikki Linnakangas

Dany DeBontridder wrote:

I'm working to implement a new feature to pg_dump: the ability to dump
objects like function, indexes... 


pg_dump already dumps functions and indexes.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] where to write small reusable functions ?

2007-04-13 Thread NikhilS

Hi,

char *

pg_strcat (char *dest,char *src)
{
/* pg_realloc is a safer function than realloc */
 dest=pg_realloc(dest,strlen(dest)+strlen(src)+1);

strcat(dest,src);
return dest;
}



Postgres already has something for the above functionality.

See makeStringInfo, appendStringInfo.

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] where to write small reusable functions ?

2007-04-13 Thread Peter Eisentraut
Am Freitag, 13. April 2007 14:28 schrieb Dany DeBontridder:
 But, in that case, those functions are only usable for pg_dump, what about
 the rest of code ? We don't have a central location for those small
 reusable snippets of code ?

The main point of these functions is to catch errors and exit the program.  
But that behavior is very program-dependent, so I don't think it'd be useful 
to put them in a central location.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] where to write small reusable functions ?

2007-04-13 Thread Dany DeBontridder

On 4/13/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


Dany DeBontridder wrote:
 I'm working to implement a new feature to pg_dump: the ability to dump
 objects like function, indexes...

pg_dump already dumps functions and indexes.




Right but you can't dump only one or two functions or only the functions and
nothing else. (the same for index, triggers...)

D.


Re: [HACKERS] [PATCHES] Reviewers Guide to Deferred Transactions/Transaction Guarantee

2007-04-13 Thread Zeugswetter Andreas ADI SD

I agree with Tom's reasoning about the suggested simplifications, sorry.

  3. Should the WALWriter also do the wal_buffers half-full write at
the 
  start of XLogInsert() ?
 
 That should go away entirely; to me the main point of the 
 separate wal-writer process is to take over responsibility 
 for not letting too many dirty wal buffers accumulate.

That also sounds a lot simpler, but I think Bruce wanted to be able to
give
some time guarantee to the not waiting for fsync txns.
When a commit only half-filled the page and no more WAL comes in for 
a long time, there is only WALWriter to do the IO.
The WALWriter would need to only flush a half-full page after timeout
iff it contains a commit record.

One more question on autocommit:
Do we wait for a flush for an autocommitted DML ?
Seems we generally should not.

Andreas


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] conflicting gettimeofday with MinGW

2007-04-13 Thread Andrew Dunstan

Magnus Hagander wrote:

On Wed, Apr 11, 2007 at 06:17:19PM +0200, Zeugswetter Andreas ADI SD wrote:
  
That page is ages out of date. The intended sync is 


apparently broken.
  

The current download area is on sourceforge
http://sf.net/project/showfiles.php?group_id=2435

  


*sigh*

And what is in 3.12, which is apparently the current version?
  

Sorry that was implied. sys/time.h did not change between 3.10 and 3.12.
There is no #define to remove the declaration.

In cvs the file is marked as dead and gives a reference to cygwin's
w32api.
Maybe the answer is to simply note that the mingw-runtime binary from
sourceforge above 3.9 has a broken sys/time.h (it is not the file from
cygwin but the dead one from cvs). 



I think that soudns reasonable. Something for the mingw FAQ. Care to
provide bruce with a patch?

  


No, that means that we can't build with currently released Mingw, AIUI. 
I think the solution is probably to rename our replacement gettimeofday 
and add a #define to translate the calls. That's pretty much what we've 
done in other cases, and has seemed to be a more robust strategy than 
supplying replacements for standard library routines with routines of 
the same name.


cheers

andrew



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Tom Lane
Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes:
 But you also turn off the optimization that avoids writing regular
 WAL records when the info is already contained in a full-page image
 (increasing the uncompressed size of WAL).
 It was that part I questioned.

That's what bothers me about this patch, too.  It will be increasing
the cost of writing WAL (more data - more CRC computation and more
I/O, not to mention more contention for the WAL locks) which translates
directly to a server slowdown.

The main arguments that I could see against Andreas' alternative are:

1. Some WAL record types are arranged in a way that actually would not
permit the reconstruction of the short form from the long form, because
they throw away too much data when the full-page image is substituted.
An example that's fresh in my mind is that the current format of the
btree page split WAL record discards newitemoff in that case, so you
couldn't identify the inserted item in the page image.  Now this is only
saving two bytes in what's usually going to be a darn large record
anyway, and it complicates the code to do it, so I wouldn't cry if we
changed btree split to include newitemoff always.  But there might be
some other cases where more data is involved.  In any case, someone
would have to look through every single WAL record type to determine
whether reconstruction is possible and fix it if not.

2. The compresslog utility would have to have specific knowledge about
every compressible WAL record type, to know how to convert it to the
short format.  That means an ongoing maintenance commitment there.
I don't think this is unacceptable, simply because we need only teach
it about a few common record types, not everything under the sun ---
anything it doesn't know how to fix, just leave alone, and if it's an
uncommon record type it really doesn't matter.  (I guess that means
that we don't really have to do #1 for every last record type, either.)

So I don't think either of these is a showstopper.  Doing it this way
would certainly make the patch more acceptable, since the argument that
it might hurt rather than help performance in some cases would go away.

 What about disconnecting WAL LSN from physical WAL record position
 during replay ?
 Add simple short WAL records in pg_compresslog like: advance LSN by 8192
 bytes.

I don't care for that, as it pretty much destroys some of the more
important sanity checks that xlog replay does.  The page boundaries
need to match the records contained in them.  So I think we do need
to have pg_decompresslog insert dummy WAL entries to fill up the
space saved by omitting full pages.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Group Commit

2007-04-13 Thread Simon Riggs
On Tue, 2007-04-10 at 11:40 +0100, Heikki Linnakangas wrote:
 Tom Lane wrote:
  Heikki Linnakangas [EMAIL PROTECTED] writes:
  I've been working on the patch to enhance our group commit behavior. The 
  patch is a dirty hack at the moment, but I'm settled on the algorithm 
  I'm going to use and I know the issues involved.
  
  One question that just came to mind is whether Simon's no-commit-wait
  patch doesn't fundamentally alter the context of discussion for this.

I was certainly intending that it would.

  Aside from the prospect that people won't really care about group commit
  if they can just use the periodic-WAL-sync approach, ISTM that one way
  to get group commit is to just make everybody wait for the dedicated
  WAL writer to write their commit record.  With a sufficiently short
  delay between write/fsync attempts in the background process, won't
  that net out at about the same place as a complicated group-commit
  patch?
 
 Possibly. To get efficient group commit there would need to be some kind 
 of signaling between the WAL writer and normal backends. I think there 
 is some in the patch, but I'm not sure if it gives efficient group 
 commit. A constant delay will just give us something similar to 
 commit_delay.

Agreed.

 I've refrained from spending time on group commit until the 
 commit-no-wait patch lands, because it's going to conflict anyway. I'm 
 starting to feel we should not try to rush group commit into 8.3, unless 
 it somehow falls out of the commit-no-wait patch by accident, given that 
 we're past feature freeze and coming up with a proper group commit 
 algorithm would need a lot of research and testing. Better do it for 8.4 
 with more time, we've got enough features on plate for 8.3 anyway.

My feeling was that I couldn't get both done for 8.3, and that including
the WAL Writer in 8.3 would make the dev path clearer for a later
attempt upon group commit.

I think it was worth exploring whether it would be easy, but I think we
can see it'll take a lot of work to make it fly right.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] Reviewers Guide to Deferred Transactions/TransactionGuarantee

2007-04-13 Thread Simon Riggs
On Thu, 2007-04-12 at 15:56 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  transaction_guarantee.v11.patch 

Thanks for the review.

 I can't help feeling that this is enormously overcomplicated.

I agree with all but one of your comments, see below. 

 The DFC in particular seems to not be worth its overhead.  Why wouldn't
 we simply track the newest commit record at all times, and then whenever
 the wal writer wakes up, it would write/fsync that far (or write/fsync
 all completed WAL pages, if there's no new commit record to worry
 about)?

 The other interesting issue is not letting hint-bit updates get to disk
 in advance of the WAL flush, but I don't see a need to track those at
 a per-transaction level: just advance page LSN to latest commit record
 any time a hint bit is updated.  The commit will likely be flushed
 before we'd be interested in writing the buffer out anyway.  Moreover,
 the way you are doing it creates a conflict in that the DFC has to
 guarantee to remember every unflushed transaction, whereas it really
 needs to be just an approximate cache for its performance to be good.

I've spent a few hours thinking on this and I'm happy with it now. The
lure of removing that much code is too strong to resist; its certainly
easier to remove code after freeze than it is to add it.

Advancing the LSN too far was a worry of mine, but we have the code now
to cope if that shows to be a problem in testing. So lets strip that
out.

 I see the concern about not letting clog pages go to disk before the
 corresponding WAL data is flushed, but that could be handled much more
 simply: just force a flush through the newest commit record before any
 write of a clog page.  Those writes are infrequent enough (every 32K
 transactions or one checkpoint) that this seems not a serious problem.

This bit I'm not that happy with. You're right its fairly infrequent,
but the clog pages are typically written when we extend the clog. That
happens while holding XidGenLock and ProcArrayLock, so holding those
across an additional (and real) I/O is going to make that blockage
worse. We've been to great pains in other places to remove logjams and
we know that the follow-on effects of logjams are not swift to clear
when the system is running at full load on multiple CPU systems.

The code to implement this is pretty clean: a few extra lines in
clog/slru and bubbled-up API changes. 

I was actually thinking of adding something to the bgwriter to clean the
LRU block of the clog, if it was dirty, once per cycle, to further
reduce the possibility of I/O at that point.

 AFAIK there is no need to associate any forced flush with multixacts;
 there is no state saved across crashes for those anyway.

Agreed. 

 I don't see a point in allowing the WAL writer to be disabled ---
 I believe it will be a performance win just like the bgwriter,
 independently of whether transaction_guarantee is used or not,
 by helping to keep down the number of dirty WAL buffers.  That in
 turn allows some other simplifications, like not needing an assign hook
 for transaction_guarantee.

That would be pleasant. The other changes make hint bit setting need a
LWlock request, so I wanted to include a way of saying I never ever
want to use transaction_guarantee = off. I see the beauty of your
suggestion and agree. 

So keep the parameter, but let it default to 100ms?
Range 10-1000ms?

 I disagree with your desire to remove the fsync parameter.  It may have
 less use than before with this feature, but that doesn't mean it has
 none.

OK

  3. Should the WALWriter also do the wal_buffers half-full write at the
  start of XLogInsert() ?
 
 That should go away entirely; to me the main point of the separate
 wal-writer process is to take over responsibility for not letting too
 many dirty wal buffers accumulate.

Yes


I'll make the agreed changes by next Wed/Thurs. 

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Vista/IPv6

2007-04-13 Thread Hiroshi Saito

Hi.

From: Magnus Hagander [EMAIL PROTECTED]


I see.

But - does it work when build with MSVC6? IIRC, MSVC6 pre-dates windows
2000 and the windows IPV6 support.

Can you verify that it works if you manually add this #define and build
with MSVC6?


I don't have IPV6 test environment recently However, It has the compile environment 
of VC6, and it is finished finely. Then, win32.mak is not the thing of only VC6. 
And VC 7.1, VC8, and they will be used. It should be then supported IPV6 at least.
What confirmed #DEFINE of IPV6 and built it by VC6 has passed the test of IPV4 further 
again. Therefore, you should make it effective.


Regards,
Hiroshi Saito




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Simon Riggs
On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote:
 Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes:
  But you also turn off the optimization that avoids writing regular
  WAL records when the info is already contained in a full-page image
  (increasing the uncompressed size of WAL).
  It was that part I questioned.

I think its right to question it, certainly.

 That's what bothers me about this patch, too.  It will be increasing
 the cost of writing WAL (more data - more CRC computation and more
 I/O, not to mention more contention for the WAL locks) which translates
 directly to a server slowdown.

I don't really understand this concern. Koichi-san has included a
parameter setting that would prevent any change at all in the way WAL is
written. If you don't want this slight increase in WAL, don't enable it.
If you do enable it, you'll also presumably be compressing the xlog too,
which works much better than gzip using less CPU. So overall it saves
more than it costs, ISTM, and nothing at all if you choose not to use
it.

 The main arguments that I could see against Andreas' alternative are:
 
 1. Some WAL record types are arranged in a way that actually would not
 permit the reconstruction of the short form from the long form, because
 they throw away too much data when the full-page image is substituted.
 An example that's fresh in my mind is that the current format of the
 btree page split WAL record discards newitemoff in that case, so you
 couldn't identify the inserted item in the page image.  Now this is only
 saving two bytes in what's usually going to be a darn large record
 anyway, and it complicates the code to do it, so I wouldn't cry if we
 changed btree split to include newitemoff always.  But there might be
 some other cases where more data is involved.  In any case, someone
 would have to look through every single WAL record type to determine
 whether reconstruction is possible and fix it if not.
 
 2. The compresslog utility would have to have specific knowledge about
 every compressible WAL record type, to know how to convert it to the
 short format.  That means an ongoing maintenance commitment there.
 I don't think this is unacceptable, simply because we need only teach
 it about a few common record types, not everything under the sun ---
 anything it doesn't know how to fix, just leave alone, and if it's an
 uncommon record type it really doesn't matter.  (I guess that means
 that we don't really have to do #1 for every last record type, either.)
 
 So I don't think either of these is a showstopper.  Doing it this way
 would certainly make the patch more acceptable, since the argument that
 it might hurt rather than help performance in some cases would go away.

Yeh, its additional code paths, but it sounds like Koichi-san and
colleagues are going to be trail blazing any bugs there and will be
around to fix any more that emerge.

  What about disconnecting WAL LSN from physical WAL record position
  during replay ?
  Add simple short WAL records in pg_compresslog like: advance LSN by 8192
  bytes.
 
 I don't care for that, as it pretty much destroys some of the more
 important sanity checks that xlog replay does.  The page boundaries
 need to match the records contained in them.  So I think we do need
 to have pg_decompresslog insert dummy WAL entries to fill up the
 space saved by omitting full pages.

Agreed. I don't want to start touching something that works so well.


We've been thinking about doing this for at least 3 years now, so I
don't see any reason to baulk at it now. I'm happy with Koichi-san's
patch as-is, assuming further extensive testing will be carried out on
it during beta.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote:
 That's what bothers me about this patch, too.  It will be increasing
 the cost of writing WAL (more data - more CRC computation and more
 I/O, not to mention more contention for the WAL locks) which translates
 directly to a server slowdown.

 I don't really understand this concern.

The real objection is that a patch that's alleged to make WAL smaller
actually does the exact opposite.  Now maybe you can buy that back
downstream of the archiver --- after yet more added-on processing ---
but it still seems that there's a fundamental misdesign here.

 Koichi-san has included a parameter setting that would prevent any
 change at all in the way WAL is written.

It bothers me that we'd need to have such a switch.  That's just another
way to shoot yourself in the foot, either by not enabling it (in which
case applying pg_compresslog as it stands would actively break your
WAL), or by enabling it when you weren't actually going to use
pg_compresslog (because you misunderstood the documentation to imply
that it'd make your WAL smaller by itself).  What I want to see is a
patch that doesn't bloat WAL at all and therefore doesn't need a switch.
I think Andreas is correct to complain that it should be done that way.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Simon Riggs
On Fri, 2007-04-13 at 11:47 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote:
  That's what bothers me about this patch, too.  It will be increasing
  the cost of writing WAL (more data - more CRC computation and more
  I/O, not to mention more contention for the WAL locks) which translates
  directly to a server slowdown.
 
  I don't really understand this concern.
 
 The real objection is that a patch that's alleged to make WAL smaller
 actually does the exact opposite.  Now maybe you can buy that back
 downstream of the archiver --- after yet more added-on processing ---
 but it still seems that there's a fundamental misdesign here.
 
  Koichi-san has included a parameter setting that would prevent any
  change at all in the way WAL is written.
 
 It bothers me that we'd need to have such a switch.  That's just another
 way to shoot yourself in the foot, either by not enabling it (in which
 case applying pg_compresslog as it stands would actively break your
 WAL), or by enabling it when you weren't actually going to use
 pg_compresslog (because you misunderstood the documentation to imply
 that it'd make your WAL smaller by itself).  What I want to see is a
 patch that doesn't bloat WAL at all and therefore doesn't need a switch.
 I think Andreas is correct to complain that it should be done that way.

I agree with everything you say because we already had *exactly* this
discussion when the patch was already submitted, with me saying
everything you just said.

After a few things have been renamed to show their correct function and
impact, I am now comfortable with this patch.

Writing lots of additional code simply to remove a parameter that
*might* be mis-interpreted doesn't sound useful to me, especially when
bugs may leak in that way. My take is that this is simple and useful
*and* we have it now; other ways don't yet exist, nor will they in time
for 8.3.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-13 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Writing lots of additional code simply to remove a parameter that
 *might* be mis-interpreted doesn't sound useful to me, especially when
 bugs may leak in that way. My take is that this is simple and useful
 *and* we have it now; other ways don't yet exist, nor will they in time
 for 8.3.

The potential for misusing the switch is only one small part of the
argument; the larger part is that this has been done in the wrong way
and will cost performance unnecessarily.  The fact that it's ready
now is not something that I think should drive our choices.

I believe that it would be possible to make the needed core-server
changes in time for 8.3, and then to work on compress/decompress
on its own time scale and publish it on pgfoundry; with the hope
that it would be merged to contrib or core in 8.4.  Frankly the
compress/decompress code needs work anyway before it could be
merged (eg, I noted a distinct lack of I/O error checking).

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] conflicting gettimeofday with MinGW

2007-04-13 Thread Magnus Hagander
Andrew Dunstan wrote:
 In cvs the file is marked as dead and gives a reference to cygwin's
 w32api.
 Maybe the answer is to simply note that the mingw-runtime binary from
 sourceforge above 3.9 has a broken sys/time.h (it is not the file from
 cygwin but the dead one from cvs). 

 I think that soudns reasonable. Something for the mingw FAQ. Care to
 provide bruce with a patch?

   
 
 No, that means that we can't build with currently released Mingw, AIUI.
 I think the solution is probably to rename our replacement gettimeofday
 and add a #define to translate the calls. That's pretty much what we've
 done in other cases, and has seemed to be a more robust strategy than
 supplying replacements for standard library routines with routines of
 the same name.

But if I read you right, they're removing it in the next version? That
would indicate that they included it by mistake?

(But sure, in general that's not a bad idea)

//Magnus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] [Fwd: Index Advisor]

2007-04-13 Thread Simon Riggs
On Tue, 2007-04-10 at 12:18 -0700, Gurjeet Singh wrote:

 Also, although the whole plan-tree is available in
 get_relation_info(), but it wouldn't be the right place to scan other
 tables, for eg., for generating JOIN-INDEXes or materializing some
 intermediate joins. (sometime in the future we may support them!).

I like Tom's suggestion. We never thought actually creating the indexes
was a very good thing and I'd be happy to bury that idea for good.

Speed is definitely a consideration if we are to re-plan thousands of
SQL statements for a real workload.

 If we don't run the planner twice, then the developer will have to
 run it manually twice, and compare the costs manually (with and
 without v-indexes); virtually impossible for lage applications and
 introduction of another human-error possibility.

AFAICS Tom hasn't referred to running twice or not, so I'm not very sure
what you're referring to, sorry. If you could answer Tom's suggestions
one by one directly underneath them it would be easier to discuss
things. 

ISTM that you've done a great job, the trick is now to reach agreement
and finish this. If there is something still to discuss, it needs to be
very clearly tied back to Tom's comments so everyone can follow it, then
agree it. If there is a problem in Tom's suggestions that directly
effects the operation of the tool then we need to identify what that is.
But if those hooks would give us all we need, then lets agree it and fix
up the adviser plug-in later.

We really, really, really need this. Lots. 

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


[HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)

2007-04-13 Thread Tom Lane
Steve [EMAIL PROTECTED] writes:
 [ strange planner misbehavior in 8.2.3 ]

After some off-list investigation (thanks, Steve, for letting me poke
at your machine), the short answer is that the heuristics used by
choose_bitmap_and() suck.  The problem query is like 

select ... from ds where
ds.receipt = '1998-12-30 0:0:0' and
ds.encounter_id in ( ... 100 distinct values ... );

and the table has a truly remarkable variety of indexes on encounter_id,
receipt, and combinations of them with other columns.  The receipt
condition is actually in effect a no-op, because all receipt dates are
later than that, but because ineq_histogram_selectivity doesn't trust
histogram data unreservedly we compute a selectivity of about 0.7
for it.  That means that the indexes that cover both receipt and
encounter_id are given a selectivity score just fractionally better than
those involving encounter_id alone, and therefore they sort first in
choose_bitmap_and's sort step, and the way that that routine is coded,
only combinations of the very first index with other ones will be
considered for a bitmap heap scan.  So the possibility of using just the
index on encounter_id alone is never considered, even though that
alternative is vastly cheaper than the alternatives that are considered.
(It happens that encounter_id is a low-order column in all the indexes
that include receipt, and so these scans end up covering the whole index
... multiple times even.  The cost estimation is fine --- the thing
knows these are expensive --- what's falling down is the heuristic for
which combinations of indexes to consider using in a bitmap scan.)

The original coding of choose_bitmap_and involved a fuzzy comparison
of selectivities, which would have avoided this problem, but we got rid
of that later because it had its own problems.  In fact,
choose_bitmap_and has caused us enough problems that I'm thinking we
need a fundamental rethink of how it works, rather than just marginal
tweaks.  If you haven't looked at this code before, the comments explain
the idea well enough:

/*
 * choose_bitmap_and
 *Given a nonempty list of bitmap paths, AND them into one path.
 *
 * This is a nontrivial decision since we can legally use any subset of the
 * given path set.  We want to choose a good tradeoff between selectivity
 * and cost of computing the bitmap.
 *
 * The result is either a single one of the inputs, or a BitmapAndPath
 * combining multiple inputs.
 */
...
/*
 * In theory we should consider every nonempty subset of the given paths.
 * In practice that seems like overkill, given the crude nature of the
 * estimates, not to mention the possible effects of higher-level AND and
 * OR clauses.  As a compromise, we sort the paths by selectivity.  We
 * always take the first, and sequentially add on paths that result in a
 * lower estimated cost.
 *
 * We also make some effort to detect directly redundant input paths, as
 * can happen if there are multiple possibly usable indexes.  (Another way
 * it can happen is that best_inner_indexscan will find the same OR join
 * clauses that create_or_index_quals has pulled OR restriction clauses
 * out of, and then both versions show up as duplicate paths.)  We
 * consider an index redundant if any of its index conditions were already
 * used by earlier indexes.  (We could use predicate_implied_by to have a
 * more intelligent, but much more expensive, check --- but in most cases
 * simple pointer equality should suffice, since after all the index
 * conditions are all coming from the same RestrictInfo lists.)
 *
 * You might think the condition for redundancy should be all index
 * conditions already used, not any, but this turns out to be wrong.
 * For example, if we use an index on A, and then come to an index with
 * conditions on A and B, the only way that the second index can be later
 * in the selectivity-order sort is if the condition on B is completely
 * non-selective.  In any case, we'd surely be drastically misestimating
 * the selectivity if we count the same condition twice.
 *
 * We include index predicate conditions in the redundancy test.  Because
 * the test is just for pointer equality and not equal(), the effect is
 * that use of the same partial index in two different AND elements is
 * considered redundant.  (XXX is this too strong?)
 *
 * Note: outputting the selected sub-paths in selectivity order is a good
 * thing even if we weren't using that as part of the selection method,
 * because it makes the short-circuit case in MultiExecBitmapAnd() more
 * likely to apply.
 */


One idea I thought about was to sort by index scan cost, using
selectivity only as a tiebreaker for cost, rather than the other way
around as is currently done.  This seems fairly plausible because
indexscans that are cheaper than other indexscans likely return 

Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)

2007-04-13 Thread Alvaro Herrera
Tom Lane wrote:

 One idea I thought about was to sort by index scan cost, using
 selectivity only as a tiebreaker for cost, rather than the other way
 around as is currently done.  This seems fairly plausible because
 indexscans that are cheaper than other indexscans likely return fewer
 rows too, and so selectivity is already accounted for to some extent ---
 at least you can't have an enormously worse selectivity at lower cost,
 whereas Steve's example proves it doesn't work the other way.  But I'm
 worried about breaking the reasoning about redundant indexes that's
 mentioned in the comments.
 
 Another alternative that would respond to the immediate problem is to
 maintain the current sort order, but as we come to each index, consider
 using that one alone, and throw away whatever AND we might have built up
 if that one alone beats the AND-so-far.  This seems more conservative,
 as it's unlikely to break any cases that work well now, but on the other
 hand it feels like plastering another wart atop a structure that's
 already rather rickety.
 
 Has anyone got any thoughts about the best way to do this?

How about doing both: sort the index by index scan cost; then pick the
first index on the list and start adding indexes when they lower the
cost.  When adding each index, consider it by itself against the
already stacked indexes.  If the cost is lower, put this index at the
top of the list, and restart the algorithm (after the sorting step of
course).

I think the concern about condition redundancy should be attacked
separately.  How about just comparing whether they have common prefixes
of conditions?  I admit I don't understand what would happen with
indexes defined like (lower(A), B, C) versus (A, B) for example.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org