date:20070308

[HACKERS] HOT WIP Patch - version 4.1

2007-03-08 Thread Pavan Deolasee



Please see  HOT WIP patch, version 4.1 posted on -patches.
here are  not any significant changes since the version 4.0 patch that
I posted a week back.

This patch includes some optimizations for efficiently looking
up LP_DELETEd tuples. I have used the recent changes made by
Tom/Heikki which give us few bits per page header. I use one
bit to track if there are any LP_DELETEd tuples in the page.
The changes to this bit are not WAL-logged and hence the
information might not be accurate. But we should be ok with
that.

Another non-trivial change is the addition of logic to clean
up row level fragmentation. I have discussed this earlier on
the list, but neverthless would summarize it again here.

When we reuse LP_DELETEd tuples for UPDATE, we might waste
few bytes when the original size of the reused tuple is
larger than the new tuple. The information about the original
length is lost. When we run out of LP_DELETEd tuples of
size equal or greater than the requested size in UPDATE path,
we correct the row level fragmentation, if any. Please note,
we don't move tuples around and hence don't need the
VACUUM-strength lock on the page.

We use another bit in the page header to track if there are
any fragmented LP_DELETEd tuples in the page. We also need
one bit in the tuple header to track that the particular
tuple was fragmeneted while being reused. This information
is then used when the tuple is again released and marked
LP_DELETE, to update the page level hint bit.


Comments/suggestions ?

Thanks,
Pavan

--


EnterpriseDBhttp://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS


Hi,

Maybe, the dummy entry could be extended to contain the bounds (max/min) for
each of the other involved partitions and they could be updated each time a
DML happens across the partitions. That ways, an update to a particular
partition needs to lock out the others, examine the dummy entries in its own
index and follow it up with dummy entries update into other partitions if
the need be.

Ofcourse as you have mentioned all of this so needs to be done after a
careful think on the locking/deadlocking etc issues.

Regards,
Nikhils

On 3/7/07, Alvaro Herrera <[EMAIL PROTECTED]> wrote:


I am wondering if we can implement unique indexes across several tables
(inheritance hierarchy) not by using a single, big index covering all
the tables, but rather by inserting a dummy entry into each partition's
unique index.  This dummy entry would have an expanded CTID which would
include the tableoid, so it's possible to check it (albeit there is a
problem in that we may require the opening of another heap to do the
actual checking).  These dummy entries could be removed by bulkcleanup
as soon as the inserting transaction is no longer running, to avoid
bloating the index too much.  All said dummy index entries would be
located at either the rightmost or the leftmost leaf, or close to it, so
another idea is to have future inserters reuse the entry for a different
key.

The obvious problem with this is, naturally, the excess I/O that extra
index traversing causes.  The not so obvious ones are locking,
deadlocking and the opening of other heaps and indexes while you do the
insertion, which may be too expensive.  On the other hand, maybe this
idea is easier to implement than full-fledged cross-table indexes, so we
could have richer partitioning earlier than when somebody finally bites
the bullet and implements cross-table indexes.

Or maybe this is just a dumb idea, but I had to let it out anyway :-)

--
Alvaro Herrera
http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.





--
EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS


Hi,



There are 2 other reasons to favor triggers though:

1) People (Josh Drake comes to mind) have found that if you get over
a tiny number of partitions, the performance of rules is abysmal.

2) I believe it should be possible to construct an update trigger
that allows you to perform updates that will place the row in
question into a new partition. While I can see cases for simply
disallowing updates to the partitioning key, I think there are also
times when being able to do that would be very useful.



The consensus seems to be veering towards triggers.



I think it'd be great to make adding and removing partitions as
simple as ALTER TABLE. I don't think that DELETE should be the
mechanism to drop a partition, though. Again, DML statements
shouldn't be performing DDL.



Since partition is inheritance-based, a simple DROP or  "NO INHERIT" will do
the job to deal with the partition. Do we want to reinvent additional syntax
when these are around and are documented?

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] Proposed ProcessUtility() API additions

2007-03-08 Thread Heikki Linnakangas


Tom Lane wrote:

The point of adding is_top_level is to provide a simpler, more reliable
means for PreventTransactionChain and related functions to detect
whether a function is trying to invoke a non-transaction-block-safe
command.  Currently we rely on an ugly test involving seeing if the
statement node is in the QueryContext, but that's always been a kluge,
and I'm not sure that it works 100% even today.  I'd like to get rid
of the QueryContext global altogether.


Great! I once played with the idea of using a simpler MemoryContext 
implementation in the parser to save some CPU cycles, and 
PreventTransactionChain stopped working. I don't know if what I was 
doing was a good idea, but PreventTransactionChain is definitely a hack.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Simon Riggs

On Wed, 2007-03-07 at 22:32 -0500, Luke Lonergan wrote:
> Andreas,
> 
> On 3/7/07 11:45 AM, "Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]>
> wrote:
> 
> > Whoa, do you have anything to back that up ?
> 
> Sure - when we start to consider designs that implement advanced data
> management features, we run into problems with the architecture of
> "tables->tables->tables...".  Here are some examples:
> 1 - people think of partitions as a logical building block for tables, they
> would like to move partitions around underneath a table without the table
> definition being involved.  In the current implementation, there are
> explicit linkages between the table definition and the child tables -
> imagine an "ALTER TABLE foo_parent ADD COLUMN" and how it would need to
> cascade to 1,000 child tables and you get the beginning of it - this
> connection should not exist.

The inheritance can work at multiple levels, so its up to you how you
manage things. If you want to add a column to only the newest tables,
you just add a new mid-level table, add the new column only to that and
then make all new partitions inherit from that table rather than the
main table. So I don't see the objection here, I see a benefit.

> 2 - INSERT/UPDATE/DELETE processing through the SQL rewrite layer (rules) is
> terribly slow and gets slower as you add more partitions.  If done closer to
> the storage layer, this can be done in ways that use access methods shared
> with other storage entities, e.g. Indices, and the code path would flow more
> naturally.

That bit is just syntactic sugar and unnecessary, for most applications,
in my experience - especially because it doesn't work with COPY.

People do seem to want it, so as I said upthread, we need a way of
speeding up the selection of the appropriate partition, so we can get
this to work for individual INSERTs.

This needs substantial improvement, but I don't see this challenging the
partition == table assumption.

> 3 - Parallel query can be accomplished more easily by separating scans
> across relations split among tablespaces.  This is more natural than trying
> to parallelize APPEND nodes within existing plans

Tables-> Partitions -> Tablespaces

Changing Partitions from Tables to SomethingElse won't increase the
current capability to define the disk layout for concurrent I/O.

Parallel Query needs to be planner-aware, so if we change partitions
from being tables, then we'd need to reintroduce them to the planner.
APPEND is clearly not the only thing that's needed for parallel query.

Note to Nikhil: Make sure the new syntax doesn't prevent partitions from
being placed upon multiple tablespaces in some manner, at CREATE TABLE
time.

> > You would need to elaborate what you actually mean, but I think it is
> > moot.
> > Sure, the constraint technique can be further extended (e.g. during
> > runtime), but imho the approach is very good.
> 
> Well, it's being used and that's good, but it needs to be better IMO and I
> think that before we go too far down the current path we should consider the
> alternatives more carefully.

I'm happy to reconsider things, but we need to do that with some clear
analysis of what doesn't work yet and how best to implement that.

None of what's been mentioned requires us to reconsider the 
Partition == Table assumption. There were other ways considered, but
they didn't win out in the analysis, for documented reasons.

If there are benefits to having partitions act like tables, then maybe
we can make them behave differently in just those circumstances. Pack
animals behave differently in a group, so why not tables, if need be?
But when is that exactly?

There is a ton of work to make partitioning the elegant beast we'd like
it to be, but that seems like extension only, not rip and replace.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Simon Riggs

On Wed, 2007-03-07 at 21:27 -0700, Jim Nasby wrote:
> On Mar 7, 2007, at 3:26 AM, Simon Riggs wrote:
> > If you know that the constraints on each of the tables is distinct,  
> > then
> > building a UNIQUE index on each of the partitions is sufficient to  
> > prove
> > that all rows in the combined partitioned table are distinct also.
> >
> > The hard part there is checking that the partition constraints are
> > distinct. If the partition constraints are added one at a time, you  
> > can
> > use the predicate testing logic to compare the to-be-added partition's
> > constraint against each of the already added constraints. That becomes
> > an O(N) problem.
> >
> > What is really needed is a data structure that allows range partitions
> > to be accessed more efficiently. This could make adding partitions and
> > deciding in which partition a specific value goes an O(logN)  
> > operation.
> 
> Directing data to child tables with triggers pretty much necessitates  
> having some way to codify what partition a particular row belongs in.  
> IE: for partitioning by month, you'll see things like naming the  
> partition tables "parent_table_name_$YEAR_$MONTH", so the  
> 'partitioning function' takes a date or timestamp and then returns  
> what partition it belongs to. Perhaps there is some way to use that  
> mapping to drive the selection of what partitions could contain a  
> given value?
> 
> One possibility would be to require 3 functions for a partitioned  
> table: one accepts the partitioning key and tells you what partition  
> it's in, one that tells you what the minimum partitioning key for a  
> partition would be, and one that tells you what the maximum would be.  
> If the user supplied those 3 functions, I think it would be possibly  
> to automatically generate code for the triggers and check  
> constraints. The min/max partition key functions might allow you to  
> more efficiently do partition elimination, too.

ISTM this is a good idea.

SQLServer uses partitioning functions and I like that approach. It makes
it much easier to do partition-wise joins between tables that share
partitioning functions.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander

It has been brought to my attention by Tokuharu Yuzawa that our calling
of WSAStartup() in DllMain() (libpqdll.c) is incorrect. Basically we're
calling WSAStartup() so that the client application does not have to.
However, due to the fact that WSAStartup() can itself load libraries,
there is a risk of deadlocking here. See for example:

MSDN docs
http://msdn2.microsoft.com/en-us/library/aa910684.aspx

MIT Kerberos have had the same problem:
http://mailman.mit.edu/pipermail/krbdev/2005-March/003244.html

And even MS had the same bug in their own native database library:
http://support.microsoft.com/kb/818539

There's also a note about socket issues in DLLs on:
http://support.microsoft.com/kb/q237572/


The easy fix for this is to remove the calls. Which obviously will break
some client apps. A fairly easy fix for the WSAStartup() call is to have
a check in the connection functions against a global variable that will
then make sure to call WSAStartup() the first time it's called.

That would leave us "leaking" the WSAStartup() call, but only one per
application. This is not perfect, but I'm thinking we can maybe live
with that. 

If not, perhaps we can have it call WSAStartup() everytime we connect to
a server, and then WSACleanup() when we shut down that connection with
PQfinish(). We're still going to leak if the user forgets to run
PQfinish(), but we're leaking other resources as well in that case. That
will break if any network related functions are called when there is no
connection open, but I doubt that's possible?

Of course, if we had a libpq_init() and a libpq_shutdown() function
things would be very easy, but we don't. And adding it just for this
purpose seems like trying too hard.


Yet another option would be to require that the client app deal with the
startup/shutdown code itself, but that will seriously break backwards
compatibility, so I don't think that's a good idea at all.


Other ideas?

//Magnus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Gregory Stark

"Jim Nasby" <[EMAIL PROTECTED]> writes:

> One possibility would be to require 3 functions for a partitioned table: one
> accepts the partitioning key and tells you what partition  it's in, one that
> tells you what the minimum partitioning key for a  partition would be, and one
> that tells you what the maximum would be.  If the user supplied those 3
> functions, I think it would be possibly  to automatically generate code for 
> the
> triggers and check  constraints. The min/max partition key functions might
> allow you to  more efficiently do partition elimination, too.

But then it would be harder to tell whether a clause implied a given
partition. That is, if you have a partition constraint of "col OP const" then
we can test whether a query clause of "col OP2 const2" implies that constraint
when planning (or actually whether it implies it's false to exclude the
partition). If you have a constraint like "P1(const)" it'll be pretty hard to
do much with that.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Zeugswetter Andreas ADI SD

> > The hard part there is checking that the partition constraints are 
> > distinct. If the partition constraints are added one at a time, you 
> > can use the predicate testing logic to compare the to-be-added 
> > partition's constraint against each of the already added
constraints. 
> > That becomes an O(N) problem.

Yes, we could preevaluate that check (per index) in the DDL phase and
keep the info in a flag. 
Also interesting info is if there is an order the partitions can be read
in to satisfy a particular order by.

> > What is really needed is a data structure that allows range
partitions 
> > to be accessed more efficiently. This could make adding partitions
and 
> > deciding in which partition a specific value goes an O(logN) 
> > operation.

I do not really see a problem with O(N) since typical N currently range
from 10 to 200.
N = 1000 is already good for a >= 10 TB table.
If a 10 GB partition were too large we should imho invest more in the
advanced indexing methods that are currently beeing developed.

> Directing data to child tables with triggers pretty much 
> necessitates having some way to codify what partition a 
> particular row belongs in.  
> IE: for partitioning by month, you'll see things like naming 
> the partition tables "parent_table_name_$YEAR_$MONTH", so the 
> 'partitioning function' takes a date or timestamp and then 
> returns what partition it belongs to. Perhaps there is some 
> way to use that mapping to drive the selection of what 
> partitions could contain a given value?

You put it in the first partition that has matching constraints.

> One possibility would be to require 3 functions for a partitioned
> table: one accepts the partitioning key and tells you what 
> partition it's in, one that tells you what the minimum 
> partitioning key for a partition would be, and one that tells 
> you what the maximum would be.  
> If the user supplied those 3 functions, I think it would be 
> possibly to automatically generate code for the triggers and 
> check constraints. The min/max partition key functions might 
> allow you to more efficiently do partition elimination, too.

I can see this as a good optional addition, but it can only be optional
else it would pretty much limit the methods that can be used for
partitioning. e.g. hash, modulo do not have a min,max per partition.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Andreas Pflug

Magnus Hagander wrote:
>
> The easy fix for this is to remove the calls. Which obviously will break
> some client apps. A fairly easy fix for the WSAStartup() call is to have
> a check in the connection functions against a global variable that will
> then make sure to call WSAStartup() the first time it's called.
>
> That would leave us "leaking" the WSAStartup() call, but only one per
> application. This is not perfect, but I'm thinking we can maybe live
> with that. 
>
> If not, perhaps we can have it call WSAStartup() everytime we connect to
> a server, and then WSACleanup() when we shut down that connection with
> PQfinish(). 

Taken from MSDN docs, this seems the recommended solution. After the
first WSAStartup call subsequent calls are cheap because they only
increment a counter.

Regards,
Andreas


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander

On Thu, Mar 08, 2007 at 12:47:42PM +0100, Andreas Pflug wrote:
> Magnus Hagander wrote:
> >
> > The easy fix for this is to remove the calls. Which obviously will break
> > some client apps. A fairly easy fix for the WSAStartup() call is to have
> > a check in the connection functions against a global variable that will
> > then make sure to call WSAStartup() the first time it's called.
> >
> > That would leave us "leaking" the WSAStartup() call, but only one per
> > application. This is not perfect, but I'm thinking we can maybe live
> > with that. 
> >
> > If not, perhaps we can have it call WSAStartup() everytime we connect to
> > a server, and then WSACleanup() when we shut down that connection with
> > PQfinish(). 
> 
> Taken from MSDN docs, this seems the recommended solution. After the
> first WSAStartup call subsequent calls are cheap because they only
> increment a counter.

Now that I look closer at it, we *already* do WSAStartup() in
makeEmptyPGconn... And free it in  freePGconn().

So I suggest the following simple patch.. Any objections?

//Magnus
Index: fe-connect.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.344
diff -c -r1.344 fe-connect.c
*** fe-connect.c20 Feb 2007 15:20:51 -  1.344
--- fe-connect.c8 Mar 2007 12:16:42 -
***
*** 1840,1848 
  #ifdef WIN32

/*
!* Make sure socket support is up and running. Even though this is done 
in
!* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for
!* static builds at all, so we have to do it in the main code too.
 */
WSADATA wsaData;

--- 1840,1846 
  #ifdef WIN32

/*
!* Make sure socket support is up and running.
 */
WSADATA wsaData;

Index: libpqdll.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/libpqdll.c,v
retrieving revision 1.10
diff -c -r1.10 libpqdll.c
*** libpqdll.c  11 Mar 2006 04:38:40 -  1.10
--- libpqdll.c  8 Mar 2007 12:05:59 -
***
*** 15,31 
switch (fdwReason)
{
case DLL_PROCESS_ATTACH:
!   if (WSAStartup(MAKEWORD(1, 1), &wsaData))
!   {
!   /*
!* No really good way to do error handling 
here, since we
!* don't know how we were loaded
!*/
!   return FALSE;
!   }
break;
case DLL_PROCESS_DETACH:
!   WSACleanup();
break;
}

--- 15,24 
switch (fdwReason)
{
case DLL_PROCESS_ATTACH:
!   /* We used to call WSAStartup() here, but this may 
cause deadlocks */
break;
case DLL_PROCESS_DETACH:
!   /* We used to call WSACleanup() here, but this may 
cause deadlocks */
break;
}


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-08 Thread Pavan Deolasee

Pavan Deolasee wrote:
>
> Thanks a lot, Tom. It seems to work fine for me. I will do some
> more tests and report if I see any issue.
>

The problem mentioned before is hard to reproduce with the
suggested change, but its not completely gone away. I have
seen that again on CVS HEAD with the patch applied.

I am facing another issue with VACUUM FULL. This
problem gets reproduced with HOT very easily, but takes
few attempts to reproduce with CVS HEAD, but it
certainly exists.

This time I am using pgbench. All tables but "history" are
created with fillfactor=50

Now, I start running pgbench with scale factor of 10 and 40
clients and 1 txns/client. After few minutes, I start
running VACUUM FULL on tellers and branches, every 10 seconds.
After a while, all pgbench clients fail with the following
errors:

Client 1 aborted in state 11: ERROR:  duplicate key violates unique 
constraint "branches_pkey"
Client 30 aborted in state 11: ERROR:  duplicate key violates unique 
constraint "branches_pkey"
Client 39 aborted in state 11: ERROR:  duplicate key violates unique 
constraint "branches_pkey"
Client 7 aborted in state 11: ERROR:  duplicate key violates unique 
constraint "branches_pkey"

Next run of VACUUM FULL gives the following error:

WARNING:  index "branches_pkey" contains 15 row versions, but table 
contains 12 row versions

HINT:  Rebuild the index with REINDEX.

Has this been reported earlier ? IIRC Tom mentioned in one of the
emails that Merlin has reported some problem related
to "duplicate key violation". Tom, could this be related ?

Thanks,
Pavan

--

EnterpriseDBhttp://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Tom Lane

Magnus Hagander <[EMAIL PROTECTED]> writes:
> So I suggest the following simple patch.. Any objections?

One wonders if we need DllMain() at all any more.  We certainly don't
need that switch statement ...

Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
like it shouldn't be done until after we've closed the socket.  I'd
be inclined to put it at the bottom of the routine.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander

On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote:
> Magnus Hagander <[EMAIL PROTECTED]> writes:
> > So I suggest the following simple patch.. Any objections?
> 
> One wonders if we need DllMain() at all any more.  We certainly don't
> need that switch statement ...

Indeed. Looking even more into it (sheesh, I really didn't do my
homework here), libpqdll.c isn't even *compiled* on mingw. Or on the new
MSVC build. It's only compiled on the old msvc build. Given that, we can
probably just delete the file.


> Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
> like it shouldn't be done until after we've closed the socket.  I'd
> be inclined to put it at the bottom of the routine.

Certainly looks wrong. It's interesting how this could have worked
*before*. That's a clear indication that it really doesn't appear to
matter much what we do here :S

The patch would then look something like this, and a remove of
libpqdll.c.

//Magnus

Index: bcc32.mak
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/bcc32.mak,v
retrieving revision 1.26
diff -c -r1.26 bcc32.mak
*** bcc32.mak   11 Jan 2007 02:42:31 -  1.26
--- bcc32.mak   8 Mar 2007 15:23:17 -
***
*** 93,99 
[EMAIL PROTECTED] "$(INTDIR)\fe-secure.obj"
[EMAIL PROTECTED] "$(INTDIR)\pqexpbuffer.obj"
[EMAIL PROTECTED] "$(INTDIR)\pqsignal.obj"
-   [EMAIL PROTECTED] "$(OUTDIR)\libpqdll.obj"
[EMAIL PROTECTED] "$(OUTDIR)\win32.obj"
[EMAIL PROTECTED] "$(INTDIR)\wchar.obj"
[EMAIL PROTECTED] "$(INTDIR)\encnames.obj"
--- 93,98 
***
*** 155,168 

  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v
- LINK32_OBJS= "$(INTDIR)\libpqdll.obj"

  # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm

! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" $(LINK32_OBJS) 
"$(INTDIR)\libpq.res" blibpqdll.def
$(LINK32) @<<
$(LINK32_FLAGS) +
!   c0d32.obj $(LINK32_OBJS), +
$@,, +
"$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, +
blibpqdll.def,"$(INTDIR)\libpq.res"
--- 154,166 

  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v

  # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm

! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" "$(INTDIR)\libpq.res" 
blibpqdll.def
$(LINK32) @<<
$(LINK32_FLAGS) +
!   c0d32.obj , +
$@,, +
"$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, +
blibpqdll.def,"$(INTDIR)\libpq.res"
Index: fe-connect.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.344
diff -c -r1.344 fe-connect.c
*** fe-connect.c20 Feb 2007 15:20:51 -  1.344
--- fe-connect.c8 Mar 2007 15:16:58 -
***
*** 1840,1848 
  #ifdef WIN32

/*
!* Make sure socket support is up and running. Even though this is done 
in
!* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for
!* static builds at all, so we have to do it in the main code too.
 */
WSADATA wsaData;

--- 1840,1846 
  #ifdef WIN32

/*
!* Make sure socket support is up and running.
 */
WSADATA wsaData;

***
*** 1918,1927 
PGnotify   *notify;
pgParameterStatus *pstatus;

- #ifdef WIN32
-   WSACleanup();
- #endif
-
if (!conn)
return;

--- 1916,1921 
***
*** 1986,1991 
--- 1980,1989 
termPQExpBuffer(&conn->errorMessage);
termPQExpBuffer(&conn->workBuffer);
free(conn);
+
+ #ifdef WIN32
+   WSACleanup();
+ #endif
  }

  /*
Index: win32.mak
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/win32.mak,v
retrieving revision 1.43
diff -c -r1.43 win32.mak
*** win32.mak   11 Jan 2007 02:42:31 -  1.43
--- win32.mak   8 Mar 2007 15:22:29 -
***
*** 63,69 
[EMAIL PROTECTED] "$(INTDIR)\fe-secure.obj"
[EMAIL PROTECTED] "$(INTDIR)\pqexpbuffer.obj"
[EMAIL PROTECTED] "$(INTDIR)\pqsignal.obj"
-   [EMAIL PROTECTED] "$(OUTDIR)\libpqdll.obj"
[EMAIL PROTECTED] "$(OUTDIR)\win32.obj"
[EMAIL PROTECTED] "$(INTDIR)\wchar.obj"
[EMAIL PROTECTED] "$(INTDIR)\encnames.obj"
--- 63,68 
***
*** 143,149 
   /pdb:"$(OUTDIR)\libpqdll.pdb" /machine:I386 
/out:"$(OUTDIR)\$(OUTFILENAME).dll"\
   /implib:"$(OUTDIR)\$(OUTFILENAME)dll.lib"  /def:$(OUTFILENAME)dll.def
  LINK32_OBJS= \
-   "$(INTDIR)\libpqdll.obj" \
"$(OUTDIR)\$(OUTFILENAME).lib" \
"$(OUTDIR)\libpq.res"

--- 142,147 
***
*** 159,165 
$(RSC) $(RSC_PROJ) l

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-08 Thread Sherry Moore

Hi Simon,

> and what you haven't said
> 
> - all of this is orthogonal to the issue of buffer cache spoiling in
> PostgreSQL itself. That issue does still exist as a non-OS issue, but
> we've been discussing in detail the specific case of L2 cache effects
> with specific kernel calls. All of the test results have been
> stand-alone, so we've not done any measurements in that area. I say this
> because you make the point that reducing the working set size of write
> workloads has no effect on the L2 cache issue, but ISTM its still
> potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
  difference with VACUUM was caused by whether the size of the
  user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
  to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
  the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [JDBC] [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-08 Thread Тимчишин Виталий



I am from pgsql-jdbc, so I may not be "in the thread", so please ignore 
places where my misunderstanding goes out.

The main two questions, IMHO, is:
1) What is the key to plan cache. Current option is some statement key 
(id). Another option would be statement text (you still need to store it 
if you want to replan at some point). In this case you can use same plan 
for multiple statements going from different sessions. That's the point 
Simon was talking about. This should significantly reduce planning, 
especially on multiple similar clients. Now, as I understand, every 
connection prepare same statements and plan then independent. Such 
change would make Application servers prepare new connections much 
faster (given they prepare a number of same statements for each 
connection, which is the case for my engine). This should work for both 
named and unnamed. Note that adding unnamed statements to cache (and not 
removing on statement disposal) may require much larger cache.

BTW: This is used by IBM DB2 UDB.
2) Specific plans when parameters are known. This is the point about 
using partial index(and sometimes even using full  index- i.e. 
specifying frequent value of some index or one of two tables in a join). 
I'd say the best would be to have generic plan and try to replan, 
starting from generic plan results (dispose any possibility that gives 
values worse then generic plan). Such a replan should be much faster 
then original planning because you have rather high starting point. 
Another option is to catch possibilities at original planning and select 
correct plan when parameters are known - you check all possible uses 
with "this will be frequent value, this will match this partial index, 
..." the question is the number of such plans. But since all of them 
must be better then generic (and it is possible to make a three, i.e. "A 
and B are not frequent" -> "A is frequent" -> "A is frequent and B meets 
partial index" and children must be better then parent), I'd say there 
won't be many (and you can always limit it's number and leave only the 
best if one goes out of number or even collect usages and leave the 
plans that are used).




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

[HACKERS] No buffer space available

2007-03-08 Thread Nik

have an SQL file with a set of about 3000 insert statements. This
file is executed twice from a remote client machine running Windows
XP. The first time it is executed against a Windows 2003 Server
running PostgreSQL 8.0. The second time it is executed against a
Windows 2003 Server running PostgreSQL 8.2. This setup has been in
place for about a year now and never had any issues.

However, I noticed today that all the file executions fail with
[nativecode=could not send data to server: No buffer space available
(0x2747/10055)] error.

I attempted to run the same file using pgAdmin III version 1.6.2, but
every time the file is executed the only return in messages pane is
the colon sign ":". No data is inserted.

I attempted to run the same file using psql.exe and this worked
without any problems.

I also noticed that if I use less inserts, for example only a few
hundred, the file executes fine.

What could be the issue?

Thank you.

Nik


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Tom Lane

Magnus Hagander <[EMAIL PROTECTED]> writes:
> On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote:
>> Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
>> like it shouldn't be done until after we've closed the socket.  I'd
>> be inclined to put it at the bottom of the routine.

> Certainly looks wrong. It's interesting how this could have worked
> *before*.

Because the calls in DllMain covered us (ie, the WSA usage count never
got to be less than one).  If we remove them, we'd better get this pair
right.

One thing that bothers me a bit is that if we just move the call to the
bottom, then freePGconn won't do it at all if passed a NULL pointer.
Now (assuming a non-broken app) the only way that can happen is if
makeEmptyPGconn runs out of memory.  If the client gets back a null
pointer from a connection attempt, it's probably a 50-50 proposition
whether it will think it ought to do PQfinish with it.  So it'd be good
if we could keep the usage count straight either way.  I propose the
invariant "a WSA usage count is associated with a non-null PGconn
structure".  That would mean that doing WSACleanup only at the bottom
of freePGconn is correct, but also that makeEmptyPGconn needs to do
WSACleanup in its (two) failure paths.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Umar Farooq Minhas

Hi,

How can we accrately estimate the "seq_page_fetch" and "random_page_fetch" 
costs from outside the postgres using for example a C routine. Using a very 
simple program in C, I have two sets of files. I want to access one set 
randomly i.e. pulling data from random locations within the files. The second 
set of files is accessed sequentially. The goal here is to approximate the disk 
I/O cost for a "random page fetch" and a "sequential page fetch" respectively. 
I am using low-level(unbuffered) C routines i.e. read/write and lseek ( for 
positioning file pointer), the read/write buffer size is 8k (to match the size 
of postgres page), and Linux is the host OS. We all know that linux is a 
heavily cached OS, for that very reason I am using sets of files instead of a 
single file, in a hope that whenever a new file from a set of files is accessed 
for the first time, it will NOT be in the OS cache, thus giving accurate 
results of actually fetching the file pages from the physical disk. And also 
the host is restarted before running the experiment so as to force a cold-cache 
start.

I am hoping somebody could point me in the right direction.

Thanks

-Umar

[HACKERS]

2007-03-08 Thread Abrie

I need to find out if anybody can give me some advice on incremental backups
and restores 

Reason: backup size 

 

abrie

Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander

On Thu, Mar 08, 2007 at 10:37:11AM -0500, Tom Lane wrote:
> Magnus Hagander <[EMAIL PROTECTED]> writes:
> > On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote:
> >> Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
> >> like it shouldn't be done until after we've closed the socket.  I'd
> >> be inclined to put it at the bottom of the routine.
> 
> > Certainly looks wrong. It's interesting how this could have worked
> > *before*.
> 
> Because the calls in DllMain covered us (ie, the WSA usage count never
> got to be less than one).  If we remove them, we'd better get this pair
> right.

But those calls weren't even compiled in when building using mingw,
which is what the majority of our users have been using lately, I think.
(Since that's what we ship in the binary package)


> One thing that bothers me a bit is that if we just move the call to the
> bottom, then freePGconn won't do it at all if passed a NULL pointer.
> Now (assuming a non-broken app) the only way that can happen is if
> makeEmptyPGconn runs out of memory.  If the client gets back a null
> pointer from a connection attempt, it's probably a 50-50 proposition
> whether it will think it ought to do PQfinish with it.  So it'd be good
> if we could keep the usage count straight either way.  I propose the
> invariant "a WSA usage count is associated with a non-null PGconn
> structure".  That would mean that doing WSACleanup only at the bottom
> of freePGconn is correct, but also that makeEmptyPGconn needs to do
> WSACleanup in its (two) failure paths.

I'm honestly unsure wether we need to bother with it, but yeah, that
will likely be "more correct". 
(Except one of the error paths in makeEmptyPGconn is already covered,
since it calls freePGconn, which does the WSACleanup)

//Magnus

Index: bcc32.mak
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/bcc32.mak,v
retrieving revision 1.26
diff -c -r1.26 bcc32.mak
*** bcc32.mak   11 Jan 2007 02:42:31 -  1.26
--- bcc32.mak   8 Mar 2007 15:23:17 -
***
*** 93,99 
[EMAIL PROTECTED] "$(INTDIR)\fe-secure.obj"
[EMAIL PROTECTED] "$(INTDIR)\pqexpbuffer.obj"
[EMAIL PROTECTED] "$(INTDIR)\pqsignal.obj"
-   [EMAIL PROTECTED] "$(OUTDIR)\libpqdll.obj"
[EMAIL PROTECTED] "$(OUTDIR)\win32.obj"
[EMAIL PROTECTED] "$(INTDIR)\wchar.obj"
[EMAIL PROTECTED] "$(INTDIR)\encnames.obj"
--- 93,98 
***
*** 155,168 
  
  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v
- LINK32_OBJS= "$(INTDIR)\libpqdll.obj"
  
  # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm
  
! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" $(LINK32_OBJS) 
"$(INTDIR)\libpq.res" blibpqdll.def 
$(LINK32) @<<
$(LINK32_FLAGS) +
!   c0d32.obj $(LINK32_OBJS), +
$@,, +
"$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, +
blibpqdll.def,"$(INTDIR)\libpq.res"
--- 154,166 
  
  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v
  
  # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm
  
! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" "$(INTDIR)\libpq.res" 
blibpqdll.def 
$(LINK32) @<<
$(LINK32_FLAGS) +
!   c0d32.obj , +
$@,, +
"$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, +
blibpqdll.def,"$(INTDIR)\libpq.res"
Index: fe-connect.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.344
diff -c -r1.344 fe-connect.c
*** fe-connect.c20 Feb 2007 15:20:51 -  1.344
--- fe-connect.c8 Mar 2007 15:42:59 -
***
*** 1840,1848 
  #ifdef WIN32
  
/*
!* Make sure socket support is up and running. Even though this is done 
in
!* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for
!* static builds at all, so we have to do it in the main code too.
 */
WSADATA wsaData;
  
--- 1840,1846 
  #ifdef WIN32
  
/*
!* Make sure socket support is up and running.
 */
WSADATA wsaData;
  
***
*** 1853,1859 
--- 1851,1862 
  
conn = (PGconn *) malloc(sizeof(PGconn));
if (conn == NULL)
+   {
+ #ifdef WIN32
+   WSACleanup();
+ #endif
return conn;
+   }
  
/* Zero all pointers and booleans */
MemSet(conn, 0, sizeof(PGconn));
***
*** 1918,1927 
PGnotify   *notify;
pgParameterStatus *pstatus;
  
- #ifdef WIN32
-   WSACleanup();
- #endif
- 
if (!conn)
return;
  
--- 1921,1926 
***
*** 1986,1991 
--- 1985,1994 
termPQExpBuffer(&conn->errorMessage);
termPQExpBuffer(&conn->

Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Tom Lane

Magnus Hagander <[EMAIL PROTECTED]> writes:
> On Thu, Mar 08, 2007 at 10:37:11AM -0500, Tom Lane wrote:
>> Because the calls in DllMain covered us (ie, the WSA usage count never
>> got to be less than one).  If we remove them, we'd better get this pair
>> right.

> But those calls weren't even compiled in when building using mingw,

Hmm ... does make you wonder, doesn't it?  But anyway, if we're
bothering to call the functions at all, we should try to meet the
defined protocol.  So I like this latest version of the patch.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] No buffer space available

2007-03-08 Thread Andrew Dunstan


Nik wrote:

have an SQL file with a set of about 3000 insert statements. This
file is executed twice from a remote client machine running Windows
XP. The first time it is executed against a Windows 2003 Server
running PostgreSQL 8.0. The second time it is executed against a
Windows 2003 Server running PostgreSQL 8.2. This setup has been in
place for about a year now and never had any issues.

However, I noticed today that all the file executions fail with
[nativecode=could not send data to server: No buffer space available
(0x2747/10055)] error.

I attempted to run the same file using pgAdmin III version 1.6.2, but
every time the file is executed the only return in messages pane is
the colon sign ":". No data is inserted.

I attempted to run the same file using psql.exe and this worked
without any problems.

I also noticed that if I use less inserts, for example only a few
hundred, the file executes fine.

What could be the issue?

  



First, this is the wrong place to ask this question. pgsql-hackers is 
about development of postgres itself, not about usage questions. Perhaps 
you should ask on pgsql-general.


Second, if you can run the file using psql, then it does not appear to 
be a problem with postgres at all, but rather with the client you are 
using (you don't say what that client is). Perhaps your client is trying 
a naive method of batching inserts and thus running out of buffer space 
- it's hard to tell since you haven't given much information.


In the case of pgAdminIII, you might need to ask on its lists.

cheers

andrew



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS]

2007-03-08 Thread Richard Huxton


Abrie wrote:

I need to find out if anybody can give me some advice on incremental backups
and restores 

Reason: backup size 


Abrie - this is the pgsql-hackers mailing list. It's for questions about 
the development of the PostgreSQL database system. You'll want the 
pgsql-general or pgsql-admin lists.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Tom Lane

"Umar Farooq Minhas" <[EMAIL PROTECTED]> writes:
> How can we accrately estimate the "seq_page_fetch" and =
> "random_page_fetch" costs from outside the postgres using for example a =
> C routine.

Use a test case larger than memory.  Repeat many times to average out
noise.  IIRC, when I did the experiments that led to the current
random_page_cost of 4.0, it took about a week before I had numbers I
trusted.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Gregory Stark

"Tom Lane" <[EMAIL PROTECTED]> writes:

> "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes:
>> How can we accrately estimate the "seq_page_fetch" and =
>> "random_page_fetch" costs from outside the postgres using for example a =
>> C routine.
>
> Use a test case larger than memory.  Repeat many times to average out
> noise.  IIRC, when I did the experiments that led to the current
> random_page_cost of 4.0, it took about a week before I had numbers I
> trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Alvaro Herrera

Tom Lane wrote:
> Alvaro Herrera <[EMAIL PROTECTED]> writes:

> > Is everybody OK with not putting a per-tablespace worker limit?
> > Is everybody OK with putting per-database worker limits on a pg_database
> > column?
> 
> I don't think we need a new pg_database column.  If it's a GUC you can
> do ALTER DATABASE SET, no?  Or was that what you meant?

No, that doesn't work unless we save the datconfig column to the
pg_database flatfile, because it's the launcher (which is not connected) 
who needs to read it.  Same thing with an hypothetical per-database
naptime.  The launcher would also need to parse it, which is not ideal
(though not a dealbreaker either).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] [PATCHES] pg_standby

2007-03-08 Thread Doug Knight

Hi Simon,
I would preserve the existing trigger function as little t "-t", and
maybe implement a catchup trigger function as big t "-T"? Set it up so
that if the first attempt to find the WAL file postgres is currently
requesting succeeds, skip over the trigger check. If the first attempt
fails, then do your trigger check. That way, in the OCF script, the
postmaster can be started, the trigger file set, and connection to the
database looped on until it succeeds as an indication for when the
database is up and available. I think that's cleaner than comparing a
filename from a 'ps' command. Once I've completed the OCF script and
done some testing, I'll forward it to you for you to review and see if
you want to include it.

Thanks,
Doug

On Thu, 2007-03-08 at 15:37 +, Simon Riggs wrote:
> On Thu, 2007-03-08 at 10:33 -0500, Doug Knight wrote:
> > Thanks, Simon. I kind of figured that's how pg_standby would work,
> > since its invoked by postgres once per WAL file. What I was thinking I
> > might do in the OCF script is to grab the pg_standby process line from
> > a ps, pull out the "current" WAL file path and filename, then do an
> > existence check for the file. If the file exists, then
> > pg_standby/postgres is probably processing it. If not, then we're
> > probably waiting on it, implying that recovery is complete. Thoughts
> > on this process?
> 
> I suppose I might be able to have the option to catch up before it
> stops, on the basis that if it can find the file it was looking for
> without waiting then that can override the trigger.
> 
> Which way would you like it to work?
>

Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-08 Thread Gaetano Mendola

Tom Lane wrote:
> Gaetano Mendola <[EMAIL PROTECTED]> writes:
>> [ 8.2 evaluates volatile functions in the targetlist of a view ]
>> If I mark the function as STABLE or IMMUTABLE then even with version
>> 8.2 the function is not evaluated. Is this the intended behavior?
> 
> Yes; people complained that we needed to be more careful about the
> number of times volatile functions get evaluated.

I suspect that functions are evaluated also for record discarded due to
joins. Is that the case?

Like:

SELECT * FROM ta JOIN tb USING (id) where tb.foo = 4;

If ta is a view with some calculated fields are the function on ta
evaluated only for record matching the filters or in some case (
like a full scan on ta ) also for the records discarded due to the join?

Regards
Gaetano Mendola

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] [PATCHES] pg_standby

2007-03-08 Thread Simon Riggs

On Thu, 2007-03-08 at 13:29 -0500, Doug Knight wrote:

> I would preserve the existing trigger function as little t "-t", and
> maybe implement a catchup trigger function as big t "-T"? Set it up so
> that if the first attempt to find the WAL file postgres is currently
> requesting succeeds, skip over the trigger check. If the first attempt
> fails, then do your trigger check. That way, in the OCF script, the
> postmaster can be started, the trigger file set, and connection to the
> database looped on until it succeeds as an indication for when the
> database is up and available. I think that's cleaner than comparing a
> filename from a 'ps' command. Once I've completed the OCF script and
> done some testing, I'll forward it to you for you to review and see if
> you want to include it.

I'm happy to do this, unless other objections.

I'll be doing another version before feature freeze.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Alvaro Herrera

Galy Lee wrote:

Hi,

> Alvaro Herrera wrote:
> > I still haven't received the magic bullet to solve the hot table
> > problem, but these at least means we continue doing *something*.
> 
> Can I know about what is your plan or idea for autovacuum improvement
> for 8.3 now? And also what is the roadmap of autovacuum improvement for 8.4?

Things I want to do for 8.3:

- Make use of the launcher/worker stuff, that is, allow multiple
  autovacuum processes in parallel.  With luck we'll find out how to
  deal with hot tables.

Things I'm not sure we'll be able to have in 8.3, in which case I'll get
to them for early 8.4:

- The maintenance window stuff, i.e., being able to throttle workers
  depending on a user-defined schedule.

8.4 material:

- per-tablespace throttling, coordinating IO from multiple workers

I don't have anything else as detailed as a "plan".  If you have
suggestions, I'm all ears.

Now regarding your restartable vacuum work.  I think that stopping a
vacuum at some point and being able to restart it later is very cool and
may get you some hot chicks, but I'm not sure it's really useful.  I
think it makes more sense to do something like throttling an ongoing
vacuum to a reduced IO rate, if the maintenance window closes.  So if
you're in the middle of a heap scan and the maintenance window closes,
you immediately stop the scan and go the the index cleanup phase, *at a
reduced IO rate*.  This way, the user will be able to get the benefits
of vacuuming at some not-too-distant future, without requiring the
maintenance window to open again, but without the heavy IO impact that
was allowed during the maintenance window.

Does this make sense?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Jim C. Nasby

On Thu, Mar 08, 2007 at 01:52:27PM +0530, NikhilS wrote:
> >I think it'd be great to make adding and removing partitions as
> >simple as ALTER TABLE. I don't think that DELETE should be the
> >mechanism to drop a partition, though. Again, DML statements
> >shouldn't be performing DDL.
> 
> 
> Since partition is inheritance-based, a simple DROP or  "NO INHERIT" will do
> the job to deal with the partition. Do we want to reinvent additional syntax
> when these are around and are documented?

Well, if the syntax for adding a new partition eventually ends up as
ALTER TABLE ADD PARTITION, then it would make more sense that you remove
a partition via ALTER TABLE DROP PARTITION.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Jim C. Nasby

On Thu, Mar 08, 2007 at 10:12:30AM +, Gregory Stark wrote:
> "Jim Nasby" <[EMAIL PROTECTED]> writes:
> 
> > One possibility would be to require 3 functions for a partitioned table: one
> > accepts the partitioning key and tells you what partition  it's in, one that
> > tells you what the minimum partitioning key for a  partition would be, and 
> > one
> > that tells you what the maximum would be.  If the user supplied those 3
> > functions, I think it would be possibly  to automatically generate code for 
> > the
> > triggers and check  constraints. The min/max partition key functions might
> > allow you to  more efficiently do partition elimination, too.
> 
> But then it would be harder to tell whether a clause implied a given
> partition. That is, if you have a partition constraint of "col OP const" then
> we can test whether a query clause of "col OP2 const2" implies that constraint
> when planning (or actually whether it implies it's false to exclude the
> partition). If you have a constraint like "P1(const)" it'll be pretty hard to
> do much with that.

Well, you could tell what partition 'const' was in; I would think that
plus knowledge about OP2 would allow you to decide what partitions you
need to look at.

There's also nothing to prevent us from also adding the constraints and
using constraint exclusion as well. In fact, I think we'd want to have
the constraints just so we know that a given partition only contains the
data we want it to.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Jim C. Nasby

One other thought... a lot of this discussion seems to parallel the
debate from a while ago about whether SERIAL should act like a macro
(ie: it just sets everything up and users are free to monkey under the
hood afterwards), or whether it should be it's own 'closed-box'
construct.

Currently, we seem to be leaning towards partition management being a
'macro', with child tables very exposed, etc. I don't know if that's
good or bad, but it's probably worth some thought.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-08 Thread Nikita

Few things from our side:

1. 'Skyline Of' is a new operator proposed in ICDE 2003, one of the topmost
conferences of Data Engineering. Skyline operation is a hot area of research
in query processing. Many of the database community people do know about
this operator, and it is fast catching the attention.

2. The skyline operation is very useful in data analysis. Suppose, if we
have a cricket database, and we want to find the bowlers who have taken
maximum wickets in minimum overs, we can issue an easy-to-write query using
'Skyline of' syntax as follows:

Select * from Player_Match Skyline Of overs_bowled min, wickets_taken max;

This query gives 25 interesting tuples (result set) out of 24750 tuples in
0.0509 seconds. The same result is obtained in 0.8228 seconds if the
following equivalent nested-query is issued:

select * from Player_Match p1 where not exists ( select * from Player_Match
p2 where p2.overs_bowled <= p1.overs_bowled and p2.wickets_taken >=
p1.wickets_taken and (p2.overs_bowled < p1.overs_bowled or p2.wickets_taken>
p1.wickets_taken))

Note that the above time is the time elapsed between issuing a query and
obtaining the result set.
As can be seen, the above query looks pretty cumbersome to write and is
inefficient too. So, which query will the user prefer? As the number of
dimensions increases, writing a nested-query will become a hedious task.
Btw, how can such a query be written using aggregate function syntax??

3. As far as optimizing the Skyline is concerned, it is still a research
problem since it requires estimating the cardinality of the skyline result
set.

4. Until and unless this operator is implemented in a popular database
system, how can a user ever get to know about it and hence appreciate its
usefulness?

Btw, it was our B.Tech final year project, and not a term project :-)

Regards.

On 3/8/07, Tom Lane <[EMAIL PROTECTED]> wrote:

Shane Ambler <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Well, whether it's horrible or not is in the eye of the beholder, but
>> this is certainly a non-standard syntax extension.

> Being non-standard should not be the only reason to reject a worthwhile
> feature.

No, but being non-standard is certainly an indicator that the feature
may not be of widespread interest --- if it were, the SQL committee
would've gotten around to including it; seems they've managed to include
everything but the kitchen sink already.  Add to that the complete lack
of any previous demand for the feature, and you have to wonder where the
market is.

> The fact that several
> different groups have been mentioned to be working on this feature would
> indicate that it is worth considering.

It looks to me more like someone published a paper that caught the
attention of a few profs looking for term projects for their students.

Now maybe it really is the best idea since sliced bread and will be seen
in the next SQL spec edition, but color me skeptical.  It seems to me
to be a very narrow-usage extension, as opposed to (eg) multi-input
aggregates or WITH/RECURSIVE, which provide general mechanisms applicable
to a multitude of problems.  Now even so it would be fine if the
implementation were similarly narrow in scope, but the published
description of the patch mentions a large chunk of additional executor
mechanisms.  If we're going to be adding as much code as that, I'd like
to see a wider scope of usage for it.

Basically, this patch isn't sounding like it has a reasonable
bang-to-the-buck ratio ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

--
Pride sullies the noblest character

Re: [HACKERS] Stream bitmaps

2007-03-08 Thread Heikki Linnakangas


Hi Gavin,

Any progress?

Gavin Sherry wrote:

Heikki,

On Mon, 5 Mar 2007, Heikki Linnakangas wrote:


Hi all,

I'd like to see the indexam API changes needed by the bitmap indexam to
be committed soon. Has anyone looked at the proposed API in the latest
patch? Any thoughts?


Thanks for looking at it!


I'm quite happy with it myself, with a few reservations:

- All the getbitmap implementations except the new bitmap indexam are
just boilerplate. How about making getbitmap-function optional, and
having a generic implementation that fills in a hash bitmap using the
traditional getnext function?

- getbitmap is passed an existing bitmap as argument, and the
implementation needs to OR the existing bitmap with new tuples. How
about AND? An indexam could be smart about ANDing with an existing
bitmap, for example skipping to the first set bit in the existing bitmap
and starting the scan from there.

- I'd like to have support to return candidate matches with both
getbitmap and getnext. A simple flag per page of results would be enough
for getbitmap, I think.

- StreamBitmap and HashBitmap are separate node types, but OpStream is
not. opaque-field in the StreamBitmap struct is not really that opaque,
it needs to be a StreamNode. I drew a UML sketch of what I think the
class-hierarchy is
(http://community.enterprisedb.com/streambitmaps.png). This is
object-oriented programming, we're just implementing classes and
inheritance with structs and function pointers. The current patch mixes
different techniques, and that needs to be cleaned up.


All good ideas, I'll look at them more closely in the morning.


I'd like to see a separate patch that contains just the API changes.
Gavin, could you extract an API-only patch from the bitmap index patch?
I can work on it as well, but I don't want to step on your toes.


Okay, I can do that.

Thanks,

Gavin



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Luke Lonergan

Adding to this:

Ayush recently wrote a C program that emulates PG IO to do this analysis, and 
we came out with (predictably) a ratio of sequential/random of 20-50 (for a 
single user).  This is predictable because the random component is fixed at the 
access time of a single hard drive no matter how many disks are in an array, 
while the sequential scales nearly linearly with the number of drives in the 
array.

So, you can estimate random using 8-12ms per random access, and sequential as 
1/(number of disks X 60-130MB/s).

Ayush, can you forward your C program?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Gregory Stark [mailto:[EMAIL PROTECTED]
Sent:   Thursday, March 08, 2007 12:37 PM Eastern Standard Time
To: Tom Lane
Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

"Tom Lane" <[EMAIL PROTECTED]> writes:

> "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes:
>> How can we accrately estimate the "seq_page_fetch" and =
>> "random_page_fetch" costs from outside the postgres using for example a =
>> C routine.
>
> Use a test case larger than memory.  Repeat many times to average out
> noise.  IIRC, when I did the experiments that led to the current
> random_page_cost of 4.0, it took about a week before I had numbers I
> trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] [PATCHES] pg_standby

2007-03-08 Thread Doug Knight

Excellent. Once you're ready, fire it over and I'll test it on our
config.

Doug
On Thu, 2007-03-08 at 18:34 +, Simon Riggs wrote:
> On Thu, 2007-03-08 at 13:29 -0500, Doug Knight wrote:
> 
> > I would preserve the existing trigger function as little t "-t", and
> > maybe implement a catchup trigger function as big t "-T"? Set it up so
> > that if the first attempt to find the WAL file postgres is currently
> > requesting succeeds, skip over the trigger check. If the first attempt
> > fails, then do your trigger check. That way, in the OCF script, the
> > postmaster can be started, the trigger file set, and connection to the
> > database looped on until it succeeds as an indication for when the
> > database is up and available. I think that's cleaner than comparing a
> > filename from a 'ps' command. Once I've completed the OCF script and
> > done some testing, I'll forward it to you for you to review and see if
> > you want to include it.
> 
> I'm happy to do this, unless other objections.
> 
> I'll be doing another version before feature freeze.
>

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Shane Ambler




Note to Nikhil: Make sure the new syntax doesn't prevent partitions from
being placed upon multiple tablespaces in some manner, at CREATE TABLE
time.


What if the syntax was something like -

CREATE TABLE tabname (
...
...
 ) PARTITION BY
 HASH(expr)
| RANGE(expr)
| LIST(expr)
[PARTITIONS num_partitions] /* will apply to HASH only for now*/
[PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname],
 PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]
 ...
];


And (if we use the ALTER TABLE to add partitions)

ALTER TABLE tabname
ADD PARTITION partition_name CHECK(...)
[USING TABLESPACE tblspcname];


Of course ALTER TABLE childtable SET TABLESPACE tblspcname; should not 
cause any probs.



--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Jim C. Nasby

On Thu, Mar 08, 2007 at 05:35:03PM +, Gregory Stark wrote:
> 
> "Tom Lane" <[EMAIL PROTECTED]> writes:
> 
> > "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes:
> >> How can we accrately estimate the "seq_page_fetch" and =
> >> "random_page_fetch" costs from outside the postgres using for example a =
> >> C routine.
> >
> > Use a test case larger than memory.  Repeat many times to average out
> > noise.  IIRC, when I did the experiments that led to the current
> > random_page_cost of 4.0, it took about a week before I had numbers I
> > trusted.
> 
> When I was running tests I did it on a filesystem where nothing else was
> running. Between tests I unmounted and remounted it. As I understand it Linux
> associates the cache with the filesystem and not the block device and discards
> all pages from cache when the filesystem is unmounted.
> 
> That doesn't contradict anything Tom said, it might be useful as an additional
> tool though.

Another trick I've used in the past is to just run the machine out of
memory, using the following:

/*
 * $Id: clearmem.c,v 1.1 2003/06/29 20:41:33 decibel Exp $
 *
 * Utility to clear out a chunk of memory and zero it. Useful for flushing disk 
buffers
 */

int main(int argc, char *argv[]) {
if (!calloc(atoi(argv[1]), 1024*1024)) { printf("Error allocating 
memory.\n"); }
}

I'll monitor top while that's running to ensure that some stuff gets
swapped out to disk. I believe this might still leave some cached data
in other areas of the kernel, but it's probably not enough to worry
about.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Acclerating INSERT/UPDATE using UPS

2007-03-08 Thread Hideyuki Kawashima


Simon,


Not checkpointing at all is not a good plan, since this will lead to an
enormous build up of WAL files and a very long recovery time if the
system does fail.


I appreciate your detailed comments.
Following your comments, I revised the problem.
Sigres-0.1.3 does checkpointings.

In summary, the features of Sigres-0.1.3 are as follows.
0: 10% faster than conventional PostgreSQL under tmpfs.
1: Checkpointings are continually executed.
2: Sigres mode is in default (the mode can be turned off via postgresql.conf).
3: issue_xlog_sync is called only by bgwriter (continually, via
createcheckpoint)
4: The entity of XLogWrite (_XLogWrite in my code) is called by both
backends and a bgwriter.
For each backend, _XLogWrite is called only via AdvanceXLInsertBuffer.
For a bgwriter, _XLogWrite is called via CreateCheckPoint.

Please try it if you have interest.
http://sourceforge.jp/projects/sigres/

Again, I really appreciate beneficial comments from this community !

Regards,

-- Hideyuki

--
Hideyuki Kawashima (Ph.D.)
University of Tsukuba
Assistant Professor

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Shane Ambler


Jim C. Nasby wrote:

On Thu, Mar 08, 2007 at 01:52:27PM +0530, NikhilS wrote:

I think it'd be great to make adding and removing partitions as
simple as ALTER TABLE. I don't think that DELETE should be the
mechanism to drop a partition, though. Again, DML statements
shouldn't be performing DDL.


Since partition is inheritance-based, a simple DROP or  "NO INHERIT" will do
the job to deal with the partition. Do we want to reinvent additional syntax
when these are around and are documented?


Well, if the syntax for adding a new partition eventually ends up as
ALTER TABLE ADD PARTITION, then it would make more sense that you remove
a partition via ALTER TABLE DROP PARTITION.


This follows on from the suggestion I made - taken along the lines of 
the subject "auto creation of partitions" where I suggested the syntax 
of partition check(month of mydatecol) and have a new partition created 
as data was entered. With this scenario dropping the partition when it 
was empty would complement the creation of a new partition as needed.


Given that there seems to be no real support of going with "auto 
maintenance" were new partitions are added as needed, then the auto 
dropping of empty partitions would also not apply.


Leaving us with only specific add partition / drop partition commands.
And have the parent table pick up rows not matching any partition check 
criteria.



--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Neil Conway

On Thu, 2007-03-08 at 17:35 +, Gregory Stark wrote:
> When I was running tests I did it on a filesystem where nothing else was
> running. Between tests I unmounted and remounted it. As I understand it Linux
> associates the cache with the filesystem and not the block device and discards
> all pages from cache when the filesystem is unmounted.

On recent Linux kernels, /proc/sys/vm/drop_caches can also be useful:

http://linux.inet.hr/proc_sys_vm_drop_caches.html

You could also use posix_fadvise() to achieve a similar effect on a per-file
basis.

-Neil



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Stream bitmaps

2007-03-08 Thread Gavin Sherry

On Thu, 8 Mar 2007, Heikki Linnakangas wrote:

> Hi Gavin,
>
> Any progress?
>

Really busy at the moment, but it's on my TODO list for today.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] who gets paid for this

2007-03-08 Thread Christian Bird


Hi all,

I'm a grad student at UC Davis studying the postgres community and I
wanted to know if some on this list could help me out.  I'm studying
the factors that affect people "graduating" from being mailing list
participant to developers with write access to the repository.  Is it
possible to find out who is being employed to work on postgres and who
is doing it on their own time?  Some of my data points to there being
two ways that people make the jump.  More specifically, could those
who worked on apache as some aspect of their job prior to getting repo
access let me know?  Or if there are devs who know this information
about others, I'd be really appreciative to get it.  Thanks a lot.

-- Christian Bird

--
Christian Bird
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] who gets paid for this

2007-03-08 Thread Christian Bird


Hi all,

I'm a grad student at UC Davis studying the apache server community
and I wanted to know if some on this list could help me out.  I'm
studying the factors that affect people "graduating" from being
mailing list participant to developers with write access to the
repository.  Is it possible to find out who is being employed to work
on apache and who is doing it on their own time?  Some of my data
points to there being two ways that people make the jump.  More
specifically, could those who worked on postgres as some aspect of
their job prior to getting cvs access let me know?  Or if there are
devs who know this information about others, I'd be really
appreciative to get it.  Thanks a lot.

-- Christian Bird

--
Christian Bird
[EMAIL PROTECTED]



--
Christian Bird
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Umar Farooq Minhas

Re: [HACKERS] Estimating seq_page_fetch and random_page_fetchThanks a lot for 
your replies. The suggestions have proved much useful.
Ayush, I'm curious to see your C program, thanks.

Here is a related but different issue. I started looking at the postgres 
optimizer/planner code a month back to modify it for the purposes of 
experiments that I need to conduct. The EXPLAIN command prints the total costs 
i.e both CPU + I/O however, for my purposes I need these two costs to be 
separated i.e. instead of getting one cost displayed, I want cpu cost and io 
cost displayed separated when i run EXPLAIN on a particular query. Till now I 
haven't been able to figure out a 'clean' way of doing this. Can anyone tell me 
how much time should I expect to spend making such a change ? and from where 
should I start ? costsize.c ?

I have another question. Looking at the optimizer code, it pretty much looks 
insensitive to the memory factor. The only parameters being utilized are the 
"effective_cache_size" ( in estimating index cost only) and "work_mem" for 
(sort, aggregation, groups, hash/merge joins). Are these the only memory 
factors that DIRECTLY effect the cost estimates of the planner/optimizer?

Again your help is appreciated.

-Umar
  - Original Message - 
  From: Luke Lonergan 
  To: Gregory Stark ; Tom Lane ; Ayush Parashar 
  Cc: Umar Farooq Minhas ; pgsql-hackers@postgresql.org 
  Sent: Thursday, March 08, 2007 2:16 PM
  Subject: Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

  Adding to this:

  Ayush recently wrote a C program that emulates PG IO to do this analysis, and 
we came out with (predictably) a ratio of sequential/random of 20-50 (for a 
single user).  This is predictable because the random component is fixed at the 
access time of a single hard drive no matter how many disks are in an array, 
while the sequential scales nearly linearly with the number of drives in the 
array.

  So, you can estimate random using 8-12ms per random access, and sequential as 
1/(number of disks X 60-130MB/s).

  Ayush, can you forward your C program?

  - Luke

  Msg is shrt cuz m on ma treo

   -Original Message-
  From:   Gregory Stark [mailto:[EMAIL PROTECTED]
  Sent:   Thursday, March 08, 2007 12:37 PM Eastern Standard Time
  To: Tom Lane
  Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org
  Subject:Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

  "Tom Lane" <[EMAIL PROTECTED]> writes:

  > "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes:
  >> How can we accrately estimate the "seq_page_fetch" and =
  >> "random_page_fetch" costs from outside the postgres using for example a =
  >> C routine.
  >
  > Use a test case larger than memory.  Repeat many times to average out
  > noise.  IIRC, when I did the experiments that led to the current
  > random_page_cost of 4.0, it took about a week before I had numbers I
  > trusted.

  When I was running tests I did it on a filesystem where nothing else was
  running. Between tests I unmounted and remounted it. As I understand it Linux
  associates the cache with the filesystem and not the block device and discards
  all pages from cache when the filesystem is unmounted.

  That doesn't contradict anything Tom said, it might be useful as an additional
  tool though.

  --
Gregory Stark
EnterpriseDB  http://www.enterprisedb.com

  ---(end of broadcast)---
  TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq

Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Galy Lee


Alvaro Herrera wrote:
> I don't have anything else as detailed as a "plan".  If you have
> suggestions, I'm all ears.
Cool, thanks for the update. :) We also have some new ideas on the
improvement of autovacuum now. I will raise it up later.

> Now regarding your restartable vacuum work.  
> Does this make sense?
I also have reached a similar conclusion now.  Thank you.

Regards
Galy

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro

Jim Nasby <[EMAIL PROTECTED]> wrote:

> > Also, my recommended bgwriter_lru_maxpages is "average number of
> > recycled buffers per cycle", that is hardly able to tune manually.
> 
> What do you mean by 'number of recycled buffers per cycle"?

There is the following description in the documentation:

| * bgwriter_lru_percent (floating point)
| To reduce the probability that server processes will need to issue their
| own writes, the background writer tries to write buffers that are likely
| to be recycled soon.
| * bgwriter_lru_maxpages (integer)
| In each round, no more than this many buffers will be written as a
| result of scanning soon-to-be-recycled buffers.

The number of recycled buffers per round is the same as the number of
StrategyGetBuffer() calls per round. I think the number is suitable for
bgwriter_lru_maxpages if we want bgwriter to write almost of dirty pages.

I proposed to change the semantics of bgwriter_lru_maxpages. It represented
"maximum writes per round", but the new meaning is "reserved buffers for
recycle". Non-dirty unused buffers will be counted among it.

I'm measuring the difference of performance between manual and automatic
tuning, especially adding some jobs before writes. I'll inform you about
them when I get the results.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro

Greg Smith <[EMAIL PROTECTED]> wrote:

> > Also, my recommended bgwriter_lru_maxpages is "average number of
> > recycled buffers per cycle", that is hardly able to tune manually.
> 
> This is completely dependent on what percentage of your buffer cache is 
> pinned.

Don't you mean usage_count? In my understanding, each backend pins two
or so buffers at once. So percentage of pinned buffers should be low.

> If your load is something like the standard pgbench, the LRU 
> writer will rarely find anything useful to write, so this entire line of 
> thinking won't work.  The proper behavior for heavily pinned data is to 
> turn off the LRU writer altogether so there's more time to run the all 
> scan.

I think you are pointing out the problem of buffer management algorithm
itself, not only bgwriter. Even if you turn off the LRU writer, each
backend pays the same cost to find recyclable buffers every time they
allocate a buffer.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Tom Lane

Alvaro Herrera <[EMAIL PROTECTED]> writes:
> Now regarding your restartable vacuum work.  I think that stopping a
> vacuum at some point and being able to restart it later is very cool and
> may get you some hot chicks, but I'm not sure it's really useful.

Too true :-(

> I think it makes more sense to do something like throttling an ongoing
> vacuum to a reduced IO rate, if the maintenance window closes.  So if
> you're in the middle of a heap scan and the maintenance window closes,
> you immediately stop the scan and go the the index cleanup phase, *at a
> reduced IO rate*.

Er, why not just finish out the scan at the reduced I/O rate?  Any sort
of "abort" behavior is going to create net inefficiency, eg doing an
index scan to remove only a few tuples.  ISTM that the vacuum ought to
just continue along its existing path at a slower I/O rate.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread Greg Smith


On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

In my understanding, each backend pins two or so buffers at once. So 
percentage of pinned buffers should be low.


With the pgbench workload, a substantial percentage of the buffer cache 
ends up pinned.  From staring at the buffer cache using 
contrib/pg_buffercache, I believe most of that consists of the index 
blocks for the records being updated in the accounts table.


I just posted a new version of the patch I asked for feedback on at the 
beginning of this thread, the latest one is at 
http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
I've been adjusting it to monitor the same data I think you need to refine 
your patch.  I believe the approach you're taking includes some 
assumptions that seem perfectly reasonable, but that my testing doesn't 
agree with.  There's nothing like measuring something to settle what's 
really going on, though, so that's what I've been focusing on.  I'd love 
to get some feedback on whether other people can replicate the kind of 
things I'm seeing.


The new code generates statistics about exactly what the background writer 
scan found during each round.  If there's been substantial write activity, 
it prints a log line when it recycles back to the beginning of the all 
scan, to help characterize what the buffer pool looked like during that 
scan from the perspective of the bgwriter.  Here's some sample log output 
from my underpowered laptop while running pgbench:


bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB 
(31.9%)
...
checkpoint required (wrote checkpoint_segments)
checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms

Here 69% of the buffer cache contained dirty data, and 49% of the cache 
was both pinned and dirty.  During that same time period, the LRU write 
also wrote out a fair amount of data, operating on the 20% of the cache 
that was dirty but not pinned.  On my production server, where the 
background writer is turned way up to reduce checkpoint times, these 
numbers are even more extreme; almost everything that's dirty is also 
pinned during pgbench, and the LRU is lucky to find anything it can write 
as a result.


That patch is against the 8.2 codebase; now that I'm almost done I'm 
planning to move it to HEAD instead soon (where it will conflict 
considerably with your patch).  If you have an 8.2 configuration you can 
test with my patch applied, set log_min_messages = debug2, try it out, and 
see what you get when running pgbench for a while.  I think you'll 
discover some interesting and unexpected things.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] who gets paid for this

2007-03-08 Thread Josh Berkus

Christian,

>More specifically, could those
> who worked on apache as some aspect of their job prior to getting repo
> access let me know?  Or if there are devs who know this information
> about others, I'd be really appreciative to get it. 

Hmmm.  Wrong project.  And I think you're making the (incorrect) assumption 
that granting commit rights works the same way in all projects.   It does 
not.

How about you call me and we can chat about how the PostgreSQL project 
actually works?  415-752-2500.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS

Hi,

On 3/9/07, Shane Ambler <[EMAIL PROTECTED]> wrote:

> Note to Nikhil: Make sure the new syntax doesn't prevent partitions from
> being placed upon multiple tablespaces in some manner, at CREATE TABLE
> time.

What if the syntax was something like -

CREATE TABLE tabname (
 ...
 ...
  ) PARTITION BY
  HASH(expr)
| RANGE(expr)
| LIST(expr)
[PARTITIONS num_partitions] /* will apply to HASH only for now*/
[PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname],
  PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]
  ...
];

And (if we use the ALTER TABLE to add partitions)

ALTER TABLE tabname
ADD PARTITION partition_name CHECK(...)
[USING TABLESPACE tblspcname];

We could as well drop the USING part.

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS


Hi,



This follows on from the suggestion I made - taken along the lines of
the subject "auto creation of partitions" where I suggested the syntax
of partition check(month of mydatecol) and have a new partition created
as data was entered. With this scenario dropping the partition when it
was empty would complement the creation of a new partition as needed.

Given that there seems to be no real support of going with "auto
maintenance" were new partitions are added as needed, then the auto
dropping of empty partitions would also not apply.

Leaving us with only specific add partition / drop partition commands.
And have the parent table pick up rows not matching any partition check
criteria.



I was thinking along the lines of what Jim had suggested earlier regarding
overflow partition. Instead of dumping unmatched rows to the master table,
we could put them into a default "DUMP/DUMB" partition.

Given that Simon wants to do away with having the master table APPENDed in
the planning phase, this would be better.

Regards,
Nikhils


--


Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz





--
EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] who gets paid for this

2007-03-08 Thread Tom Lane

Josh Berkus  writes:
> Christian,
>> More specifically, could those
>> who worked on apache as some aspect of their job prior to getting repo
>> access let me know?  Or if there are devs who know this information
>> about others, I'd be really appreciative to get it. 

> Hmmm.  Wrong project.  And I think you're making the (incorrect) assumption 
> that granting commit rights works the same way in all projects.   It does 
> not.

Even more to the point, "getting paid for" has almost nothing to do
with "has commit privileges".  At least on this project.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Galy Lee


Tom Lane wrote:
> Er, why not just finish out the scan at the reduced I/O rate?  Any sort

Sometimes, you may need to vacuum large table in maintenance window and
hot table in the service time. If vacuum for hot table does not eat two
much foreground resource, then you can vacuum large table with a lower
IO rate outside maintenance window; but if vacuum for hot table is
overeating the system resource, then launcher had better to stop the
long running vacuum outside maintenance window.

But I am not insisting on the stop-start feature at this moment.
Changing the cost delay dynamically sounds more reasonable. We can use
it to balance total I/O of workers in service time or maintenance time.
It is not so difficult to achieve this by leveraging the share memory of
autovacuum.

Best Regards
Galy Lee

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread Tom Lane

Greg Smith <[EMAIL PROTECTED]> writes:
> With the pgbench workload, a substantial percentage of the buffer cache 
> ends up pinned.

[ raised eyebrow... ]  Prove that.  AFAIK it's impossible for the
pgbench queries to pin more than about three or four buffers per backend
concurrently.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro


Greg Smith <[EMAIL PROTECTED]> wrote:

> > In my understanding, each backend pins two or so buffers at once. So 
> > percentage of pinned buffers should be low.
> 
> With the pgbench workload, a substantial percentage of the buffer cache 
> ends up pinned.

> http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
> bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB 
> (31.9%)
> ...
> checkpoint required (wrote checkpoint_segments)
> checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms
> 
> Here 69% of the buffer cache contained dirty data, and 49% of the cache 
> was both pinned and dirty.

No. "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or
recently-used (bufHdr->usage_count > 0) buffers in your patch.

!   if (bufHdr->refcount != 0 || bufHdr->usage_count != 0)
{
!   if (skip_pinned)
!   {
!   UnlockBufHdr(bufHdr);
!   return BUF_PINNED;
!   }
!   buffer_write_type=BUF_WRITTEN_AND_PINNED;


Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Acclerating INSERT/UPDATE using UPS

2007-03-08 Thread August Zajonc

Joshua D. Drake wrote:
> From an deployable application perspective, this could be a big deal. We
> are already starting to see very large traction in the Win32 desktop app
> arena.
> 

There seem to be a few overlapping proposals in terms of reducing
various guarantees in the name of performance.

As more and more options are added that affect integrity (fsync, full
page writes, commit nowait, sigres) it might be nice to outline and
compare the approaches, and particularly to describe clearly the failure
scenarios and how they are significantly different from one another.

One potentially needs to track an increasing number of ways in which
items might be set which reduce certain guarantees on data integrity
which is unpleasant.

If a setting is wrong on a performance knob, no problem, when there are
complaints things are slow you can go through and adjust them. The same
is not true of data consistency. When the complaint comes it is usually
too late to fiddle with knobs.

I'm just thinking some caution should be exercised in adding too many of
them in the first place. I happen to love COMMIT NOWAIT though, for
many, this replaces fsync=off.

- August

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

60 matches

Mail list logo