[HACKERS] HOT WIP Patch - version 4.1
Please see HOT WIP patch, version 4.1 posted on -patches. here are not any significant changes since the version 4.0 patch that I posted a week back. This patch includes some optimizations for efficiently looking up LP_DELETEd tuples. I have used the recent changes made by Tom/Heikki which give us few bits per page header. I use one bit to track if there are any LP_DELETEd tuples in the page. The changes to this bit are not WAL-logged and hence the information might not be accurate. But we should be ok with that. Another non-trivial change is the addition of logic to clean up row level fragmentation. I have discussed this earlier on the list, but neverthless would summarize it again here. When we reuse LP_DELETEd tuples for UPDATE, we might waste few bytes when the original size of the reused tuple is larger than the new tuple. The information about the original length is lost. When we run out of LP_DELETEd tuples of size equal or greater than the requested size in UPDATE path, we correct the row level fragmentation, if any. Please note, we don't move tuples around and hence don't need the VACUUM-strength lock on the page. We use another bit in the page header to track if there are any fragmented LP_DELETEd tuples in the page. We also need one bit in the tuple header to track that the particular tuple was fragmeneted while being reused. This information is then used when the tuple is again released and marked LP_DELETE, to update the page level hint bit. Comments/suggestions ? Thanks, Pavan -- EnterpriseDBhttp://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Auto creation of Partitions
Hi, Maybe, the dummy entry could be extended to contain the bounds (max/min) for each of the other involved partitions and they could be updated each time a DML happens across the partitions. That ways, an update to a particular partition needs to lock out the others, examine the dummy entries in its own index and follow it up with dummy entries update into other partitions if the need be. Ofcourse as you have mentioned all of this so needs to be done after a careful think on the locking/deadlocking etc issues. Regards, Nikhils On 3/7/07, Alvaro Herrera <[EMAIL PROTECTED]> wrote: I am wondering if we can implement unique indexes across several tables (inheritance hierarchy) not by using a single, big index covering all the tables, but rather by inserting a dummy entry into each partition's unique index. This dummy entry would have an expanded CTID which would include the tableoid, so it's possible to check it (albeit there is a problem in that we may require the opening of another heap to do the actual checking). These dummy entries could be removed by bulkcleanup as soon as the inserting transaction is no longer running, to avoid bloating the index too much. All said dummy index entries would be located at either the rightmost or the leftmost leaf, or close to it, so another idea is to have future inserters reuse the entry for a different key. The obvious problem with this is, naturally, the excess I/O that extra index traversing causes. The not so obvious ones are locking, deadlocking and the opening of other heaps and indexes while you do the insertion, which may be too expensive. On the other hand, maybe this idea is easier to implement than full-fledged cross-table indexes, so we could have richer partitioning earlier than when somebody finally bites the bullet and implements cross-table indexes. Or maybe this is just a dumb idea, but I had to let it out anyway :-) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, There are 2 other reasons to favor triggers though: 1) People (Josh Drake comes to mind) have found that if you get over a tiny number of partitions, the performance of rules is abysmal. 2) I believe it should be possible to construct an update trigger that allows you to perform updates that will place the row in question into a new partition. While I can see cases for simply disallowing updates to the partitioning key, I think there are also times when being able to do that would be very useful. The consensus seems to be veering towards triggers. I think it'd be great to make adding and removing partitions as simple as ALTER TABLE. I don't think that DELETE should be the mechanism to drop a partition, though. Again, DML statements shouldn't be performing DDL. Since partition is inheritance-based, a simple DROP or "NO INHERIT" will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Proposed ProcessUtility() API additions
Tom Lane wrote: The point of adding is_top_level is to provide a simpler, more reliable means for PreventTransactionChain and related functions to detect whether a function is trying to invoke a non-transaction-block-safe command. Currently we rely on an ugly test involving seeing if the statement node is in the QueryContext, but that's always been a kluge, and I'm not sure that it works 100% even today. I'd like to get rid of the QueryContext global altogether. Great! I once played with the idea of using a simpler MemoryContext implementation in the parser to save some CPU cycles, and PreventTransactionChain stopped working. I don't know if what I was doing was a good idea, but PreventTransactionChain is definitely a hack. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Auto creation of Partitions
On Wed, 2007-03-07 at 22:32 -0500, Luke Lonergan wrote: > Andreas, > > On 3/7/07 11:45 AM, "Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]> > wrote: > > > Whoa, do you have anything to back that up ? > > Sure - when we start to consider designs that implement advanced data > management features, we run into problems with the architecture of > "tables->tables->tables...". Here are some examples: > 1 - people think of partitions as a logical building block for tables, they > would like to move partitions around underneath a table without the table > definition being involved. In the current implementation, there are > explicit linkages between the table definition and the child tables - > imagine an "ALTER TABLE foo_parent ADD COLUMN" and how it would need to > cascade to 1,000 child tables and you get the beginning of it - this > connection should not exist. The inheritance can work at multiple levels, so its up to you how you manage things. If you want to add a column to only the newest tables, you just add a new mid-level table, add the new column only to that and then make all new partitions inherit from that table rather than the main table. So I don't see the objection here, I see a benefit. > 2 - INSERT/UPDATE/DELETE processing through the SQL rewrite layer (rules) is > terribly slow and gets slower as you add more partitions. If done closer to > the storage layer, this can be done in ways that use access methods shared > with other storage entities, e.g. Indices, and the code path would flow more > naturally. That bit is just syntactic sugar and unnecessary, for most applications, in my experience - especially because it doesn't work with COPY. People do seem to want it, so as I said upthread, we need a way of speeding up the selection of the appropriate partition, so we can get this to work for individual INSERTs. This needs substantial improvement, but I don't see this challenging the partition == table assumption. > 3 - Parallel query can be accomplished more easily by separating scans > across relations split among tablespaces. This is more natural than trying > to parallelize APPEND nodes within existing plans Tables-> Partitions -> Tablespaces Changing Partitions from Tables to SomethingElse won't increase the current capability to define the disk layout for concurrent I/O. Parallel Query needs to be planner-aware, so if we change partitions from being tables, then we'd need to reintroduce them to the planner. APPEND is clearly not the only thing that's needed for parallel query. Note to Nikhil: Make sure the new syntax doesn't prevent partitions from being placed upon multiple tablespaces in some manner, at CREATE TABLE time. > > You would need to elaborate what you actually mean, but I think it is > > moot. > > Sure, the constraint technique can be further extended (e.g. during > > runtime), but imho the approach is very good. > > Well, it's being used and that's good, but it needs to be better IMO and I > think that before we go too far down the current path we should consider the > alternatives more carefully. I'm happy to reconsider things, but we need to do that with some clear analysis of what doesn't work yet and how best to implement that. None of what's been mentioned requires us to reconsider the Partition == Table assumption. There were other ways considered, but they didn't win out in the analysis, for documented reasons. If there are benefits to having partitions act like tables, then maybe we can make them behave differently in just those circumstances. Pack animals behave differently in a group, so why not tables, if need be? But when is that exactly? There is a ton of work to make partitioning the elegant beast we'd like it to be, but that seems like extension only, not rip and replace. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Auto creation of Partitions
On Wed, 2007-03-07 at 21:27 -0700, Jim Nasby wrote: > On Mar 7, 2007, at 3:26 AM, Simon Riggs wrote: > > If you know that the constraints on each of the tables is distinct, > > then > > building a UNIQUE index on each of the partitions is sufficient to > > prove > > that all rows in the combined partitioned table are distinct also. > > > > The hard part there is checking that the partition constraints are > > distinct. If the partition constraints are added one at a time, you > > can > > use the predicate testing logic to compare the to-be-added partition's > > constraint against each of the already added constraints. That becomes > > an O(N) problem. > > > > What is really needed is a data structure that allows range partitions > > to be accessed more efficiently. This could make adding partitions and > > deciding in which partition a specific value goes an O(logN) > > operation. > > Directing data to child tables with triggers pretty much necessitates > having some way to codify what partition a particular row belongs in. > IE: for partitioning by month, you'll see things like naming the > partition tables "parent_table_name_$YEAR_$MONTH", so the > 'partitioning function' takes a date or timestamp and then returns > what partition it belongs to. Perhaps there is some way to use that > mapping to drive the selection of what partitions could contain a > given value? > > One possibility would be to require 3 functions for a partitioned > table: one accepts the partitioning key and tells you what partition > it's in, one that tells you what the minimum partitioning key for a > partition would be, and one that tells you what the maximum would be. > If the user supplied those 3 functions, I think it would be possibly > to automatically generate code for the triggers and check > constraints. The min/max partition key functions might allow you to > more efficiently do partition elimination, too. ISTM this is a good idea. SQLServer uses partitioning functions and I like that approach. It makes it much easier to do partition-wise joins between tables that share partitioning functions. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] WSAStartup() in libpq
It has been brought to my attention by Tokuharu Yuzawa that our calling of WSAStartup() in DllMain() (libpqdll.c) is incorrect. Basically we're calling WSAStartup() so that the client application does not have to. However, due to the fact that WSAStartup() can itself load libraries, there is a risk of deadlocking here. See for example: MSDN docs http://msdn2.microsoft.com/en-us/library/aa910684.aspx MIT Kerberos have had the same problem: http://mailman.mit.edu/pipermail/krbdev/2005-March/003244.html And even MS had the same bug in their own native database library: http://support.microsoft.com/kb/818539 There's also a note about socket issues in DLLs on: http://support.microsoft.com/kb/q237572/ The easy fix for this is to remove the calls. Which obviously will break some client apps. A fairly easy fix for the WSAStartup() call is to have a check in the connection functions against a global variable that will then make sure to call WSAStartup() the first time it's called. That would leave us "leaking" the WSAStartup() call, but only one per application. This is not perfect, but I'm thinking we can maybe live with that. If not, perhaps we can have it call WSAStartup() everytime we connect to a server, and then WSACleanup() when we shut down that connection with PQfinish(). We're still going to leak if the user forgets to run PQfinish(), but we're leaking other resources as well in that case. That will break if any network related functions are called when there is no connection open, but I doubt that's possible? Of course, if we had a libpq_init() and a libpq_shutdown() function things would be very easy, but we don't. And adding it just for this purpose seems like trying too hard. Yet another option would be to require that the client app deal with the startup/shutdown code itself, but that will seriously break backwards compatibility, so I don't think that's a good idea at all. Other ideas? //Magnus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Auto creation of Partitions
"Jim Nasby" <[EMAIL PROTECTED]> writes: > One possibility would be to require 3 functions for a partitioned table: one > accepts the partitioning key and tells you what partition it's in, one that > tells you what the minimum partitioning key for a partition would be, and one > that tells you what the maximum would be. If the user supplied those 3 > functions, I think it would be possibly to automatically generate code for > the > triggers and check constraints. The min/max partition key functions might > allow you to more efficiently do partition elimination, too. But then it would be harder to tell whether a clause implied a given partition. That is, if you have a partition constraint of "col OP const" then we can test whether a query clause of "col OP2 const2" implies that constraint when planning (or actually whether it implies it's false to exclude the partition). If you have a constraint like "P1(const)" it'll be pretty hard to do much with that. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Auto creation of Partitions
> > The hard part there is checking that the partition constraints are > > distinct. If the partition constraints are added one at a time, you > > can use the predicate testing logic to compare the to-be-added > > partition's constraint against each of the already added constraints. > > That becomes an O(N) problem. Yes, we could preevaluate that check (per index) in the DDL phase and keep the info in a flag. Also interesting info is if there is an order the partitions can be read in to satisfy a particular order by. > > What is really needed is a data structure that allows range partitions > > to be accessed more efficiently. This could make adding partitions and > > deciding in which partition a specific value goes an O(logN) > > operation. I do not really see a problem with O(N) since typical N currently range from 10 to 200. N = 1000 is already good for a >= 10 TB table. If a 10 GB partition were too large we should imho invest more in the advanced indexing methods that are currently beeing developed. > Directing data to child tables with triggers pretty much > necessitates having some way to codify what partition a > particular row belongs in. > IE: for partitioning by month, you'll see things like naming > the partition tables "parent_table_name_$YEAR_$MONTH", so the > 'partitioning function' takes a date or timestamp and then > returns what partition it belongs to. Perhaps there is some > way to use that mapping to drive the selection of what > partitions could contain a given value? You put it in the first partition that has matching constraints. > One possibility would be to require 3 functions for a partitioned > table: one accepts the partitioning key and tells you what > partition it's in, one that tells you what the minimum > partitioning key for a partition would be, and one that tells > you what the maximum would be. > If the user supplied those 3 functions, I think it would be > possibly to automatically generate code for the triggers and > check constraints. The min/max partition key functions might > allow you to more efficiently do partition elimination, too. I can see this as a good optional addition, but it can only be optional else it would pretty much limit the methods that can be used for partitioning. e.g. hash, modulo do not have a min,max per partition. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] WSAStartup() in libpq
Magnus Hagander wrote: > > The easy fix for this is to remove the calls. Which obviously will break > some client apps. A fairly easy fix for the WSAStartup() call is to have > a check in the connection functions against a global variable that will > then make sure to call WSAStartup() the first time it's called. > > That would leave us "leaking" the WSAStartup() call, but only one per > application. This is not perfect, but I'm thinking we can maybe live > with that. > > If not, perhaps we can have it call WSAStartup() everytime we connect to > a server, and then WSACleanup() when we shut down that connection with > PQfinish(). Taken from MSDN docs, this seems the recommended solution. After the first WSAStartup call subsequent calls are cheap because they only increment a counter. Regards, Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WSAStartup() in libpq
On Thu, Mar 08, 2007 at 12:47:42PM +0100, Andreas Pflug wrote: > Magnus Hagander wrote: > > > > The easy fix for this is to remove the calls. Which obviously will break > > some client apps. A fairly easy fix for the WSAStartup() call is to have > > a check in the connection functions against a global variable that will > > then make sure to call WSAStartup() the first time it's called. > > > > That would leave us "leaking" the WSAStartup() call, but only one per > > application. This is not perfect, but I'm thinking we can maybe live > > with that. > > > > If not, perhaps we can have it call WSAStartup() everytime we connect to > > a server, and then WSACleanup() when we shut down that connection with > > PQfinish(). > > Taken from MSDN docs, this seems the recommended solution. After the > first WSAStartup call subsequent calls are cheap because they only > increment a counter. Now that I look closer at it, we *already* do WSAStartup() in makeEmptyPGconn... And free it in freePGconn(). So I suggest the following simple patch.. Any objections? //Magnus Index: fe-connect.c === RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v retrieving revision 1.344 diff -c -r1.344 fe-connect.c *** fe-connect.c20 Feb 2007 15:20:51 - 1.344 --- fe-connect.c8 Mar 2007 12:16:42 - *** *** 1840,1848 #ifdef WIN32 /* !* Make sure socket support is up and running. Even though this is done in !* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for !* static builds at all, so we have to do it in the main code too. */ WSADATA wsaData; --- 1840,1846 #ifdef WIN32 /* !* Make sure socket support is up and running. */ WSADATA wsaData; Index: libpqdll.c === RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/libpqdll.c,v retrieving revision 1.10 diff -c -r1.10 libpqdll.c *** libpqdll.c 11 Mar 2006 04:38:40 - 1.10 --- libpqdll.c 8 Mar 2007 12:05:59 - *** *** 15,31 switch (fdwReason) { case DLL_PROCESS_ATTACH: ! if (WSAStartup(MAKEWORD(1, 1), &wsaData)) ! { ! /* !* No really good way to do error handling here, since we !* don't know how we were loaded !*/ ! return FALSE; ! } break; case DLL_PROCESS_DETACH: ! WSACleanup(); break; } --- 15,24 switch (fdwReason) { case DLL_PROCESS_ATTACH: ! /* We used to call WSAStartup() here, but this may cause deadlocks */ break; case DLL_PROCESS_DETACH: ! /* We used to call WSACleanup() here, but this may cause deadlocks */ break; } ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in VACUUM FULL ?
Pavan Deolasee wrote: > > Thanks a lot, Tom. It seems to work fine for me. I will do some > more tests and report if I see any issue. > The problem mentioned before is hard to reproduce with the suggested change, but its not completely gone away. I have seen that again on CVS HEAD with the patch applied. I am facing another issue with VACUUM FULL. This problem gets reproduced with HOT very easily, but takes few attempts to reproduce with CVS HEAD, but it certainly exists. This time I am using pgbench. All tables but "history" are created with fillfactor=50 Now, I start running pgbench with scale factor of 10 and 40 clients and 1 txns/client. After few minutes, I start running VACUUM FULL on tellers and branches, every 10 seconds. After a while, all pgbench clients fail with the following errors: Client 1 aborted in state 11: ERROR: duplicate key violates unique constraint "branches_pkey" Client 30 aborted in state 11: ERROR: duplicate key violates unique constraint "branches_pkey" Client 39 aborted in state 11: ERROR: duplicate key violates unique constraint "branches_pkey" Client 7 aborted in state 11: ERROR: duplicate key violates unique constraint "branches_pkey" Next run of VACUUM FULL gives the following error: WARNING: index "branches_pkey" contains 15 row versions, but table contains 12 row versions HINT: Rebuild the index with REINDEX. Has this been reported earlier ? IIRC Tom mentioned in one of the emails that Merlin has reported some problem related to "duplicate key violation". Tom, could this be related ? Thanks, Pavan -- EnterpriseDBhttp://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WSAStartup() in libpq
Magnus Hagander <[EMAIL PROTECTED]> writes: > So I suggest the following simple patch.. Any objections? One wonders if we need DllMain() at all any more. We certainly don't need that switch statement ... Also, isn't the WSACleanup() in freePGconn in the wrong place? Seems like it shouldn't be done until after we've closed the socket. I'd be inclined to put it at the bottom of the routine. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] WSAStartup() in libpq
On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote: > Magnus Hagander <[EMAIL PROTECTED]> writes: > > So I suggest the following simple patch.. Any objections? > > One wonders if we need DllMain() at all any more. We certainly don't > need that switch statement ... Indeed. Looking even more into it (sheesh, I really didn't do my homework here), libpqdll.c isn't even *compiled* on mingw. Or on the new MSVC build. It's only compiled on the old msvc build. Given that, we can probably just delete the file. > Also, isn't the WSACleanup() in freePGconn in the wrong place? Seems > like it shouldn't be done until after we've closed the socket. I'd > be inclined to put it at the bottom of the routine. Certainly looks wrong. It's interesting how this could have worked *before*. That's a clear indication that it really doesn't appear to matter much what we do here :S The patch would then look something like this, and a remove of libpqdll.c. //Magnus Index: bcc32.mak === RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/bcc32.mak,v retrieving revision 1.26 diff -c -r1.26 bcc32.mak *** bcc32.mak 11 Jan 2007 02:42:31 - 1.26 --- bcc32.mak 8 Mar 2007 15:23:17 - *** *** 93,99 [EMAIL PROTECTED] "$(INTDIR)\fe-secure.obj" [EMAIL PROTECTED] "$(INTDIR)\pqexpbuffer.obj" [EMAIL PROTECTED] "$(INTDIR)\pqsignal.obj" - [EMAIL PROTECTED] "$(OUTDIR)\libpqdll.obj" [EMAIL PROTECTED] "$(OUTDIR)\win32.obj" [EMAIL PROTECTED] "$(INTDIR)\wchar.obj" [EMAIL PROTECTED] "$(INTDIR)\encnames.obj" --- 93,98 *** *** 155,168 LINK32=ilink32.exe LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v - LINK32_OBJS= "$(INTDIR)\libpqdll.obj" # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm ! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" $(LINK32_OBJS) "$(INTDIR)\libpq.res" blibpqdll.def $(LINK32) @<< $(LINK32_FLAGS) + ! c0d32.obj $(LINK32_OBJS), + $@,, + "$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, + blibpqdll.def,"$(INTDIR)\libpq.res" --- 154,166 LINK32=ilink32.exe LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm ! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" "$(INTDIR)\libpq.res" blibpqdll.def $(LINK32) @<< $(LINK32_FLAGS) + ! c0d32.obj , + $@,, + "$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, + blibpqdll.def,"$(INTDIR)\libpq.res" Index: fe-connect.c === RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v retrieving revision 1.344 diff -c -r1.344 fe-connect.c *** fe-connect.c20 Feb 2007 15:20:51 - 1.344 --- fe-connect.c8 Mar 2007 15:16:58 - *** *** 1840,1848 #ifdef WIN32 /* !* Make sure socket support is up and running. Even though this is done in !* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for !* static builds at all, so we have to do it in the main code too. */ WSADATA wsaData; --- 1840,1846 #ifdef WIN32 /* !* Make sure socket support is up and running. */ WSADATA wsaData; *** *** 1918,1927 PGnotify *notify; pgParameterStatus *pstatus; - #ifdef WIN32 - WSACleanup(); - #endif - if (!conn) return; --- 1916,1921 *** *** 1986,1991 --- 1980,1989 termPQExpBuffer(&conn->errorMessage); termPQExpBuffer(&conn->workBuffer); free(conn); + + #ifdef WIN32 + WSACleanup(); + #endif } /* Index: win32.mak === RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/win32.mak,v retrieving revision 1.43 diff -c -r1.43 win32.mak *** win32.mak 11 Jan 2007 02:42:31 - 1.43 --- win32.mak 8 Mar 2007 15:22:29 - *** *** 63,69 [EMAIL PROTECTED] "$(INTDIR)\fe-secure.obj" [EMAIL PROTECTED] "$(INTDIR)\pqexpbuffer.obj" [EMAIL PROTECTED] "$(INTDIR)\pqsignal.obj" - [EMAIL PROTECTED] "$(OUTDIR)\libpqdll.obj" [EMAIL PROTECTED] "$(OUTDIR)\win32.obj" [EMAIL PROTECTED] "$(INTDIR)\wchar.obj" [EMAIL PROTECTED] "$(INTDIR)\encnames.obj" --- 63,68 *** *** 143,149 /pdb:"$(OUTDIR)\libpqdll.pdb" /machine:I386 /out:"$(OUTDIR)\$(OUTFILENAME).dll"\ /implib:"$(OUTDIR)\$(OUTFILENAME)dll.lib" /def:$(OUTFILENAME)dll.def LINK32_OBJS= \ - "$(INTDIR)\libpqdll.obj" \ "$(OUTDIR)\$(OUTFILENAME).lib" \ "$(OUTDIR)\libpq.res" --- 142,147 *** *** 159,165 $(RSC) $(RSC_PROJ) l
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Simon, > and what you haven't said > > - all of this is orthogonal to the issue of buffer cache spoiling in > PostgreSQL itself. That issue does still exist as a non-OS issue, but > we've been discussing in detail the specific case of L2 cache effects > with specific kernel calls. All of the test results have been > stand-alone, so we've not done any measurements in that area. I say this > because you make the point that reducing the working set size of write > workloads has no effect on the L2 cache issue, but ISTM its still > potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether the size of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cache misses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set size will be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [JDBC] [HACKERS] Plan invalidation vs. unnamed prepared statements
I am from pgsql-jdbc, so I may not be "in the thread", so please ignore places where my misunderstanding goes out. The main two questions, IMHO, is: 1) What is the key to plan cache. Current option is some statement key (id). Another option would be statement text (you still need to store it if you want to replan at some point). In this case you can use same plan for multiple statements going from different sessions. That's the point Simon was talking about. This should significantly reduce planning, especially on multiple similar clients. Now, as I understand, every connection prepare same statements and plan then independent. Such change would make Application servers prepare new connections much faster (given they prepare a number of same statements for each connection, which is the case for my engine). This should work for both named and unnamed. Note that adding unnamed statements to cache (and not removing on statement disposal) may require much larger cache. BTW: This is used by IBM DB2 UDB. 2) Specific plans when parameters are known. This is the point about using partial index(and sometimes even using full index- i.e. specifying frequent value of some index or one of two tables in a join). I'd say the best would be to have generic plan and try to replan, starting from generic plan results (dispose any possibility that gives values worse then generic plan). Such a replan should be much faster then original planning because you have rather high starting point. Another option is to catch possibilities at original planning and select correct plan when parameters are known - you check all possible uses with "this will be frequent value, this will match this partial index, ..." the question is the number of such plans. But since all of them must be better then generic (and it is possible to make a three, i.e. "A and B are not frequent" -> "A is frequent" -> "A is frequent and B meets partial index" and children must be better then parent), I'd say there won't be many (and you can always limit it's number and leave only the best if one goes out of number or even collect usages and leave the plans that are used). ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] No buffer space available
have an SQL file with a set of about 3000 insert statements. This file is executed twice from a remote client machine running Windows XP. The first time it is executed against a Windows 2003 Server running PostgreSQL 8.0. The second time it is executed against a Windows 2003 Server running PostgreSQL 8.2. This setup has been in place for about a year now and never had any issues. However, I noticed today that all the file executions fail with [nativecode=could not send data to server: No buffer space available (0x2747/10055)] error. I attempted to run the same file using pgAdmin III version 1.6.2, but every time the file is executed the only return in messages pane is the colon sign ":". No data is inserted. I attempted to run the same file using psql.exe and this worked without any problems. I also noticed that if I use less inserts, for example only a few hundred, the file executes fine. What could be the issue? Thank you. Nik ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WSAStartup() in libpq
Magnus Hagander <[EMAIL PROTECTED]> writes: > On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote: >> Also, isn't the WSACleanup() in freePGconn in the wrong place? Seems >> like it shouldn't be done until after we've closed the socket. I'd >> be inclined to put it at the bottom of the routine. > Certainly looks wrong. It's interesting how this could have worked > *before*. Because the calls in DllMain covered us (ie, the WSA usage count never got to be less than one). If we remove them, we'd better get this pair right. One thing that bothers me a bit is that if we just move the call to the bottom, then freePGconn won't do it at all if passed a NULL pointer. Now (assuming a non-broken app) the only way that can happen is if makeEmptyPGconn runs out of memory. If the client gets back a null pointer from a connection attempt, it's probably a 50-50 proposition whether it will think it ought to do PQfinish with it. So it'd be good if we could keep the usage count straight either way. I propose the invariant "a WSA usage count is associated with a non-null PGconn structure". That would mean that doing WSACleanup only at the bottom of freePGconn is correct, but also that makeEmptyPGconn needs to do WSACleanup in its (two) failure paths. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Estimating seq_page_fetch and random_page_fetch
Hi, How can we accrately estimate the "seq_page_fetch" and "random_page_fetch" costs from outside the postgres using for example a C routine. Using a very simple program in C, I have two sets of files. I want to access one set randomly i.e. pulling data from random locations within the files. The second set of files is accessed sequentially. The goal here is to approximate the disk I/O cost for a "random page fetch" and a "sequential page fetch" respectively. I am using low-level(unbuffered) C routines i.e. read/write and lseek ( for positioning file pointer), the read/write buffer size is 8k (to match the size of postgres page), and Linux is the host OS. We all know that linux is a heavily cached OS, for that very reason I am using sets of files instead of a single file, in a hope that whenever a new file from a set of files is accessed for the first time, it will NOT be in the OS cache, thus giving accurate results of actually fetching the file pages from the physical disk. And also the host is restarted before running the experiment so as to force a cold-cache start. I am hoping somebody could point me in the right direction. Thanks -Umar
[HACKERS]
I need to find out if anybody can give me some advice on incremental backups and restores Reason: backup size abrie
Re: [HACKERS] WSAStartup() in libpq
On Thu, Mar 08, 2007 at 10:37:11AM -0500, Tom Lane wrote: > Magnus Hagander <[EMAIL PROTECTED]> writes: > > On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote: > >> Also, isn't the WSACleanup() in freePGconn in the wrong place? Seems > >> like it shouldn't be done until after we've closed the socket. I'd > >> be inclined to put it at the bottom of the routine. > > > Certainly looks wrong. It's interesting how this could have worked > > *before*. > > Because the calls in DllMain covered us (ie, the WSA usage count never > got to be less than one). If we remove them, we'd better get this pair > right. But those calls weren't even compiled in when building using mingw, which is what the majority of our users have been using lately, I think. (Since that's what we ship in the binary package) > One thing that bothers me a bit is that if we just move the call to the > bottom, then freePGconn won't do it at all if passed a NULL pointer. > Now (assuming a non-broken app) the only way that can happen is if > makeEmptyPGconn runs out of memory. If the client gets back a null > pointer from a connection attempt, it's probably a 50-50 proposition > whether it will think it ought to do PQfinish with it. So it'd be good > if we could keep the usage count straight either way. I propose the > invariant "a WSA usage count is associated with a non-null PGconn > structure". That would mean that doing WSACleanup only at the bottom > of freePGconn is correct, but also that makeEmptyPGconn needs to do > WSACleanup in its (two) failure paths. I'm honestly unsure wether we need to bother with it, but yeah, that will likely be "more correct". (Except one of the error paths in makeEmptyPGconn is already covered, since it calls freePGconn, which does the WSACleanup) //Magnus Index: bcc32.mak === RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/bcc32.mak,v retrieving revision 1.26 diff -c -r1.26 bcc32.mak *** bcc32.mak 11 Jan 2007 02:42:31 - 1.26 --- bcc32.mak 8 Mar 2007 15:23:17 - *** *** 93,99 [EMAIL PROTECTED] "$(INTDIR)\fe-secure.obj" [EMAIL PROTECTED] "$(INTDIR)\pqexpbuffer.obj" [EMAIL PROTECTED] "$(INTDIR)\pqsignal.obj" - [EMAIL PROTECTED] "$(OUTDIR)\libpqdll.obj" [EMAIL PROTECTED] "$(OUTDIR)\win32.obj" [EMAIL PROTECTED] "$(INTDIR)\wchar.obj" [EMAIL PROTECTED] "$(INTDIR)\encnames.obj" --- 93,98 *** *** 155,168 LINK32=ilink32.exe LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v - LINK32_OBJS= "$(INTDIR)\libpqdll.obj" # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm ! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" $(LINK32_OBJS) "$(INTDIR)\libpq.res" blibpqdll.def $(LINK32) @<< $(LINK32_FLAGS) + ! c0d32.obj $(LINK32_OBJS), + $@,, + "$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, + blibpqdll.def,"$(INTDIR)\libpq.res" --- 154,166 LINK32=ilink32.exe LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v # @<< is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm ! "$(OUTDIR)\blibpq.dll": "$(OUTDIR)\blibpq.lib" "$(INTDIR)\libpq.res" blibpqdll.def $(LINK32) @<< $(LINK32_FLAGS) + ! c0d32.obj , + $@,, + "$(OUTDIR)\blibpq.lib" import32.lib cw32mt.lib, + blibpqdll.def,"$(INTDIR)\libpq.res" Index: fe-connect.c === RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v retrieving revision 1.344 diff -c -r1.344 fe-connect.c *** fe-connect.c20 Feb 2007 15:20:51 - 1.344 --- fe-connect.c8 Mar 2007 15:42:59 - *** *** 1840,1848 #ifdef WIN32 /* !* Make sure socket support is up and running. Even though this is done in !* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for !* static builds at all, so we have to do it in the main code too. */ WSADATA wsaData; --- 1840,1846 #ifdef WIN32 /* !* Make sure socket support is up and running. */ WSADATA wsaData; *** *** 1853,1859 --- 1851,1862 conn = (PGconn *) malloc(sizeof(PGconn)); if (conn == NULL) + { + #ifdef WIN32 + WSACleanup(); + #endif return conn; + } /* Zero all pointers and booleans */ MemSet(conn, 0, sizeof(PGconn)); *** *** 1918,1927 PGnotify *notify; pgParameterStatus *pstatus; - #ifdef WIN32 - WSACleanup(); - #endif - if (!conn) return; --- 1921,1926 *** *** 1986,1991 --- 1985,1994 termPQExpBuffer(&conn->errorMessage); termPQExpBuffer(&conn->
Re: [HACKERS] WSAStartup() in libpq
Magnus Hagander <[EMAIL PROTECTED]> writes: > On Thu, Mar 08, 2007 at 10:37:11AM -0500, Tom Lane wrote: >> Because the calls in DllMain covered us (ie, the WSA usage count never >> got to be less than one). If we remove them, we'd better get this pair >> right. > But those calls weren't even compiled in when building using mingw, Hmm ... does make you wonder, doesn't it? But anyway, if we're bothering to call the functions at all, we should try to meet the defined protocol. So I like this latest version of the patch. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] No buffer space available
Nik wrote: have an SQL file with a set of about 3000 insert statements. This file is executed twice from a remote client machine running Windows XP. The first time it is executed against a Windows 2003 Server running PostgreSQL 8.0. The second time it is executed against a Windows 2003 Server running PostgreSQL 8.2. This setup has been in place for about a year now and never had any issues. However, I noticed today that all the file executions fail with [nativecode=could not send data to server: No buffer space available (0x2747/10055)] error. I attempted to run the same file using pgAdmin III version 1.6.2, but every time the file is executed the only return in messages pane is the colon sign ":". No data is inserted. I attempted to run the same file using psql.exe and this worked without any problems. I also noticed that if I use less inserts, for example only a few hundred, the file executes fine. What could be the issue? First, this is the wrong place to ask this question. pgsql-hackers is about development of postgres itself, not about usage questions. Perhaps you should ask on pgsql-general. Second, if you can run the file using psql, then it does not appear to be a problem with postgres at all, but rather with the client you are using (you don't say what that client is). Perhaps your client is trying a naive method of batching inserts and thus running out of buffer space - it's hard to tell since you haven't given much information. In the case of pgAdminIII, you might need to ask on its lists. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS]
Abrie wrote: I need to find out if anybody can give me some advice on incremental backups and restores Reason: backup size Abrie - this is the pgsql-hackers mailing list. It's for questions about the development of the PostgreSQL database system. You'll want the pgsql-general or pgsql-admin lists. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
"Umar Farooq Minhas" <[EMAIL PROTECTED]> writes: > How can we accrately estimate the "seq_page_fetch" and = > "random_page_fetch" costs from outside the postgres using for example a = > C routine. Use a test case larger than memory. Repeat many times to average out noise. IIRC, when I did the experiments that led to the current random_page_cost of 4.0, it took about a week before I had numbers I trusted. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
"Tom Lane" <[EMAIL PROTECTED]> writes: > "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes: >> How can we accrately estimate the "seq_page_fetch" and = >> "random_page_fetch" costs from outside the postgres using for example a = >> C routine. > > Use a test case larger than memory. Repeat many times to average out > noise. IIRC, when I did the experiments that led to the current > random_page_cost of 4.0, it took about a week before I had numbers I > trusted. When I was running tests I did it on a filesystem where nothing else was running. Between tests I unmounted and remounted it. As I understand it Linux associates the cache with the filesystem and not the block device and discards all pages from cache when the filesystem is unmounted. That doesn't contradict anything Tom said, it might be useful as an additional tool though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] RFC: changing autovacuum_naptime semantics
Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Is everybody OK with not putting a per-tablespace worker limit? > > Is everybody OK with putting per-database worker limits on a pg_database > > column? > > I don't think we need a new pg_database column. If it's a GUC you can > do ALTER DATABASE SET, no? Or was that what you meant? No, that doesn't work unless we save the datconfig column to the pg_database flatfile, because it's the launcher (which is not connected) who needs to read it. Same thing with an hypothetical per-database naptime. The launcher would also need to parse it, which is not ideal (though not a dealbreaker either). -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] pg_standby
Hi Simon, I would preserve the existing trigger function as little t "-t", and maybe implement a catchup trigger function as big t "-T"? Set it up so that if the first attempt to find the WAL file postgres is currently requesting succeeds, skip over the trigger check. If the first attempt fails, then do your trigger check. That way, in the OCF script, the postmaster can be started, the trigger file set, and connection to the database looped on until it succeeds as an indication for when the database is up and available. I think that's cleaner than comparing a filename from a 'ps' command. Once I've completed the OCF script and done some testing, I'll forward it to you for you to review and see if you want to include it. Thanks, Doug On Thu, 2007-03-08 at 15:37 +, Simon Riggs wrote: > On Thu, 2007-03-08 at 10:33 -0500, Doug Knight wrote: > > Thanks, Simon. I kind of figured that's how pg_standby would work, > > since its invoked by postgres once per WAL file. What I was thinking I > > might do in the OCF script is to grab the pg_standby process line from > > a ps, pull out the "current" WAL file path and filename, then do an > > existence check for the file. If the file exists, then > > pg_standby/postgres is probably processing it. If not, then we're > > probably waiting on it, implying that recovery is complete. Thoughts > > on this process? > > I suppose I might be able to have the option to catch up before it > stops, on the basis that if it can find the file it was looking for > without waiting then that can override the trigger. > > Which way would you like it to work? >
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Tom Lane wrote: > Gaetano Mendola <[EMAIL PROTECTED]> writes: >> [ 8.2 evaluates volatile functions in the targetlist of a view ] >> If I mark the function as STABLE or IMMUTABLE then even with version >> 8.2 the function is not evaluated. Is this the intended behavior? > > Yes; people complained that we needed to be more careful about the > number of times volatile functions get evaluated. I suspect that functions are evaluated also for record discarded due to joins. Is that the case? Like: SELECT * FROM ta JOIN tb USING (id) where tb.foo = 4; If ta is a view with some calculated fields are the function on ta evaluated only for record matching the filters or in some case ( like a full scan on ta ) also for the records discarded due to the join? Regards Gaetano Mendola ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCHES] pg_standby
On Thu, 2007-03-08 at 13:29 -0500, Doug Knight wrote: > I would preserve the existing trigger function as little t "-t", and > maybe implement a catchup trigger function as big t "-T"? Set it up so > that if the first attempt to find the WAL file postgres is currently > requesting succeeds, skip over the trigger check. If the first attempt > fails, then do your trigger check. That way, in the OCF script, the > postmaster can be started, the trigger file set, and connection to the > database looped on until it succeeds as an indication for when the > database is up and available. I think that's cleaner than comparing a > filename from a 'ps' command. Once I've completed the OCF script and > done some testing, I'll forward it to you for you to review and see if > you want to include it. I'm happy to do this, unless other objections. I'll be doing another version before feature freeze. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] RFC: changing autovacuum_naptime semantics
Galy Lee wrote: Hi, > Alvaro Herrera wrote: > > I still haven't received the magic bullet to solve the hot table > > problem, but these at least means we continue doing *something*. > > Can I know about what is your plan or idea for autovacuum improvement > for 8.3 now? And also what is the roadmap of autovacuum improvement for 8.4? Things I want to do for 8.3: - Make use of the launcher/worker stuff, that is, allow multiple autovacuum processes in parallel. With luck we'll find out how to deal with hot tables. Things I'm not sure we'll be able to have in 8.3, in which case I'll get to them for early 8.4: - The maintenance window stuff, i.e., being able to throttle workers depending on a user-defined schedule. 8.4 material: - per-tablespace throttling, coordinating IO from multiple workers I don't have anything else as detailed as a "plan". If you have suggestions, I'm all ears. Now regarding your restartable vacuum work. I think that stopping a vacuum at some point and being able to restart it later is very cool and may get you some hot chicks, but I'm not sure it's really useful. I think it makes more sense to do something like throttling an ongoing vacuum to a reduced IO rate, if the maintenance window closes. So if you're in the middle of a heap scan and the maintenance window closes, you immediately stop the scan and go the the index cleanup phase, *at a reduced IO rate*. This way, the user will be able to get the benefits of vacuuming at some not-too-distant future, without requiring the maintenance window to open again, but without the heavy IO impact that was allowed during the maintenance window. Does this make sense? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Auto creation of Partitions
On Thu, Mar 08, 2007 at 01:52:27PM +0530, NikhilS wrote: > >I think it'd be great to make adding and removing partitions as > >simple as ALTER TABLE. I don't think that DELETE should be the > >mechanism to drop a partition, though. Again, DML statements > >shouldn't be performing DDL. > > > Since partition is inheritance-based, a simple DROP or "NO INHERIT" will do > the job to deal with the partition. Do we want to reinvent additional syntax > when these are around and are documented? Well, if the syntax for adding a new partition eventually ends up as ALTER TABLE ADD PARTITION, then it would make more sense that you remove a partition via ALTER TABLE DROP PARTITION. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Auto creation of Partitions
On Thu, Mar 08, 2007 at 10:12:30AM +, Gregory Stark wrote: > "Jim Nasby" <[EMAIL PROTECTED]> writes: > > > One possibility would be to require 3 functions for a partitioned table: one > > accepts the partitioning key and tells you what partition it's in, one that > > tells you what the minimum partitioning key for a partition would be, and > > one > > that tells you what the maximum would be. If the user supplied those 3 > > functions, I think it would be possibly to automatically generate code for > > the > > triggers and check constraints. The min/max partition key functions might > > allow you to more efficiently do partition elimination, too. > > But then it would be harder to tell whether a clause implied a given > partition. That is, if you have a partition constraint of "col OP const" then > we can test whether a query clause of "col OP2 const2" implies that constraint > when planning (or actually whether it implies it's false to exclude the > partition). If you have a constraint like "P1(const)" it'll be pretty hard to > do much with that. Well, you could tell what partition 'const' was in; I would think that plus knowledge about OP2 would allow you to decide what partitions you need to look at. There's also nothing to prevent us from also adding the constraints and using constraint exclusion as well. In fact, I think we'd want to have the constraints just so we know that a given partition only contains the data we want it to. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Auto creation of Partitions
One other thought... a lot of this discussion seems to parallel the debate from a while ago about whether SERIAL should act like a macro (ie: it just sets everything up and users are free to monkey under the hood afterwards), or whether it should be it's own 'closed-box' construct. Currently, we seem to be leaning towards partition management being a 'macro', with child tables very exposed, etc. I don't know if that's good or bad, but it's probably worth some thought. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
Few things from our side: 1. 'Skyline Of' is a new operator proposed in ICDE 2003, one of the topmost conferences of Data Engineering. Skyline operation is a hot area of research in query processing. Many of the database community people do know about this operator, and it is fast catching the attention. 2. The skyline operation is very useful in data analysis. Suppose, if we have a cricket database, and we want to find the bowlers who have taken maximum wickets in minimum overs, we can issue an easy-to-write query using 'Skyline of' syntax as follows: Select * from Player_Match Skyline Of overs_bowled min, wickets_taken max; This query gives 25 interesting tuples (result set) out of 24750 tuples in 0.0509 seconds. The same result is obtained in 0.8228 seconds if the following equivalent nested-query is issued: select * from Player_Match p1 where not exists ( select * from Player_Match p2 where p2.overs_bowled <= p1.overs_bowled and p2.wickets_taken >= p1.wickets_taken and (p2.overs_bowled < p1.overs_bowled or p2.wickets_taken> p1.wickets_taken)) Note that the above time is the time elapsed between issuing a query and obtaining the result set. As can be seen, the above query looks pretty cumbersome to write and is inefficient too. So, which query will the user prefer? As the number of dimensions increases, writing a nested-query will become a hedious task. Btw, how can such a query be written using aggregate function syntax?? 3. As far as optimizing the Skyline is concerned, it is still a research problem since it requires estimating the cardinality of the skyline result set. 4. Until and unless this operator is implemented in a popular database system, how can a user ever get to know about it and hence appreciate its usefulness? Btw, it was our B.Tech final year project, and not a term project :-) Regards. On 3/8/07, Tom Lane <[EMAIL PROTECTED]> wrote: Shane Ambler <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Well, whether it's horrible or not is in the eye of the beholder, but >> this is certainly a non-standard syntax extension. > Being non-standard should not be the only reason to reject a worthwhile > feature. No, but being non-standard is certainly an indicator that the feature may not be of widespread interest --- if it were, the SQL committee would've gotten around to including it; seems they've managed to include everything but the kitchen sink already. Add to that the complete lack of any previous demand for the feature, and you have to wonder where the market is. > The fact that several > different groups have been mentioned to be working on this feature would > indicate that it is worth considering. It looks to me more like someone published a paper that caught the attention of a few profs looking for term projects for their students. Now maybe it really is the best idea since sliced bread and will be seen in the next SQL spec edition, but color me skeptical. It seems to me to be a very narrow-usage extension, as opposed to (eg) multi-input aggregates or WITH/RECURSIVE, which provide general mechanisms applicable to a multitude of problems. Now even so it would be fine if the implementation were similarly narrow in scope, but the published description of the patch mentions a large chunk of additional executor mechanisms. If we're going to be adding as much code as that, I'd like to see a wider scope of usage for it. Basically, this patch isn't sounding like it has a reasonable bang-to-the-buck ratio ... regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend -- Pride sullies the noblest character
Re: [HACKERS] Stream bitmaps
Hi Gavin, Any progress? Gavin Sherry wrote: Heikki, On Mon, 5 Mar 2007, Heikki Linnakangas wrote: Hi all, I'd like to see the indexam API changes needed by the bitmap indexam to be committed soon. Has anyone looked at the proposed API in the latest patch? Any thoughts? Thanks for looking at it! I'm quite happy with it myself, with a few reservations: - All the getbitmap implementations except the new bitmap indexam are just boilerplate. How about making getbitmap-function optional, and having a generic implementation that fills in a hash bitmap using the traditional getnext function? - getbitmap is passed an existing bitmap as argument, and the implementation needs to OR the existing bitmap with new tuples. How about AND? An indexam could be smart about ANDing with an existing bitmap, for example skipping to the first set bit in the existing bitmap and starting the scan from there. - I'd like to have support to return candidate matches with both getbitmap and getnext. A simple flag per page of results would be enough for getbitmap, I think. - StreamBitmap and HashBitmap are separate node types, but OpStream is not. opaque-field in the StreamBitmap struct is not really that opaque, it needs to be a StreamNode. I drew a UML sketch of what I think the class-hierarchy is (http://community.enterprisedb.com/streambitmaps.png). This is object-oriented programming, we're just implementing classes and inheritance with structs and function pointers. The current patch mixes different techniques, and that needs to be cleaned up. All good ideas, I'll look at them more closely in the morning. I'd like to see a separate patch that contains just the API changes. Gavin, could you extract an API-only patch from the bitmap index patch? I can work on it as well, but I don't want to step on your toes. Okay, I can do that. Thanks, Gavin -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
Adding to this: Ayush recently wrote a C program that emulates PG IO to do this analysis, and we came out with (predictably) a ratio of sequential/random of 20-50 (for a single user). This is predictable because the random component is fixed at the access time of a single hard drive no matter how many disks are in an array, while the sequential scales nearly linearly with the number of drives in the array. So, you can estimate random using 8-12ms per random access, and sequential as 1/(number of disks X 60-130MB/s). Ayush, can you forward your C program? - Luke Msg is shrt cuz m on ma treo -Original Message- From: Gregory Stark [mailto:[EMAIL PROTECTED] Sent: Thursday, March 08, 2007 12:37 PM Eastern Standard Time To: Tom Lane Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org Subject:Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch "Tom Lane" <[EMAIL PROTECTED]> writes: > "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes: >> How can we accrately estimate the "seq_page_fetch" and = >> "random_page_fetch" costs from outside the postgres using for example a = >> C routine. > > Use a test case larger than memory. Repeat many times to average out > noise. IIRC, when I did the experiments that led to the current > random_page_cost of 4.0, it took about a week before I had numbers I > trusted. When I was running tests I did it on a filesystem where nothing else was running. Between tests I unmounted and remounted it. As I understand it Linux associates the cache with the filesystem and not the block device and discards all pages from cache when the filesystem is unmounted. That doesn't contradict anything Tom said, it might be useful as an additional tool though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] pg_standby
Excellent. Once you're ready, fire it over and I'll test it on our config. Doug On Thu, 2007-03-08 at 18:34 +, Simon Riggs wrote: > On Thu, 2007-03-08 at 13:29 -0500, Doug Knight wrote: > > > I would preserve the existing trigger function as little t "-t", and > > maybe implement a catchup trigger function as big t "-T"? Set it up so > > that if the first attempt to find the WAL file postgres is currently > > requesting succeeds, skip over the trigger check. If the first attempt > > fails, then do your trigger check. That way, in the OCF script, the > > postmaster can be started, the trigger file set, and connection to the > > database looped on until it succeeds as an indication for when the > > database is up and available. I think that's cleaner than comparing a > > filename from a 'ps' command. Once I've completed the OCF script and > > done some testing, I'll forward it to you for you to review and see if > > you want to include it. > > I'm happy to do this, unless other objections. > > I'll be doing another version before feature freeze. >
Re: [HACKERS] Auto creation of Partitions
Note to Nikhil: Make sure the new syntax doesn't prevent partitions from being placed upon multiple tablespaces in some manner, at CREATE TABLE time. What if the syntax was something like - CREATE TABLE tabname ( ... ... ) PARTITION BY HASH(expr) | RANGE(expr) | LIST(expr) [PARTITIONS num_partitions] /* will apply to HASH only for now*/ [PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname], PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname] ... ]; And (if we use the ALTER TABLE to add partitions) ALTER TABLE tabname ADD PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]; Of course ALTER TABLE childtable SET TABLESPACE tblspcname; should not cause any probs. -- Shane Ambler [EMAIL PROTECTED] Get Sheeky @ http://Sheeky.Biz ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
On Thu, Mar 08, 2007 at 05:35:03PM +, Gregory Stark wrote: > > "Tom Lane" <[EMAIL PROTECTED]> writes: > > > "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes: > >> How can we accrately estimate the "seq_page_fetch" and = > >> "random_page_fetch" costs from outside the postgres using for example a = > >> C routine. > > > > Use a test case larger than memory. Repeat many times to average out > > noise. IIRC, when I did the experiments that led to the current > > random_page_cost of 4.0, it took about a week before I had numbers I > > trusted. > > When I was running tests I did it on a filesystem where nothing else was > running. Between tests I unmounted and remounted it. As I understand it Linux > associates the cache with the filesystem and not the block device and discards > all pages from cache when the filesystem is unmounted. > > That doesn't contradict anything Tom said, it might be useful as an additional > tool though. Another trick I've used in the past is to just run the machine out of memory, using the following: /* * $Id: clearmem.c,v 1.1 2003/06/29 20:41:33 decibel Exp $ * * Utility to clear out a chunk of memory and zero it. Useful for flushing disk buffers */ int main(int argc, char *argv[]) { if (!calloc(atoi(argv[1]), 1024*1024)) { printf("Error allocating memory.\n"); } } I'll monitor top while that's running to ensure that some stuff gets swapped out to disk. I believe this might still leave some cached data in other areas of the kernel, but it's probably not enough to worry about. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Acclerating INSERT/UPDATE using UPS
Simon, Not checkpointing at all is not a good plan, since this will lead to an enormous build up of WAL files and a very long recovery time if the system does fail. I appreciate your detailed comments. Following your comments, I revised the problem. Sigres-0.1.3 does checkpointings. In summary, the features of Sigres-0.1.3 are as follows. 0: 10% faster than conventional PostgreSQL under tmpfs. 1: Checkpointings are continually executed. 2: Sigres mode is in default (the mode can be turned off via postgresql.conf). 3: issue_xlog_sync is called only by bgwriter (continually, via createcheckpoint) 4: The entity of XLogWrite (_XLogWrite in my code) is called by both backends and a bgwriter. For each backend, _XLogWrite is called only via AdvanceXLInsertBuffer. For a bgwriter, _XLogWrite is called via CreateCheckPoint. Please try it if you have interest. http://sourceforge.jp/projects/sigres/ Again, I really appreciate beneficial comments from this community ! Regards, -- Hideyuki -- Hideyuki Kawashima (Ph.D.) University of Tsukuba Assistant Professor ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Auto creation of Partitions
Jim C. Nasby wrote: On Thu, Mar 08, 2007 at 01:52:27PM +0530, NikhilS wrote: I think it'd be great to make adding and removing partitions as simple as ALTER TABLE. I don't think that DELETE should be the mechanism to drop a partition, though. Again, DML statements shouldn't be performing DDL. Since partition is inheritance-based, a simple DROP or "NO INHERIT" will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Well, if the syntax for adding a new partition eventually ends up as ALTER TABLE ADD PARTITION, then it would make more sense that you remove a partition via ALTER TABLE DROP PARTITION. This follows on from the suggestion I made - taken along the lines of the subject "auto creation of partitions" where I suggested the syntax of partition check(month of mydatecol) and have a new partition created as data was entered. With this scenario dropping the partition when it was empty would complement the creation of a new partition as needed. Given that there seems to be no real support of going with "auto maintenance" were new partitions are added as needed, then the auto dropping of empty partitions would also not apply. Leaving us with only specific add partition / drop partition commands. And have the parent table pick up rows not matching any partition check criteria. -- Shane Ambler [EMAIL PROTECTED] Get Sheeky @ http://Sheeky.Biz ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
On Thu, 2007-03-08 at 17:35 +, Gregory Stark wrote: > When I was running tests I did it on a filesystem where nothing else was > running. Between tests I unmounted and remounted it. As I understand it Linux > associates the cache with the filesystem and not the block device and discards > all pages from cache when the filesystem is unmounted. On recent Linux kernels, /proc/sys/vm/drop_caches can also be useful: http://linux.inet.hr/proc_sys_vm_drop_caches.html You could also use posix_fadvise() to achieve a similar effect on a per-file basis. -Neil ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Stream bitmaps
On Thu, 8 Mar 2007, Heikki Linnakangas wrote: > Hi Gavin, > > Any progress? > Really busy at the moment, but it's on my TODO list for today. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] who gets paid for this
Hi all, I'm a grad student at UC Davis studying the postgres community and I wanted to know if some on this list could help me out. I'm studying the factors that affect people "graduating" from being mailing list participant to developers with write access to the repository. Is it possible to find out who is being employed to work on postgres and who is doing it on their own time? Some of my data points to there being two ways that people make the jump. More specifically, could those who worked on apache as some aspect of their job prior to getting repo access let me know? Or if there are devs who know this information about others, I'd be really appreciative to get it. Thanks a lot. -- Christian Bird -- Christian Bird [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] who gets paid for this
Hi all, I'm a grad student at UC Davis studying the apache server community and I wanted to know if some on this list could help me out. I'm studying the factors that affect people "graduating" from being mailing list participant to developers with write access to the repository. Is it possible to find out who is being employed to work on apache and who is doing it on their own time? Some of my data points to there being two ways that people make the jump. More specifically, could those who worked on postgres as some aspect of their job prior to getting cvs access let me know? Or if there are devs who know this information about others, I'd be really appreciative to get it. Thanks a lot. -- Christian Bird -- Christian Bird [EMAIL PROTECTED] -- Christian Bird [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetchThanks a lot for your replies. The suggestions have proved much useful. Ayush, I'm curious to see your C program, thanks. Here is a related but different issue. I started looking at the postgres optimizer/planner code a month back to modify it for the purposes of experiments that I need to conduct. The EXPLAIN command prints the total costs i.e both CPU + I/O however, for my purposes I need these two costs to be separated i.e. instead of getting one cost displayed, I want cpu cost and io cost displayed separated when i run EXPLAIN on a particular query. Till now I haven't been able to figure out a 'clean' way of doing this. Can anyone tell me how much time should I expect to spend making such a change ? and from where should I start ? costsize.c ? I have another question. Looking at the optimizer code, it pretty much looks insensitive to the memory factor. The only parameters being utilized are the "effective_cache_size" ( in estimating index cost only) and "work_mem" for (sort, aggregation, groups, hash/merge joins). Are these the only memory factors that DIRECTLY effect the cost estimates of the planner/optimizer? Again your help is appreciated. -Umar - Original Message - From: Luke Lonergan To: Gregory Stark ; Tom Lane ; Ayush Parashar Cc: Umar Farooq Minhas ; pgsql-hackers@postgresql.org Sent: Thursday, March 08, 2007 2:16 PM Subject: Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch Adding to this: Ayush recently wrote a C program that emulates PG IO to do this analysis, and we came out with (predictably) a ratio of sequential/random of 20-50 (for a single user). This is predictable because the random component is fixed at the access time of a single hard drive no matter how many disks are in an array, while the sequential scales nearly linearly with the number of drives in the array. So, you can estimate random using 8-12ms per random access, and sequential as 1/(number of disks X 60-130MB/s). Ayush, can you forward your C program? - Luke Msg is shrt cuz m on ma treo -Original Message- From: Gregory Stark [mailto:[EMAIL PROTECTED] Sent: Thursday, March 08, 2007 12:37 PM Eastern Standard Time To: Tom Lane Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org Subject:Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch "Tom Lane" <[EMAIL PROTECTED]> writes: > "Umar Farooq Minhas" <[EMAIL PROTECTED]> writes: >> How can we accrately estimate the "seq_page_fetch" and = >> "random_page_fetch" costs from outside the postgres using for example a = >> C routine. > > Use a test case larger than memory. Repeat many times to average out > noise. IIRC, when I did the experiments that led to the current > random_page_cost of 4.0, it took about a week before I had numbers I > trusted. When I was running tests I did it on a filesystem where nothing else was running. Between tests I unmounted and remounted it. As I understand it Linux associates the cache with the filesystem and not the block device and discards all pages from cache when the filesystem is unmounted. That doesn't contradict anything Tom said, it might be useful as an additional tool though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] RFC: changing autovacuum_naptime semantics
Alvaro Herrera wrote: > I don't have anything else as detailed as a "plan". If you have > suggestions, I'm all ears. Cool, thanks for the update. :) We also have some new ideas on the improvement of autovacuum now. I will raise it up later. > Now regarding your restartable vacuum work. > Does this make sense? I also have reached a similar conclusion now. Thank you. Regards Galy ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
Jim Nasby <[EMAIL PROTECTED]> wrote: > > Also, my recommended bgwriter_lru_maxpages is "average number of > > recycled buffers per cycle", that is hardly able to tune manually. > > What do you mean by 'number of recycled buffers per cycle"? There is the following description in the documentation: | * bgwriter_lru_percent (floating point) | To reduce the probability that server processes will need to issue their | own writes, the background writer tries to write buffers that are likely | to be recycled soon. | * bgwriter_lru_maxpages (integer) | In each round, no more than this many buffers will be written as a | result of scanning soon-to-be-recycled buffers. The number of recycled buffers per round is the same as the number of StrategyGetBuffer() calls per round. I think the number is suitable for bgwriter_lru_maxpages if we want bgwriter to write almost of dirty pages. I proposed to change the semantics of bgwriter_lru_maxpages. It represented "maximum writes per round", but the new meaning is "reserved buffers for recycle". Non-dirty unused buffers will be counted among it. I'm measuring the difference of performance between manual and automatic tuning, especially adding some jobs before writes. I'll inform you about them when I get the results. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
Greg Smith <[EMAIL PROTECTED]> wrote: > > Also, my recommended bgwriter_lru_maxpages is "average number of > > recycled buffers per cycle", that is hardly able to tune manually. > > This is completely dependent on what percentage of your buffer cache is > pinned. Don't you mean usage_count? In my understanding, each backend pins two or so buffers at once. So percentage of pinned buffers should be low. > If your load is something like the standard pgbench, the LRU > writer will rarely find anything useful to write, so this entire line of > thinking won't work. The proper behavior for heavily pinned data is to > turn off the LRU writer altogether so there's more time to run the all > scan. I think you are pointing out the problem of buffer management algorithm itself, not only bgwriter. Even if you turn off the LRU writer, each backend pays the same cost to find recyclable buffers every time they allocate a buffer. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] RFC: changing autovacuum_naptime semantics
Alvaro Herrera <[EMAIL PROTECTED]> writes: > Now regarding your restartable vacuum work. I think that stopping a > vacuum at some point and being able to restart it later is very cool and > may get you some hot chicks, but I'm not sure it's really useful. Too true :-( > I think it makes more sense to do something like throttling an ongoing > vacuum to a reduced IO rate, if the maintenance window closes. So if > you're in the middle of a heap scan and the maintenance window closes, > you immediately stop the scan and go the the index cleanup phase, *at a > reduced IO rate*. Er, why not just finish out the scan at the reduced I/O rate? Any sort of "abort" behavior is going to create net inefficiency, eg doing an index scan to remove only a few tuples. ISTM that the vacuum ought to just continue along its existing path at a slower I/O rate. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote: In my understanding, each backend pins two or so buffers at once. So percentage of pinned buffers should be low. With the pgbench workload, a substantial percentage of the buffer cache ends up pinned. From staring at the buffer cache using contrib/pg_buffercache, I believe most of that consists of the index blocks for the records being updated in the accounts table. I just posted a new version of the patch I asked for feedback on at the beginning of this thread, the latest one is at http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt I've been adjusting it to monitor the same data I think you need to refine your patch. I believe the approach you're taking includes some assumptions that seem perfectly reasonable, but that my testing doesn't agree with. There's nothing like measuring something to settle what's really going on, though, so that's what I've been focusing on. I'd love to get some feedback on whether other people can replicate the kind of things I'm seeing. The new code generates statistics about exactly what the background writer scan found during each round. If there's been substantial write activity, it prints a log line when it recycles back to the beginning of the all scan, to help characterize what the buffer pool looked like during that scan from the perspective of the bgwriter. Here's some sample log output from my underpowered laptop while running pgbench: bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB (31.9%) ... checkpoint required (wrote checkpoint_segments) checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms Here 69% of the buffer cache contained dirty data, and 49% of the cache was both pinned and dirty. During that same time period, the LRU write also wrote out a fair amount of data, operating on the 20% of the cache that was dirty but not pinned. On my production server, where the background writer is turned way up to reduce checkpoint times, these numbers are even more extreme; almost everything that's dirty is also pinned during pgbench, and the LRU is lucky to find anything it can write as a result. That patch is against the 8.2 codebase; now that I'm almost done I'm planning to move it to HEAD instead soon (where it will conflict considerably with your patch). If you have an 8.2 configuration you can test with my patch applied, set log_min_messages = debug2, try it out, and see what you get when running pgbench for a while. I think you'll discover some interesting and unexpected things. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] who gets paid for this
Christian, >More specifically, could those > who worked on apache as some aspect of their job prior to getting repo > access let me know? Or if there are devs who know this information > about others, I'd be really appreciative to get it. Hmmm. Wrong project. And I think you're making the (incorrect) assumption that granting commit rights works the same way in all projects. It does not. How about you call me and we can chat about how the PostgreSQL project actually works? 415-752-2500. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/9/07, Shane Ambler <[EMAIL PROTECTED]> wrote: > Note to Nikhil: Make sure the new syntax doesn't prevent partitions from > being placed upon multiple tablespaces in some manner, at CREATE TABLE > time. What if the syntax was something like - CREATE TABLE tabname ( ... ... ) PARTITION BY HASH(expr) | RANGE(expr) | LIST(expr) [PARTITIONS num_partitions] /* will apply to HASH only for now*/ [PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname], PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname] ... ]; And (if we use the ALTER TABLE to add partitions) ALTER TABLE tabname ADD PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]; We could as well drop the USING part. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, This follows on from the suggestion I made - taken along the lines of the subject "auto creation of partitions" where I suggested the syntax of partition check(month of mydatecol) and have a new partition created as data was entered. With this scenario dropping the partition when it was empty would complement the creation of a new partition as needed. Given that there seems to be no real support of going with "auto maintenance" were new partitions are added as needed, then the auto dropping of empty partitions would also not apply. Leaving us with only specific add partition / drop partition commands. And have the parent table pick up rows not matching any partition check criteria. I was thinking along the lines of what Jim had suggested earlier regarding overflow partition. Instead of dumping unmatched rows to the master table, we could put them into a default "DUMP/DUMB" partition. Given that Simon wants to do away with having the master table APPENDed in the planning phase, this would be better. Regards, Nikhils -- Shane Ambler [EMAIL PROTECTED] Get Sheeky @ http://Sheeky.Biz -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] who gets paid for this
Josh Berkus writes: > Christian, >> More specifically, could those >> who worked on apache as some aspect of their job prior to getting repo >> access let me know? Or if there are devs who know this information >> about others, I'd be really appreciative to get it. > Hmmm. Wrong project. And I think you're making the (incorrect) assumption > that granting commit rights works the same way in all projects. It does > not. Even more to the point, "getting paid for" has almost nothing to do with "has commit privileges". At least on this project. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] RFC: changing autovacuum_naptime semantics
Tom Lane wrote: > Er, why not just finish out the scan at the reduced I/O rate? Any sort Sometimes, you may need to vacuum large table in maintenance window and hot table in the service time. If vacuum for hot table does not eat two much foreground resource, then you can vacuum large table with a lower IO rate outside maintenance window; but if vacuum for hot table is overeating the system resource, then launcher had better to stop the long running vacuum outside maintenance window. But I am not insisting on the stop-start feature at this moment. Changing the cost delay dynamically sounds more reasonable. We can use it to balance total I/O of workers in service time or maintenance time. It is not so difficult to achieve this by leveraging the share memory of autovacuum. Best Regards Galy Lee ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
Greg Smith <[EMAIL PROTECTED]> writes: > With the pgbench workload, a substantial percentage of the buffer cache > ends up pinned. [ raised eyebrow... ] Prove that. AFAIK it's impossible for the pgbench queries to pin more than about three or four buffers per backend concurrently. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
Greg Smith <[EMAIL PROTECTED]> wrote: > > In my understanding, each backend pins two or so buffers at once. So > > percentage of pinned buffers should be low. > > With the pgbench workload, a substantial percentage of the buffer cache > ends up pinned. > http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt > bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB > (31.9%) > ... > checkpoint required (wrote checkpoint_segments) > checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms > > Here 69% of the buffer cache contained dirty data, and 49% of the cache > was both pinned and dirty. No. "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or recently-used (bufHdr->usage_count > 0) buffers in your patch. ! if (bufHdr->refcount != 0 || bufHdr->usage_count != 0) { ! if (skip_pinned) ! { ! UnlockBufHdr(bufHdr); ! return BUF_PINNED; ! } ! buffer_write_type=BUF_WRITTEN_AND_PINNED; Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Acclerating INSERT/UPDATE using UPS
Joshua D. Drake wrote: > From an deployable application perspective, this could be a big deal. We > are already starting to see very large traction in the Win32 desktop app > arena. > There seem to be a few overlapping proposals in terms of reducing various guarantees in the name of performance. As more and more options are added that affect integrity (fsync, full page writes, commit nowait, sigres) it might be nice to outline and compare the approaches, and particularly to describe clearly the failure scenarios and how they are significantly different from one another. One potentially needs to track an increasing number of ways in which items might be set which reduce certain guarantees on data integrity which is unpleasant. If a setting is wrong on a performance knob, no problem, when there are complaints things are slow you can go through and adjust them. The same is not true of data consistency. When the complaint comes it is usually too late to fiddle with knobs. I'm just thinking some caution should be exercised in adding too many of them in the first place. I happen to love COMMIT NOWAIT though, for many, this replaces fsync=off. - August ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org