Re: [HACKERS] Vertical Partitioning with TOAST
Hello all, Thank you for having the interest. Jim C. Nasby wrote: Valid point. I do think there's a lot of benefit to being able to set the limit much lower than what it currently defaults to today. We have a client that has a queue-type table that is updated very frequently. One of the fields is text, that is not updated as frequently. Keeping this table vacuumed well enough has proven to be problematic, because any delay to vacuuming quickly results in a very large amount of bloat. Moving that text field into a seperate table would most likely be a win. Yes, our team think that this patch is effective that the tuple can be partially updated. For instance, DBT-2 updates frequently contents excluding c_data in the customer table. Because c_data(about 400bytes: The size of the entire tuple is 500bytes.) is copied together in every case, it is thought that it has decreased the performance. That is more important than the vertical partitioning function. Of course, it is important to change DDL of the table. However, I think it might be useful when it is not possible to change. As pointed out by Tom, this is a patch to verify the idea. I want to know that community is how much interested in a partial update. Of course, it is interested whether to want the vertical partitioning function in PostgreSQL, too. :-) By the way, should I send the patch to -patches again? -- Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Reducing relation locking overhead
On Thu, 2005-12-01 at 21:37 -0500, Tom Lane wrote: In looking at the current pgbench results, I notice that one considerable contribution to LWLock contention is access to the heavyweight-lock manager. 4. The only reason we need to take relation-level locks on indexes at all is to make the world safe for REINDEX being done concurrently with read-only accesses to the table (that don't use the index being reindexed). If we went back to requiring exclusive lock for reindex we could forget all about both #2 and #3. Particularly for updates of relations with lots of indexes, this could be a pretty significant win. However we'd definitely be giving up something that was seen as a feature at one point, so I'm not sold on this idea ... unless someone can see a way to reduce the overhead without giving up concurrent REINDEX. In a high txn rate workload the majority of lock requests will definitely be towards indexes. This is a major overhead and contention point and as you say, exists only to allow REINDEX (and online CREATE TABLE when it exists) to operate. There is a considerable price to be paid for this flexibility, so although I think that flexibility is important, I do think we need to look at ways of reducing the cost for the 95%+ of the time we pay the price for no benefit. It seems clear that if we do not, then we are going to have to completely redesign the lock manager, so any way forward from here is a reasonable size task. Putting extra restrictions on REINDEX-like operations would be acceptable in these circumstances. Now follows various ideas, many of which are quite possibly bogus, but the main point is that we need a way, even if it isn't one of these: One such idea might be to require that they occur outside of a transaction block, just like VACUUM. Further thoughts: 1. Normally, we do not lock indexes via the LockMgrLock 2. When a REINDEX-like operation comes along, it first of all updates an MaintIntentLock flag on the index relation, which causes a relcache invalidation. It then waits until all backends have updated their relcache. 3. While the MaintIntentLock is set, all operations acquire and release LockMgrLocks on the index. 4. The REINDEX-like operation proceeds, having accepted a possibly extensive delay in acquiring the MaintIntentLock, in the name of concurrent access. 5. When the REINDEX completes, we unset the MaintIntentLock and other backends return to normal operation. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Reducing contention for the LockMgrLock
On Wed, 2005-12-07 at 22:46 -0500, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Is hashtable overhead all that large? Each table could be made initially size-of-current-table/N entries. One problem is that currently the memory freed from a hashtable is not put back into shmem freespace, is it? Yeah; the problem is mainly that we'd have to allocate extra space to allow for unevenness of usage across the multiple hashtables. It's hard to judge how large the effect would be without testing, but I think that this problem would inhibit us from having dozens or hundreds of separate partitions. The imbalance across partitions would be a major issue because of the difficulty of selecting a well-distributed partitioning key. If you use the LOCKTAG, then operations on the heaviest hit tables would go to the same partitions continually for LockRelation requests. The frequency of access per table usually drops off dramatically in rank order: look at TPC-B (pgbench) and TPC-C; my own experience would be that you seldom have as many even as 16 heavy hit tables. My guess would be that doing all of that would do little more than reduce contention to ~50%, and that this would show quickly diminishing returns for N 4. Also, the more sharply defined your application profile, the worse this effect will be. Having said that, I think this *is* the best way forward *if* we continue to accept the barrage of lock requests. So I've reopened the debate on the earlier thread: [HACKERS] Reducing relation locking overhead and am reviewing thoughts/requirements on that thread to avoid the necessity of altering the lock manager in this way. pgbench is the right workload to expose this effect and measure worst case contention, so at least performance gains are easy to measure. A possible response is to try to improve dynahash.c to make its memory management more flexible, but I'd prefer not to get into that unless it becomes really necessary. A shared freespace pool would create a contention bottleneck of its own... ...but a less frequently accessed one. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Reducing contention for the LockMgrLock
On Wed, 2005-12-07 at 22:53 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: My view would be that the LockMgrLock is not relevant for all workloads, but I want even more to be able to discuss whether it is, or is not, on an accepted basis before discussions begin. Certainly. I showed the evidence ... The output you gave wasn't anything I recognize in the code. Assuming its not already there, please can you share code you are using to find the evidence, even if its just privately in some form? You're looking at the number of spins to acquire each lock? Or some other measure of wait time on a lock? I want to be in a position to run tests and then share the output with the project in an agreed form, then quickly move to action. You're right to put the burden of proof onto test results; I want to agree the measurements before we test. Manfred's earlier patch provides very clear output for observing contention, including full summaries. Could we commit that, so we can all use this for analysis? Updated with the wait info. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Reducing relation locking overhead
Ühel kenal päeval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby: On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote: ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: What's worse, once you have excluded writes you have to rescan the entire table to be sure you haven't missed anything. So in the scenarios where this whole thing is actually interesting, ie enormous tables, you're still talking about a fairly long interval with writes locked out. Maybe not as long as a complete REINDEX, but long. I was thinking you would set a flag to disable use of the FSM for inserts/updates while the reindex was running. So you would know where to find the new tuples, at the end of the table after the last tuple you read. What about keeping a seperate list of new tuples? Obviously we'd only do this when an index was being built on a table. The problem with separate list is that it can be huge. For example on a table with 200 inserts/updates per second an index build lasting 6 hours would accumulate total on 6*3600*200 = 432 new tuples. Sure, but it's unlikely that such a table would be very wide, so 4.3M tuples would probably only amount to a few hundred MB of data. It's also possible that this list could be vacuumed by whatever the regular vacuum process is for the table. I think that keeping such list as part the table at well defined location (like pages from N to M) is the best strategy, as it will automatically make all new tuples available to parallel processes and avoids both duplicate storage as well as the the need for changing insert/update code. --- Hannu ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing
On Wed, 2005-12-07 at 19:36, Greg Stark wrote: [snip] We periodically ran into problems with load spikes or other performance problems causing things to get very slow and stay slow for a while. Letting things settle out usually worked but occasionally we had to restart the whole system to clear out the queue of requests. Just as a personal opinion: I would love a REINDEX which does not block reads/writes, even if writes will be more expensive while it's running. There's always a period of time I can schedule the REINDEX so there's very low write activity, but it is impossible to find a time slot when there's NO write activity... and I think most of the real world applications are like this. I think it's very rare that an application is constantly getting high load, but most of them are constantly getting SOME important activity which makes downtime hard to schedule. Now if the slowdown of writes is not more than the acceptable service level, then it is a very workable solution to schedule the REINDEX on a not so busy time slot. Cheers, Csaba. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Inherited Constraints
On Wed, 2005-12-07 at 21:24 +, Simon Riggs wrote: Following patch implements record of whether a constraint is inherited or not, and prevents dropping of inherited constraints. Patch posted to -patches list. What it doesn't do: It doesn't yet prevent dropping the parent constraint, which is wrong, clearly, but what to do about it? 1. make dropping a constraint drop all constraints dependent upon it (without any explicit cascade) 2. add a new clause to ALTER TABLE DROP CONSTRAINT CASCADE I prefer (1), since it is SQL Standard compliant, easier to remember and automatic de-inheritance is the natural opposite of the automatic inheritance process. Comments, please -hackers? Which implementation should I pick (or another)? Further patch will utilise this new knowledge to reduce the number of tests made during constraint_exclusion. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] HOOKS for Synchronous Replication?
I can only add that patched code did not build on windows, contacted author about that and never got an answer back. Regards paolo Tom Lane [EMAIL PROTECTED] ha scritto Christopher Kings-Lynne [EMAIL PROTECTED] writes: Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? I had hoped to see some comments from the Slony people about it. I'd feel better about the validity of a set of hooks if more than one project agreed that it was useful/appropriate ... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] forced to restart postgresql service yesterday
Yesterday one of our clients called up and complained about lousy performance and being unable to log in to our postgresql 8.0 backed ERP running on windows 2000 server. The server has been run for several months without being restarted or rebooted. The login was hanging in a simple plpgsql login script which basically did an insert/update on a small table. It would hang when called from within psql, and once hung the query would not respond to cancel requests (such as they are implemented on win32). Investigating further, trying to select form this small table at all would also hang. Unfortunately, this table tells me which pids are safe to kill and which are not, so I had no choice to do emergency restart of postgresql service which went without complaint and everything worked normally, with nothing extraordinary in the server log. Not ruling out an obscure win32 problem here but this is fairly untraceable. This is just a FYI type of post. This particular client with about 50 users has been running over a year on win32/pg and this is the first time I had to restart the service :(. I am really pushing a move to linux although there is no reason to believe this will prevent this from happening again. Merlin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Reducing contention for the LockMgrLock
Simon Riggs [EMAIL PROTECTED] writes: The output you gave wasn't anything I recognize in the code. Assuming its not already there, please can you share code you are using to find the evidence, even if its just privately in some form? See below. Also, the message I previously mentioned shows a different tack on the same theme: http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php although in the light of later events I think that keeping the counts in shared memory like that is a bad idea --- too likely to skew the results. You're looking at the number of spins to acquire each lock? Number of semop waits. Manfred's earlier patch provides very clear output for observing contention, including full summaries. Could we commit that, so we can all use this for analysis? Updated with the wait info. What patch would that be? regards, tom lane *** src/backend/storage/ipc/ipc.c.orig Tue Nov 22 16:06:33 2005 --- src/backend/storage/ipc/ipc.c Tue Nov 29 12:27:13 2005 *** *** 125,130 --- 125,132 { elog(DEBUG3, shmem_exit(%d), code); + LWLockStats(); + /* * call all the registered callbacks. * *** src/backend/storage/lmgr/lwlock.c.orig Tue Dec 6 18:08:33 2005 --- src/backend/storage/lmgr/lwlock.c Tue Dec 6 18:16:05 2005 *** *** 21,26 --- 21,28 */ #include postgres.h + #include unistd.h + #include access/clog.h #include access/multixact.h #include access/subtrans.h *** *** 32,37 --- 34,43 /* We use the ShmemLock spinlock to protect LWLockAssign */ extern slock_t *ShmemLock; + static int num_counts; + static int *sh_acquire_counts; + static int *ex_acquire_counts; + static int *block_counts; typedef struct LWLock { *** *** 209,214 --- 215,226 LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int)); LWLockCounter[0] = (int) NumFixedLWLocks; LWLockCounter[1] = numLocks; + + /* local counter space */ + num_counts = numLocks; + sh_acquire_counts = calloc(numLocks, sizeof(int)); + ex_acquire_counts = calloc(numLocks, sizeof(int)); + block_counts = calloc(numLocks, sizeof(int)); } *** *** 257,262 --- 269,278 int extraWaits = 0; PRINT_LWDEBUG(LWLockAcquire, lockid, lock); + if (mode == LW_EXCLUSIVE) + ex_acquire_counts[lockid]++; + else + sh_acquire_counts[lockid]++; /* * We can't wait if we haven't got a PGPROC. This should only occur *** *** 328,333 --- 344,351 if (!mustwait) break; /* got the lock */ + block_counts[lockid]++; + /* * Add myself to wait queue. * *** *** 598,601 --- 616,640 return true; } return false; + } + + void + LWLockStats(void) + { + int pid = getpid(); + int i; + + LWLockAcquire(0, LW_EXCLUSIVE); + + for (i = 0; i num_counts; i++) + { + if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i]) + { + fprintf(stderr, PID %d lwlock %d: shacq %u exacq %u blk %u\n, + pid, i, sh_acquire_counts[i], ex_acquire_counts[i], + block_counts[i]); + } + } + + LWLockRelease(0); } ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] forced to restart postgresql service yesterday
Merlin Moncure [EMAIL PROTECTED] writes: The login was hanging in a simple plpgsql login script which basically did an insert/update on a small table. It would hang when called from within psql, and once hung the query would not respond to cancel requests (such as they are implemented on win32). Investigating further, trying to select form this small table at all would also hang. Unfortunately, this table tells me which pids are safe to kill and which are not, Did you look at pg_locks or pg_stat_activity? There is pretty much nothing we can do with this report given the lack of detail. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Vertical Partitioning with TOAST
Jim C. Nasby wrote: On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: This seems like a useful feature to add, allowing for easy built-in verticle partitioning. Are there issues with the patch as-is? Other than the ones mentioned by the poster? It seemed to me more like a not-too-successful experiment than something ready for application. If you take the viewpoint that this is just another TOAST storage strategy, I think it's pretty useless. A large field value is going to get toasted anyway with the regular strategy, and if your column happens to contain some values that are not large, forcing them out-of-line anyway is simply silly. (You could make a case for making the threshold size user-controllable, but I don't see the case for setting the threshold to zero, which is what this amounts to.) Valid point. I do think there's a lot of benefit to being able to set the limit much lower than what it currently defaults to today. We have a client that has a queue-type table that is updated very frequently. One of the fields is text, that is not updated as frequently. Keeping this table vacuumed well enough has proven to be problematic, because any delay to vacuuming quickly results in a very large amount of bloat. Moving that text field into a seperate table would most likely be a win. Presumably this would need to be settable on at least a per-table basis. Would adding such a variable be a good beginner TODO, or is it too invasive? Well, we have now: ALTER TABLE ALTER [ COLUMN ] column SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } What else is needed? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Reducing relation locking overhead
Simon Riggs [EMAIL PROTECTED] writes: Further thoughts: 1. Normally, we do not lock indexes via the LockMgrLock 2. When a REINDEX-like operation comes along, it first of all updates an MaintIntentLock flag on the index relation, which causes a relcache invalidation. It then waits until all backends have updated their relcache. There isn't any way for it to do that (ie, be sure everyone else has adjusted to the new state of affairs), short of acquiring some sort of short-term exclusive lock on the table, which is a really bad idea. The pending lock would block other incoming requests on the table until all the current users exited their transactions. I think the idea of not doing locking on indexes was pretty much shot down in this thread, and we have to go looking for other solutions ... thus my other proposal. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Reducing contention for the LockMgrLock
Simon Riggs [EMAIL PROTECTED] writes: The imbalance across partitions would be a major issue because of the difficulty of selecting a well-distributed partitioning key. If you use the LOCKTAG, then operations on the heaviest hit tables would go to the same partitions continually for LockRelation requests. The frequency of access per table usually drops off dramatically in rank order: look at TPC-B (pgbench) and TPC-C; my own experience would be that you seldom have as many even as 16 heavy hit tables. My guess would be that doing all of that would do little more than reduce contention to ~50%, and that this would show quickly diminishing returns for N 4. Also, the more sharply defined your application profile, the worse this effect will be. That's a fair point, and reinforces my instinct that having a large number of partitions would be a losing game. But you are mistaken to think that the number of hot-spot tables is the only limit on the number of usable partitions. It's the number of their indexes that matters most. (The pgbench test is if anything probably understating the problem, because it has only a single index on each table.) In any case, even a factor of 2 or so reduction in direct conflicts should have a useful impact on the number of semop waits, because it's a nonlinear effect... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Reducing relation locking overhead
On Thu, 2005-12-08 at 16:05, Tom Lane wrote: [SNIP] There isn't any way for it to do that (ie, be sure everyone else has adjusted to the new state of affairs), short of acquiring some sort of short-term exclusive lock on the table, which is a really bad idea. The pending lock would block other incoming requests on the table until all the current users exited their transactions. But it is an acceptable compromise to lock the table until all current transactions are over... the alternative for reindexing a big table is usually to schedule a down-time, which is even worse... REINDEX is usually used to fix a big tables big index bloat, and that won't fly without a downtime, or, with this short-term full table lock in a low-traffic time-slot. For my usage patterns I would vote with the table lock if it is just a means of blocking new transactions until the running ones finish. I'll just make sure there are none long running when I issue the REINDEX... Cheers, Csaba. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] HOOKS for Synchronous Replication?
On Wednesday 07 December 2005 20:24, Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? I had hoped to see some comments from the Slony people about it. I'd feel better about the validity of a set of hooks if more than one project agreed that it was useful/appropriate ... I missed seeing it all together the first time through, I'll see what I can do about taking a indepth look at it over the next few days and provide some feedback. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Darcy Buskermolen Wavefire Technologies Corp. http://www.wavefire.com ph: 250.717.0200 fx: 250.763.1759 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] About my new work at Command Prompt Inc.
On 12/7/2005 9:37 AM, Devrim GUNDUZ wrote: Hi, I'd like to inform the people who does not read Planet PostgreSQL Command Prompt Inc.has just hired me for my community work I have been doing so far, like PostgreSQL RPM stuff and other PostgreSQL related RPMs, such as Slony-I, pgpool, PostGIS, etc) and website things. That means I'll spend more time on these. Congratulations Devrim, and a big thanks for the work you're doing. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Foreign key trigger timing bug?
On 12/7/2005 4:50 PM, Stephan Szabo wrote: On Wed, 7 Dec 2005, Bruce Momjian wrote: I had an open 8.1 item that was: o fix foreign trigger timing issue Would someone supply text for a TODO entry on this, as I don't think we fixed it in 8.1. I'd split this into two separate items now. Fix before delete triggers on cascaded deletes to run after the cascaded delete is done. This is odd, but seems to be what the spec requires. Ugh, that sounds ugly. One problem I see is, what do we do if the BEFORE trigger then returns NULL (to skip the delete). The cascaded operations are already done. Do we have to execute the cascaded deletes in a subtransaction or do we disallow the skip in this case? Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] forced to restart postgresql service yesterday
Subject: Re: [HACKERS] forced to restart postgresql service yesterday Merlin Moncure [EMAIL PROTECTED] writes: The login was hanging in a simple plpgsql login script which basically did an insert/update on a small table. It would hang when called from within psql, and once hung the query would not respond to cancel requests (such as they are implemented on win32). Investigating further, trying to select form this small table at all would also hang. Unfortunately, this table tells me which pids are safe to kill and which are not, Did you look at pg_locks or pg_stat_activity? There is pretty much nothing we can do with this report given the lack of detail. understood, I was in a big hurry to get the server back up. pg_stat_activity worked ok...there were a lot of hung processes and it's possible pg was over connection limit although pretty much everything logs in as super user. There was also a ton of stuff in pg_locks but it was hard to determine anything useful because my app makes a lot of use of userlocks and I don't get the benefit of the revamped 8.1 pg_locks view. In any case, while waiting on a lock win32 pg will respond to query cancel, and the server wouldn't while trying to do anything with this particular table. It was like a process sucking black hole. Also, cpu load was 0 as was disk. Merlin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] Inherited Constraints
Ühel kenal päeval, N, 2005-12-08 kell 11:10, kirjutas Simon Riggs: On Wed, 2005-12-07 at 21:24 +, Simon Riggs wrote: Following patch implements record of whether a constraint is inherited or not, and prevents dropping of inherited constraints. Patch posted to -patches list. What it doesn't do: It doesn't yet prevent dropping the parent constraint, which is wrong, clearly, but what to do about it? 1. make dropping a constraint drop all constraints dependent upon it (without any explicit cascade) 2. add a new clause to ALTER TABLE DROP CONSTRAINT CASCADE I prefer (1), since it is SQL Standard compliant, easier to remember and automatic de-inheritance is the natural opposite of the automatic inheritance process. Comments, please -hackers? It would be logical to do the same as DROP TABLE does, i.e (2). hannu=# create table parent(i int); CREATE TABLE hannu=# create table child() inherits(parent); CREATE TABLE hannu=# drop table parent; NOTICE: table child depends on table parent ERROR: cannot drop table parent because other objects depend on it HINT: Use DROP ... CASCADE to drop the dependent objects too. hannu=# Maybe there should be another option in addition to CASCADE, say DISINHERIT, which leaves all child constraints as heads of new ingeritance hierarchies. DROP CASCADE + ADD BACK ALL CHILD CONSTRAINTS may be prohibitively expensive for biggish tables. Another nice (but no doubt more complex) thing would be ability to add multiple constraints at once, needing only one seqscan to check for compliance with added constraints and/or making constraint checks smarter, so that for.ex. ADD CONSTRAINT CHECK i 0 could make use of index on i instead of doing a seqscan. Or if there is a constraint CHECK i 0 then adding another like CHECK i -1 would not need to check actual data either. Which implementation should I pick (or another)? Further patch will utilise this new knowledge to reduce the number of tests made during constraint_exclusion. So will hierarchical inheritance be the thing to do to take advantage of i then ? year +- month1 |+-day1 |+-day2 . |\-day31 +- month2 etc. btw, will your DROP patch support multiple inheritance ? --- Hannu ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Reducing contention for the LockMgrLock
Tom Lane [EMAIL PROTECTED] writes: That's a fair point, and reinforces my instinct that having a large number of partitions would be a losing game. But you are mistaken to think that the number of hot-spot tables is the only limit on the number of usable partitions. It's the number of their indexes that matters most. Hm, so hypothetically an insert or update on a table with 4 indexes which have been split into 4 partitions would need to touch each partition? Would that defeat the benefits of the partitioning? Or enhance it? Would it be better to ensure that the indexes of a single table ended up in the same partition? Or to ensure they're spread across partitions? -- greg ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] generalizing the planner knobs
BTW, there's another end to the 'enable_seqscan=false' problem... it sometimes doesn't work! Last I looked, enable_seqscan=false only added a fixed overhead cost to a seqscan (100 IIRC). The problem is, some queries will produce estimates for other methodes that are more expensive than a seqscan even with the added burden. If instead of adding a fixed amount enable_seqscan=false multiplied by some amount then this would probably be impossible to occur. (And before someone asks, no, I don't remember which query was actually faster...) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 I have often considered that this is an indication that seq scan is actually the better plan... although, i have to admit that is a little confusing that enable_seqscan = false actually let you use a seqscan if the other plans are bad enough -- regards, Jaime Casanova (DBA: DataBase Aniquilator ;) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Reducing contention for the LockMgrLock
Greg Stark [EMAIL PROTECTED] writes: Hm, so hypothetically an insert or update on a table with 4 indexes which have been split into 4 partitions would need to touch each partition? That would be the best case, actually, that each heavily-used lock ends up in a different partition. As Simon points out, we have no way to guarantee that. Would that defeat the benefits of the partitioning? Or enhance it? It'd be what you'd want, because it would reduce the odds that two processes doing this concurrently would need to touch the same partition at the same time. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] Inherited Constraints
Another nice (but no doubt more complex) thing would be ability to add multiple constraints at once, needing only one seqscan to check for compliance with added constraints and/or making constraint checks smarter, so that for.ex. ADD CONSTRAINT CHECK i 0 could make use of index on i instead of doing a seqscan. Or if there is a constraint CHECK i 0 then adding another like CHECK i -1 would not need to check actual data either. Check out the comma in alter table. ALTER TABLE tab ADD COLUMN serial NOT NULL UNIQUE, ADD CHECK (foo 24), ADD CHECK (baz 18), ADD COLUMN integer NOT NULL DEFAULT 32; Table tab (and each of the tables that inherits from it) is scanned and rewritten once. I believe this was added for 8.0. -- ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] bgwriter leaks resources after errors
I think that bug #2099 http://archives.postgresql.org/pgsql-bugs/2005-12/msg00075.php is probably explained by the observation that the bgwriter fails to call AtEOXact_Files() after an error; therefore, if it gets an error while holding an open DIR pointer, the DIR will never get closed. (There are calls to AllocateDir in the checkpoint code path, eg SimpleLruTruncate.) But while looking at this I realized we have worse problems: there's no mechanism for releasing buffer pins after an error in the bgwriter. So, for example, a write error on a dirty buffer will cause bgwriter to permanently leak a buffer pin. Even if the write problem is later rectified, the buffer will remain pinned indefinitely. Quite aside from the consequences of resource-leakage, this could make it impossible to complete subsequent VACUUMs on the affected relation. That raises it to a must fix in my mind. The bug exists in 8.0 and later. The cleanest solution seems to be to make the bgwriter create and use a ResourceOwner, and probably get rid of the special hack in resowner.c that allows CurrentResourceOwner to be NULL. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] generalizing the planner knobs
Jaime Casanova [EMAIL PROTECTED] writes: BTW, there's another end to the 'enable_seqscan=false' problem... it sometimes doesn't work! I have often considered that this is an indication that seq scan is actually the better plan... There are cases where it is the *only* plan, eg, you have no relevant indexes. I am not sure that applies to Jim's complaint though. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Reducing contention for the LockMgrLock
On Thu, 2005-12-08 at 10:26 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: The imbalance across partitions would be a major issue because of the difficulty of selecting a well-distributed partitioning key. If you use the LOCKTAG, then operations on the heaviest hit tables would go to the same partitions continually for LockRelation requests. The frequency of access per table usually drops off dramatically in rank order: look at TPC-B (pgbench) and TPC-C; my own experience would be that you seldom have as many even as 16 heavy hit tables. My guess would be that doing all of that would do little more than reduce contention to ~50%, and that this would show quickly diminishing returns for N 4. Also, the more sharply defined your application profile, the worse this effect will be. That's a fair point, and reinforces my instinct that having a large number of partitions would be a losing game. But you are mistaken to think that the number of hot-spot tables is the only limit on the number of usable partitions. It's the number of their indexes that matters most. (The pgbench test is if anything probably understating the problem, because it has only a single index on each table.) True. So either 16 partitions, or maybe 8, does sound about right then. In any case, even a factor of 2 or so reduction in direct conflicts should have a useful impact on the number of semop waits, because it's a nonlinear effect... Nonlinear effects work both ways. Factor of 2 is great, but not enough to prevent this discussion becoming an annual event ;-) (Thinks: Oh, it already is...) Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Reducing relation locking overhead
On Thu, 2005-12-08 at 16:23 +0100, Csaba Nagy wrote: On Thu, 2005-12-08 at 16:05, Tom Lane wrote: [SNIP] There isn't any way for it to do that (ie, be sure everyone else has adjusted to the new state of affairs), short of acquiring some sort of short-term exclusive lock on the table, which is a really bad idea. The pending lock would block other incoming requests on the table until all the current users exited their transactions. It might be possible to do this with some new form of utility lock that can be unlocked before end of transaction, then re-locked again later. That would imply the use of optimistic locking, as already discussed. Only DDL and VACUUMs *need* to be locked out during the REINDEX. I'm hand-waving here, so I'll stop. But we do know there *is* a way, because this is already implemented elsewhere, somehow. But it is an acceptable compromise to lock the table until all current transactions are over... the alternative for reindexing a big table is usually to schedule a down-time, which is even worse... REINDEX is usually used to fix a big tables big index bloat, and that won't fly without a downtime, or, with this short-term full table lock in a low-traffic time-slot. For my usage patterns I would vote with the table lock if it is just a means of blocking new transactions until the running ones finish. I'll just make sure there are none long running when I issue the REINDEX... Certainly that is my view. You only schedule these things when you have to, picking as light traffic period as you can. Most people would accept some pretty hefty restrictions in order to get it to work. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Replication on the backend
Are you sure that no way to implement a generic aproach on the backend? What does specification say? Does Oracle 10g have a core implementation of replication (cluster)? Gustavo. 2005/12/7, Andrew Sullivan [EMAIL PROTECTED]: On Tue, Dec 06, 2005 at 12:35:43AM -0500, Jan Wieck wrote: We do not plan to implement replication inside the backend. Replication needs are so diverse that pluggable replication support makes a lot more sense. To me it even makes more sense than keeping transaction support outside of the database itself and add it via pluggable storage add-on.And, as I say every single time this comes up, Oracle's and IBM's and MS's and everybody else's replication systems are _also_ add ons.Ifyou don't believe me, look at the license costs.You can get asystem without it enabled, which means (by definition) it's a modularextension. A--Andrew Sullivan| [EMAIL PROTECTED]In the future this spectacle of the middle classes shocking the avant-garde will probably become the textbook definition of Postmodernism. --Brad Holland---(end of broadcast)---TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Improving free space usage (was: Reducing relation locking overhead)
On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote: ??hel kenal p??eval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby: On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote: ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: What's worse, once you have excluded writes you have to rescan the entire table to be sure you haven't missed anything. So in the scenarios where this whole thing is actually interesting, ie enormous tables, you're still talking about a fairly long interval with writes locked out. Maybe not as long as a complete REINDEX, but long. I was thinking you would set a flag to disable use of the FSM for inserts/updates while the reindex was running. So you would know where to find the new tuples, at the end of the table after the last tuple you read. What about keeping a separate list of new tuples? Obviously we'd only do this when an index was being built on a table. The problem with separate list is that it can be huge. For example on a table with 200 inserts/updates per second an index build lasting 6 hours would accumulate total on 6*3600*200 = 432 new tuples. Sure, but it's unlikely that such a table would be very wide, so 4.3M tuples would probably only amount to a few hundred MB of data. It's also possible that this list could be vacuumed by whatever the regular vacuum process is for the table. I think that keeping such list as part the table at well defined location (like pages from N to M) is the best strategy, as it will automatically make all new tuples available to parallel processes and avoids both duplicate storage as well as the the need for changing insert/update code. There's one thing I hate about that idea though: good luck trying to move those tuples somewhere else after the index build is done and you now want to shrink the table back down to a more normal size. If we had a better way to do that it would be much more palatable, but right now on a heavily updated table this would result in a lot of bloat. Along those lines, I've wondered if it makes sense to add more flexibility in how free space is reclaimed in a table. One obvious possibility is to have a strategy where new tuples will always look to the FSM for space (instead of going into the current page if possible), and the FSM will always hand out the earliest page in the table it has. This mode would have the effect of moving tuples towards the front of the table, allowing for space reclamation. A variation might be that this mode will not effect tuples that are generated as part of an UPDATE and are in the first x% of the table, since it doesn't make sense to move a tuple from page 2 to page 1 in a 1000 page table. Another possibility is to always go to the FSM and to have the FSM hand back the page that is closest to the new tuple according to a certain index. This would allow for ALTER TABLE CLUSTER to be much more self-maintaining. The downside is that you'd have to do a lookup on that index, but presumably if the index is important enough to cluster on then it should be well-cached. There's probably some other tweaks that could be done as well to make this more performant. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Vertical Partitioning with TOAST
On 12/8/2005 1:42 PM, Jim C. Nasby wrote: On Thu, Dec 08, 2005 at 10:03:43AM -0500, Bruce Momjian wrote: Jim C. Nasby wrote: On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: This seems like a useful feature to add, allowing for easy built-in verticle partitioning. Are there issues with the patch as-is? Other than the ones mentioned by the poster? It seemed to me more like a not-too-successful experiment than something ready for application. If you take the viewpoint that this is just another TOAST storage strategy, I think it's pretty useless. A large field value is going to get toasted anyway with the regular strategy, and if your column happens to contain some values that are not large, forcing them out-of-line anyway is simply silly. (You could make a case for making the threshold size user-controllable, but I don't see the case for setting the threshold to zero, which is what this amounts to.) Valid point. I do think there's a lot of benefit to being able to set the limit much lower than what it currently defaults to today. We have a client that has a queue-type table that is updated very frequently. One of the fields is text, that is not updated as frequently. Keeping this table vacuumed well enough has proven to be problematic, because any delay to vacuuming quickly results in a very large amount of bloat. Moving that text field into a seperate table would most likely be a win. Presumably this would need to be settable on at least a per-table basis. Would adding such a variable be a good beginner TODO, or is it too invasive? Well, we have now: ALTER TABLE ALTER [ COLUMN ] column SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } What else is needed? As Tom suggested, I think it would be best to be able to change the size at which a field gets stored externally. I think it also makes sense to have this reverse the normal order of compress first, then if it still doesn't fit store it externally. I forsee this typically being useful when you have fields that are between ~100 and 1000 bytes in size, and I'm doubtful that compression would do much good there. But I wouldn't rule out this being useful on fields that can also sometimes contain much larger amounts of data, so I don't think it makes sense to disable compression completely. So, I think this leaves two new options: It's not the size of a field that triggers toasting. It is the size of the entire tuple. As long as that is BLKSIZE/4, the toaster will pick the currently largest inline value and do something with it. something is either compressing or (if not allowed or already done) moving external. Jan SET STORAGE EXTERNAL [THRESHOLD x] If a field is over x in size, it's stored externally. SET STORAGE EXTENDED [THRESHOLD x] If a field is over x in size, it's stored externally. If it's over BLCKSZ/4 it will also be compressed (I think that's how things work now). Actually, that's rather ugly. I think it would be better to just break external storage and compression out into their own attributes: SET STORAGE [ ALLOW EXTERNAL [THRESHOLD x] ] [ ALLOW COMPRESSION [THRESHOLD x] ] ALLOW EXTERNAL: if a field is larger than x bytes (BLCKSZ/4 by default) then it will be stored externally. May be specified along with ALLOW COMPRESSION. ALLOW COMPRESSION: if a field is larger than x bytes (BLCKSZ/4 by default) then it will be COMPRESSED. May be specified along with ALLOW EXTERNAL. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Replication on the backend
What is the point of these questions? If you have a concrete, practical proposal to make, please do so. Otherwise, you have already got the answer from the people who are actually working on replication and understand it far beyond abstract considerations. If you think there is a good reason to do replication directly in the backend code rather than as an addon, possibly using an agreed API, then you need to provide hard evidence, not mere assertion or conjecture. cheers andrew Gustavo Tonini wrote: Are you sure that no way to implement a generic aproach on the backend? What does specification say? Does Oracle 10g have a core implementation of replication (cluster)? Gustavo. 2005/12/7, Andrew Sullivan [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]: On Tue, Dec 06, 2005 at 12:35:43AM -0500, Jan Wieck wrote: We do not plan to implement replication inside the backend. Replication needs are so diverse that pluggable replication support makes a lot more sense. To me it even makes more sense than keeping transaction support outside of the database itself and add it via pluggable storage add-on. And, as I say every single time this comes up, Oracle's and IBM's and MS's and everybody else's replication systems are _also_ add ons. If you don't believe me, look at the license costs. You can get a system without it enabled, which means (by definition) it's a modular extension. A -- Andrew Sullivan | [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOOKS for Synchronous Replication?
On Thu, Dec 08, 2005 at 08:33:59AM -0800, Darcy Buskermolen wrote: On Wednesday 07 December 2005 20:24, Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? I had hoped to see some comments from the Slony people about it. I'd feel better about the validity of a set of hooks if more than one project agreed that it was useful/appropriate ... I missed seeing it all together the first time through, I'll see what I can do about taking a indepth look at it over the next few days and provide some feedback. While this code might be useful, whouldn't it be much more valuable to provide hooks into xlog so that we could do non-trigger-based replication? (As well as non-trigger-based materialized views...) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] bgwriter leaks resources after errors
Tom Lane [EMAIL PROTECTED] wrote The cleanest solution seems to be to make the bgwriter create and use a ResourceOwner, and probably get rid of the special hack in resowner.c that allows CurrentResourceOwner to be NULL. Is that true that every moment the bgwriter just has one buffer pinned? Then remember that buffer is enough. This could avoid using ResourceOwner but it may need some hack into pin/unpin. Regards, Qingqing ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] generalizing the planner knobs
On Thu, Dec 08, 2005 at 01:07:10PM -0500, Tom Lane wrote: Jaime Casanova [EMAIL PROTECTED] writes: BTW, there's another end to the 'enable_seqscan=false' problem... it sometimes doesn't work! I have often considered that this is an indication that seq scan is actually the better plan... There are cases where it is the *only* plan, eg, you have no relevant indexes. I am not sure that applies to Jim's complaint though. IIRC I ran into this when I was working on generating some numbers about how well a high correlation improves the performance of an index scan (since afaict the cost estimator for index scan is rather broken :( ) In that case, I had defined an index on a ~120M row table on a collumn with a very low correlation. It's pretty much a given that a seqscan and sort would be faster than the index scan, but it would have still been good to be able to verify that. Because of how enable_seqscan works, I couldn't. BTW, http://archives.postgresql.org/pgsql-performance/2005-04/msg00669.php is where I first mentioned this, including the cost function that I think is broken. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Replication on the backend
On 12/8/2005 1:28 PM, Gustavo Tonini wrote: Are you sure that no way to implement a generic aproach on the backend? What You mean generic as in a replication system that can do asynchronous master-slave, asynchronous multimaster with conflict resolution based on timestamps, system priority or user defined resolution stubs, can do synchronous predicate locking but also does support thousands of asynchronous, partial replica (salesman on the road), and last but not least can run as a synchronous cluster in a LAN segment. All the above can of course be mixed together ... like a central cluster of 8 load balanced, fault tolerant systems, with async multimaster replica in all external branch servers, partial multimaster replica for the road warriers and some slave leaf nodes for reporting. If you can present a prototype for the above, I am sure that we will change our opinion and finally settle for one, builtin replication system. does specification say? Does Oracle 10g have a core implementation of replication (cluster)? What specification? Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOOKS for Synchronous Replication?
On 12/7/2005 11:24 PM, Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? I had hoped to see some comments from the Slony people about it. I'd feel better about the validity of a set of hooks if more than one project agreed that it was useful/appropriate ... I am going to look into this in the next couple of days. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] HOOKS for Synchronous Replication?
On 12/8/05, Jim C. Nasby [EMAIL PROTECTED] wrote: While this code might be useful, whouldn't it be much more valuable toprovide hooks into xlog so that we could do non-trigger-basedreplication? (As well as non-trigger-based materialized views...) If we're going to do hooks for replication, I think we should look at including xlog. Trigger-based replication is an OLTP performance killer if you're just looking to do async replication.
Re: [HACKERS] HOOKS for Synchronous Replication?
A general purpose log miner is also useful in many other areas besides replication. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jonah H. Harris Sent: Thursday, December 08, 2005 11:32 AM To: Jim C. Nasby Cc: Darcy Buskermolen; pgsql-hackers@postgresql.org; Tom Lane; Christopher Kings-Lynne Subject: Re: [HACKERS] HOOKS for Synchronous Replication? On 12/8/05, Jim C. Nasby [EMAIL PROTECTED] wrote: While this code might be useful, whouldn't it be much more valuable to provide hooks into xlog so that we could do non-trigger-based replication? (As well as non-trigger-based materialized views...) If we're going to do hooks for replication, I think we should look at including xlog. Trigger-based replication is an OLTP performance killer if you're just looking to do async replication.
Re: [HACKERS] HOOKS for Synchronous Replication?
On 12/8/2005 2:05 PM, Jim C. Nasby wrote: On Thu, Dec 08, 2005 at 08:33:59AM -0800, Darcy Buskermolen wrote: On Wednesday 07 December 2005 20:24, Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? I had hoped to see some comments from the Slony people about it. I'd feel better about the validity of a set of hooks if more than one project agreed that it was useful/appropriate ... I missed seeing it all together the first time through, I'll see what I can do about taking a indepth look at it over the next few days and provide some feedback. While this code might be useful, whouldn't it be much more valuable to provide hooks into xlog so that we could do non-trigger-based replication? (As well as non-trigger-based materialized views...) I don't see why these would be mutually exclusive. A generic API needs to have them all. Without having looked at the patch yet, what I expect from an API is that the backend will (after initialization and becoming a member of a database) check if this database is replicated. If so load the specific shared object that implement the backend part of the replication system and then call an init() function in that. This init() function then will add callbacks to all the hooks where this particular replication system wants to be called. So one replication system might want to be called on commit, just before writing the WAL record, some other system doesn't care about that, but wants to see the WAL record after it was written. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Reducing relation locking overhead
Simon Riggs [EMAIL PROTECTED] writes: I'm hand-waving here, so I'll stop. But we do know there *is* a way, because this is already implemented elsewhere, somehow. That's not really the point --- the question is whether the cure is worse than the disease. It's entirely possible that the tradeoffs needed to support fully concurrent REINDEX would represent a higher price than the feature is worth, or that it's simply too much work to get there from here. For instance, I would imagine that the way Oracle does this relies on their use of rollback segments, which is something we're certainly unlikely to emulate. Given the discussion so far, it seems likely to me that completely concurrent REINDEX is indeed out of reach, and that what we ought to be thinking about is what sort of compromise design (ie, partially concurrent REINDEX) is reasonable. Something that might work is: 1. Take ShareUpdateExclusiveLock (this blocks VACUUM and DDL changes), then run existing CREATE INDEX code. The finished index may be missing some tuples inserted during the run. 2. Commit transaction so that index becomes visible (we assume it's marked so that the planner will know not to rely on it). Continue to hold ShareUpdateExclusiveLock so VACUUM doesn't run. 3. Attempt to acquire ShareLock (possibly a ConditionalLockAcquire/sleep loop instead of just flat-out LockAcquire). Once we have this we know there are no active writer transactions. Release the lock immediately. 4. Make a new scan of the table and insert any rows not already present in the index. (This need not process anything inserted later than step 3, because any new transactions will know to insert rows in the index anyway.) 5. Mark index good and commit, releasing all locks. I don't think that it's worth the effort and complexity to try to avoid a full table scan in step 4. At best you would save much less than 50% of the total work, and the impact on normal operations is not free. If what you want is a REINDEX rather than creating an independent new index, then at step 5 you need to do a swap operation which'll require obtaining exclusive lock on the index. This creates another opportunity for deadlock failures, but again a conditional loop might help. There are still some issues about the behavior when the index is UNIQUE. Ideally you would like the CREATE INDEX to fail on a duplicate, not any concurrent writer transaction, but I don't think it's possible to guarantee that. Also, I'm not sure how we get rid of the broken index if there is a failure later than step 2. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOOKS for Synchronous Replication?
True, I think we need hooks for both methods.On 12/8/05, Jan Wieck [EMAIL PROTECTED] wrote: On 12/8/2005 2:05 PM, Jim C. Nasby wrote: On Thu, Dec 08, 2005 at 08:33:59AM -0800, Darcy Buskermolen wrote: On Wednesday 07 December 2005 20:24, Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? I had hoped to see some comments from the Slony people about it. I'd feel better about the validity of a set of hooks if more than one project agreed that it was useful/appropriate ... I missed seeing it all together the first time through, I'll see what I can do about taking a indepth look at it over the next few days and provide some feedback. While this code might be useful, whouldn't it be much more valuable to provide hooks into xlog so that we could do non-trigger-based replication? (As well as non-trigger-based materialized views...) I don't see why these would be mutually exclusive. A generic API needsto have them all.Without having looked at the patch yet, what I expect from an API isthat the backend will (after initialization and becoming a member of a database) check if this database is replicated. If so load the specificshared object that implement the backend part of the replication systemand then call an init() function in that. This init() function then will add callbacks to all the hooks where this particular replication systemwants to be called. So one replication system might want to be called oncommit, just before writing the WAL record, some other system doesn't care about that, but wants to see the WAL record after it was written.Jan--#==## It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me.##== [EMAIL PROTECTED] #---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] generic builtin functions
Andrew Dunstan [EMAIL PROTECTED] writes: Still thinking a bit more about this ... how about we have output functions take an optional second argument, which is the type oid? No. We just undid that for good and sufficient security reasons. If an output function depends on anything more than the contents of the object it's handed, it's vulnerable to being lied to. http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php I realize that being told the wrong type ID might be less than catastrophic for enum_out, but I'm going to fiercely resist putting back any extra arguments for output functions. The temptation to use them unsafely is just too strong --- we've learned that the hard way more than once already, and I don't want to repeat the same mistake yet again. Input funcs get a typioparam and typmod argument in addition to the data value, Entirely different situation because the only thing an input function assumes is that it's been handed some text ... which it can validate. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] generic builtin functions
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Still thinking a bit more about this ... how about we have output functions take an optional second argument, which is the type oid? No. We just undid that for good and sufficient security reasons. If an output function depends on anything more than the contents of the object it's handed, it's vulnerable to being lied to. http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php I realize that being told the wrong type ID might be less than catastrophic for enum_out, but I'm going to fiercely resist putting back any extra arguments for output functions. The temptation to use them unsafely is just too strong --- we've learned that the hard way more than once already, and I don't want to repeat the same mistake yet again. Input funcs get a typioparam and typmod argument in addition to the data value, Entirely different situation because the only thing an input function assumes is that it's been handed some text ... which it can validate. OK, If you resist fiercely I'm sure it won't happen, so I'll put away the asbestos underpants ;-) I see in that discussion you say: Every rowtype Datum will carry its own concrete type. Are we storing that on disk in every composite object? If not, where do we get it from before calling the output function? cheers andrew regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] 8.1.1 stamped
I have stamped the 8.1.1 CVS tree and it is ready for testing/packaging. The release notes are here: http://candle.pha.pa.us/main/writings/pgsql/sgml/release.html#RELEASE-8-1-1 -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Upcoming PG re-releases
On Tue, 6 Dec 2005, Bruce Momjian wrote: Exactly what does vim do that iconv does not? Fuzzy encoding sounds scary to me. Right. It actually makes assumptions about the source encoding. People who care about their data need, unfortunately, to spend a bit of time on this problem. I've been discussing the same issue on the slony1 mailing list, because the issue can affect people's ability upgrade using slony1. http://gborg.postgresql.org/pipermail/slony1-general/2005-December/003430.html It would be good if had the script I suggest in the email: A script which identifies non-utf-8 characters and provides some context, line numbers, etc, will greatly speed up the process of remedying the situation. Thoughts? Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Upcoming PG re-releases
Gavin Sherry wrote: On Tue, 6 Dec 2005, Bruce Momjian wrote: Exactly what does vim do that iconv does not? Fuzzy encoding sounds scary to me. Right. It actually makes assumptions about the source encoding. People who care about their data need, unfortunately, to spend a bit of time on this problem. I've been discussing the same issue on the slony1 mailing list, because the issue can affect people's ability upgrade using slony1. http://gborg.postgresql.org/pipermail/slony1-general/2005-December/003430.html It would be good if had the script I suggest in the email: A script which identifies non-utf-8 characters and provides some context, line numbers, etc, will greatly speed up the process of remedying the situation. I think the best we can do is the iconv -c with the diff idea, which is already in the release notes. I suppose we could merge the iconv and diff into a single command, but I don't see a portable way to output the iconv output to stdout., /dev/stdin not being portable. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCHES] Inherited Constraints
On Thu, 2005-12-08 at 11:10 +, Simon Riggs wrote: On Wed, 2005-12-07 at 21:24 +, Simon Riggs wrote: Following patch implements record of whether a constraint is inherited or not, and prevents dropping of inherited constraints. Patch posted to -patches list. What it doesn't do: It doesn't yet prevent dropping the parent constraint, which is wrong, clearly, but what to do about it? 1. make dropping a constraint drop all constraints dependent upon it (without any explicit cascade) 2. add a new clause to ALTER TABLE DROP CONSTRAINT CASCADE I prefer (1), since it is SQL Standard compliant, easier to remember and automatic de-inheritance is the natural opposite of the automatic inheritance process. Comments, please -hackers? Late night hacking again ALTER TABLE DROP CONSTRAINT CASCADE does of course already exist, so the following should cause dependency violation ERRORs: - omitting the CASCADE when attempting to delete parent constraint - attempting to drop the child constraint Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Upcoming PG re-releases
On 12/8/05, Bruce Momjian pgman@candle.pha.pa.us wrote: A script which identifies non-utf-8 characters and provides some context, line numbers, etc, will greatly speed up the process of remedying the situation. I think the best we can do is the iconv -c with the diff idea, which is already in the release notes. I suppose we could merge the iconv and diff into a single command, but I don't see a portable way to output the iconv output to stdout., /dev/stdin not being portable. No, what is needed for people who care about fixing their data is a loadable strip_invalid_utf8() that works in older versions.. then just select * from bar where foo != strip_invalid_utf8(foo); The function would be useful in general, for example, if you have an application which doesn't already have much utf8 logic, you want to use a text field, and stripping is the behaviour you want. For example, lots of simple web applications. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Another encoding issue
Hi all, Here's another interesting encoding issue. I cannot recall having seen it on the lists. --- [EMAIL PROTECTED] build7]$ bin/createdb -E LATIN1 test CREATE DATABASE [EMAIL PROTECTED] build7]$ cat break.sh dat=`echo -en \245\241` echo create table test (d text); echo insert into test values('$dat'); [EMAIL PROTECTED] build7]$ sh break.sh | bin/psql test CREATE TABLE INSERT 0 1 [EMAIL PROTECTED] build7]$ bin/createdb -T test test2 CREATE DATABASE [EMAIL PROTECTED] build7]$ bin/createdb -T test -E UTF-8 test2 CREATE DATABASE [EMAIL PROTECTED] build7]$ bin/pg_dump -C test2 test2.dmp [EMAIL PROTECTED] build7]$ bin/dropdb test2 DROP DATABASE [EMAIL PROTECTED] build7]$ bin/psql template1 -f test2.dmp SET SET SET CREATE DATABASE ALTER DATABASE You are now connected to database test2. [...] CREATE TABLE ALTER TABLE psql:test2.dmp:345: ERROR: invalid UTF-8 byte sequence detected near byte 0xa5 CONTEXT: COPY test, line 1, column d: [...] --- Until createdb() is a lot more sophisticated, we cannot translate characters between encodings. I don't think this is a huge issue though, as most people are only going to be creating empty databases anyway. Still, it probably requires documentation. Thoughts? Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Foreign key trigger timing bug?
On Thu, 8 Dec 2005, Jan Wieck wrote: On 12/7/2005 4:50 PM, Stephan Szabo wrote: On Wed, 7 Dec 2005, Bruce Momjian wrote: I had an open 8.1 item that was: o fix foreign trigger timing issue Would someone supply text for a TODO entry on this, as I don't think we fixed it in 8.1. I'd split this into two separate items now. Fix before delete triggers on cascaded deletes to run after the cascaded delete is done. This is odd, but seems to be what the spec requires. Ugh, that sounds ugly. Yeah. I really don't understand it, but it appears to me to be explicitly different in the spec for on delete cascade even compared to the rest of the referential actions. One problem I see is, what do we do if the BEFORE trigger then returns NULL (to skip the delete). The cascaded operations are already done. Do we have to execute the cascaded deletes in a subtransaction or do we disallow the skip in this case? I think we'd have disallow skipping. Especially since skipping would probably end up with a violated constraint. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] generic builtin functions
Andrew Dunstan [EMAIL PROTECTED] writes: I see in that discussion you say: Every rowtype Datum will carry its own concrete type. Are we storing that on disk in every composite object? Yup. For a rowtype datum it's not a serious overhead. I realize that the same is not true of enums :-( regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Inherited Constraints
On Thursday 2005-12-08 15:47, Simon Riggs wrote: does of course already exist, so the following should cause dependency violation ERRORs: - omitting the CASCADE when attempting to delete parent constraint - attempting to drop the child constraint Why should dropping the child constraint fail? Child tables are supposed to be able to over-ride parent constraints. Dropping a parent's constraint sounds like just a way to over-ride a constraint with no constraint at all. (Making the column unconstrained.) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Another encoding issue
If we're bringing up odd encoding issues, why not talk about the mystery encoding of the shared catalogs? :) Basically depending on which database you're logged into when you alter a catalog will affect what encoding the new object appears as in the shared catalog. This for one makes it impossible for us in phpPgAdmin to display a list of databases, where some database names are in EUC and some are in UTF-8 and some are in LATIN5... I bring it up as I notice that in MySQL 5 at least, all system object names (in our case that'd be all strings in the shared catalogs) are stored in UTF-8, always. Chris Gavin Sherry wrote: Hi all, Here's another interesting encoding issue. I cannot recall having seen it on the lists. --- [EMAIL PROTECTED] build7]$ bin/createdb -E LATIN1 test CREATE DATABASE [EMAIL PROTECTED] build7]$ cat break.sh dat=`echo -en \245\241` echo create table test (d text); echo insert into test values('$dat'); [EMAIL PROTECTED] build7]$ sh break.sh | bin/psql test CREATE TABLE INSERT 0 1 [EMAIL PROTECTED] build7]$ bin/createdb -T test test2 CREATE DATABASE [EMAIL PROTECTED] build7]$ bin/createdb -T test -E UTF-8 test2 CREATE DATABASE [EMAIL PROTECTED] build7]$ bin/pg_dump -C test2 test2.dmp [EMAIL PROTECTED] build7]$ bin/dropdb test2 DROP DATABASE [EMAIL PROTECTED] build7]$ bin/psql template1 -f test2.dmp SET SET SET CREATE DATABASE ALTER DATABASE You are now connected to database test2. [...] CREATE TABLE ALTER TABLE psql:test2.dmp:345: ERROR: invalid UTF-8 byte sequence detected near byte 0xa5 CONTEXT: COPY test, line 1, column d: [...] --- Until createdb() is a lot more sophisticated, we cannot translate characters between encodings. I don't think this is a huge issue though, as most people are only going to be creating empty databases anyway. Still, it probably requires documentation. Thoughts? Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Foreign key trigger timing bug?
Stephan Szabo [EMAIL PROTECTED] writes: Yeah. I really don't understand it, but it appears to me to be explicitly different in the spec for on delete cascade even compared to the rest of the referential actions. One problem I see is, what do we do if the BEFORE trigger then returns NULL (to skip the delete). The cascaded operations are already done. Do we have to execute the cascaded deletes in a subtransaction or do we disallow the skip in this case? I think we'd have disallow skipping. Especially since skipping would probably end up with a violated constraint. That seems to me to be a sufficient reason to not follow the spec in this respect. A BEFORE trigger should be run BEFORE anything happens, full stop. I can't think of any good reason why the spec's semantics are better. (It's not like our triggers are exactly spec-compatible anyway.) regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Improving free space usage (was: Reducing relation locking overhead)
On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote: Along those lines, I've wondered if it makes sense to add more flexibility in how free space is reclaimed in a table. One obvious possibility is to have a strategy where new tuples will always look to the FSM for space (instead of going into the current page if possible), and the FSM will always hand out the earliest page in the table it has. This mode would have the effect of moving tuples towards the front of the table, allowing for space reclamation. A variation might be that this mode will not effect tuples that are generated as part of an UPDATE and are in the first x% of the table, since it doesn't make sense to move a tuple from page 2 to page 1 in a 1000 page table. This % could be depending on some fill factor of the table, aiming not to move tuples, that would end up in the final volume of a balance table, which, in case of heavily updated table, would probably be 2 to 3 times the volume of densely populated table. Another possibility is to always go to the FSM and to have the FSM hand back the page that is closest to the new tuple according to a certain index. This would allow for ALTER TABLE CLUSTER to be much more self-maintaining. The downside is that you'd have to do a lookup on that index, but presumably if the index is important enough to cluster on then it should be well-cached. There's probably some other tweaks that could be done as well to make this more performant. Yes, I agree on all your points about better placement of new tuples, all they would be useful indeed. Sounds like a TODO, barring objections... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Warm-cache prefetching
I found an interesting paper improving index speed by prefetching memory data to L1/L2 cache here (there is discussion about prefetching disk data to memory several days ago ice-breaker thread): http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf Also related technique used to speedup memcpy: http://people.redhat.com/arjanv/pIII.c I wonder if we could use it to speed up in-memory scan opertion for heap or index. Tom's patch has made scan can handle a page (vs. row) every time, which is a basis for this optimization. I write a program try to simulate it, but I am not good at micro optimization, and I just get a very weak but kind-of-stable improvement. I wonder if any people here are interested to take a look. Regards, Qingqing -- Test results -- Cache line size: 64 CPU: P4 2.4G $#./prefetch 10 16 Sum: -951304192: with prefetch on - duration: 42.163 ms Sum: -951304192: with prefetch off - duration: 42.838 ms Sum: -951304192: with prefetch on - duration: 44.044 ms Sum: -951304192: with prefetch off - duration: 42.792 ms Sum: -951304192: with prefetch on - duration: 42.324 ms Sum: -951304192: with prefetch off - duration: 42.803 ms Sum: -951304192: with prefetch on - duration: 42.189 ms Sum: -951304192: with prefetch off - duration: 42.801 ms Sum: -951304192: with prefetch on - duration: 42.155 ms Sum: -951304192: with prefetch off - duration: 42.827 ms Sum: -951304192: with prefetch on - duration: 42.179 ms Sum: -951304192: with prefetch off - duration: 42.798 ms Sum: -951304192: with prefetch on - duration: 42.180 ms Sum: -951304192: with prefetch off - duration: 42.804 ms Sum: -951304192: with prefetch on - duration: 42.193 ms Sum: -951304192: with prefetch off - duration: 42.827 ms Sum: -951304192: with prefetch on - duration: 42.164 ms Sum: -951304192: with prefetch off - duration: 42.810 ms Sum: -951304192: with prefetch on - duration: 42.182 ms Sum: -951304192: with prefetch off - duration: 42.826 ms Test program /* * prefetch.c * PostgreSQL warm-cache sequential scan simulator with prefetch */ #include stdio.h #include stdlib.h #include memory.h #include errno.h #include sys/time.h typedef char bool; #define true((bool) 1) #define false ((bool) 0) #define BLCKSZ 8192 #define CACHESZ 64 #define NBLCKS 5000 int sum; int main(int argc, char *argv[]) { int i, rounds; char*blocks; int cpu_cost; if (argc != 3) { fprintf(stderr, usage: prefetch rounds cpu_cost [1, 16]\n); exit(-1); } rounds = atoi(argv[1]); cpu_cost = atoi(argv[2]); if (cpu_cost 16) exit(-1); for (i = 0; i 2*rounds; i++) { int j, k; struct timeval start_t, stop_t; boolenable = i%2?false:true; char*blck; blocks = (char *)malloc(BLCKSZ*NBLCKS); memset(blocks, 'a', BLCKSZ*NBLCKS); sum = 0; gettimeofday(start_t, NULL); for (j = 0; j NBLCKS; j++) { blck = blocks + j*BLCKSZ; for (k=0; k BLCKSZ; k+=CACHESZ) { int *s = (int *)(blck + k); int u = cpu_cost; if (enable) { /* prefetch ahead */ __asm__ __volatile__ ( 1: prefetchnta 128(%0)\n : : r (s) : memory ); } /* pretend to process current tuple */ while (u--) sum += (*(s+u))*(*(s+u)); } } gettimeofday(stop_t, NULL); free(blocks); /* measure the time */ if (stop_t.tv_usec start_t.tv_usec) { stop_t.tv_sec--; stop_t.tv_usec += 100; } fprintf (stdout, Sum: %d: with prefetch %s - duration: %ld.%03ld ms\n, sum, enable?on:off, (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 + (stop_t.tv_usec - start_t.tv_usec) / 1000), (long) (stop_t.tv_usec - start_t.tv_usec) % 1000); } exit(0); } ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get
Re: [HACKERS] Warm-cache prefetching
Qingqing, On 12/8/05 8:07 PM, Qingqing Zhou [EMAIL PROTECTED] wrote: /* prefetch ahead */ __asm__ __volatile__ ( 1: prefetchnta 128(%0)\n : : r (s) : memory ); I think this kind / grain of prefetch is handled as a compiler optimization in the latest GNU compilers, and further there are some memory streaming operations for the Pentium 4 ISA that are now part of the standard compiler optimizations done by gcc. What I think would be tremendously beneficial is to implement L2 cache blocking in certain key code paths like sort. What I mean by cache blocking is performing as many operations on a block of memory (maybe 128 pages worth for a 1MB cache) as possible, then moving to the next batch of memory and performing all of the work on that batch, etc. The other thing to consider in conjunction with this would be maximizing use of the instruction cache, increasing use of parallel functional units and minimizing pipeline stalls. The best way to do this would be to group operations into tighter groups and separating out branching: So instead of structures like this: function_row_at_a_time(row) if conditions do some work else if other do different work else if error print error_log You'd have function_buffer_at_a_time(buffer_of_rows) loop on sizeof(buffer_of_rows) / sizeof_L2cache do a lot of work on each row loop on sizeof(buffer_of_rows) / sizeof_L2cache if error exit The ideas in the above optimizations: - Delay work until a buffer can be gathered - Increase the computational intensity of the loops by putting more instructions together - While in loops doing lots of work, avoid branches / jumps - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Warm-cache prefetching
Luke Lonergan [EMAIL PROTECTED] wrote /* prefetch ahead */ __asm__ __volatile__ ( 1: prefetchnta 128(%0)\n : : r (s) : memory ); I think this kind / grain of prefetch is handled as a compiler optimization in the latest GNU compilers, and further there are some memory streaming operations for the Pentium 4 ISA that are now part of the standard compiler optimizations done by gcc. Is there any special kind of optimization flag of gcc needed to support this? I just tried both 2.96 and 4.01 with O2. Unfortunately, sse_clear_page() encounters a core-dump by 4.0.1 at this line: __asm__ __volatile__ ( movntdq %%xmm0, %0::m(sse_save[0]) ); So I removed this test (sorry ...). There is no non-trivial difference AFAICS. The results is attached. I will look into the other parts of your thread tomorrow, Regards, Qingqing --- *#ll prefp3-* -rwx--1 zhouqq jmgrp 38k Dec 9 00:49 prefp3-296 -rwx--1 zhouqq jmgrp 16k Dec 9 00:49 prefp3-401 *#./prefp3-296 2392.975 MHz clear_page function 'gcc clear_page()' took 27142 cycles per page (172.7 MB/s) clear_page function 'normal clear_page()'took 27161 cycles per page (172.6 MB/s) clear_page function 'mmx clear_page() 'took 17293 cycles per page (271.1 MB/s) clear_page function 'gcc clear_page()' took 27174 cycles per page (172.5 MB/s) clear_page function 'normal clear_page()'took 27142 cycles per page (172.7 MB/s) clear_page function 'mmx clear_page() 'took 17291 cycles per page (271.1 MB/s) copy_page function 'normal copy_page()' took 18552 cycles per page (252.7 MB/s) copy_page function 'mmx copy_page() ' took 12511 cycles per page (374.6 MB/s) copy_page function 'sse copy_page() ' took 12318 cycles per page (380.5 MB/s) *#./prefp3-401 2392.970 MHz clear_page function 'gcc clear_page()' took 27120 cycles per page (172.8 MB/s) clear_page function 'normal clear_page()'took 27151 cycles per page (172.6 MB/s) clear_page function 'mmx clear_page() 'took 17295 cycles per page (271.0 MB/s) clear_page function 'gcc clear_page()' took 27152 cycles per page (172.6 MB/s) clear_page function 'normal clear_page()'took 27114 cycles per page (172.9 MB/s) clear_page function 'mmx clear_page() 'took 17296 cycles per page (271.0 MB/s) copy_page function 'normal copy_page()' took 18586 cycles per page (252.2 MB/s) copy_page function 'mmx copy_page() ' took 12620 cycles per page (371.4 MB/s) copy_page function 'sse copy_page() ' took 12698 cycles per page (369.1 MB/s) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Warm-cache prefetching
Perhaps because P4 is already doing H/W prefetching? http://www.tomshardware.com/2000/11/20/intel/page5.html I ran the test program on an opteron 2.2G: % ./a.out 10 16 Sum: -951304192: with prefetch on - duration: 81.166 ms Sum: -951304192: with prefetch off - duration: 79.769 ms Sum: -951304192: with prefetch on - duration: 81.173 ms Sum: -951304192: with prefetch off - duration: 79.718 ms Sum: -951304192: with prefetch on - duration: 83.293 ms Sum: -951304192: with prefetch off - duration: 79.731 ms Sum: -951304192: with prefetch on - duration: 81.227 ms Sum: -951304192: with prefetch off - duration: 79.851 ms Sum: -951304192: with prefetch on - duration: 81.003 ms Sum: -951304192: with prefetch off - duration: 79.724 ms Sum: -951304192: with prefetch on - duration: 81.084 ms Sum: -951304192: with prefetch off - duration: 79.728 ms Sum: -951304192: with prefetch on - duration: 81.009 ms Sum: -951304192: with prefetch off - duration: 79.723 ms Sum: -951304192: with prefetch on - duration: 81.074 ms Sum: -951304192: with prefetch off - duration: 79.719 ms Sum: -951304192: with prefetch on - duration: 81.188 ms Sum: -951304192: with prefetch off - duration: 79.724 ms Sum: -951304192: with prefetch on - duration: 81.075 ms Sum: -951304192: with prefetch off - duration: 79.719 ms Got slowdown. I ran it on a PIII 650M: % ./a.out 10 16 Sum: -951304192: with prefetch on - duration: 284.952 ms Sum: -951304192: with prefetch off - duration: 291.439 ms Sum: -951304192: with prefetch on - duration: 290.690 ms Sum: -951304192: with prefetch off - duration: 299.692 ms Sum: -951304192: with prefetch on - duration: 295.287 ms Sum: -951304192: with prefetch off - duration: 290.992 ms Sum: -951304192: with prefetch on - duration: 285.116 ms Sum: -951304192: with prefetch off - duration: 294.127 ms Sum: -951304192: with prefetch on - duration: 286.986 ms Sum: -951304192: with prefetch off - duration: 291.001 ms Sum: -951304192: with prefetch on - duration: 283.233 ms Sum: -951304192: with prefetch off - duration: 401.910 ms Sum: -951304192: with prefetch on - duration: 297.021 ms Sum: -951304192: with prefetch off - duration: 307.814 ms Sum: -951304192: with prefetch on - duration: 287.201 ms Sum: -951304192: with prefetch off - duration: 303.870 ms Sum: -951304192: with prefetch on - duration: 286.962 ms Sum: -951304192: with prefetch off - duration: 352.779 ms Sum: -951304192: with prefetch on - duration: 283.245 ms Sum: -951304192: with prefetch off - duration: 294.422 ms looks like some speedup to me. On Thu, 08 Dec 2005 Qingqing Zhou wrote : I found an interesting paper improving index speed by prefetching memory data to L1/L2 cache here (there is discussion about prefetching disk data to memory several days ago ice-breaker thread): http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf Also related technique used to speedup memcpy: http://people.redhat.com/arjanv/pIII.c I wonder if we could use it to speed up in-memory scan opertion for heap or index. Tom's patch has made scan can handle a page (vs. row) every time, which is a basis for this optimization. I write a program try to simulate it, but I am not good at micro optimization, and I just get a very weak but kind-of-stable improvement. I wonder if any people here are interested to take a look. Regards, Qingqing -- Test results -- Cache line size: 64 CPU: P4 2.4G $#./prefetch 10 16 Sum: -951304192: with prefetch on - duration: 42.163 ms Sum: -951304192: with prefetch off - duration: 42.838 ms Sum: -951304192: with prefetch on - duration: 44.044 ms Sum: -951304192: with prefetch off - duration: 42.792 ms Sum: -951304192: with prefetch on - duration: 42.324 ms Sum: -951304192: with prefetch off - duration: 42.803 ms Sum: -951304192: with prefetch on - duration: 42.189 ms Sum: -951304192: with prefetch off - duration: 42.801 ms Sum: -951304192: with prefetch on - duration: 42.155 ms Sum: -951304192: with prefetch off - duration: 42.827 ms Sum: -951304192: with prefetch on - duration: 42.179 ms Sum: -951304192: with prefetch off - duration: 42.798 ms Sum: -951304192: with prefetch on - duration: 42.180 ms Sum: -951304192: with prefetch off - duration: 42.804 ms Sum: -951304192: with prefetch on - duration: 42.193 ms Sum: -951304192: with prefetch off - duration: 42.827 ms Sum: -951304192: with prefetch on - duration: 42.164 ms Sum: -951304192: with prefetch off - duration: 42.810 ms Sum: -951304192: with prefetch on - duration: 42.182 ms Sum: -951304192: with prefetch off - duration: 42.826 ms Test program /* * prefetch.c *PostgreSQL warm-cache sequential scan simulator with prefetch */ #include stdio.h #include stdlib.h #include memory.h #include errno.h #include sys/time.h typedef char bool; #define true ((bool) 1) #define false ((bool) 0) #define BLCKSZ8192 #define CACHESZ 64 #define