Re: [HACKERS] Replication on the backend
On Wed, 2005-12-07 at 01:04 -0800, J. Andrew Rogers wrote: Opteron boards get pretty damn close to Big Iron SMP fabric performance in a cheap package. Given how many companies have announced plans to produce Opteron server boards with Infiniband fabrics directly integrated into HyperTransport, I would say that this is the future of server boards. InfiniBand on-board? Wow, seems very interesting. Thank you for your hints and numbers, very helpfull! And if postgres could actually use an infiniband fabric for clustering a single database instance across Opteron servers, that would be very impressive... full ACK Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Replication on the backend
On Dec 6, 2005, at 9:09 PM, Gregory Maxwell wrote: Eh, why would light limited delay be any slower than a disk on FC the same distance away? :) In any case, performance of PG on iscsi is just fine. You can't blame the network... Doing multimaster replication is hard because the locking primitives that are fine on a simple multiprocessor system (with a VERY high bandwidth very low latency interconnect between processors) just don't work across a network, so you're left finding other methods and making them work... Speed of light latency shows up pretty damn often in real networks, even relatively local ones. The number of people that wonder why a transcontinental SLA of 10ms is not possible is astonishing. The silicon fabrics are sufficiently fast that most well-designed networks are limited by how fast one can push photons through a fiber, which is significantly slower than photons through a vacuum. Silicon switch fabrics add latency measured in nanoseconds, which is effectively zero for many networks that leave the system board. Compared to single system simple SMP, a local cluster built on a first-rate fabric will have about an order of magnitude higher latency but very similar bandwidth. On the other hand, at those latencies you can increase the number of addressable processors with that kind of bandwidth by an order of magnitude, so it is a bit of a trade. However, latency matters a lot such that one would have to be a lot smarter about partitioning synchronization across that fabric even though one would lose nothing in the bandwidth department. But again, multimaster isn't hard because there of some inherently slow property of networks. Eh? As far as I know, the difficulty of multi-master is almost entirely a product of the latency of real networks such that they are too slow for scalable distributed locks. SMP is little more than a distributed lock manager implemented in silicon. Therefore, multi- master is hard in practice because we cannot drive networks fast enough. That said, current state-of-the-art network fabrics are within an order of magnitude of SMP fabrics such that they could be real contenders, particularly once you get north of 8-16 processors. The really sweet potential is in Opteron system boards with Infiniband directly attached to HyperTransport. At that level of bandwidth and latency, both per node and per switch fabric, the architecture possibilities start to become intriguing. J. Andrew Rogers ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Oddity with extract microseconds?
In article [EMAIL PROTECTED], Christopher Kings-Lynne [EMAIL PROTECTED] writes: mysql SELECT EXTRACT(MICROSECOND FROM '2003-01-02 10:30:01.00123'); +---+ | EXTRACT(MICROSECOND FROM '2003-01-02 10:30:01.00123') | +---+ | 1230 | +---+ 1 row in set (0.00 sec) Does contrary behavior from MySQL count as evidence that PostgreSQL's behavior is correct? :-) No...I happen to think that their way is more consistent though. Pity it's not in the spec. I'd say the comparison with MySQL is useless because MySQL is unable to store microseconds in a DATETIME or TIMESTAMP column, although you can extract microseconds from a date/time literal. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Replication on the backend
On Dec 6, 2005, at 11:42 PM, Markus Schiltknecht wrote: Does anybody have latency / roundtrip measurements for current hardware? I'm interested in: 1Gb Ethernet, 10 Gb Ethernet, InfiniBand, probably even p2p usb2 or firewire links? In another secret life, I know a bit about supercomputing fabrics. The latency metrics have to be thoroughly qualified. First, most of the RTT latency numbers for network fabrics are for 0 byte packet sizes, which really does not apply to anyone shuffling real data around. For small packets, high-performance fabrics (HTX Infiniband, Quadrics, etc) have approximately an order of magnitude less latency than vanilla Ethernet, though the performance specifics depend greatly on the actual usage. For large packet sizes, the differences in latency become far less obvious. However, for real packets a performant fabric will still look very good compared to disk systems. Switched fiber fabrics have enough relatively inexpensive throughput now to saturate most disk systems and CPU I/O busses; only platforms like HyperTransport can really keep up. It is worth pointing out that the latency of high-end network fabrics is similar to large NUMA fabrics, which exposes some of the limits of SMP scalability. As a point of reference, an organization that knows what they are doing should have no problem getting 500 microsecond RTT on a vanilla metropolitan area GigE fiber network -- a few network operators actually do deliver this on a regional scale. For a local cluster, a competent design can best this by orders of magnitude. There are a number of silicon limitations, but a system that connects the fabric directly to HyperTransport can drive several GB/s with very respectable microsecond latencies if the rest of the system is up to it. There are Opteron system boards now that will drive Infiniband directly from HyperTransport. I know Arima/Rioworks makes some (great server boards generally), and several other companies are either making them or have announced them in the pipeline. These Opteron boards get pretty damn close to Big Iron SMP fabric performance in a cheap package. Given how many companies have announced plans to produce Opteron server boards with Infiniband fabrics directly integrated into HyperTransport, I would say that this is the future of server boards. And if postgres could actually use an infiniband fabric for clustering a single database instance across Opteron servers, that would be very impressive... J. Andrew Rogers ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Replication on the backend
Andrew, And if postgres could actually use an infiniband fabric for clustering a single database instance across Opteron servers, that would be very impressive... That's what we do with Bizgres MPP. We've implemented an interconnect to do the data shuffling underneath the optimizer/executor and we currently use TCP/IP, though we could haul out SDP over Infiniband should we need it. However, our optimizer effectively minimizes traffic over the interconnect now and that works well for all of the plans we've run so far. It would be nice to characterize the improvements we could get from moving to 3x infiniband. Regarding a direct Hypertransport to Infiniband bridge, have you looked at Pathscale? http://www.pathscale.com/ I know the fellow behind the scenes who designed it, and I think it's probably well thought out. We were gunning for less than 1us RTT through the adapter and switch once, and I bet they are close. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Feature Request: Multi-octet raw
On Tue, Dec 06, 2005 at 08:54:42PM -0700, Trent Shipley wrote: It would be nice if Postgresql supported multi-octet raw data. Certainly a lot of what you would do with it would be similar to bytea, but the basic string functions would be overloaded so that the unit of work would be a multi-octet word. Well, PostgreSQL allows you to create your own types so you could just make one that does what you want. The only tricky part would be the syntax on creation because user-defined types can't take parameters. OTOH, maybe by the time someone writes the code for such a type the details will have been worked out. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgp3gVrY3egve.pgp Description: PGP signature
Re: [HACKERS] SERIAL type feature request
Hi, Zoltan Boszormenyi írta: Jan Wieck írta: On 12/5/2005 1:03 PM, Zoltan Boszormenyi wrote: Jan Wieck írta: On 12/4/2005 5:10 PM, Zoltan Boszormenyi wrote: I found this in the SQL2003 draft: 4.14.7 Identity columns ... An identity column has a start value, an increment, a maximum value, a minimum value, and a cycle option. ... The exact properties of a sequence. It would be a good idea to be able to provide all these the same way PostgreSQL provides CREATE SEQUENCE. I think nobody would object to implementing support for the SQL2003 syntax. Most of that would be providing all the values that will get forwarded into the internal sequence generation during CREATE TABLE. The other thing needed is an extension to the default value mechanism that overrides any given value to implement GENERATE ALLWAYS. Not too hard either. Where can I find this syntax? (PDF file name, page#) Thanks. I think I modify my feature request for the standard behaviour. It's all in the Foundation paper inside this zip: http://www.wiscorp.com/sql/sql_2003_standard.zip Thanks, I found it. It's GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY, isn't it? If I interpret it correctly, GENERATED ALWAYS AS IDENTITY means that no matter what I give in INSERT INTO MYTABLE (serial_id, ...) VALUES (N, ...), the sequence next value will be inserted into the database. I am all for it, it's much stronger than just watching for the 0 value and would fit my needs. The other behaviour is GENERATED BY DEFAULT AS IDENTITY, which is what PostgreSQL currently provides. Best regards, Zoltán Böszörményi To reiterate it, I would like the following added to PostgreSQL 8.2 TODO (I may have got the optional parametes wrong...): - Extend SERIAL type declaration and functionality with the SQL2003 compliant sequence generation options: SERIAL [ GENERATED { ALWAYS | BY DEFAULT } [ AS IDENTITY ( [ START WITH startvalue ] [ INCREMENT BY incrementvalue ] [ MAXVALUE maxvalue ] [ MINVALUE minvalue ] [ CYCLE | NO CYCLE ] ) ] ] ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] About my new work at Command Prompt Inc.
Hi, I'd like to inform the people who does not read Planet PostgreSQL Command Prompt Inc.has just hired me for my community work I have been doing so far, like PostgreSQL RPM stuff and other PostgreSQL related RPMs, such as Slony-I, pgpool, PostGIS, etc) and website things. That means I'll spend more time on these. Thanks to Joshua. I'm very happy to be a member of CMD. Let's see what else I can do for CMD. Regards -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Oddity with extract microseconds?
Christopher Kings-Lynne wrote: Why aren't 'minutes' considered too? Because they aren't 'seconds'. Well, seconds aren't microseconds either. Yeah, they are: it's just one field. The other way of looking at it (that everything is seconds) is served by extract(epoch). Well, it's different in MySQL unfortunately - what does the standard say? Out of interest, can someone try this for me in MySQL 5: SELECT EXTRACT (MICROSECOND FROM '2003-01-02 10:30:00.00123'); SELECT EXTRACT (MICROSECOND FROM '2003-01-02 10:30:10.00123'); mysql 4.1.5 gives back 123 in both cases. I assume they haven't changed that, although anything is possible. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Foreign key trigger timing bug?
I had an open 8.1 item that was: o fix foreign trigger timing issue Would someone supply text for a TODO entry on this, as I don't think we fixed it in 8.1. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Foreign key trigger timing bug?
I think this is the foreign key trigger timing issue. --- Darcy Buskermolen wrote: On Friday 09 September 2005 08:46, Stephan Szabo wrote: On Fri, 9 Sep 2005, Tom Lane wrote: Stephan Szabo [EMAIL PROTECTED] writes: Is there a case other than a before trigger updating a row we will want to act upon later in the statement where we'll get a row with xmax of our transaction and cmax greater than the current command? The greater-cmax case could occur via any kind of function, not only a trigger, ie update tab set x = foo(x) where ... where foo() is a volatile function that internally updates the tab table. I *thought* I was missing a case, I just couldn't figure out what. I suppose you could say that this is horrible programming practice and anyone who tries it deserves whatever weird behavior ensues ... but it's not the case that every such situation involves a trigger. Well, the change I was thinking of would have made it an error if foo(x) updated a row that was then later selected by the update rather than the current behavior which I think would have ignored the already updated row, so that's probably not going to work. I see that this still is not addressed fulling in beta 3. Can anybody give a quick overview of where this is sitting, and if it's likely to make it's way into 8.1 gold ? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Darcy Buskermolen Wavefire Technologies Corp. http://www.wavefire.com ph: 250.717.0200 fx: 250.763.1759 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing
On 12/6/05, Hannu Krosing wrote: 1) run a transaction repeatedly, trying to hit a point of no concurrent transactions, encance the odds by locking out starting other transactions for a few (tenths or hundredths of) seconds, if it succeeds, record SNAP1, commit and and continue, else rollback, then sleep a little and retry. Which locks can be released by committing here? 2) build index on all rows inserted before SNAP1 3) run a transaction repeatedly, trying to hit a point of no concurrent transactions by locking out other transactions for a few (tenths or hundredths of) seconds, if it succeeds, record SNAP2, mark index as visible for inserts, commit. now all new transactions see the index and use it when inserting new tuples. 4) scan over table, add all tuples between SNAP1 and SNAP2 to index You can not guarantee that every tuple inserted in the table will be visible to SNAP 2 if you take SNAP2 before the commit of the insert-only index has dropped below the global XMIN-horizon. 5) mark index as usable for query plans How about: - begin transaction X1 - insert all visible tuples in an index - mark index incomplete - commit - wait for X1 to become visible to all running transactions (X1 is known from the XMIN in pg_class / pg_index) - begin transaction X2 - insert all missing tuples in index - mark index complete - commit Jochem ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing
Hannu Krosing [EMAIL PROTECTED] writes: But that said, realistically *any* solution has to obtain a lock at some time to make the schema change. I would say pretty much any O(1) (constant time) outage is at least somewhat acceptable as contrasted with the normal index build which locks out other writers for at least O(n lg n) time. Anything on the order of 100ms is probably as good as it gets here. For me any delay less than the client timeout is acceptable and anything more than that is not. N sec is ok, N+1 is not. It's as simple as that. I don't think the client timeout is directly relevant here. If your client timeout is 20s and you take 19s, how many requests have queued up behind you? If you normally process requests in under 200ms and receive 10 requests per second (handling at least 2 simultaneously) then you now have 190 requests queued up. Those requests take resources and will slow down your server. If they slow things down too much then you will start failing to meet your 200ms deadline. It's more likely that your system is engineered to use queueing and simultaneous dispatch to deal with spikes in load up to a certain margin. Say you know it can deal with spikes in load of up to 2x the regular rate. Then you can deal with service outage of up to the 200ms deadline. If you can deal with spikes of up to 4x the regular rate then you can deal with an outage of up to 600ms. Moreover even if you had the extra resources available to handle a 19s backlog of requests, how long would it take you to clear that backlog? If you have a narrow headroom on meeting the deadline in the first place, and now you have even less headroom because of the resources dedicated to the queue, it'll take you a long time to clear the backlog. We periodically ran into problems with load spikes or other performance problems causing things to get very slow and stay slow for a while. Letting things settle out usually worked but occasionally we had to restart the whole system to clear out the queue of requests. -- greg ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [pgsql-www] About my new work at Command Prompt Inc.
Devrim Gunduz Wrote: Command Prompt Inc.has just hired me for my community work I have been doing so far, like PostgreSQL RPM stuff and other PostgreSQL related RPMs, such as Slony-I, pgpool, PostGIS, etc) and website things. That means I'll spend more time on these. Congratulations, Devrim. You're doing a fine job on all those fronts; I believe the decision we made last year to pass the RPM maintenance to you was a wise one indeed, as you have stepped up to the plate nicely, and have furthered the RPM's considerably from where my efforts had taken them; I just wanted to mention that publicly to all those on these lists. -- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: Concurrent CREATE INDEX, try 2 (was Re: [HACKERS] Reducing
Ühel kenal päeval, K, 2005-12-07 kell 13:36, kirjutas Greg Stark: Hannu Krosing [EMAIL PROTECTED] writes: But that said, realistically *any* solution has to obtain a lock at some time to make the schema change. I would say pretty much any O(1) (constant time) outage is at least somewhat acceptable as contrasted with the normal index build which locks out other writers for at least O(n lg n) time. Anything on the order of 100ms is probably as good as it gets here. For me any delay less than the client timeout is acceptable and anything more than that is not. N sec is ok, N+1 is not. It's as simple as that. I don't think the client timeout is directly relevant here. It is relevant. It is the ultimate check of success or failure :) If your client timeout is 20s and you take 19s, how many requests have queued up behind you? If you normally process requests in under 200ms and receive 10 requests per second (handling at least 2 simultaneously) then you now have 190 requests queued up. Again, I'm handling 20 to 200 simultaneously quite nicely. Those requests take resources and will slow down your server. If they slow things down too much then you will start failing to meet your 200ms deadline. If I can't meet the deadline, I've got a problem. The rest is implementation detail. It's more likely that your system is engineered to use queueing and simultaneous dispatch to deal with spikes in load up to a certain margin. Say you know it can deal with spikes in load of up to 2x the regular rate. I know it can, just that the 3x spike lasts for 6 hours :P Then you can deal with service outage of up to the 200ms deadline. If you can deal with spikes of up to 4x the regular rate then you can deal with an outage of up to 600ms. Small local fluctuations happen all the time. As a rule of a thumb I want to stay below 50% of resource usage on average for any noticable period and will start looking for code optimisations or additional hardware if this is crossed. Moreover even if you had the extra resources available to handle a 19s backlog of requests, how long would it take you to clear that backlog? If you have a narrow headroom on meeting the deadline in the first place, and now you have even less headroom because of the resources dedicated to the queue, it'll take you a long time to clear the backlog. While it feels heroic to run at 90% capacity, it is not usually a good policy. All kinds of unforeseen stuff happens all the time - checkpoints, backups, vacuums, unexpected growth, system cronjobs, ... With too little headroom you are screwed anyway. What I am aiming at with this CONCURRENT CREATE INDEX proposal, is being no more disruptive than other stuff that keeps happening anyway. That would be the baseline. Anything better is definitely desirable, but should not be a stopper for implementing the baseline functionality. - Hannu ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Replication on the backend
On Tue, Dec 06, 2005 at 12:35:43AM -0500, Jan Wieck wrote: We do not plan to implement replication inside the backend. Replication needs are so diverse that pluggable replication support makes a lot more sense. To me it even makes more sense than keeping transaction support outside of the database itself and add it via pluggable storage add-on. And, as I say every single time this comes up, Oracle's and IBM's and MS's and everybody else's replication systems are _also_ add ons. If you don't believe me, look at the license costs. You can get a system without it enabled, which means (by definition) it's a modular extension. A -- Andrew Sullivan | [EMAIL PROTECTED] In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Foreign key trigger timing bug?
On Wednesday 07 December 2005 09:33, Bruce Momjian wrote: I had an open 8.1 item that was: o fix foreign trigger timing issue Stephan Szabo had this to say to me when I was asking him about his progress on this issue a while back. There are some fundamental issues right now between before triggers and foreign keys based on how we act upon rows for the same statement that have been modified in the before trigger (which is to say that the outer statement does not act upon them). Would someone supply text for a TODO entry on this, as I don't think we fixed it in 8.1. No it's not yet resolved. -- Darcy Buskermolen Wavefire Technologies Corp. http://www.wavefire.com ph: 250.717.0200 fx: 250.763.1759 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Foreign key trigger timing bug?
On Wed, 7 Dec 2005, Bruce Momjian wrote: I had an open 8.1 item that was: o fix foreign trigger timing issue Would someone supply text for a TODO entry on this, as I don't think we fixed it in 8.1. I'd split this into two separate items now. Fix before delete triggers on cascaded deletes to run after the cascaded delete is done. This is odd, but seems to be what the spec requires. Fix problems with referential action caused before triggers that modify rows that would also be modified by the referential action. Right now, this has a few symptoms, either you can get spurious seeming errors from the constraint or you can end up with invalid data in the referencing table. As far as I can see, the spec doesn't have much to say about this because the spec doesn't seem to allow before triggers to modify tables. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Reducing contention for the LockMgrLock
We've suspected for awhile that once we'd fixed the buffer manager's use of a single global BufMgrLock, the next contention hotspot would be the lock manager's LockMgrLock. I've now seen actual evidence of that in profiling pgbench: using a modified backend that counts LWLock-related wait operations, the LockMgrLock is responsible for an order of magnitude more blockages than the next highest LWLock: PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354 PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323 PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304 PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278 PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251 PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243 PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200 PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177 PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148 PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124 PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408 PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393 PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390 PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388 PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379 PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377 PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367 PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366 PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354 PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321 PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241 PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225 (etc) We had also seen evidence to this effect from OSDL: http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php So it seems it's time to start thinking about how to reduce contention for the LockMgrLock. There are no interesting read-only operations on the shared lock table, so there doesn't seem to be any traction to be gained by making some operations take just shared access to the LockMgrLock. The best idea I've come up with after a bit of thought is to replace the shared lock table with N independent tables representing partitions of the lock space. Each lock would be assigned to one of these partitions based on, say, a hash of its LOCKTAG. I'm envisioning N of 16 or so to achieve (hopefully) about an order-of-magnitude reduction of contention. There would be a separate LWLock guarding each partition; the LWLock for a given partition would be considered to protect the LOCK objects assigned to that partition, all the PROCLOCK objects associated with each such LOCK, and the shared-memory hash tables holding these objects (each partition would need its own hash tables). A PGPROC's lock-related fields are only interesting when it is waiting for a lock, so we could say that the LWLock for the partition containing the lock it is waiting for must be held to examine/change these fields. The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bit tricky to handle since it necessarily spans across partitions. We might be able to deal with this with suitable rules about when the list can be touched, but I've not worked this out in detail. Another possibility is to break this list apart into N lists, one per partition, but that would bloat the PGPROC array a bit, especially if we wanted larger N. The basic LockAcquire and LockRelease operations would only need to acquire the LWLock for the partition containing the lock they are interested in; this is what gives us the contention reduction. LockReleaseAll is also interesting from a performance point of view, since it executes at every transaction exit. If we divide PGPROC's PROCLOCK list into N lists then it will be very easy for LockReleaseAll to take only the partition locks it actually needs to release these locks; if not, we might have to resort to scanning the list N times, once while we hold the LWLock for each partition. I think that CheckDeadLock will probably require taking all the partition LWLocks (as long as it does this in a predetermined order there is no risk of deadlock on the partition LWLocks). But one hopes this is not a performance-critical operation. Ditto for GetLockStatusData. One objection I can see to this idea is that having N lock hash tables instead of one will eat a larger amount of shared memory in hashtable overhead. But the lock hashtables are fairly small relative to the shared buffer array (given typical configuration parameters) so this doesn't seem like a major problem. Another objection is that LockReleaseAll will get slower (since it will certainly call LWLockAcquire/Release more times) and in situations that aren't heavily concurrent there won't be any compensating gain. I think this won't be a significant effect, but there's probably no way to tell for sure without actually writing the code and
Re: [HACKERS] Reducing contention for the LockMgrLock
Tom, This would also explain some things we've seen during benchmarking here at EnterpriseDB. I like your idea and, as I'm on my way out, will think about it a bit tonight. Similarly, I don't see the any forward-looking reason for keeping the separate hash tables used for the LockMethodIds. Or, it may just be that I haven't looked closely enough at what the differences are. -Jonah On 12/7/05, Tom Lane [EMAIL PROTECTED] wrote: We've suspected for awhile that once we'd fixed the buffer manager's useof a single global BufMgrLock, the next contention hotspot would be thelock manager's LockMgrLock.I've now seen actual evidence of that in profiling pgbench: using a modified backend that counts LWLock-relatedwait operations, the LockMgrLock is responsible for an order of magnitudemore blockages than the next highest LWLock:PID 12971 lwlock LockMgrLock: shacq 0 exacq 50630 blk 3354 PID 12979 lwlock LockMgrLock: shacq 0 exacq 49706 blk 3323PID 12976 lwlock LockMgrLock: shacq 0 exacq 50567 blk 3304PID 12962 lwlock LockMgrLock: shacq 0 exacq 50635 blk 3278PID 12974 lwlock LockMgrLock: shacq 0 exacq 50599 blk 3251 PID 12972 lwlock LockMgrLock: shacq 0 exacq 50204 blk 3243PID 12973 lwlock LockMgrLock: shacq 0 exacq 50321 blk 3200PID 12978 lwlock LockMgrLock: shacq 0 exacq 50266 blk 3177PID 12977 lwlock LockMgrLock: shacq 0 exacq 50379 blk 3148 PID 12975 lwlock LockMgrLock: shacq 0 exacq 49790 blk 3124PID 12971 lwlock WALInsertLock: shacq 0 exacq 24022 blk 408PID 12972 lwlock WALInsertLock: shacq 0 exacq 24021 blk 393PID 12976 lwlock WALInsertLock: shacq 0 exacq 24017 blk 390 PID 12977 lwlock WALInsertLock: shacq 0 exacq 24021 blk 388PID 12973 lwlock WALInsertLock: shacq 0 exacq 24018 blk 379PID 12962 lwlock WALInsertLock: shacq 0 exacq 24024 blk 377PID 12974 lwlock WALInsertLock: shacq 0 exacq 24016 blk 367 PID 12975 lwlock WALInsertLock: shacq 0 exacq 24021 blk 366PID 12978 lwlock WALInsertLock: shacq 0 exacq 24023 blk 354PID 12979 lwlock WALInsertLock: shacq 0 exacq 24033 blk 321PID 12973 lwlock ProcArrayLock: shacq 45214 exacq 6003 blk 241 PID 12971 lwlock ProcArrayLock: shacq 45355 exacq 6003 blk 225(etc)We had also seen evidence to this effect from OSDL:http://archives.postgresql.org/pgsql-patches/2003-12/msg00365.php So it seems it's time to start thinking about how to reduce contentionfor the LockMgrLock.There are no interesting read-only operations on theshared lock table, so there doesn't seem to be any traction to be gained by making some operations take just shared access to the LockMgrLock.The best idea I've come up with after a bit of thought is to replace theshared lock table with N independent tables representing partitions of the lock space.Each lock would be assigned to one of these partitions basedon, say, a hash of its LOCKTAG.I'm envisioning N of 16 or so to achieve(hopefully) about an order-of-magnitude reduction of contention.There would be a separate LWLock guarding each partition; the LWLock for a givenpartition would be considered to protect the LOCK objects assigned to thatpartition, all the PROCLOCK objects associated with each such LOCK, and the shared-memory hash tables holding these objects (each partition wouldneed its own hash tables).A PGPROC's lock-related fields are onlyinteresting when it is waiting for a lock, so we could say that the LWLock for the partition containing the lock it is waiting for must beheld to examine/change these fields.The per-PGPROC list of all PROCLOCKs belonging to that PGPROC is a bittricky to handle since it necessarily spans across partitions.We might be able to deal with this with suitable rules about when the list can betouched, but I've not worked this out in detail.Another possibility isto break this list apart into N lists, one per partition, but that would bloat the PGPROC array a bit, especially if we wanted larger N.The basic LockAcquire and LockRelease operations would only need toacquire the LWLock for the partition containing the lock they areinterested in; this is what gives us the contention reduction. LockReleaseAll is also interesting from a performance point of view,since it executes at every transaction exit.If we divide PGPROC'sPROCLOCK list into N lists then it will be very easy for LockReleaseAll to take only the partition locks it actually needs to release these locks;if not, we might have to resort to scanning the list N times, once whilewe hold the LWLock for each partition.I think that CheckDeadLock will probably require taking all the partition LWLocks (as long as it does this in a predetermined order there is no riskof deadlock on the partition LWLocks).But one hopes this is not aperformance-critical operation.Ditto for GetLockStatusData. One objection I can see to this idea is that having N lock hash tablesinstead of one will eat a larger amount of shared memory in hashtableoverhead.But the lock hashtables are fairly small relative to theshared buffer array (given typical configuration parameters) so this doesn't seem like a
Re: [HACKERS] Reducing contention for the LockMgrLock
On Wed, 2005-12-07 at 16:59 -0500, Tom Lane wrote: I've now seen actual evidence of that in profiling pgbench: using a modified backend that counts LWLock-related wait operations, So it seems it's time to start thinking about how to reduce contention for the LockMgrLock You're right to be following up this thought. My concern, longer term is on our ability to determine contention issues in an agreed way. I've long been thinking about wait-time measurement - I think its the only way to proceed. There's always a next-bottleneck, so I'd like to first agree the diagnostic probes so we can decide how to determine that. That way we can all work on solutions for various workloads, and prove that they work, in those cases. My view would be that the LockMgrLock is not relevant for all workloads, but I want even more to be able to discuss whether it is, or is not, on an accepted basis before discussions begin. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Reducing contention for the LockMgrLock
Tom Lane wrote: Interesting proposal. LockReleaseAll is also interesting from a performance point of view, since it executes at every transaction exit. If we divide PGPROC's PROCLOCK list into N lists then it will be very easy for LockReleaseAll to take only the partition locks it actually needs to release these locks; if not, we might have to resort to scanning the list N times, once while we hold the LWLock for each partition. On the other hand, each scan would be shorter than the previous one; and it's not necessary to hold each and every partition's LWLock, only the one found in the first entry of the list on each scan until the list is empty. So it's N scans only in the worst case of a PGPROC holding locks on all partitions. One objection I can see to this idea is that having N lock hash tables instead of one will eat a larger amount of shared memory in hashtable overhead. But the lock hashtables are fairly small relative to the shared buffer array (given typical configuration parameters) so this doesn't seem like a major problem. Is hashtable overhead all that large? Each table could be made initially size-of-current-table/N entries. One problem is that currently the memory freed from a hashtable is not put back into shmem freespace, is it? While at it, I'm inclined to get rid of the current assumption that there are logically separate hash tables for different LockMethodIds. AFAICS all that's doing for us is creating a level of confusion; there's nothing on the horizon suggesting we'd ever actually make use of the flexibility. Yeah, please. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] row is too big: size 8916, maximum size 8136
On 12/6/2005 9:03 PM, Euler Taveira de Oliveira wrote: Hi, I'm doing some tests with a 700 columns' table. But when I try to load some data with INSERT or COPY I got that message. I verified that the BLCKZ is limiting the tuple size but I couldn't have a clue why it's not using TOAST. I'm using PostgreSQL 8.0.3 in Slackware 10.1 box. Let me know if you want a test case or other useful information. The external reference of a toasted attribute is 20 bytes in size. I might be wrong, but at 700 columns you have to ask your developers some serious questions about their qualification. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] HOOKS for Synchronous Replication?
Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? Chris ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Reducing contention for the LockMgrLock
Alvaro Herrera [EMAIL PROTECTED] writes: Is hashtable overhead all that large? Each table could be made initially size-of-current-table/N entries. One problem is that currently the memory freed from a hashtable is not put back into shmem freespace, is it? Yeah; the problem is mainly that we'd have to allocate extra space to allow for unevenness of usage across the multiple hashtables. It's hard to judge how large the effect would be without testing, but I think that this problem would inhibit us from having dozens or hundreds of separate partitions. A possible response is to try to improve dynahash.c to make its memory management more flexible, but I'd prefer not to get into that unless it becomes really necessary. A shared freespace pool would create a contention bottleneck of its own... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Reducing contention for the LockMgrLock
Simon Riggs [EMAIL PROTECTED] writes: My view would be that the LockMgrLock is not relevant for all workloads, but I want even more to be able to discuss whether it is, or is not, on an accepted basis before discussions begin. Certainly. I showed the evidence that it is currently a significant problem for pgbench-like workloads, but pgbench is of course not representative of everything. My feeling about it is that different workloads are going to expose different weak spots, and so as long as a given test case isn't obviously artificial, whatever bottleneck it exposes is fair game to work on. pgbench seems reasonably representative of a class of applications with relatively short transactions, so I don't doubt that if pgbench has a problem with LockMgrLock contention, there are real- world cases out there that do too. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOOKS for Synchronous Replication?
Christopher Kings-Lynne [EMAIL PROTECTED] writes: Anyone remember this patch? http://gorda.di.uminho.pt/community/pgsqlhooks/ The discussion seems to be pretty minimal: http://archives.postgresql.org/pgsql-hackers/2005-06/msg00859.php Does anyone see a need to investigate it further? I had hoped to see some comments from the Slony people about it. I'd feel better about the validity of a set of hooks if more than one project agreed that it was useful/appropriate ... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Vertical Partitioning with TOAST
This seems like a useful feature to add, allowing for easy built-in verticle partitioning. Are there issues with the patch as-is? (Other than it probably should have gone to -patches...) On Thu, Dec 01, 2005 at 05:59:08PM +0900, Junji TERAMOTO wrote: Hi all, I wrote a experimental patch for a vertical partitioning function. I decided to use the code of TOAST to create the function easily. In a word, the row that the user specified is forcedly driven out with TOAST. The performance gain of 10% was seen by driving out c_data of the customer table in the DBT-2 benchmark in our environment. The mechanism of TOAST is an overdesigned system to use it for a vertical partitioning. Because the overhead of processing is large, the performance might down according to the environment. There are seriously a lot of things that should be considered if a vertical partitioning is mounted. For instance, TOAST index is omitted, and ctid is used for link. Your comments are welcome. Thanks. --- How To Use --- Use ALTER TABLE command. http://www.postgresql.org/docs/8.1/static/sql-altertable.html ALTER TABLE name ALTER COLUMN column SET STRAGE FORCEEXTERNAL; I do not understand whether FORCEEXTERNAL is an appropriate word. Please teach when there is a better word... -- Junji Teramoto diff -purN postgresql-8.1.0.org/src/backend/access/heap/heapam.c postgresql-8.1.0/src/backend/access/heap/heapam.c --- postgresql-8.1.0.org/src/backend/access/heap/heapam.c 2005-10-15 11:49:08.0 +0900 +++ postgresql-8.1.0/src/backend/access/heap/heapam.c 2005-12-01 15:31:38.307713257 +0900 @@ -1070,6 +1070,36 @@ heap_get_latest_tid(Relation relation, } /* end of loop */ } +// Add by junji from here +/* + * has_rel_forceexternal - Is there SET STORAGE FORCEEXTERNALed rows? + */ +bool +has_rel_forceexternal(Relation relation) +{ + TupleDesc tupleDesc; + Form_pg_attribute *att; + int numAttrs; + int i; + + /* + * Get the tuple descriptor and break down the tuple(s) into fields. + */ + tupleDesc = relation-rd_att; + att = tupleDesc-attrs; + numAttrs = tupleDesc-natts; + + for (i = 0; i numAttrs; i++) + { + if (att[i]-attstorage == 'f') + return true; + } + + return false; +} +// Add by junji to here + + /* * heap_insert - insert tuple into a heap * @@ -1130,6 +1160,9 @@ heap_insert(Relation relation, HeapTuple * out-of-line attributes from some other relation, invoke the toaster. */ if (HeapTupleHasExternal(tup) || +// Add by junji from here + (has_rel_forceexternal(relation)) || +// Add by junji to here (MAXALIGN(tup-t_len) TOAST_TUPLE_THRESHOLD)) heap_tuple_toast_attrs(relation, tup, NULL); @@ -1762,6 +1795,9 @@ l2: */ need_toast = (HeapTupleHasExternal(oldtup) || HeapTupleHasExternal(newtup) || +// Add by junji from here + (has_rel_forceexternal(relation)) || +// Add by junji to here (MAXALIGN(newtup-t_len) TOAST_TUPLE_THRESHOLD)); newtupsize = MAXALIGN(newtup-t_len); diff -purN postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c postgresql-8.1.0/src/backend/access/heap/tuptoaster.c --- postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c 2005-10-15 11:49:09.0 +0900 +++ postgresql-8.1.0/src/backend/access/heap/tuptoaster.c 2005-12-01 15:29:29.722579466 +0900 @@ -512,6 +512,46 @@ toast_insert_or_update(Relation rel, Hea } } +// Add by junji from here + /* + * We look for attributes of attstorage 'f'. + */ + if (rel-rd_rel-reltoastrelid != InvalidOid) + { + Datum old_value; + + /*-- + * Search for the biggest yet inlined attribute with + * attstorage equals 'x' or 'e' + *-- + */ + for (i = 0; i numAttrs; i++) + { + if (toast_action[i] == 'p') + continue; + if (VARATT_IS_EXTERNAL(toast_values[i])) + continue; + if (att[i]-attstorage != 'f') + continue; + + /* + * Store this external + */ + old_value = toast_values[i]; + toast_action[i] = 'p'; + toast_values[i] = toast_save_datum(rel, toast_values[i]); + if (toast_free[i]) + pfree(DatumGetPointer(old_value)); + +
Re: [HACKERS] generalizing the planner knobs
On Thu, Dec 01, 2005 at 12:32:12PM -0500, Qingqing Zhou wrote: Neil Conway [EMAIL PROTECTED] wrote This would also be useful when diagnosing bad query plans: for example, setting enable_seqscan=false often causes the planner to disregard the use of *any* sequential scan, anywhere in the plan. The ability to slightly bump up the cost of particular operations would allow more alternative plans to be examined. This method also has the problem of enable_seqscan=false in some situations. I would vote we implement the final general solution like query plan hints directly. BTW, there's another end to the 'enable_seqscan=false' problem... it sometimes doesn't work! Last I looked, enable_seqscan=false only added a fixed overhead cost to a seqscan (100 IIRC). The problem is, some queries will produce estimates for other methodes that are more expensive than a seqscan even with the added burden. If instead of adding a fixed amount enable_seqscan=false multiplied by some amount then this would probably be impossible to occur. (And before someone asks, no, I don't remember which query was actually faster...) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Vertical Partitioning with TOAST
Jim C. Nasby [EMAIL PROTECTED] writes: This seems like a useful feature to add, allowing for easy built-in verticle partitioning. Are there issues with the patch as-is? Other than the ones mentioned by the poster? It seemed to me more like a not-too-successful experiment than something ready for application. If you take the viewpoint that this is just another TOAST storage strategy, I think it's pretty useless. A large field value is going to get toasted anyway with the regular strategy, and if your column happens to contain some values that are not large, forcing them out-of-line anyway is simply silly. (You could make a case for making the threshold size user-controllable, but I don't see the case for setting the threshold to zero, which is what this amounts to.) The poster was not actually suggesting applying it in the form of a force-external TOAST strategy; he was using this as a prototype to try to interest people in the idea of out-of-line storage mechanisms with lower overhead than TOAST. But that part is all speculation not code. Personally, I'd rather look into whether we couldn't speed up TOAST without changing any of its basic assumptions. The current implementation isn't awful, but it was built to allow the existing table and index mechanisms to be re-used for TOAST data. Now that we know for certain TOAST is a good idea, it would be reasonable to take a second look at whether we could improve the performance with another round of implementation effort. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Reducing relation locking overhead
On Fri, Dec 02, 2005 at 03:25:58PM -0500, Greg Stark wrote: Postgres would have no trouble building an index of the existing data using only shared locks. The problem is that any newly inserted (or updated) records could be missing from such an index. To do it you would then have to gather up all those newly inserted records. And of course while you're doing that new records could be inserted. And so on. There's no guarantee it would ever finish, though I suppose you could detect the situation if the size of the new batch wasn't converging to 0 and throw an error. Why throw an error? Just grab a lock that would prevent any new inserts from occuring. Or at least make that an option. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Reducing relation locking overhead
On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: What's worse, once you have excluded writes you have to rescan the entire table to be sure you haven't missed anything. So in the scenarios where this whole thing is actually interesting, ie enormous tables, you're still talking about a fairly long interval with writes locked out. Maybe not as long as a complete REINDEX, but long. I was thinking you would set a flag to disable use of the FSM for inserts/updates while the reindex was running. So you would know where to find the new tuples, at the end of the table after the last tuple you read. What about keeping a seperate list of new tuples? Obviously we'd only do this when an index was being built on a table. Since it would probably be problematic and expensive to check for this every time you accessed a table, it would make sense to check only at the start of a transaction and have an index build wait until all running transactions knew that an index build was going to happen. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Reducing relation locking overhead
Ühel kenal päeval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: What's worse, once you have excluded writes you have to rescan the entire table to be sure you haven't missed anything. So in the scenarios where this whole thing is actually interesting, ie enormous tables, you're still talking about a fairly long interval with writes locked out. Maybe not as long as a complete REINDEX, but long. I was thinking you would set a flag to disable use of the FSM for inserts/updates while the reindex was running. So you would know where to find the new tuples, at the end of the table after the last tuple you read. What about keeping a seperate list of new tuples? Obviously we'd only do this when an index was being built on a table. The problem with separate list is that it can be huge. For example on a table with 200 inserts/updates per second an index build lasting 6 hours would accumulate total on 6*3600*200 = 432 new tuples. Hannu ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Reducing relation locking overhead
On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote: ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: What's worse, once you have excluded writes you have to rescan the entire table to be sure you haven't missed anything. So in the scenarios where this whole thing is actually interesting, ie enormous tables, you're still talking about a fairly long interval with writes locked out. Maybe not as long as a complete REINDEX, but long. I was thinking you would set a flag to disable use of the FSM for inserts/updates while the reindex was running. So you would know where to find the new tuples, at the end of the table after the last tuple you read. What about keeping a seperate list of new tuples? Obviously we'd only do this when an index was being built on a table. The problem with separate list is that it can be huge. For example on a table with 200 inserts/updates per second an index build lasting 6 hours would accumulate total on 6*3600*200 = 432 new tuples. Sure, but it's unlikely that such a table would be very wide, so 4.3M tuples would probably only amount to a few hundred MB of data. It's also possible that this list could be vacuumed by whatever the regular vacuum process is for the table. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] 8.1, OID's and plpgsql
On Wed, Dec 07, 2005 at 12:06:23AM -0500, Tom Lane wrote: Greg Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: Rather than hard-wiring a special case for any of these things, I'd much rather see us implement INSERT...RETURNING and UPDATE...RETURNING as per previous suggestions. I wonder whether the ui tools need anything more low level than that. In general sticking their grubby fingers in the query the user entered seems wrong and they would have to tack on a RETURNING clause. That was mentioned before as a possible objection, but I'm not sure that I buy it. The argument seems to be that a client-side driver would understand the query and table structure well enough to know what to do with a returned pkey value, but not well enough to understand how to tack on a RETURNING clause to request that value. This seems a bit bogus. There may be some point in implementing a protocol-level equivalent of RETURNING just to reduce the overhead on both sides, but I think we ought to get the RETURNING functionality in place first and then worry about that... Along those lines, I don't see anything on the TODO list about this... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Vertical Partitioning with TOAST
On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: This seems like a useful feature to add, allowing for easy built-in verticle partitioning. Are there issues with the patch as-is? Other than the ones mentioned by the poster? It seemed to me more like a not-too-successful experiment than something ready for application. If you take the viewpoint that this is just another TOAST storage strategy, I think it's pretty useless. A large field value is going to get toasted anyway with the regular strategy, and if your column happens to contain some values that are not large, forcing them out-of-line anyway is simply silly. (You could make a case for making the threshold size user-controllable, but I don't see the case for setting the threshold to zero, which is what this amounts to.) Valid point. I do think there's a lot of benefit to being able to set the limit much lower than what it currently defaults to today. We have a client that has a queue-type table that is updated very frequently. One of the fields is text, that is not updated as frequently. Keeping this table vacuumed well enough has proven to be problematic, because any delay to vacuuming quickly results in a very large amount of bloat. Moving that text field into a seperate table would most likely be a win. Presumably this would need to be settable on at least a per-table basis. Would adding such a variable be a good beginner TODO, or is it too invasive? Personally, I'd rather look into whether we couldn't speed up TOAST without changing any of its basic assumptions. The current implementation isn't awful, but it was built to allow the existing table and index mechanisms to be re-used for TOAST data. Now that we know for certain TOAST is a good idea, it would be reasonable to take a second look at whether we could improve the performance with another round of implementation effort. I've often wondered about all the overhead of storing toast data in what amounts to a regular table. Sounds like another TODO... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match