Re: [HACKERS] Transient plans versus the SPI API
On Wed, Aug 3, 2011 at 8:33 PM, Tom Lane wrote: > Simon Riggs writes: >> I think its possible to tell automatically whether we need to replan >> always or not based upon the path we take through selectivity >> functions. > > I don't really believe that, or at least I think it would only detect a > few cases. Examples of parameter-value-sensitive decisions that are > made nowhere near the selectivity functions are constraint exclusion and > LIKE pattern to index-qual conversion. And in none of these cases do we > really know at the bottom level whether a different parameter value will > lead to a significant change in the finished plan. For instance, if > there's no index for column foo, it is a waste of time to force > replanning just because we have varying selectivity estimates for > "WHERE foo > $1". > > I think we'll be a lot better off with the framework discussed last > year: build a generic plan, as well as custom plans for the first few > sets of parameter values, and then observe whether there's a significant > reduction in estimated costs for the custom plans. The problem there is which executions we build custom plans for. That turns the problem into a sampling issue and you'll only fix the problems that occur with a frequency to match your sampling pattern and rate. Examples of situations where it won't help. * plans that vary by table size will be about the same in the first 5 executions. After large number of executions, things go bad. * text search using parameter is provided by user input - sensible requests have low selectivities; some users put in or "e" and then we try to retrieve whole table by index scan. Almost impossible to prevent all potentially high selectivity inputs from user. We could add LIMIT but frequently ORM generated queries do not do that. This isn't my-way-or-your-way - I think we need to look at some form of "safety barriers" so we generate a plan but also know when the plan has outlived its usefulness and force a re-plan. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
Peter Geoghegan writes: > On 3 August 2011 21:03, Tom Lane wrote: >> I mean that it's unclear what you'll get if status has a bitpattern >> equivalent to a negative integer. If the compiler implements the >> comparison as signed, the test will yield TRUE; if unsigned, it's FALSE. > On compilers on which the enum value is represented as an unsigned > int, passing -1 is just the same as doing that with any function with > an unsigned int argument for that argument - it will convert to a > large unsigned value . So sure, that isolated part of the test's > outcome will vary depending on whether or not the compiler opts to > represent the enum as signed when it can, but I don't look at it as > you do. I simply consider that to be a violation of the enum's > contract, or the caller's failure to consider that the enum couldn't > represent -1 -- they got what they asked for. This argument is completely missing the point of the test, which is to verify whether or not the caller adhered to the enum's contract. You can *not* assume that he did while arguing about whether the test is valid or accomplishes its goals. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 04.08.2011 04:21, David Fetter wrote: On Wed, Aug 03, 2011 at 01:40:42PM +0300, Heikki Linnakangas wrote: On 03.08.2011 13:05, Peter Geoghegan wrote: I don't believe that the standard allows for an implementation of enums as unsigned integers - after all, individual enum literals can be given corresponding negative integer values. C99 says: Each enumerated type shall be compatible with char, a signed integer type, or an unsigned integer type. The choice of type is implementation-defined,110) but shall be capable of representing the values of all the members of the enumeration. Are we moving to C99? C89 says: Each enumerated type shall be compatible with an integer type; the choice of type is implementation-defined. Well, that's the same thing, just in less explicit words. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 6:21 PM, Jim Nasby wrote: > On Aug 3, 2011, at 1:21 PM, Robert Haas wrote: >> 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache >> because PostgreSQL protects its free-list with a single lock and thus >> scales poorly with smaller caches." This is a complaint about >> BufFreeList lock which, in fact, I've seen as a huge point of >> contention on some workloads. In fact, on read-only workloads, with >> my lazy vxid lock patch applied, this is, I believe, the only >> remaining unpartitioned LWLock that is ever taken in exclusive mode; >> or at least the only one that's taken anywhere near often enough to >> matter. I think we're going to do something about this, although I >> don't have a specific idea in mind at the moment. > > This has been discussed before: > http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which > itself references 2 other threads). > > The basic idea is: have a background process that proactively moves buffers > onto the free list so that backends should normally never have to run the > clock sweep (which is rather expensive). The challenge there is figuring out > how to get stuff onto the free list with minimal locking impact. I think one > possible option would be to put the freelist under it's own lock (IIRC we > currently use it to protect the clock sweep as well). Of course, that still > means the free list lock could be a point of contention, but presumably it's > far faster to add or remove something from the list than it is to run the > clock sweep. Based on recent benchmarking, I'm going to say "no". It doesn't seem to matter how short you make the critical section: a single program-wide mutex is a loser. Furthermore, the "free list" is a joke, because it's nearly always going to be completely empty. We could probably just rip that out and use the clock sweep and never miss it, but I doubt it would improve performance much. I think what we probably need to do is have multiple clock sweeps in progress at the same time. So, for example, if you have 8GB of shared_buffers, you might have 8 mutexes, one for each GB. When a process wants a buffer, it locks one of the mutexes and sweeps through that 1GB partition. If it finds a buffer before returning to the point at which it started the scan, it's done. Otherwise, it releases its mutex, grabs the next one, and continues on until it finds a free buffer. The trick with any modification in this area is that pretty much any degree of increased parallelism is potentially going to reduce the quality of buffer replacement to some degree. So the trick will be to try to squeeze out as much concurrency as possible while minimizing degradation in the quality of buffer replacements. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compressing the AFTER TRIGGER queue
Excerpts from Jim Nasby's message of mié ago 03 18:05:21 -0400 2011: > Not sure how much this relates to this discussion, but I have often wished we > had AFTER FOR EACH STATEMENT triggers that provided OLD and NEW recordsets > you could make use of. Sometimes it's very valuably to be able to look at > *all* the rows that changed in a transaction in one shot. I have a mind to implement this sometime. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On Wed, Aug 03, 2011 at 01:40:42PM +0300, Heikki Linnakangas wrote: > On 03.08.2011 13:05, Peter Geoghegan wrote: > >I don't believe that the standard allows for an implementation of > >enums as unsigned integers - after all, individual enum literals can > >be given corresponding negative integer values. > > C99 says: > > >Each enumerated type shall be compatible with char, a signed integer type, > >or an > >unsigned integer type. The choice of type is implementation-defined,110) but > >shall be > >capable of representing the values of all the members of the enumeration. Are we moving to C99? C89 says: Each enumerated type shall be compatible with an integer type; the choice of type is implementation-defined. Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compressing the AFTER TRIGGER queue
On Wed, Aug 3, 2011 at 6:05 PM, Jim Nasby wrote: > Not sure how much this relates to this discussion, but I have often wished we > had AFTER FOR EACH STATEMENT triggers that provided OLD and NEW recordsets > you could make use of. Sometimes it's very valuably to be able to look at > *all* the rows that changed in a transaction in one shot. Yeah, that would be awesome. I think some of our competitors provide exactly that feature... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 5:35 PM, Tom Lane wrote: > That still seems utterly astonishing to me. We're touching each of > those files once per query cycle; a cycle that contains two message > sends, who knows how many internal spinlock/lwlock/heavyweightlock > acquisitions inside Postgres (some of which *do* contend with each > other), and a not insignificant amount of plain old computing. > Meanwhile, this particular spinlock inside the kernel is protecting > what, a single doubleword fetch? How is that the bottleneck? Spinlocks seem to have a very ugly "tipping point". When I tested pgbench -S on a 64-core system with the lazy vxid patch applied and a patch to use random_r() in lieu of random, the amount of system time used per SELECT-only transaction at 48 clients was 3.59 times as much as it was at 4 clients. And the amount used per transaction at 52 clients was 3.63 times the amount used per transaction at 48 clients. And the amount used at 56 clients was 3.25 times the amount used at 52 clients. You can see the throughput graph starting to flatten out in the 32-44 client range, but it's not particularly alarming. However, once you pass that point things rapidly get totally out of control in a real hurry. A few more clients and the machine is basically doing nothing but spin. > I am wondering whether kernel spinlocks are broken. I don't think so. Stefan Kaltenbrunner had one profile where he showed something like sixty or eighty percent of the usermode CPU time in s_lock. I didn't have access to that particular hardware, but the testing I've done strongly suggests that most of that was the SInvalReadLock spinlock. And before I patched pgbench to avoid calling random(), that was doing the same thing - literally flattening a 64-core box fighting over a single futex that normally costs almost nothing. (That one wasn't quite as bad because the futex actually deschedules the waiters, but it was still bad.) I'm actually not really sure why it shakes out this way (birthday paradox?) but having seen the effect several times now, I'm disinclined to believe it's an artifact. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incremental checkopints
I have not explained well what I have in my mind in first message. Main goal is more buffers to stay dirty in memory for longer time. So checkpoint segments have to be 2, 3... times than in current approach. And separate parameter can control how much buffers to write at once. DBA can tune: - checkpoint segments - higher number for less writes but more time for crash recovery; - (how much buffers to write at once) - more for throughput and less for latency. > I think what we'd need to track is the LSN that first dirtied the page > (as opposed to the current field, which tracks the LSN that most > recently wrote the page). If we write and flush all pages whose > first-dirtied LSN precedes some cutoff point, then we ought to be able > to advance the redo pointer to that point. Also if the page is written by backend LSN_first_dirtied have to be cleared. I believe (but I can not prove) it worths testing and advantage will be noticeable. Jordan Ivanov -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 3 August 2011 21:03, Tom Lane wrote: > I mean that it's unclear what you'll get if status has a bitpattern > equivalent to a negative integer. If the compiler implements the > comparison as signed, the test will yield TRUE; if unsigned, it's FALSE. On compilers on which the enum value is represented as an unsigned int, passing -1 is just the same as doing that with any function with an unsigned int argument for that argument - it will convert to a large unsigned value . So sure, that isolated part of the test's outcome will vary depending on whether or not the compiler opts to represent the enum as signed when it can, but I don't look at it as you do. I simply consider that to be a violation of the enum's contract, or the caller's failure to consider that the enum couldn't represent -1 -- they got what they asked for. To put it another way, you could say that the first part of the test only exists for the benefit of compilers that treat all enums as signed. Clang recognised that redundancy for its unsigned case, and complained. It doesn't make sense to me to look at that one part of the test in isolation though. With my patch, the test as a whole does its job (in an identical fashion to before, in fact). Come to think of it, you could argue that the warning is invalid on the basis that the enum being unsigned is implementation defined and therefore the first part of the test is conceivably not redundant on some platforms, and perhaps I will on the Clang list. Probably not though, because it was hard enough with the warning bug that I managed to get them to fix, and I thought that was open-and-shut. >> I'm not convinced that that is an improvement to rely on the >> conversion doing so, but it's not as if I feel very strongly about it. > > The C standard specifies that signed-to-unsigned conversions must work > like that; and even if the standard didn't, it would surely work like > that on any machine with two's-complement representation, which is to > say every computer built in the last forty years or so. So I don't find > it a questionable assumption. I wasn't questioning whether it would work. I just think that relying on the behaviour of the conversion like that doesn't make your intent very clear. If I thought that it would or could be plain broken under certain circumstances, I would naturally feel strongly about it; as I've said, I don't. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres / plpgsql equivalent to python's getattr() ?
James Robinson writes: > Python's getattr() allows for dynamic lookup of attributes on an > object, as in: > inst = MyClass(x=12, y=24) > v = getattr(inst, 'x') > assert v == 12 > Oftentimes in writing data validating trigger functions, it'd be real > handy to be able to do a similar thing in plpgsql > Is there something in the internals which inherently prevent this? plpgsql is strongly typed (much more so than python, anyway). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On Wed, Aug 03, 2011 at 04:03:39PM -0400, Tom Lane wrote: > The C standard specifies that signed-to-unsigned conversions must work > like that; and even if the standard didn't, it would surely work like > that on any machine with two's-complement representation, which is to > say every computer built in the last forty years or so. So I don't find > it a questionable assumption. I had the "pleasure" of working on a Univac 1108 in about 1978 and it was very definitely ones complement. I'm somewhat amazed to find that the Univac 1100 series architecture and instruction set lives on to this day. The last pure 1100 seems to be the Unisys 2200/3800 released in 1997. Even later U1100/Exec 8 descendants appear to still exist and are still actively supported: http://en.wikipedia.org/wiki/Unisys_OS_2200_operating_system So there are still ones complement machines out there. However I suggest we pretend otherwise and continue to ignore them. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Aug 3, 2011, at 1:21 PM, Robert Haas wrote: > 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache > because PostgreSQL protects its free-list with a single lock and thus > scales poorly with smaller caches." This is a complaint about > BufFreeList lock which, in fact, I've seen as a huge point of > contention on some workloads. In fact, on read-only workloads, with > my lazy vxid lock patch applied, this is, I believe, the only > remaining unpartitioned LWLock that is ever taken in exclusive mode; > or at least the only one that's taken anywhere near often enough to > matter. I think we're going to do something about this, although I > don't have a specific idea in mind at the moment. This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads). The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compressing the AFTER TRIGGER queue
On Aug 2, 2011, at 7:09 AM, Simon Riggs wrote: >> The best compression and flexibility in >>> that case is to store a bitmap since that will average out at about 1 >>> bit per row, with variable length bitmaps. Which is about 8 times >>> better compression ratio than originally suggested, without any loss >>> of metadata. >> >> Yeah that's probably possible in specific cases, but I'm still not >> sure how to make it meet the full requirements of the after trigger >> queue. > > I think you'd better explain what use case you are trying to optimise > for. It seems unlikely that you will come up with a compression scheme > that will fit all cases. > > The only cases that seem a problem to me are > * bulk RI checks > * large writes on tables using trigger based replication > maybe you have others? Not sure how much this relates to this discussion, but I have often wished we had AFTER FOR EACH STATEMENT triggers that provided OLD and NEW recordsets you could make use of. Sometimes it's very valuably to be able to look at *all* the rows that changed in a transaction in one shot. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Locking end of indexes during VACUUM
Simon Riggs writes: > What seems strange is that we make no attempt to check whether we have > already identified all tuples being removed by the VACUUM. We have the > number of dead tuples we are looking for and we track the number of > tuples we have deleted from the index, so we could easily make this > check early and avoid waiting. > Can we avoid scanning all pages once we have proven we have all dead tuples? That seems pretty dangerous to me, as it makes correctness of tuple removal totally dependent on there not being any duplicates, etc. I don't think there's a sufficiently compelling performance argument here to justify making VACUUM more fragile. (In any case, since VACUUM is visiting the leaf pages in physical not logical order, it's hard to argue that there would be any clear win for specific application access patterns such as "hitting the largest keys a lot".) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Postgres / plpgsql equivalent to python's getattr() ?
Hackers, Python's getattr() allows for dynamic lookup of attributes on an object, as in: inst = MyClass(x=12, y=24) v = getattr(inst, 'x') assert v == 12 Oftentimes in writing data validating trigger functions, it'd be real handy to be able to do a similar thing in plpgsql against column values in a row or record type, such as making use of a trigger argument for hint as what column to consider in this table's case. Oh, to be able to do something like (toy example known to be equivalent to a check): CREATE OR REPLACE FUNCTION must_be_positive() RETURNS TRIGGER AS $$ begin if getattr(NEW, TG_ARGV[0]) <= 0 then raise exception(TG_ARGV[0] || ' must be positive'); end if; -- after trigger return null; end; $$ LANGUAGE PLPGSQL; A function which takes a row + a text column name, and / or a peer function taking row + index within row would really open up plpgsql's expressivity in cases where you're writing mainly SQL stuff, not really wanting to go over to plpythonu or whatnot (whose description of rows are as dicts). Is there something in the internals which inherently prevent this? Or am I fool and it already exists? Not having to defer to EXECUTE would be attractive. James Robinson Socialserve.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas writes: > On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane wrote: >> ... We could possibly accept stale values for the >> planner estimates, but I think heapam's number had better be accurate. > I think the exact requirement is that, if the relation turns out to be > larger than the size we read, the extra blocks had better not contain > any tuples our snapshot can see. There's actually no interlock > between smgrnblocks() and smgrextend() right now, so presumably we > don't need to add one. No interlock in userspace, you mean. We're relying on the kernel to do it, ie, give us a number that is not older than the time of our (already taken at this point) snapshot. > I don't really think there's any sensible way to implement a > per-backend cache, because that would require invalidation events of > some kind to be sent on relation extension, and that seems utterly > insane from a performance standpoint, even if we invented something > less expensive than sinval. Yeah, that's the issue. But "relation extension" is not actually a cheap operation, since it requires a minimum of one kernel call that is presumably doing something nontrivial in the filesystem. I'm not entirely convinced that we couldn't make this work --- especially since we could certainly derate the duty cycle by a factor of ten or more without giving up anything remotely meaningful in planning accuracy. (I'd be inclined to make it send an inval only once the relation size had changed at least, say, 10%.) > A shared cache seems like it could work, but the locking is tricky. > Normally we'd just use a hash table protected by an LWLock, one one > LWLock per partition, but here that's clearly not going to work. The > kernel is using a spinlock per file, and that's still too > heavy-weight. That still seems utterly astonishing to me. We're touching each of those files once per query cycle; a cycle that contains two message sends, who knows how many internal spinlock/lwlock/heavyweightlock acquisitions inside Postgres (some of which *do* contend with each other), and a not insignificant amount of plain old computing. Meanwhile, this particular spinlock inside the kernel is protecting what, a single doubleword fetch? How is that the bottleneck? I am wondering whether kernel spinlocks are broken. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cataloguing NOT NULL constraints
Excerpts from Dean Rasheed's message of sáb jul 23 04:37:06 -0400 2011: > On 22 July 2011 22:28, Robert Haas wrote: > >> mine was that we need a command such as > >> > >> ALTER TABLE foo ALTER COLUMN bar SET NOT NULL name_of_notnull_constr > >> > >> where the last bit is what's new. > > > > Well, if you don't have that, I don't see how you have any chance of > > pg_dump working correctly. > > Ah yes, pg_dump's dumpConstraint() needs a clause to alter a table > adding a named NOT NULL constraint (and the DOMAIN case should be > preserving the constraint's name too). So it looks like some new > syntax for ALTER TABLE to add named NOT NULL constraints is probably > needed before this can be committed. So after writing the code to handle named NOT NULL constraints for tables, I'm thinking that dumpConstraints needs to be fixed thusly: @@ -12888,6 +12968,27 @@ dumpConstraint(Archive *fout, ConstraintInfo *coninfo) NULL, NULL); } } + else if (coninfo->contype == 'n' && tbinfo) + { + /* NOT NULL constraint on a table */ + if (coninfo->separate) + { + write_msg(NULL, "NOT NULL constraints cannot be dumped separately from their owning table\n"); + exit_nicely(); + } + } + else if (coninfo->contype == 'n' && tbinfo == NULL) + { + /* NOT NULL constraint on a domain */ + TypeInfo *tyinfo = coninfo->condomain; + + /* Ignore if not to be dumped separately */ + if (coninfo->separate) + { + write_msg(NULL, "NOT NULL constraints cannot be dumped separately from their owning domain\n"); + exit_nicely(); + } + } else { write_msg(NULL, "unrecognized constraint type: %c\n", coninfo->contype); There doesn't seem to be any point in giving pg_dump the ability to dump such constraints separately; I don't think there's any situation in which dependencies between the table/domain and NOT NULL constraints would get circular (which is what causes them to acquire the "separate" flag). I don't want to write code that is going to be always unused, particularly not in a beast as hairy such as pg_dump. Note that the code dumps not null constraints along with the table definition, which includes the constraint name, so it works perfectly fine already. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [GENERAL] Odd VACUUM behavior when it is expected to truncate last empty pages
On Wed, Aug 3, 2011 at 12:33 PM, Pavan Deolasee wrote: > > > The only problem, other than a surprising behavior that you noted, > that I see with this approach is that we might repeatedly try to > truncate a relation which in fact does not have anything to truncate. > The worst thing is we might unnecessarily take an exclusive lock on > the table. > So it seems we tried to fix this issue sometime back http://archives.postgresql.org/pgsql-hackers/2008-12/msg01994.php But I don't quite understand how the fix would really work. nonempty_pages would most likely be set at a value lower than relpages if the last page in the relation is all-visible according to the visibility map. Did we mean to test (nonempty_pages > 0) there ? But even that may not work except for the case when there are no dead tuples in the relation. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane wrote: > Robert Haas writes: >> On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane wrote: >>> (If the query ended up being a seqscan, I'd expect a second >>> lseek(SEEK_END) when the executor starts up, but I gather from the other >>> complaints that the mosbench people were only testing simple indexscan >>> queries.) > >> Yeah, it seems that for a sequential scan we lseek the heap, then the >> index, then the heap again; but for index scans we just hit the heap >> and the index. > > Sure. The first two come from the planner getting the table and index > sizes for estimation purposes (look in plancat.c). The last is done in > heapam.c's initscan(). We could possibly accept stale values for the > planner estimates, but I think heapam's number had better be accurate. I think the exact requirement is that, if the relation turns out to be larger than the size we read, the extra blocks had better not contain any tuples our snapshot can see. There's actually no interlock between smgrnblocks() and smgrextend() right now, so presumably we don't need to add one. However, a value cached from a few seconds ago is clearly not going to cut it. I don't really think there's any sensible way to implement a per-backend cache, because that would require invalidation events of some kind to be sent on relation extension, and that seems utterly insane from a performance standpoint, even if we invented something less expensive than sinval. I guess it might work for planning purposes if you only sent out invalidation events on every N'th extension or something, but penalizing the accuracy of planning to work around a Linux kernel bug that only manifests itself on machines with >32 cores doesn't seem very appealing. A shared cache seems like it could work, but the locking is tricky. Normally we'd just use a hash table protected by an LWLock, one one LWLock per partition, but here that's clearly not going to work. The kernel is using a spinlock per file, and that's still too heavy-weight. I think that if we could prepopulate the cache with all the keys (i.e. relfilenodes) we care about and then never add or evict any, we could run it completely unlocked. I believe that the existing memory barriers in things like LWLockAcquire() would be sufficient to prevent us from reading a too-old value (i.e. block count). In particular, you couldn't read a value that predated your snapshot, if you got your snapshot by holding ProcArrayLock. But the races involved with adding and removing items from the cache are hard to deal with without using locks, especially because the keys are 12 bytes or more and therefore can't be read or written atomically. I've been mulling over how we might deal with this and actually coded up an implementation, but it turns out (surprise, surprise) to have problems with insufficient locking. So I'm thinking it over some more. And hoping that the Linux guys decide to do something about it. This isn't really our bug - lseek is quite cheap in the uncontended case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Locking end of indexes during VACUUM
During btvacuumscan(), we lock the index for extension and then wait to acquire a cleanup lock on the last page. Then loop until we find a point where the index has not expanded again during our wait for lock on that last page. On a busy index this can take some time, especially when people regularly access data with the highest values in the index. The comments there say "It is critical that we visit all leaf pages, including ones added after we start the scan, else we might fail to delete some deletable tuples." What seems strange is that we make no attempt to check whether we have already identified all tuples being removed by the VACUUM. We have the number of dead tuples we are looking for and we track the number of tuples we have deleted from the index, so we could easily make this check early and avoid waiting. Can we avoid scanning all pages once we have proven we have all dead tuples? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas writes: > On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane wrote: >> (If the query ended up being a seqscan, I'd expect a second >> lseek(SEEK_END) when the executor starts up, but I gather from the other >> complaints that the mosbench people were only testing simple indexscan >> queries.) > Yeah, it seems that for a sequential scan we lseek the heap, then the > index, then the heap again; but for index scans we just hit the heap > and the index. Sure. The first two come from the planner getting the table and index sizes for estimation purposes (look in plancat.c). The last is done in heapam.c's initscan(). We could possibly accept stale values for the planner estimates, but I think heapam's number had better be accurate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane wrote: > Robert Haas writes: >> On a straight pgbench -S test, you get four system calls per query: >> recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates >> the two lseeks. > > [ scratches head... ] Two? Yep. > Is that one for the table and one for its > lone index, or are we being redundant there? The former. Specifically, it appears we're smart enough to only test the last segment (in this case, the table is large enough that there is a .1 file, and that's what we're lseeking). > (If the query ended up being a seqscan, I'd expect a second > lseek(SEEK_END) when the executor starts up, but I gather from the other > complaints that the mosbench people were only testing simple indexscan > queries.) Yeah, it seems that for a sequential scan we lseek the heap, then the index, then the heap again; but for index scans we just hit the heap and the index. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
Peter Geoghegan writes: > On 3 August 2011 15:29, Tom Lane wrote: >> There is another point here, though, which is that if we're not sure >> whether the compiler considers ExecStatusType to be signed or unsigned, >> then we have no idea what the test "status < PGRES_EMPTY_QUERY" even >> means. > I'm sorry, but I don't know what you mean by this. I mean that it's unclear what you'll get if status has a bitpattern equivalent to a negative integer. If the compiler implements the comparison as signed, the test will yield TRUE; if unsigned, it's FALSE. >> So I think the most reasonable fix is probably >> >>if ((unsigned int) status >= sizeof pgresStatus / sizeof >> pgresStatus[0]) >> >> which is sufficient to cover both directions, since if status is passed >> as -1 then it will convert to a large unsigned value. It's also a >> natural expression of what we really want, ie, that the integer >> equivalent of the enum value is in range. > I'm not convinced that that is an improvement to rely on the > conversion doing so, but it's not as if I feel very strongly about it. The C standard specifies that signed-to-unsigned conversions must work like that; and even if the standard didn't, it would surely work like that on any machine with two's-complement representation, which is to say every computer built in the last forty years or so. So I don't find it a questionable assumption. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas writes: > On a straight pgbench -S test, you get four system calls per query: > recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates > the two lseeks. [ scratches head... ] Two? Is that one for the table and one for its lone index, or are we being redundant there? (If the query ended up being a seqscan, I'd expect a second lseek(SEEK_END) when the executor starts up, but I gather from the other complaints that the mosbench people were only testing simple indexscan queries.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transient plans versus the SPI API
Simon Riggs writes: > I think its possible to tell automatically whether we need to replan > always or not based upon the path we take through selectivity > functions. I don't really believe that, or at least I think it would only detect a few cases. Examples of parameter-value-sensitive decisions that are made nowhere near the selectivity functions are constraint exclusion and LIKE pattern to index-qual conversion. And in none of these cases do we really know at the bottom level whether a different parameter value will lead to a significant change in the finished plan. For instance, if there's no index for column foo, it is a waste of time to force replanning just because we have varying selectivity estimates for "WHERE foo > $1". I think we'll be a lot better off with the framework discussed last year: build a generic plan, as well as custom plans for the first few sets of parameter values, and then observe whether there's a significant reduction in estimated costs for the custom plans. But in any case, it's way premature to be debating this until we have the infrastructure in which we can experiment with different policies. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transient plans versus the SPI API
On Wed, Aug 3, 2011 at 3:19 PM, Tom Lane wrote: > Robert Haas writes: >> This seems like a good design. Now what would be really cool is if >> you could observe a stream of queries like this: > >> SELECT a, b FROM foo WHERE c = 123 >> SELECT a, b FROM foo WHERE c = 97 >> SELECT a, b FROM foo WHERE c = 236 > >> ...and say, hey, I could just make a generic plan and use it every >> time I see one of these. It's not too clear to me how you'd make >> recognition of such queries cheap enough to be practical, but maybe >> someone will think of a way... > > Hm, you mean reverse-engineering the parameterization of the query? > Interesting thought, but I really don't see a way to make it practical. > > In any case, it would amount to making up for a bad decision on the > application side, ie, not transmitting the query in the parameterized > form that presumably exists somewhere in the application. I think > we'd be better served all around by encouraging app developers to rely > more heavily on parameterized queries ... but first we have to fix the > performance risks there. Fair enough. I have to admit I'm afraid of them right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On Aug 3, 2011, at 11:28 AM, Grzegorz Jaskiewicz wrote: > export CC=clang > ./configure > ... Thanks, I'll give that a try the next time I build (RC1 I guess). David
Re: [HACKERS] Transient plans versus the SPI API
Robert Haas writes: > This seems like a good design. Now what would be really cool is if > you could observe a stream of queries like this: > SELECT a, b FROM foo WHERE c = 123 > SELECT a, b FROM foo WHERE c = 97 > SELECT a, b FROM foo WHERE c = 236 > ...and say, hey, I could just make a generic plan and use it every > time I see one of these. It's not too clear to me how you'd make > recognition of such queries cheap enough to be practical, but maybe > someone will think of a way... Hm, you mean reverse-engineering the parameterization of the query? Interesting thought, but I really don't see a way to make it practical. In any case, it would amount to making up for a bad decision on the application side, ie, not transmitting the query in the parameterized form that presumably exists somewhere in the application. I think we'd be better served all around by encouraging app developers to rely more heavily on parameterized queries ... but first we have to fix the performance risks there. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 2:49 PM, Tom Lane wrote: > Martijn van Oosterhout writes: >> On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote: >>> It would be nice if the Linux guys would fix this problem for us, but >>> I'm not sure whether they will. For those who may be curious, the >>> problem is in generic_file_llseek() in fs/read-write.c. On a platform >>> with 8-byte atomic reads, it seems like it ought to be very possible >>> to read inode->i_size without taking a spinlock. > >> Interesting. There's this thread from 2003 suggesting the use of pread >> instead, it was rejected on the argument that lseek is cheap so not a >> problem. > >> http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php > > That seems rather unrelated. The point here is our use of lseek to find > out the current file size --- or at least, I would hope they're not > trying to read the inode's file size in a SEEK_CUR call. Correct. > The reason "-M prepared" helps is presumably that it eliminates most of > the RelationGetNumberOfBlocks calls the planner does to check current > table size. While we could certainly consider using a cheaper (possibly > more stale) value there, it's a bit astonishing to think that that's the > main cost in a parse/plan/execute cycle. Perhaps there are more hotspot > calls than that one? Nope. On a straight pgbench -S test, you get four system calls per query: recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates the two lseeks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 3 Aug 2011, at 19:20, David E. Wheeler wrote: > > Yeah, not sure whether or not to tweak the Makefile to use Clang. I guess not? > export CC=clang ./configure ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging volume and CREATE TABLE
Alvaro Herrera wrote: > Excerpts from Bruce Momjian's message of mar ago 02 22:46:55 -0400 2011: > > > I have created a documentation patch to clarify this, and to mention > > CREATE TABLE AS which also has this optimization. > > It doesn't seem particularly better to me. How about something like > > In minimal level, WAL-logging of some operations can be safely skipped, > which can make those operations much faster (see ). Operations on > which this optimization can be applied include: > > CREATE INDEX > CLUSTER > CREATE TABLE AS > COPY, when tables that were created or truncated in the same > transaction > > > Minimal WAL does not contain enough information to reconstruct the data > from a base backup and the WAL logs, so either archive or > hot_standby level must be used to enable ... Good idea --- updated patch attached. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml new file mode 100644 index 4fadca9..aac6c3b *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** SET ENABLE_SEQSCAN TO OFF; *** 1451,1461 This parameter can only be set at server start. ! In minimal level, WAL-logging of some bulk operations, like ! CREATE INDEX, CLUSTER and COPY on ! a table that was created or truncated in the same transaction can be ! safely skipped, which can make those operations much faster (see ! ). But minimal WAL does not contain enough information to reconstruct the data from a base backup and the WAL logs, so either archive or hot_standby level must be used to enable --- 1451,1468 This parameter can only be set at server start. ! In minimal level, WAL-logging of some bulk ! operations can be safely skipped, which can make those ! operations much faster (see ). ! Operations in which this optimization can be applied include: ! ! CREATE INDEX ! CLUSTER ! CREATE TABLE AS ! COPY into tables that were created or truncated in the same ! transaction ! ! But minimal WAL does not contain enough information to reconstruct the data from a base backup and the WAL logs, so either archive or hot_standby level must be used to enable -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Martijn van Oosterhout writes: > On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote: >> It would be nice if the Linux guys would fix this problem for us, but >> I'm not sure whether they will. For those who may be curious, the >> problem is in generic_file_llseek() in fs/read-write.c. On a platform >> with 8-byte atomic reads, it seems like it ought to be very possible >> to read inode->i_size without taking a spinlock. > Interesting. There's this thread from 2003 suggesting the use of pread > instead, it was rejected on the argument that lseek is cheap so not a > problem. > http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php That seems rather unrelated. The point here is our use of lseek to find out the current file size --- or at least, I would hope they're not trying to read the inode's file size in a SEEK_CUR call. The reason "-M prepared" helps is presumably that it eliminates most of the RelationGetNumberOfBlocks calls the planner does to check current table size. While we could certainly consider using a cheaper (possibly more stale) value there, it's a bit astonishing to think that that's the main cost in a parse/plan/execute cycle. Perhaps there are more hotspot calls than that one? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transient plans versus the SPI API
Tom Lane writes: > So yes, it'd get a little worse for that use-case. But you have to > weigh that against the likelihood that other use-cases will get better. > If our requirement for a transient-plan mechanism is that no individual > case can ever be worse than before, then we might as well abandon the > entire project right now, because the only way to meet that requirement > is to change nothing. That is not were I wanted to drift. It's just that I don't have as much time as I would like to those days, and so it helps me a lot seeing a worked out example rather than make sure I parse your proposal correctly. Thanks a lot for your answer, I have a very clear confirmation on how I read your previous email. I will have to do some testing, but it could well be that this application will benefit from locking reductions enough that it buys this effect back. > Of course we could address the worst cases by providing some mechanism > to tell the plancache code "always use a generic plan for this query" > or "always use a custom plan". I'm not entirely thrilled with that, > because it's effectively a planner hint and has got the same problems > as all planner hints, namely that users are likely to get it wrong. Yeah. > But it would be relatively painless to supply such a hint at the SPI > level, which is why I asked whether we should. It'd be much harder to > do something equivalent at higher levels, which is why I'm not that > eager to do it for SPI. Given the SLA of those prepared queries in my case, I think I could accept to have to switch from SQL statements to C coded SRF to guarantee the planning behavior. It will not make the upgrade cheaper, but I realize it's a very narrow and specific use case. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote: > It would be nice if the Linux guys would fix this problem for us, but > I'm not sure whether they will. For those who may be curious, the > problem is in generic_file_llseek() in fs/read-write.c. On a platform > with 8-byte atomic reads, it seems like it ought to be very possible > to read inode->i_size without taking a spinlock. Interesting. There's this thread from 2003 suggesting the use of pread instead, it was rejected on the argument that lseek is cheap so not a problem. http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php Perhaps we now have a benchmark where the effect can be measured. There's the issue about whether it screws up the readahead mechanism... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
[HACKERS] mosbench revisited
About nine months ago, we had a discussion of some benchmarking that was done by the mosbench folks at MIT: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php Although the authors used PostgreSQL as a test harness for driving load, it's pretty clear from reading the paper that their primary goal was to stress the Linux kernel, so the applicability of the paper to real-world PostgreSQL performance improvement is less than it might be. Still, having now actually investigated in some detail many of the same performance issues that they were struggling with, I have a much clearer understanding of what's really going on here. In PostgreSQL terms, here are the bottlenecks they ran into: 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache because PostgreSQL protects its free-list with a single lock and thus scales poorly with smaller caches." This is a complaint about BufFreeList lock which, in fact, I've seen as a huge point of contention on some workloads. In fact, on read-only workloads, with my lazy vxid lock patch applied, this is, I believe, the only remaining unpartitioned LWLock that is ever taken in exclusive mode; or at least the only one that's taken anywhere near often enough to matter. I think we're going to do something about this, although I don't have a specific idea in mind at the moment. 2. "PostgreSQL implements row- and table-level locks atop user-level mutexes; as a result, even a non-conflicting row- or table-level lock acquisition requires exclusively locking one of only 16 global mutexes." I think that the reference to row-level locks here is a red herring; or at least, I haven't seen any evidence that row-level locking is a meaningful source of contention on any workload I've tested. Table-level locks clearly are, and this is the problem that the now-committed fastlock patch addressed. So, fixed! 3. "Our workload creates one PostgreSQL connection per server core and sends queries (selects or updates) in batches of 256, aggregating successive read-only transac- tions into single transactions. This workload is intended to minimize application-level contention within PostgreSQL in order to maximize the stress PostgreSQL places on the kernel." I had no idea what this was talking about at the time, but it's now obvious in retrospect that they were working around the overhead imposed by acquiring and releasing relation and virtualxid locks. My pending "lazy vxids" patch will address the remaining issue here. 4. "With modified PostgreSQL on stock Linux, throughput for both workloads collapses at 36 cores .. The main reason is the kernel's lseek implementation." With the fastlock, sinval-hasmessages, and lazy-vxid patches applied (the first two are committed now), it's now much easier to run headlong into this bottleneck. Prior to those patches, for this to be an issue, you would need to batch your queries together in big groups to avoid getting whacked by the lock manager and/or sinval overhead first. With those problems and the recently discovered bottleneck in glibc's random() implementation fixed, good old pgbench -S is enough to hit this problem if you have enough clients and enough cores. And it turns out that the word "collapse" is not an exaggeration. On a 64-core Intel box running RHEL 6.1, performance ramped up from 24k TPS at 4 clients to 175k TPS at 32 clients and then to 207k TPS at 44 clients. After that it fell off a cliff, dropping to 93k TPS at 52 clients and 26k TPS at 64 clients, consuming truly horrifying amounts of system time in the process. A somewhat tedious investigation revealed that the problem is, in fact, contention on the inode mutex caused by lseek(). Results are much better with -M prepared (310k TPS at 48 clients, 294k TPS at 64 clients). All one-minute tests with scale factor 100, fitting inside 8GB of shared_buffers (clearly not enough for serious benchmarking, but enough to demonstrate this issue). It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode->i_size without taking a spinlock. A little Googling around suggests that some patches along these lines have been proposed and - for reasons that I don't fully understand - rejected. That now seems unfortunate. Barring a kernel-level fix, we could try to implement our own cache to work around this problem. However, any such cache would need to be darn cheap to check and update (since we can't assume that relation extension is an infrequent event) and must somehow having the same sort of mutex contention that's killing the kernel in this workload. 5. With all of the above problems fixed or worked around, the authors write, "PostgreSQL's overall scalability is primarily limited by contention for the spinlock protecting t
Re: [HACKERS] Further news on Clang - spurious warnings
On Aug 3, 2011, at 11:17 AM, Peter Eisentraut wrote: > Note that what you are using there is a GCC frontend with a LLVM > backend. (So I suppose you would get mostly GCC warnings.) Clang is a > new/different compiler frontend using the LLVM backend. Yeah, not sure whether or not to tweak the Makefile to use Clang. I guess not? David
Re: [HACKERS] cataloguing NOT NULL constraints
Excerpts from Alvaro Herrera's message of mié ago 03 13:12:38 -0400 2011: > ... ah, maybe what we could do is have gram.y create a ColumnDef in the > new production, and stick that in cmd->def instead of the list of > constraints. So parse_utilcmd would have to know that if that node > IsA(ColumnDef) then it needs to deal with column constraints. It seems > a bit cleaner overall, though it's still wart-ish. Yes, this works, thanks for the suggestion. Attached. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support alter_add-3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On ons, 2011-08-03 at 10:33 -0700, David E. Wheeler wrote: > I haven't been paying attention to warnings, but FWIW 9.1beta3 has > been building with LLVM by default with Xcode 4. Both on Snow Leopard > and on Lion I saw something like this: > > try=# select version(); > > version > > - > PostgreSQL 9.1beta3 on x86_64-apple-darwin11.0.0, compiled by > i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. > build 5658) (LLVM build 2335.15.00), 64-bit > (1 row) > > Which is just awesome. Things seem to work great. Note that what you are using there is a GCC frontend with a LLVM backend. (So I suppose you would get mostly GCC warnings.) Clang is a new/different compiler frontend using the LLVM backend. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On ons, 2011-08-03 at 10:25 +0100, Peter Geoghegan wrote: > Attached patch removes the tautologolical part of an evaluated > expression, fixing the problem flagged by this quite valid warning. I think these warnings are completely bogus and should not be worked around. Even in the most trivial case of { unsigned int foo; ... if (foo < 0 && ...) ... } I would not want to remove the check, because as the code gets moved around, refactored, reused, whatever, the unsigned int might change into a signed int, and then you're hosed. It's fine for -Wextra, but not for the -Wall level. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On Aug 3, 2011, at 1:47 AM, Peter Geoghegan wrote: > So, there are 4 remaining warnings. I haven't been paying attention to warnings, but FWIW 9.1beta3 has been building with LLVM by default with Xcode 4. Both on Snow Leopard and on Lion I saw something like this: try=# select version(); version - PostgreSQL 9.1beta3 on x86_64-apple-darwin11.0.0, compiled by i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00), 64-bit (1 row) Which is just awesome. Things seem to work great. Best, David
Re: [HACKERS] WIP fix proposal for bug #6123
On Wed, Aug 3, 2011 at 12:27 PM, Florian Pflug wrote: > Going down that road opens the door to a *lot* of subtle semantic > differences between currently equivalent queries. For example, > > UPDATE T SET f=f, a=1 > > would behave differently then > > UPDATE T SET a=1 > > because in the first case the new row would depend on the old row's > value of "f", while in the second case it doesn't. True. But I'm not really bothered by that. If the user gets an error message that says: ERROR: updated column "f" has already been modified by a BEFORE trigger ...the user will realize the error of their ways. >> There's no way to get the >> same result as if you'd done either one of them first, because they >> are inextricably intertwined. >> >> In practice, my hand-wavy reference to "reconciling the updates" is a >> problem because of the way the trigger interface works. It feels >> pretty impossible to decide that we're going to do the update, but >> with some other random values we dredged up from some other update >> replacing some of the ones the user explicitly handed to us. But if >> the trigger could return an indication of which column values it >> wished to override, then it seems to me that we could piece together a >> reasonable set of semantics. It's not exactly clear how to make that >> work, though. > > I dunno, that all still feels awfully complex. As you said yourself, > this case is quite similar to a serialization anomaly. Taking that > correspondence further, that reconciliation of updates is pretty much > what the EPQ machinery does in READ COMMITTED mode. Now, we ourselves > have warned users in the past to *not* use READ COMMITTED mode if they > do complex updates (e.g., use UPDATE ... FROM ...), because the behaviour > of that reconciliation machinery in the present of concurrent updates > is extremely hard to predict. I thus don't believe that it's a good idea > to introduce similarly complex behaviour in other parts of the system - > and particularly not if you cannot disable it by switching to another > isolation level. > > Simply throwing an error, on the other hand, makes the behaviour simple > to understand and explain. True. I'm coming around to liking that behavior better than I did on first hearing, but I'm still not crazy about it, because as an app developer I would really like to have at least the unproblematic cases actually work. Throwing an error at least makes it clear that you've done something which isn't supported, and that's probably an improvement over the current, somewhat-baffling behavior. However, it's not even 25% as nice as having it actually work as intended. That's why, even if we can't make all the cases work sanely, I'd be a lot more enthusiastic about it if we could find a way to make at least some of them work sanely. The mind-bending cases are unlikely to be the ones people do on purpose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cataloguing NOT NULL constraints
Excerpts from Dean Rasheed's message of mié ago 03 03:11:32 -0400 2011: > On 3 August 2011 04:40, Alvaro Herrera wrote: > > Excerpts from Dean Rasheed's message of sáb jul 30 18:40:46 -0400 2011: > > > >> Looks pretty good to me (not too dirty). I suppose given that the > >> parser transforms AT_ColumnConstraint into one of the existing command > >> subtypes, you could just have gram.y emit an AT_AddConstraint with the > >> ColumnDef attached, to save adding a new subtype, but there's probably > >> not much difference. > > > > Thanks. I've done the other changes you suggested, but I don't see that > > it's desirable to have gram.y emit AT_AddConstraint directly. It seems > > cleaner to be able to turn a NOT NULL constraint into AT_SetNotNull > > in parse_utilcmd instead. > > I wasn't proposing getting rid of that bit in parse_utilcmd. I just > meant move the block that calls transformColumnConstraints() to the > AT_AddConstraint case in transformAlterTableStmt(). So it would test > if it has a ColumnDef attached, and either process a table constraint > or a set of column constraints. That would avoid the need for a new > command subtype, and so the changes to tablecmds.c would not be > needed. I think it would also avoid the need to add colname to the > Constraint struct, so none of the parsenodes.h changes would be needed > either. Oh, I see. Well, the problem is precisely that we don't have a ColumnDef at that point. ... ah, maybe what we could do is have gram.y create a ColumnDef in the new production, and stick that in cmd->def instead of the list of constraints. So parse_utilcmd would have to know that if that node IsA(ColumnDef) then it needs to deal with column constraints. It seems a bit cleaner overall, though it's still wart-ish. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP fix proposal for bug #6123
Florian Pflug wrote: > To me, it still seems conceptionally cleaner to just decree that a > row must not be modified while BEFORE triggers are running, > period. > > This, BTW, also matches what Oracle does, only on a per-row > instead of per-table basis. Oracle AFAIR simply forbids touching > of the table a BEFORE trigger is attached to from within that > trigger. (They even forbid SELECTS, which is presumably because > they don't have an equivalent of our per-row command id, i.e. > cannot ensure that such a SELECT sees the state the table was in > at the beginning of the statement) It appears this was in flight while I was composing my last post. This seems pretty strict, but given the trade-offs, perhaps it is worth it. I can live with either solution, myself. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transient plans versus the SPI API
On Wed, Aug 3, 2011 at 12:19 PM, Tom Lane wrote: > Dimitri Fontaine writes: >> Tom Lane writes: >>> Anyone have an opinion about that? > >> I still have this application where PREPARE takes between 50ms and 300ms >> and EXECUTE 5ms to 10ms, and I can handle 1 PREPARE for 1 EXECUTE >> quite easily. (Yes the database fits in RAM, and yes when that's no >> longer the case we just upgrade the hardware) > >> What does your proposal mean for such a use case? > > Well, the policy for when to replan or not remains to be worked out in > detail, but what is likely to happen for such cases is that we'll waste > a few planning cycles before determining that there's no benefit in a > custom plan. So, using the worst-case ends of your ranges above and > assuming that "a few" means "10", we'd go from 300 + 5 * 1 = 50300 > ms to execute the query 1 times, to 10 * 300 + 5 * 1 = 53000 ms. A little OT here, but (as I think Simon said elsewhere) I think we really ought to be considering the table statistics when deciding whether or not to replan. It seems to me that the overwhelmingly common case where this is going to come up is when (some subset of) the MCVs require a different plan than run-of-the-mill values. It would be nice to somehow work that out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FOR KEY LOCK foreign keys
Excerpts from Robert Haas's message of mié ago 03 12:14:15 -0400 2011: > On Wed, Jul 27, 2011 at 7:16 PM, Alvaro Herrera > wrote: > > One thing I have not addressed is Noah's idea about creating a new lock > > mode, KEY UPDATE, that would let us solve the initial problem that this > > patch set to resolve in the first place. I am not clear on exactly how > > that is to be implemented, because currently heap_update and heap_delete > > do not grab any kind of lock but instead do their own ad-hoc waiting. I > > think that might need to be reshuffled a bit, to which I haven't gotten > > yet, and is a radical enough idea that I would like it to be discussed > > by the hackers community at large before setting sail on developing it. > > In the meantime, this patch does improve the current situation quite a > > lot. > > I haven't looked at the patch yet, but do you have a pointer to Noah's > proposal? And/or a description of how it differs from what you > implemented here? Yes, see his review email here: http://archives.postgresql.org/message-id/20110211071322.gb26...@tornado.leadboat.com It's long, but search for the part where he talks about "KEY UPDATE". The way my patch works is explained by Noah there. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP fix proposal for bug #6123
Robert Haas wrote: > Florian Pflug wrote: >> Here's a step-by-step description of what happens in the case of >> two UPDATES, assuming that we don't ignore any updated and throw >> no error. >> >> 1) UPDATE T SET ... WHERE ID=1 >> 2) Row with ID=1 is found & locked >> 3) BEFORE UPDATE triggers are fired, with OLD containing the >>original row's contents and NEW the values specified in (1) >> 4) UPDATE T SET ... WHERE ID=1, executed from within one of the >>triggers >> 5) BEFORE UPDATE triggers are fired again, with OLD containing >>the original row's contents and NEW the value specified in >>(4). >> 6) The row is updated with the values specified in (4), possibly >>changed by the triggers in (5) >> 7) The row is updated with the values specified in (2), possibly >>changed by the triggers in (3). >> The different between Kevin's original patch and my suggestion >> is, BTW, that he made step (7) through an error while I suggested >> the error to be thrown already at (4). > > I think Kevin's proposal is better, because AFAICS if the BEFORE > UPDATE trigger returns NULL then we haven't got a problem; and we > can't know that until we get to step 7. Robert makes a good point -- at step 4 we can't know whether the BEFORE trigger will return an unmodified record, a modified record, or NULL. Without some way to deal with that, an throwing the error earlier seems like a non-starter. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP fix proposal for bug #6123
On Aug3, 2011, at 17:57 , Robert Haas wrote: > On Wed, Aug 3, 2011 at 4:50 AM, Florian Pflug wrote: >> The different between Kevin's original patch and my suggestion is, BTW, >> that he made step (7) through an error while I suggested the error to >> be thrown already at (4). > > I think Kevin's proposal is better, because AFAICS if the BEFORE > UPDATE trigger returns NULL then we haven't got a problem; and we > can't know that until we get to step 7. Hm, not sure if I agree. A BEFORE trigger might decide to block an update, and then take some action based on the assumption that the row is actually unmodified (i.e., identical to the OLD value observed by the trigger). To me, it still seems conceptionally cleaner to just decree that a row must not be modified while BEFORE triggers are running, period. This, BTW, also matches what Oracle does, only on a per-row instead of per-table basis. Oracle AFAIR simply forbids touching of the table a BEFORE trigger is attached to from within that trigger. (They even forbid SELECTS, which is presumably because they don't have an equivalent of our per-row command id, i.e. cannot ensure that such a SELECT sees the state the table was in at the beginning of the statement) best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [GENERAL] Odd VACUUM behavior when it is expected to truncate last empty pages
(moving this to hackers since I suspect we got an issue to fix here) On Wed, Aug 3, 2011 at 6:35 AM, Sergey Konoplev wrote: > Hi all, > > I have PostgreSQL 9.0.3 installed on my Gentoo Linux box. The > configuration is default. There is no any activity in the database but > the described below. > > What I am trying to achieve is the effect described in this article > http://blog.endpoint.com/2010/09/reducing-bloat-without-locking.html. > In short I am updating last pages of a table to move the tuples to the > earlier pages to make VACUUM able to truncate the empty tail. However > I faced a strange VACUUM behavior. So the situation is: > > 1. do some UPDATEs on the table so it has several last pages free, > 2. perform VACUUM of this table the 1st time, no tail pages will be > truncated (why?), > 3. perform VACUUM the 2nd time straight after the 1st one and it will > truncate the tail pages (why this time?). > There is a check to truncate only if minimum 1000 or relpages/16 pages (compile time constants) can be truncated and attempt truncation only if thats true, otherwise we assume that the cost of attempting truncation is not worth the cost of pages salvaged. With that logic, we should have never truncated just 3 pages in a relation with 3000+ pages. But I can see that happening because of visibility maps. So if we stop scanning beyond a certain percentage of pages, we might be fooled into believing that the rest of the pages can possibly be truncated and then do a hard check to find that out. The first vacuum probably scans even the last few pages because you just updated the tuples of those pages and then updates the visibility map for those pages. The second vacuum then stops much before because the visibility map tells us that all remaining pages are visible (and thus set nonempty_pages to a lower number) and so the check I mentioned earlier succeeds and we attempt the truncation. Now, this is just a theory and a reproducible case will help to confirm this. The only problem, other than a surprising behavior that you noted, that I see with this approach is that we might repeatedly try to truncate a relation which in fact does not have anything to truncate. The worst thing is we might unnecessarily take an exclusive lock on the table. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Fast GiST index build
On 03.08.2011 11:18, Alexander Korotkov wrote: I found that in previous version of patch I missed PageSetLSN and PageSetTLI, but huge amount of WAL is still here. Also I found that huge amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but messages "FATAL: xlog flush request BFF11148/809A600 is not satisfied --- flushed only to 44/9C518750" appears. Seems that there is some totally wrong use of WAL if even optimization level does matter... Try this: diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c index 5099330..5a441e0 100644 --- a/src/backend/access/gist/gistbuild.c +++ b/src/backend/access/gist/gistbuild.c @@ -478,7 +478,7 @@ bufferingbuildinsert(GISTInsertState *state, /* Write the WAL record */ if (RelationNeedsWAL(state->r)) { - gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, noldoffnum, + recptr = gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, noldoffnum, itup, ntup, InvalidBuffer); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP fix proposal for bug #6123
On Aug3, 2011, at 17:55 , Robert Haas wrote: > On that note, I think in some ways the problems we're hitting here are > very much like serialization anomalies. Yeah, I had the same feeling of familiarity ;-) > If the user updates a tuple > based on its PK and sets some non-PK field to a constant, and while > we're doing that, a BEFORE trigger updates any field in the tuple > other than the PK, then in theory it seems we ought to be able to > reconcile the updates. It feels like the result ought to be the same > as if we'd simply run the BEFORE-trigger update to completion, and > then run the main update. However, if the BEFORE-trigger modifies any > columns that the main update examined while constructing its value for > NEW, then the updates can't be serialized. Going down that road opens the door to a *lot* of subtle semantic differences between currently equivalent queries. For example, UPDATE T SET f=f, a=1 would behave differently then UPDATE T SET a=1 because in the first case the new row would depend on the old row's value of "f", while in the second case it doesn't. > There's no way to get the > same result as if you'd done either one of them first, because they > are inextricably intertwined. > > In practice, my hand-wavy reference to "reconciling the updates" is a > problem because of the way the trigger interface works. It feels > pretty impossible to decide that we're going to do the update, but > with some other random values we dredged up from some other update > replacing some of the ones the user explicitly handed to us. But if > the trigger could return an indication of which column values it > wished to override, then it seems to me that we could piece together a > reasonable set of semantics. It's not exactly clear how to make that > work, though. I dunno, that all still feels awfully complex. As you said yourself, this case is quite similar to a serialization anomaly. Taking that correspondence further, that reconciliation of updates is pretty much what the EPQ machinery does in READ COMMITTED mode. Now, we ourselves have warned users in the past to *not* use READ COMMITTED mode if they do complex updates (e.g., use UPDATE ... FROM ...), because the behaviour of that reconciliation machinery in the present of concurrent updates is extremely hard to predict. I thus don't believe that it's a good idea to introduce similarly complex behaviour in other parts of the system - and particularly not if you cannot disable it by switching to another isolation level. Simply throwing an error, on the other hand, makes the behaviour simple to understand and explain. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transient plans versus the SPI API
Dimitri Fontaine writes: > Tom Lane writes: >> Anyone have an opinion about that? > I still have this application where PREPARE takes between 50ms and 300ms > and EXECUTE 5ms to 10ms, and I can handle 1 PREPARE for 1 EXECUTE > quite easily. (Yes the database fits in RAM, and yes when that's no > longer the case we just upgrade the hardware) > What does your proposal mean for such a use case? Well, the policy for when to replan or not remains to be worked out in detail, but what is likely to happen for such cases is that we'll waste a few planning cycles before determining that there's no benefit in a custom plan. So, using the worst-case ends of your ranges above and assuming that "a few" means "10", we'd go from 300 + 5 * 1 = 50300 ms to execute the query 1 times, to 10 * 300 + 5 * 1 = 53000 ms. So yes, it'd get a little worse for that use-case. But you have to weigh that against the likelihood that other use-cases will get better. If our requirement for a transient-plan mechanism is that no individual case can ever be worse than before, then we might as well abandon the entire project right now, because the only way to meet that requirement is to change nothing. Of course we could address the worst cases by providing some mechanism to tell the plancache code "always use a generic plan for this query" or "always use a custom plan". I'm not entirely thrilled with that, because it's effectively a planner hint and has got the same problems as all planner hints, namely that users are likely to get it wrong. But it would be relatively painless to supply such a hint at the SPI level, which is why I asked whether we should. It'd be much harder to do something equivalent at higher levels, which is why I'm not that eager to do it for SPI. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FOR KEY LOCK foreign keys
On Wed, Jul 27, 2011 at 7:16 PM, Alvaro Herrera wrote: > One thing I have not addressed is Noah's idea about creating a new lock > mode, KEY UPDATE, that would let us solve the initial problem that this > patch set to resolve in the first place. I am not clear on exactly how > that is to be implemented, because currently heap_update and heap_delete > do not grab any kind of lock but instead do their own ad-hoc waiting. I > think that might need to be reshuffled a bit, to which I haven't gotten > yet, and is a radical enough idea that I would like it to be discussed > by the hackers community at large before setting sail on developing it. > In the meantime, this patch does improve the current situation quite a > lot. I haven't looked at the patch yet, but do you have a pointer to Noah's proposal? And/or a description of how it differs from what you implemented here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incremental checkopints
2011/7/29 Greg Smith : > 1) Postponing writes as long as possible always improves the resulting > throughput of those writes. Any incremental checkpoint approach will detune > throughput by some amount. If you make writes go out more often, they will > be less efficient; that's just how things work if you benchmark anything > that allows write combining. Any incremental checkpoint approach is likely > to improve latency in some cases if it works well, while decreasing > throughput in most cases. Agreed. I came to the same conclusion a while back and then got depressed. That might mean we need a parameter to control the behavior, unless we can find a change where the throughput drop is sufficiently small that we don't really care, or make the optimization apply only in cases where we determine that the latency problem will be so severe that we'll certainly be willing to accept a drop in throughput to avoid it. > 2) The incremental checkpoint approach used by other databases, such as the > MySQL implementation, works by tracking what transaction IDs were associated > with a buffer update. The current way PostgreSQL saves buffer sync > information for the checkpoint to process things doesn't store enough > information to do that. As you say, the main price there is some additional > memory. I think what we'd need to track is the LSN that first dirtied the page (as opposed to the current field, which tracks the LSN that most recently wrote the page). If we write and flush all pages whose first-dirtied LSN precedes some cutoff point, then we ought to be able to advance the redo pointer to that point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP fix proposal for bug #6123
On Wed, Aug 3, 2011 at 4:50 AM, Florian Pflug wrote: > On Aug3, 2011, at 04:54 , Robert Haas wrote: >> On Tue, Aug 2, 2011 at 2:32 PM, Kevin Grittner >> wrote: > Would you feel comfortable with a patch which threw an error on > the DELETE case, as it does on the UPDATE case? Yes, though with a little additional twist. The twist being that I'd like the error to be thrown earlier, at the time of the offending UPDATE or DELETE, not later, when the original UPDATE or DELETE which caused the BEFORE trigger invocation stumbles over the updated row. It not only seems more logical that way, but it also makes it possible for the trigger to catch the error, and react accordingly. >>> >>> I hadn't thought of that. It does seem better in every way except >>> for how much work it takes to do it and how much testing it needs to >>> have confidence in it. Definitely not 9.1 material. >> >> I'm not sure I understand the justification for throwing an error. >> Updating a row twice is not an error under any other circumstances; >> nor is deleting a row you've updated. Why should it be here? > > Because updating or deleting a row from within a BEFORE trigger fired > upon updating or deleting that very row causes two intermingled > UPDATE/DELETE executions, not one UPDATE/DELETE followed by another. > > Here's a step-by-step description of what happens in the case of two > UPDATES, assuming that we don't ignore any updated and throw no error. > > 1) UPDATE T SET ... WHERE ID=1 > 2) Row with ID=1 is found & locked > 3) BEFORE UPDATE triggers are fired, with OLD containing the original > row's contents and NEW the values specified in (1) > 4) UPDATE T SET ... WHERE ID=1, executed from within one of the triggers > 5) BEFORE UPDATE triggers are fired again, with OLD containing the > original row's contents and NEW the value specified in (4). > 6) The row is updated with the values specified in (4), possibly changed > by the triggers in (5) > 7) The row is updated with the values specified in (2), possibly changed > by the triggers in (3). > > Now, in step (7), we're overwriting the UPDATE from steps 4-6 without > even looking at the row that UPDATE produced. If, for example, both UPDATES > do "counter = counter + 1", we'd end up with the counter incremented by > 1, *not* by 2, even though there were two distinct UPDATEs. Or even worse, > the inner UPDATE at (4) might have modified the ID. Then, we'll apply the > outer UPDATE in (7), even though the original WHERE condition from (1) > no longer matches the row. > > We could attempt to fix that by using the EvalPlanQual machinery to > re-check the WHERE clause and re-compute the new row in (7). However, > EvalPlanQual introduces a lot of conceptional complexity, and can lead > to very surprising results for more complex UPDATES. (There's that whole > self-join problem with EvalPlanQual, for example) > > Also, even if we did use EvalPlanQual, we'd still have executed the outer > BEFORE trigger (step 3) with values for OLD and NEW which *don't* match > the row the we actually update in (7). Doing that seems rather hazardous > too - who knows what the BEFORE trigger has used the values for. > > The different between Kevin's original patch and my suggestion is, BTW, > that he made step (7) through an error while I suggested the error to > be thrown already at (4). I think Kevin's proposal is better, because AFAICS if the BEFORE UPDATE trigger returns NULL then we haven't got a problem; and we can't know that until we get to step 7. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP fix proposal for bug #6123
On Wed, Aug 3, 2011 at 10:48 AM, Kevin Grittner wrote: > As have many other data mangling bugs which we fix in minor > releases. Our point here is that it's never been like this in any > product that the Wisconsin Courts has used for triggers, and never > will be. I don't believe this is very similar at all to the sorts of bugs we fix in minor releases, but let's leave that aside for the moment since it seems that we're in violent agreement about which release we're targeting. With respect to the second part of your comment, it might be useful to describe how this works in some of those other products. >> I'm not sure I understand the justification for throwing an error. > > Nobody has suggested sane semantics for doing otherwise. There is > the pipe dream of what application programmers here would like, > described earlier in the thread. I just don't see that happening, > and I'm not sure it addresses all of Florian's concerns anyway. You > had a suggestion for how to salvage the update situation, but it was > pretty hand-wavy, and I find it very hard to reconcile to some of > the issues raised by Florian. Maybe someone can propose sane > semantics which covers all the objections, and maybe that would even > be possible to implement; but right now Florian's approach seems > like the most solid proposal yet put forward. It probably is, but I'm not sure we've done enough thinking about it to be certain that it's the right way forward. On that note, I think in some ways the problems we're hitting here are very much like serialization anomalies. If the user updates a tuple based on its PK and sets some non-PK field to a constant, and while we're doing that, a BEFORE trigger updates any field in the tuple other than the PK, then in theory it seems we ought to be able to reconcile the updates. It feels like the result ought to be the same as if we'd simply run the BEFORE-trigger update to completion, and then run the main update. However, if the BEFORE-trigger modifies any columns that the main update examined while constructing its value for NEW, then the updates can't be serialized. There's no way to get the same result as if you'd done either one of them first, because they are inextricably intertwined. In practice, my hand-wavy reference to "reconciling the updates" is a problem because of the way the trigger interface works. It feels pretty impossible to decide that we're going to do the update, but with some other random values we dredged up from some other update replacing some of the ones the user explicitly handed to us. But if the trigger could return an indication of which column values it wished to override, then it seems to me that we could piece together a reasonable set of semantics. It's not exactly clear how to make that work, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 3 August 2011 15:29, Tom Lane wrote: > No, this is not an improvement at all. The point of the code is that we > are about to use the enum value as an integer array subscript, and we > want to make sure it is within the array bounds. Tying that logic to > some member of the enum is not a readability or safety improvement. > We aren't trusting the caller to pass a valid enum value, and likewise > this code doesn't particularly want to trust the enum definition to be > anything in particular. I would think that if someone were to have a reason to change the explicit value set for the identifier PGRES_EMPTY_QUERY, they would carefully consider the ramifications of doing so. It's far more likely they'd just append new values to the end of the enum though. This is why I don't consider that what I've proposed exposes us to any additional risk. > There is another point here, though, which is that if we're not sure > whether the compiler considers ExecStatusType to be signed or unsigned, > then we have no idea what the test "status < PGRES_EMPTY_QUERY" even > means. I'm sorry, but I don't know what you mean by this. > So I think the most reasonable fix is probably > > if ((unsigned int) status >= sizeof pgresStatus / sizeof > pgresStatus[0]) > > which is sufficient to cover both directions, since if status is passed > as -1 then it will convert to a large unsigned value. It's also a > natural expression of what we really want, ie, that the integer > equivalent of the enum value is in range. I'm not convinced that that is an improvement to rely on the conversion doing so, but it's not as if I feel very strongly about it. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error: could not find pg_class tuple for index 2662
daveg writes: > We have installed the patch and have encountered the error as usual. > However there is no additional output from the patch. I'm speculating > that the pg_class scan in ScanPgRelationDetailed() fails to return > tuples somehow. Evidently not, if it's not logging anything, but now the question is why. One possibility is that for some reason RelationGetNumberOfBlocks is persistently lying about the file size. (We've seen kernel bugs before that resulted in transiently wrong values, so this isn't totally beyond the realm of possibility.) Please try the attached patch, which extends the previous one to add a summary line including the number of blocks physically scanned by the seqscan. regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 81cea8b60406dffa7b5d278697ab5ad6cef6a3d8..589513a2d35987b2202ce1162b326db2d358b6fa 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *** *** 32,37 --- 32,38 #include "access/genam.h" #include "access/reloptions.h" + #include "access/relscan.h" #include "access/sysattr.h" #include "access/transam.h" #include "access/xact.h" *** *** 64,69 --- 65,71 #include "optimizer/var.h" #include "rewrite/rewriteDefine.h" #include "storage/fd.h" + #include "storage/bufmgr.h" #include "storage/lmgr.h" #include "storage/smgr.h" #include "utils/array.h" *** ScanPgRelation(Oid targetRelId, bool ind *** 310,315 --- 312,388 } /* + * ScanPgRelationDetailed + * + * Try to figure out why we failed to locate row for relation. + */ + static HeapTuple + ScanPgRelationDetailed(Oid targetRelId) + { + HeapTuple pg_class_tuple; + Relation pg_class_desc; + SysScanDesc pg_class_scan; + ScanKeyData key[1]; + int count = 0; + + /* + * form a scan key + */ + ScanKeyInit(&key[0], + ObjectIdAttributeNumber, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(targetRelId)); + + /* + * Open pg_class and fetch tuples, forcing heap scan and disabling + * visibility checks. + */ + pg_class_desc = heap_open(RelationRelationId, AccessShareLock); + pg_class_scan = systable_beginscan(pg_class_desc, ClassOidIndexId, + false, + SnapshotAny, + 1, key); + + while (HeapTupleIsValid((pg_class_tuple = systable_getnext(pg_class_scan + { + Buffer buf = pg_class_scan->scan->rs_cbuf; + bool valid; + + count++; + + /* need buffer lock to call HeapTupleSatisfiesVisibility */ + LockBuffer(buf, BUFFER_LOCK_SHARE); + valid = HeapTupleSatisfiesVisibility(pg_class_tuple, + SnapshotNow, + buf); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + elog(LOG, "searching %u for pg_class tuple for index %u: found ctid (%u,%u), xmin %u, xmax %u, flags 0x%4x 0x%4x, valid %d", + pg_class_desc->rd_node.relNode, + targetRelId, + ItemPointerGetBlockNumber(&(pg_class_tuple->t_self)), + ItemPointerGetOffsetNumber(&(pg_class_tuple->t_self)), + HeapTupleHeaderGetXmin(pg_class_tuple->t_data), + HeapTupleHeaderGetXmax(pg_class_tuple->t_data), + pg_class_tuple->t_data->t_infomask, + pg_class_tuple->t_data->t_infomask2, + valid); + } + + elog(LOG, "ScanPgRelationDetailed: found %d tuples with OID %u in %u blocks of filenode %u", + count, + targetRelId, + pg_class_scan->scan->rs_nblocks, + pg_class_desc->rd_node.relNode); + + /* all done */ + systable_endscan(pg_class_scan); + heap_close(pg_class_desc, AccessShareLock); + + return NULL; + } + + /* * AllocateRelationDesc * * This is used to allocate memory for a new relation descriptor *** RelationReloadIndexInfo(Relation relatio *** 1737,1744 indexOK = (RelationGetRelid(relation) != ClassOidIndexId); pg_class_tuple = ScanPgRelation(RelationGetRelid(relation), indexOK); if (!HeapTupleIsValid(pg_class_tuple)) ! elog(ERROR, "could not find pg_class tuple for index %u", ! RelationGetRelid(relation)); relp = (Form_pg_class) GETSTRUCT(pg_class_tuple); memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE); /* Reload reloptions in case they changed */ --- 1810,1824 indexOK = (RelationGetRelid(relation) != ClassOidIndexId); pg_class_tuple = ScanPgRelation(RelationGetRelid(relation), indexOK); if (!HeapTupleIsValid(pg_class_tuple)) ! { ! pg_class_tuple = ScanPgRelationDetailed(RelationGetRelid(relation)); ! if (!HeapTupleIsValid(pg_class_tuple)) ! elog(ERROR, "could not find pg_class tuple for index %u", ! RelationGetRelid(relation)); ! else ! elog(LOG, "could not find pg_class tuple for index %u, but succeeded on second try", ! RelationGetRelid(relation)); ! } relp = (Form_pg_class) GETSTRUCT(pg_class_tuple); memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE); /* Reload reloptions in case they changed */ -- Sent via
Re: [HACKERS] WIP fix proposal for bug #6123
Robert Haas wrote: > Kevin Grittner wrote: Would you feel comfortable with a patch which threw an error on the DELETE case, as it does on the UPDATE case? >>> >>> Yes, though with a little additional twist. The twist being that >>> I'd like the error to be thrown earlier, at the time of the >>> offending UPDATE or DELETE, not later, when the original UPDATE >>> or DELETE which caused the BEFORE trigger invocation stumbles >>> over the updated row. It not only seems more logical that way, >>> but it also makes it possible for the trigger to catch the >>> error, and react accordingly. >> >> I hadn't thought of that. It does seem better in every way >> except for how much work it takes to do it and how much testing >> it needs to have confidence in it. Definitely not 9.1 material. > > IMHO, none of this is 9.1 material. Nobody is currently arguing for including anything to fix this in 9.1. Perhaps I should have been more explicit that in agreeing with Florian, I'm giving up on the idea of any PostgreSQL release attempting to do anything about this before 9.2. That said, this is a data corrupting bug in the view of management here, based on reports from testers and my explanation of why they saw that behavior. They are aware that it can be worked around, but that's not acceptable if there's no reliable way to find all occurrences of patterns that hit this bug. To keep this effort alive here, I am using my quick hack for 9.0 and 9.1. Deletes will behave as they expect, and updates won't quietly discard part of the work of a transaction -- an error will be thrown in that situation. > It's been like this forever As have many other data mangling bugs which we fix in minor releases. Our point here is that it's never been like this in any product that the Wisconsin Courts has used for triggers, and never will be. > People who would otherwise be tempted to write their applications > this way will end up not doing so precisely because it currently > does not work. I can attest to that. > So I think we should focus on getting the right semantics here, > rather than trying to minimize the scope of the change. Agreed. I thought that's what we were talking about. > I'm not sure I understand the justification for throwing an error. Nobody has suggested sane semantics for doing otherwise. There is the pipe dream of what application programmers here would like, described earlier in the thread. I just don't see that happening, and I'm not sure it addresses all of Florian's concerns anyway. You had a suggestion for how to salvage the update situation, but it was pretty hand-wavy, and I find it very hard to reconcile to some of the issues raised by Florian. Maybe someone can propose sane semantics which covers all the objections, and maybe that would even be possible to implement; but right now Florian's approach seems like the most solid proposal yet put forward. > Updating a row twice is not an error under any other > circumstances; nor is deleting a row you've updated. Why should > it be here? As Florian pointed out, it's not exactly a matter of doing those things. If you have a proposal which addresses the issues raised earlier in the thread, please share. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
Peter Geoghegan writes: > On 3 August 2011 12:19, Heikki Linnakangas > wrote: >> Right, but the purpose of that check is to defend from programmer error. If >> the programmer screws up and calls "PQresStatus(-1)", we want to give an >> error, not crash. If you assume that the programmer will only pass a valid >> enum constant as parameter, then you might as well remove the if-statement >> altogether. I don't think that would be an improvement. > Ahh. I failed to consider the intent of the code. > Attached patch has a better solution than casting though - we simply > use an enum literal. The fact that PGRES_EMPTY_QUERY is the first > value in the enum can be safely assumed to be stable, not least > because we've even already explicitly given it a corresponding value > of 0. No, this is not an improvement at all. The point of the code is that we are about to use the enum value as an integer array subscript, and we want to make sure it is within the array bounds. Tying that logic to some member of the enum is not a readability or safety improvement. We aren't trusting the caller to pass a valid enum value, and likewise this code doesn't particularly want to trust the enum definition to be anything in particular. There is another point here, though, which is that if we're not sure whether the compiler considers ExecStatusType to be signed or unsigned, then we have no idea what the test "status < PGRES_EMPTY_QUERY" even means. So I think the most reasonable fix is probably if ((unsigned int) status >= sizeof pgresStatus / sizeof pgresStatus[0]) which is sufficient to cover both directions, since if status is passed as -1 then it will convert to a large unsigned value. It's also a natural expression of what we really want, ie, that the integer equivalent of the enum value is in range. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
It occurred to me that you may be a little surprised that the patch actually prevents the warning from occurring. If you have any doubts, I can assure you that it does. Clang differentiates between 0 as an rvalue, 0 from an enum literal and 0 resulting from evaluating a macro expression (including a function-like macro with deep nesting). -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench internal contention
Robert Haas writes: > On Tue, Aug 2, 2011 at 8:44 PM, Tom Lane wrote: >> Maybe. But if that's the approach we want to use, let's just call it >> pg_erand48 in the code, and dispense with the alias macros as well as >> all vestiges of configure support. > Works for me. Just to confirm, that means we'd also change GEQO to > use pg_erand48(). Right, that's what I was thinking. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Fast GiST index build
On Wed, Aug 3, 2011 at 4:18 AM, Alexander Korotkov wrote: > I found that in previous version of patch I missed PageSetLSN > and PageSetTLI, but huge amount of WAL is still here. Also I found that huge > amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but > messages "FATAL: xlog flush request BFF11148/809A600 is not satisfied --- > flushed only to 44/9C518750" appears. Seems that there is some totally wrong > use of WAL if even optimization level does matter... Try setting wal_debug=true to see what records are getting emitted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transient plans versus the SPI API
Tom Lane writes: > Anyone have an opinion about that? I still have this application where PREPARE takes between 50ms and 300ms and EXECUTE 5ms to 10ms, and I can handle 1 PREPARE for 1 EXECUTE quite easily. (Yes the database fits in RAM, and yes when that's no longer the case we just upgrade the hardware) What does your proposal mean for such a use case? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error: could not find pg_class tuple for index 2662
On Mon, Aug 01, 2011 at 01:23:49PM -0400, Tom Lane wrote: > daveg writes: > > On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote: > >> I think we need to start adding some instrumentation so we can get a > >> better handle on what's going on in your database. If I were to send > >> you a source-code patch for the server that adds some more logging > >> printout when this happens, would you be willing/able to run a patched > >> build on your machine? > > > Yes we can run an instrumented server so long as the instrumentation does > > not interfere with normal operation. However, scheduling downtime to switch > > binaries is difficult, and generally needs to be happen on a weekend, but > > sometimes can be expedited. I'll look into that. > > OK, attached is a patch against 9.0 branch that will re-scan pg_class > after a failure of this sort occurs, and log what it sees in the tuple > header fields for each tuple for the target index. This should give us > some useful information. It might be worthwhile for you to also log the > results of > > select relname,pg_relation_filenode(oid) from pg_class > where relname like 'pg_class%'; > > in your script that does VACUUM FULL, just before and after each time it > vacuums pg_class. That will help in interpreting the relfilenodes in > the log output. We have installed the patch and have encountered the error as usual. However there is no additional output from the patch. I'm speculating that the pg_class scan in ScanPgRelationDetailed() fails to return tuples somehow. I have also been trying to trace it further by reading the code, but have not got any solid hypothesis yet. In the absence of any debugging output I've been trying to deduce the call tree leading to the original failure. So far it looks like this: RelationReloadIndexInfo(Relation) // Relation is 2662 and !rd_isvalid pg_class_tuple = ScanPgRelation(2662, indexOK=false) // returns NULL pg_class_desc = heap_open(1259, ACC_SHARE) r = relation_open(1259, ACC_SHARE) // locks oid, ensures RelationIsValid(r) r = RelationIdGetRelation(1259) r = RelationIdCacheLookup(1259) // assume success if !rd_isvalid: RelationClearRelation(r, true) RelationInitPhysicalAddr(r) // r is pg_class relcache -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 3 August 2011 12:19, Heikki Linnakangas wrote: > Right, but the purpose of that check is to defend from programmer error. If > the programmer screws up and calls "PQresStatus(-1)", we want to give an > error, not crash. If you assume that the programmer will only pass a valid > enum constant as parameter, then you might as well remove the if-statement > altogether. I don't think that would be an improvement. Ahh. I failed to consider the intent of the code. Attached patch has a better solution than casting though - we simply use an enum literal. The fact that PGRES_EMPTY_QUERY is the first value in the enum can be safely assumed to be stable, not least because we've even already explicitly given it a corresponding value of 0. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 605d242..b6e6363 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -2386,7 +2386,7 @@ PQresultStatus(const PGresult *res) char * PQresStatus(ExecStatusType status) { - if (status < 0 || status >= sizeof pgresStatus / sizeof pgresStatus[0]) + if (status < PGRES_EMPTY_QUERY || status >= sizeof pgresStatus / sizeof pgresStatus[0]) return libpq_gettext("invalid ExecStatusType code"); return pgresStatus[status]; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 03.08.2011 14:13, Peter Geoghegan wrote: On 3 August 2011 11:05, Peter Geoghegan wrote: I don't believe that the standard allows for an implementation of enums as unsigned integers - after all, individual enum literals can be given corresponding negative integer values. It actually gives leeway to implement the enum as unsigned int when the compiler knows that it won't matter, because there are no negative integer values that correspond to some enum literal. The hint was in my original warning. :-) This warning is only seen because the first enum literal in the enum is explicitly given the value 0, thus precluding the possibility of the value being< 0, barring some abuse of the enum. It's also seen if no explicit values are given, and the compiler opts to make the representation unsigned. It is not seen if it the value is -1, for example. Despite the fact that whether or not the value is unsigned is implementation defined, I think that the patch is still valid - the expression is at least logically tautological, even if it isn't necessarily bitwise tautological, because, as I said, barring some violation of the enum's contract, it should not be< 0. That's precisely why the compiler has opted to make it unsigned. Right, but the purpose of that check is to defend from programmer error. If the programmer screws up and calls "PQresStatus(-1)", we want to give an error, not crash. If you assume that the programmer will only pass a valid enum constant as parameter, then you might as well remove the if-statement altogether. I don't think that would be an improvement. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 3 August 2011 11:05, Peter Geoghegan wrote: > I don't believe that the standard allows for an implementation of > enums as unsigned integers - after all, individual enum literals can > be given corresponding negative integer values. It actually gives leeway to implement the enum as unsigned int when the compiler knows that it won't matter, because there are no negative integer values that correspond to some enum literal. The hint was in my original warning. :-) > This warning is only seen because the first enum literal in the enum > is explicitly given the value 0, thus precluding the possibility of > the value being < 0, barring some abuse of the enum. It's also seen if no explicit values are given, and the compiler opts to make the representation unsigned. It is not seen if it the value is -1, for example. Despite the fact that whether or not the value is unsigned is implementation defined, I think that the patch is still valid - the expression is at least logically tautological, even if it isn't necessarily bitwise tautological, because, as I said, barring some violation of the enum's contract, it should not be < 0. That's precisely why the compiler has opted to make it unsigned. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 03.08.2011 13:05, Peter Geoghegan wrote: I don't believe that the standard allows for an implementation of enums as unsigned integers - after all, individual enum literals can be given corresponding negative integer values. C99 says: Each enumerated type shall be compatible with char, a signed integer type, or an unsigned integer type. The choice of type is implementation-defined,110) but shall be capable of representing the values of all the members of the enumeration. See also: http://stackoverflow.com/questions/2579230/signedness-of-enum-in-c-c99-c-cx-gnu-c-gnu-c99 -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 3 August 2011 10:34, Heikki Linnakangas wrote: > The check is only tautological if the compiler implements enums as unsigned > integers. Whether enums are signed or not is implementation-dependent. > Perhaps cast status to unsigned or signed explicitly before the checks? I don't believe that the standard allows for an implementation of enums as unsigned integers - after all, individual enum literals can be given corresponding negative integer values. This warning is only seen because the first enum literal in the enum is explicitly given the value 0, thus precluding the possibility of the value being < 0, barring some abuse of the enum. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 03.08.2011 12:25, Peter Geoghegan wrote: Attached patch removes the tautologolical part of an evaluated expression, fixing the problem flagged by this quite valid warning. The check is only tautological if the compiler implements enums as unsigned integers. Whether enums are signed or not is implementation-dependent. Perhaps cast status to unsigned or signed explicitly before the checks? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
Attached patch removes the tautologolical part of an evaluated expression, fixing the problem flagged by this quite valid warning. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 605d242..51af0d8 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -2386,7 +2386,7 @@ PQresultStatus(const PGresult *res) char * PQresStatus(ExecStatusType status) { - if (status < 0 || status >= sizeof pgresStatus / sizeof pgresStatus[0]) + if (status >= sizeof pgresStatus / sizeof pgresStatus[0]) return libpq_gettext("invalid ExecStatusType code"); return pgresStatus[status]; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP fix proposal for bug #6123
On Aug3, 2011, at 04:54 , Robert Haas wrote: > On Tue, Aug 2, 2011 at 2:32 PM, Kevin Grittner > wrote: Would you feel comfortable with a patch which threw an error on the DELETE case, as it does on the UPDATE case? >>> >>> Yes, though with a little additional twist. The twist being that >>> I'd like the error to be thrown earlier, at the time of the >>> offending UPDATE or DELETE, not later, when the original UPDATE or >>> DELETE which caused the BEFORE trigger invocation stumbles over >>> the updated row. It not only seems more logical that way, but it >>> also makes it possible for the trigger to catch the error, and >>> react accordingly. >> >> I hadn't thought of that. It does seem better in every way except >> for how much work it takes to do it and how much testing it needs to >> have confidence in it. Definitely not 9.1 material. > > I'm not sure I understand the justification for throwing an error. > Updating a row twice is not an error under any other circumstances; > nor is deleting a row you've updated. Why should it be here? Because updating or deleting a row from within a BEFORE trigger fired upon updating or deleting that very row causes two intermingled UPDATE/DELETE executions, not one UPDATE/DELETE followed by another. Here's a step-by-step description of what happens in the case of two UPDATES, assuming that we don't ignore any updated and throw no error. 1) UPDATE T SET ... WHERE ID=1 2) Row with ID=1 is found & locked 3) BEFORE UPDATE triggers are fired, with OLD containing the original row's contents and NEW the values specified in (1) 4) UPDATE T SET ... WHERE ID=1, executed from within one of the triggers 5) BEFORE UPDATE triggers are fired again, with OLD containing the original row's contents and NEW the value specified in (4). 6) The row is updated with the values specified in (4), possibly changed by the triggers in (5) 7) The row is updated with the values specified in (2), possibly changed by the triggers in (3). Now, in step (7), we're overwriting the UPDATE from steps 4-6 without even looking at the row that UPDATE produced. If, for example, both UPDATES do "counter = counter + 1", we'd end up with the counter incremented by 1, *not* by 2, even though there were two distinct UPDATEs. Or even worse, the inner UPDATE at (4) might have modified the ID. Then, we'll apply the outer UPDATE in (7), even though the original WHERE condition from (1) no longer matches the row. We could attempt to fix that by using the EvalPlanQual machinery to re-check the WHERE clause and re-compute the new row in (7). However, EvalPlanQual introduces a lot of conceptional complexity, and can lead to very surprising results for more complex UPDATES. (There's that whole self-join problem with EvalPlanQual, for example) Also, even if we did use EvalPlanQual, we'd still have executed the outer BEFORE trigger (step 3) with values for OLD and NEW which *don't* match the row the we actually update in (7). Doing that seems rather hazardous too - who knows what the BEFORE trigger has used the values for. The different between Kevin's original patch and my suggestion is, BTW, that he made step (7) through an error while I suggested the error to be thrown already at (4). best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 3 August 2011 00:52, Peter Geoghegan wrote: > Now, the only warning that remains is that same Correction - there are actually 3 additional warnings like this in repl_gram.c: /home/peter/build/Release/bin/clang -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -I. -I../../../src/include -D_GNU_SOURCE -c -o repl_gram.o repl_gram.c repl_gram.y:106:30: warning: implicit conversion from enumeration type 'enum ReplNodeTag' to different enumeration type 'NodeTag' (aka 'enum NodeTag') [-Wconversion] (yyval.node) = (Node *) makeNode(IdentifySystemCmd); ^~~ ../../../src/include/nodes/nodes.h:475:64: note: expanded from: #define makeNode(_type_)((_type_ *) newNode(sizeof(_type_),T_##_type_)) ^ :180:1: note: expanded from: T_IdentifySystemCmd ^ ../../../src/include/nodes/nodes.h:452:19: note: expanded from: _result->type = (tag); \ ~ ^~~ I'll take a look into them later. This tautology warning sounds completely valid: /home/peter/build/Release/bin/clang -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -pthread -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS -fpic -DFRONTEND -DUNSAFE_STAT_OK -I. -I../../../src/include -D_GNU_SOURCE -I../../../src/port -I../../../src/port -DSO_MAJOR_VERSION=5 -c -o fe-exec.o fe-exec.c fe-exec.c:2408:13: warning: comparison of unsigned enum expression < 0 is always false [-Wtautological-compare] if (status < 0 || status >= sizeof pgresStatus / sizeof pgresStatus[0]) ~~ ^ ~ 1 warning generated. So, there are 4 remaining warnings. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Fast GiST index build
I found that in previous version of patch I missed PageSetLSN and PageSetTLI, but huge amount of WAL is still here. Also I found that huge amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but messages "FATAL: xlog flush request BFF11148/809A600 is not satisfied --- flushed only to 44/9C518750" appears. Seems that there is some totally wrong use of WAL if even optimization level does matter... -- With best regards, Alexander Korotkov. gist_fast_build-0.9.1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cataloguing NOT NULL constraints
On 3 August 2011 04:40, Alvaro Herrera wrote: > Excerpts from Dean Rasheed's message of sáb jul 30 18:40:46 -0400 2011: > >> Looks pretty good to me (not too dirty). I suppose given that the >> parser transforms AT_ColumnConstraint into one of the existing command >> subtypes, you could just have gram.y emit an AT_AddConstraint with the >> ColumnDef attached, to save adding a new subtype, but there's probably >> not much difference. > > Thanks. I've done the other changes you suggested, but I don't see that > it's desirable to have gram.y emit AT_AddConstraint directly. It seems > cleaner to be able to turn a NOT NULL constraint into AT_SetNotNull > in parse_utilcmd instead. I wasn't proposing getting rid of that bit in parse_utilcmd. I just meant move the block that calls transformColumnConstraints() to the AT_AddConstraint case in transformAlterTableStmt(). So it would test if it has a ColumnDef attached, and either process a table constraint or a set of column constraints. That would avoid the need for a new command subtype, and so the changes to tablecmds.c would not be needed. I think it would also avoid the need to add colname to the Constraint struct, so none of the parsenodes.h changes would be needed either. (Maybe I'll have to bite the bullet and make > AT_AddConstraint work for not null constraints as well, as part of the > larger patch. Not sure.) Currently, the table constraint syntax only > lets you do a single constraint at a time, but you can do multiple > constraints with the column constraint syntax. I am not sure how hard > it is to rework the grammar so that only a single constraint is > allowed, but I'm not sure that it's worth the trouble either. > > Attached is an updated version, touching the docs and adding a new > simple regression test. > > But ... I just noticed that I need to touch ALTER DOMAIN in a similar > way as well. Oh I keep forgetting about domains. But after a quick glance at the docs, doesn't it already have syntax for this? Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers