Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 08:35, Robert Haas wrote: >> I expect it would be fine to have a tool that pulls LCRs out of WAL to >> prepare that to be sent to remote locations. Is that what you have in >> mind? > > Yes. I think it should be possible to generate LCRs from WAL, but I > think that the on-the-wire format for LCRs should be different from > the WAL format. Trying to use the same format for both things seems > like an unpleasant straightjacket. You're confusing things here. Yes, we can use a different on-the-wire format. No problem. As I've already said, the information needs to be in WAL first before we can put it into LCRs, so the "don't use same format" argument is not relevant as to why the info must be on the WAL record in the first place. And as said elsewhere, doing that does not cause problems in any of these areas: wal bloat, performance, long term restrictions on numbers of nodeids, as has so far been claimed on this thread. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 01:27, Kevin Grittner wrote: Andres Freund wrote: Yes, thats definitely a valid use-case. But that doesn't preclude the other - also not uncommon - use-case where you want to have different master which all contain up2date data. I agree. I was just saying that while one requires an origin_id, the other doesn't. And those not doing MM replication definitely don't need it. I think it would be helpful to list down a few concrete examples of this. The stereotypical multi-master scenario is that you have a single table that's replicated to two servers, and you can insert/update/delete on either server. Conflict resolution stretegies vary. The reason we need an origin id in this scenario is that otherwise this will happen: 1. A row is updated on node A 2. Node B receives the WAL record from A, and updates the corresponding row in B. This generates a new WAL record. 3. Node A receives the WAL record from B, and updates the rows again. This again generates a new WAL record, which is replicated to A, and you loop indefinitely. If each WAL record carries an origin id, node A can use it to refrain from applying the WAL record it receives from B, which breaks the loop. However, note that in this simple scenario, if the logical log replay / conflict resolution is smart enough to recognize that the row has already been updated, because the old and the new rows are identical, the loop is broken at step 3 even without the origin id. That works for the newest-update-wins and similar strategies. So the origin id is not absolutely necessary in this case. Another interesting scenario is that you maintain a global counter, like in an inventory system, and conflicts are resolved by accumulating the updates. For example, if you do "UPDATE SET counter = counter + 1" simultaneously on two nodes, the result is that the counter is incremented by two. The avoid-update-if-already-identical optimization doesn't help in this case, the origin id is necessary. Now, let's take the inventory system example further. There are actually two ways to update a counter. One is when an item is checked in or out of the warehouse, ie. "UPDATE counter = counter + 1". Those updates should accumulate. But another operation resets the counter to a specific value, "UPDATE counter = 10", like when taking an inventory. That should not accumulate with other changes, but should be newest-update-wins. The origin id is not enough for that, because by looking at the WAL record and the origin id, you don't know which type of an update it was. So, I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. And I'm worried it might not even be enough in more complicated scenarios. Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 11:26, Tom Lane wrote: > "Kevin Grittner" writes: >> Simon Riggs wrote: >>> The proposal is to use WAL to generate the logical change stream. >>> That has been shown in testing to be around x4 faster than having >>> a separate change stream, which must also be WAL logged (as Jan >>> noted). > >> Sure, that's why I want it. > > I think this argument is basically circular. The reason it's 4x faster > is that the WAL stream doesn't actually contain all the information > needed to generate LCRs (thus all the angst about maintaining catalogs > in sync, what to do about unfriendly datatypes, etc). By the time the > dust has settled and you have a workable system, you will have bloated > WAL and given back a large chunk of that multiple, thereby invalidating > the design premise. Or at least that's my prediction. The tests were conducted with the additional field added, so your prediction is not verified. The additional fields do not bloat WAL records - they take up exactly the same space as before. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
>> I've got a problem with the assumption that, when pg_control is trash, >> megabytes or gigabytes of WAL can still be relied on completely. >> >> I'm almost inclined to suggest that we not get next-LSN from WAL, but >> by scanning all the pages in the main data store and computing the max >> observed LSN. This is clearly not very attractive from a performance >> standpoint, but it would avoid the obvious failure mode where you lost >> some recent WAL segments along with pg_control. > I think it could be useful to have a tool that scans all the blocks > and computes that value, but I'd want it to just print the value out > and let me decide what to do about it. There are cases where you > don't necessarily want to clobber pg_control, but you do have future > LSNs in your data file pages. This can be either because the disk ate > your WAL, or because you didn't create recovery.conf, or because your > disk corrupted the LSNs on the data file pages. I'd want a tool that > could be either run on an individual file, or recursively on a > directory. The whole point is we need to find a valid next-LSN (Redo Replay location as I understand). If we let user decide about it, I think it can lead to inconsistent database. As per my understanding postgres database can come to consistent point only if it has both datafiles and WAL after crash. So I am not able to think if it lost WAL, how we can it make a consistent database. > If these values seem acceptable, use -f to force reset. > [rhaas pgsql]$ pg_resetxlog -f ~/pgdata > pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it > Transaction log reset > [rhaas pgsql]$ postgres > LOG: database system was shut down at 2012-06-19 15:25:28 EDT > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > So I still don't understand what problem we're solving here. 1. The values (like nextoid, nextxid, etc) are guessed values which can be improved by having these values from last checkpoint record using WAL files. 2. The value for next-LSN (ControlFile.checkPointCopy.redo) will be guessed value which if directly used for recovery after pg_resetxlog will lead to inconsistent database. So I want to improve the logic to have either appropriate value for next-LSN or more reliable value. In documentation, it is mentioned that starting database after using pg_resetxlog can contain inconsistent data. The exact wording is mentioned below in mail. My purposal to work on this Todo item is to improve the values generated for pg_control, so that it becomes more easy for users to recover from database corruption scenario's. I don't think even after working on this feature, user can recover database for all corruption scenario's. However it can improve the situation from now. Pg_resetxlog documentation related excerpts- "After running this command, it should be possible to start the server, but bear in mind that the database might contain inconsistent data due to partially-committed transactions. You should immediately dump your data, run initdb, and reload. After reload, check for inconsistencies and repair as needed." With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] return values of backend sub-main functions
On Tue, Jun 19, 2012 at 4:31 AM, Peter Eisentraut wrote: > On ons, 2012-01-18 at 21:21 +0200, Peter Eisentraut wrote: >> On lör, 2012-01-07 at 16:41 -0500, Tom Lane wrote: >> > Peter Eisentraut writes: >> > > I suggest that we change PostgresMain(), PostmasterMain(), BackendRun(), >> > > WalSenderMain(), and WalSndLoop() to return void as well. >> > >> > I agree this code is not very consistent or useful, but one question: >> > what should the callers do if one of these functions *does* return? >> >> I was thinking of a two-pronged approach: First, add >> __attribute__((noreturn)) to the functions. This will cause a suitable >> compiler to verify on a source-code level that nothing actually returns >> from the function. And second, at the call site, put an abort(); /* >> not reached */. Together, this will make the code cleaner and more >> consistent, and will also help the compiler out a bit about the control >> flow. > > Patch for 9.3 attached. +1. Should this call around line 4114 of postmaster.c not bother with proc_exit() anymore: /* And run the backend */ proc_exit(BackendRun(&port)); I was hoping that some of the clang static analyzer complaints would go away with these changes, though it looks like only one[1] did. I would be interested to see the similar elog/ereport patch you mentioned previously, perhaps that will eliminate a bunch of warnings. Josh [1] this one goes away with the patch: http://kupershmidt.org/pg/scan-build-2012-06-19-master/report-E2cUqJ.html#EndPath -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backport of fsync queue compaction
On 06/19/2012 08:41 PM, Robert Haas wrote: On Tue, Jun 19, 2012 at 7:33 PM, Josh Berkus wrote: To what version will we backport it? All supported branches, I would think. Yeah, I would feel a lot better if this was improved in 8.3 and up. I didn't bother trying to push it before now because I wanted both more bad examples and some miles on 9.1 without the new code triggering any problems from the field. Seems we're at that point now. I fear a lot of people are going to stay on 8.3 for a long time even after it's officially unsupported. It's been hard enough to move some people off of 8.1 and 8.2 even with the carrot of "8.3 will be much faster at everything". I fear 8.3 is going to end up like 7.4, where a 7 year long support window passes and people are still using it years later. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
>> I believe if WAL files are proper as mentioned in Alvaro's mail, the purposed logic should generate >> correct values. >> Do you see any problem in logic purposed in my original mail. >> Can I resume my work on this feature? > Maybe I'm missing your point, but... why don't you just use PITR to > recover from the corruption of pg_control? AFAIK PITR can be used in a scenario where there is a base back-up and we have archived the WAL files after that, now it can use WAL files to apply on base-backup. In this scenario we don't know a point from where to start the next replay. So I believe it will be difficult to use PITR in this scenario. -Original Message- From: Fujii Masao [mailto:masao.fu...@gmail.com] Sent: Tuesday, June 19, 2012 7:44 PM To: Amit Kapila Cc: Tom Lane; Alvaro Herrera; Cédric Villemain; Pg Hackers; Robert Haas Subject: Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata On Tue, Jun 19, 2012 at 2:44 AM, Amit Kapila wrote: >> AFAIR you can create pg_control from scratch already with pg_resetxlog. >> The hard part is coming up with values for the counters, such as the >> next WAL location. Some of them such as next OID are pretty harmless >> if you don't guess right, but I'm worried that wrong next WAL could >> make things worse not better. > > I believe if WAL files are proper as mentioned in Alvaro's mail, the purposed logic should generate > correct values. > Do you see any problem in logic purposed in my original mail. > Can I resume my work on this feature? Maybe I'm missing your point, but... why don't you just use PITR to recover from the corruption of pg_control? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
"Kevin Grittner" writes: > Simon Riggs wrote: >> The proposal is to use WAL to generate the logical change stream. >> That has been shown in testing to be around x4 faster than having >> a separate change stream, which must also be WAL logged (as Jan >> noted). > Sure, that's why I want it. I think this argument is basically circular. The reason it's 4x faster is that the WAL stream doesn't actually contain all the information needed to generate LCRs (thus all the angst about maintaining catalogs in sync, what to do about unfriendly datatypes, etc). By the time the dust has settled and you have a workable system, you will have bloated WAL and given back a large chunk of that multiple, thereby invalidating the design premise. Or at least that's my prediction. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] performance regression in 9.2 when loading lots of small tables
On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes wrote: > But in the 9.2 branch, the slow phenotype was re-introduced in > 1575fbcb795fc331f4, although perhaps the details of who is locking > what differs. I haven't yet sorted that out. It very much does. That commit prevents people from creating a relation in - or renaming a relation into - a schema that is being concurrently dropped, which in previous releases would have resulted in inconsistent catalog contents. I admit that it harms your test case, but how likely is it that someone is going to put every single table into its own schema? And have shared_buffers low enough for this to be the dominant cost? I think in real-world scenarios this isn't going to be a problem - although, of course, making the lock manager faster would be nifty if we can do it, and this might be a good test case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Tue, Jun 19, 2012 at 4:22 PM, Andres Freund wrote: >> > 1. dllist.h has double the element overhead by having an inline value >> > pointer (which is not needed when embedding) and a pointer to the list >> > (which I have a hard time seing as being useful) >> > 2. only double linked list, mine provided single and double linked ones >> > 3. missing macros to use when embedded in a larger struct (containerof() >> > wrappers and for(...) support basically) >> > 4. most things are external function calls... >> > 5. way much more branches/complexity in most of the functions. My >> > implementation doesn't use any branches for the typical easy >> > modifications (push, pop, remove element somewhere) and only one for the >> > typical tests (empty, has-next, ...) >> > >> > The performance and memory aspects were crucial for the aforementioned >> > toy project (slab allocator for postgres). Its not that crucial for the >> > applycache where the lists currently are mostly used although its also >> > relatively performance sensitive and obviously does a lot of list >> > manipulation/iteration. >> > >> > If I had to decide I would add the missing api in dllist.h to my >> > implementation and then remove it. Its barely used - and only in an >> > embedded fashion - as far as I can see. >> > I can understand though if that argument is met with doubt by others ;). >> > If thats the way it has to go I would add some more convenience support >> > for embedding data to dllist.h and settle for that. >> >> I think it might be simpler to leave the name as Dllist and just >> overhaul the implementation along the lines you suggest, rather than >> replacing it with something completely different. Mostly, I don't >> want to add a third thing if we can avoid it, given that Dllist as it >> exists today is used only lightly. > Well, if its the name, I have no problem with changing it, but I don't see how > you can keep the api as it currently is and address my points. > > If there is some buyin I can try to go either way (keeping the existing name, > changing the api, adjusting the callers or just adjust the callers, throw away > the old implementation) I just don't want to get into that just to see > somebody isn't agreeing with the fundamental idea. My guess is that it wouldn't be too hard to remove some of the extra pointers. Anyone who is using Dllist as a non-inline list could be converted to List * instead. Also, the performance-critical things could be reimplemented as macros. I question, though, whether we really need both single and doubly linked lists. That seems like it's almost certainly micro-optimization that we are better off not doing. > The most contentious point is probably relying on USE_INLINE being available > anywhere. Which I believe to be the point now that we have gotten rid of some > platforms. I would be hesitant to chuck that even though I realize it's unlikely that we really need !USE_INLINE. But see sortsupport for an example of how we've handled this in the recent past. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] performance regression in 9.2 when loading lots of small tables
On Tue, Jun 19, 2012 at 2:38 PM, Robert Haas wrote: > On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas wrote: >> On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes wrote: >>> There was a regression introduced in 9.2 that effects the creation and >>> loading of lots of small tables in a single transaction. >>> >>> It affects the loading of a pg_dump file which has a large number of >>> small tables (10,000 schemas, one table per schema, 10 rows per >>> table). I did not test other schema configurations, so these >>> specifics might not be needed to invoke the problem. >>> >>> It causes the loading of a dump with "psql -1 -f " to run at half the >>> previous speed. Speed of loading without the -1 is not changed. >> >> I tried to figure out why this was happening. I tried it out on the >> IBM POWER7 box after building from >> f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14 >> minutes. 'time' reports that psql used 38 seconds of user time and 11 >> seconds of system time. The remaining time was presumably spent >> waiting for the backend and/or the kernel. >> ... >> I did a separate run using perf and it had this to say: >> >> Events: 1M cycles >> + 13.18% postgres postgres [.] FlushRelationBuffers >> + 9.65% postgres postgres [.] comparetup_index_btree ... > > I built REL9_1_STABLE from commit > 1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even > slower than master. So something is different in my environment > versus yours. Yes, I forgot the perhaps the most important part. FlushRelationBuffers is a real pig under "-1 -f -" if you are running on large shared_buffers, and might swamp the other issue. I run with shared_buffers=8MB to get around that problem. And with fsync=off if you don't have BBU or something. The unlocked test in FlushRelationBuffers recently added to HEAD probably is what makes HEAD faster if you are using large shared_buffers. With small shared_buffers and fsync off I get 1 minute to load 10,000 tables/schemas in the fast phenotype, or 2 minutes in the slow phenotype. It looks like the phenotype actually has come and gone more than once (for different but related reasons), which is why I couldn't find the place in the REL9_2_STABLE where it happened very easily with git bisect. The fundamental problem seems to be that the slow phenotype takes twice as many locks on which makes the local hash table twice as big, which in turn makes the reassignment of resource owners twice as slow. The original regression seems to be because each table had both an AccessExclusiveLock and a ShareUpdateExclusiveLock taken on it when it was checked to see if a toast table was needed. I now see that that was reverted in both branches. But in the 9.2 branch, the slow phenotype was re-introduced in 1575fbcb795fc331f4, although perhaps the details of who is locking what differs. I haven't yet sorted that out. Since the fundamental problem will hopefully be fixed in 9.3, can 9.2 live with this regression in what is probably a rather rare case (loading huge number of very small tables, and with a small enough shared_buffers avoid the other limitation)? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql tab completion for GRANT role
On Tue, Jun 19, 2012 at 3:16 AM, Peter Eisentraut wrote: > On tor, 2012-06-14 at 13:38 -0400, Robert Haas wrote: >> On Sun, Jan 8, 2012 at 3:48 PM, Peter Eisentraut wrote: >> > psql tab completion currently only supports the form GRANT privilege ON >> > something TO someone (and the analogous REVOKE), but not the form GRANT >> > role TO someone. Here is a patch that attempts to implement the latter. >> >> This seems to have fallen through the cracks. > > No, it was committed in January. Oops, I missed that. Sorry for the noise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Libxml2 load error on Windows
On Tue, Jun 19, 2012 at 6:36 AM, Alex Shulgin wrote: > Dave Page writes: >> On Tue, Jun 19, 2012 at 11:04 AM, Alex Shulgin >> wrote: >>> >>> In a real bug-tracking system we would create a new bug/ticket and set >>> it's target version to 'candidate for next minor release' or something >>> like that. This way, if we don't release unless all targeted bugs are >>> resolved, this would be taken care of (hopefully.) >> >> Well yes, but the point is that that is not how the project works. I'm >> asking how we do handle this problem at the moment, because I realised >> I haven't seen this happen in the past (largely because I haven't been >> paying attention). > > It only works as long as there is some mechanism to ensure that the bugs > which were not submitted to commitfest are looked at. If there is no, > then it doesn't work actually. > > So is there a real and reliable mechanism for that? No. There probably should be, but in the meantime adding it to the CommitFest is better than a poke in the eye with a sharp stick. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture
On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund wrote: >> Well, the words are fuzzy, but I would define logical replication to >> be something which is independent of the binary format in which stuff >> gets stored on disk. If it's not independent of the disk format, then >> you can't do heterogenous replication (between versions, or between >> products). That precise limitation is the main thing that drives >> people to use anything other than SR in the first place, IME. > Not in mine. The main limitation I see is that you cannot write anything on > the standby. Which sucks majorly for many things. Its pretty much impossible > to "fix" that for SR outside of very limited cases. > While many scenarios don't need multimaster *many* need to write outside of > the standby's replication set. Well, that's certainly a common problem, even if it's not IME the most common, but I don't think we need to argue about which one is more common, because I'm not arguing against it. The point, though, is that if the logical format is independent of the on-disk format, the things we can do are a strict superset of the things we can do if it isn't. I don't want to insist that catalogs be the same (or else you get garbage when you decode tuples). I want to tolerate the fact that they may very well be different. That will in no way preclude writing outside the standby's replication set, nor will it prevent multi-master replication. It will, however, enable heterogenous replication, which is a very important use case. It will also mean that innocent mistakes (like somehow ending up with a column that is text on one server and numeric on another server) produce comprehensible error messages, rather than garbage. > Its not only the logging side which is a limitation in todays replication > scenarios. The apply side scales even worse because its *very* hard to > distribute it between multiple backends. I don't think that making LCR format = on-disk format is going to solve that problem. To solve that problem, we need to track dependencies between transactions, so that if tuple A is modified by T1 and T2, in that order, we apply T1 before T2. But if T3 - which committed after both T1 and T2 - touches none of the same data as T1 or T2 - then we can apply it in parallel, so long as we don't commit until T1 and T2 have committed (because allowing T3 to commit early would produce a serialization anomaly from the point of view of a concurrent reader). >> Because the routines that decode tuples don't include enough sanity >> checks to prevent running off the end of the block, or even the end of >> memory completely. Consider a corrupt TOAST pointer that indicates >> that there is a gigabyte of data stored in an 8kB block. One of the >> common symptoms of corruption IME is TOAST requests for -3 bytes of >> memory. > Yes, but we need to put safeguards against that sort of thing anyway. So sure, > we can have bugs but this is not a fundamental limitation. There's a reason we haven't done that already, though: it's probably going to stink for performance. If it turns out that it doesn't stink for performance, great. But if it causes a 5% slowdown on common use cases, I suspect we're not gonna do it, and I bet I can construct a case where it's worse than that (think: 400 column table with lots of varlenas, sorting by column 400 to return column 399). I think it's treading on dangerous ground to assume we're going to be able to "just go fix" this. > Postgis uses one information table in a few more complex functions but not in > anything low-level. Evidenced by the fact that it was totally normal for that > to go out of sync before < 2.0. > > But even if such a thing would be needed, it wouldn't be problematic to make > extension configuration tables be replicated as well. Ugh. That's a hack on top of a hack. Now it all works great if type X is installed as an extension but if it isn't installed as an extension then the world blows up. > I am pretty sure its not bad-behaved. But how should the code know that? You > want each type to explictly say that its unsafe if it is? Yes, exactly. Or maybe there are varying degrees of non-safety, allowing varying degrees of optimization. Like: wire format = binary format is super-safe. Then having to call an I/O function that promises not to look at any catalogs is a bit less safe. And then there's really unsafe. > I have played with several ideas: > > 1.) > keep the decoding catalog in sync with command/event triggers, correctly > replicating oids. If those log into some internal event table its easy to keep > the catalog in a correct transactional state because the events from that > table get decoded in the transaction and replayed at exactly the right spot in > there *after* it has been reassembled. The locking on the generating side > takes care of the concurrency aspects. I am not following this one completely. > 2.) > Keep the decoding site up2date by replicating the
Re: [HACKERS] Backport of fsync queue compaction
On Tue, Jun 19, 2012 at 7:33 PM, Josh Berkus wrote: > To what version will we backport it? All supported branches, I would think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 6:14 PM, Andres Freund wrote: > I definitely agree that low-level apply is possible as a module. Sure change > extraction needs core support but I was talking about what you need to > implement it reusing the "plain" logical support... > > What I do not understand is how you want to prevent loops in a simple manner > without in core support: > > A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT > action. > B reads the lcr stream from A and applies the changes. A new HEAP_INSERT > record. Gets decoded into the lcr stream as a INSERT action. > A reads the lcr stream from B and ??? > > At this point you need to prevent a loop. If you have the information where a > change originally happened (xl_origin_id = A in this case) you can have the > simple filter on A which ignores change records if lcr_origin_id == > local_replication_origin_id). See my email to Chris Browne, which I think covers this. It needs a bit in WAL (per txn, or, heck, if it's one bit, maybe per record) but not a whole node ID. >> You need a backend-local hash table inside the wal reader process, and >> that hash table needs to map XIDs to node IDs. And you occasionally >> need to prune it, so that it doesn't eat too much memory. None of >> that sounds very hard. > Its not very hard. Its just more complex than what I propose(d). True, but not a whole lot more complex, and a moderate amount of complexity to save bit-space is a good trade. Especially when Tom has come down against eating up the bit space. And I agree with him. If we've only got 16 bits of padding to work with, we surely be judicious in burning them when it can be avoided for the expense of a few hundred lines of code. >> > Btw, what do you mean with "conflating" the stream? I don't really see >> > that being proposed. >> It seems to me that you are intent on using the WAL stream as the >> logical change stream. I think that's a bad design. Instead, you >> should extract changes from WAL and then ship them around in a format >> that is specific to logical replication. > No, I don't want that. I think we will need some different format once we have > agreed how changeset extraction works. I think you are saying that you agree with me that the formats should be different, but that the LCR format is undecided as yet. If that is in fact what you are saying, great. We'll need to decide that, of course, but I think there is a lot of cool stuff that can be done that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne wrote: > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas wrote: >>> Btw, what do you mean with "conflating" the stream? I don't really see that >>> being proposed. >> >> It seems to me that you are intent on using the WAL stream as the >> logical change stream. I think that's a bad design. Instead, you >> should extract changes from WAL and then ship them around in a format >> that is specific to logical replication. > > Yeah, that seems worth elaborating on. > > What has been said several times is that it's pretty necessary to > capture the logical changes into WAL. That seems pretty needful, in > order that the replication data gets fsync()ed avidly, and so that we > don't add in the race condition of needing to fsync() something *else* > almost exactly as avidly as is the case for WAL today.. Check. > But it's undesirable to pull *all* the bulk of contents of WAL around > if it's only part of the data that is going to get applied. On a > "physical streaming" replica, any logical data that gets captured will > be useless. And on a "logical replica," they "physical" bits of WAL > will be useless. > > What I *want* you to mean is that there would be: > a) WAL readers that pull the "physical bits", and > b) WAL readers that just pull "logical bits." > > I expect it would be fine to have a tool that pulls LCRs out of WAL to > prepare that to be sent to remote locations. Is that what you have in > mind? Yes. I think it should be possible to generate LCRs from WAL, but I think that the on-the-wire format for LCRs should be different from the WAL format. Trying to use the same format for both things seems like an unpleasant straightjacket. This discussion illustrates why: we're talking about consuming scarce bit-space in WAL records for a feature that only a tiny minority of users will use, and it's still not really enough bit space. That stinks. If LCR transmission is a separate protocol, this problem can be engineered away at a higher level. Suppose we have three servers, A, B, and C, that are doing multi-master replication in a loop. A sends LCRs to B, B sends them to C, and C sends them back to A. Obviously, we need to make sure that each server applies each set of changes just once, but it suffices to have enough information in WAL to distinguish between replication transactions and non-replication transactions - that is, one bit. So suppose a change is made on server A. A generates LCRs from WAL, and tags each LCR with node_id = A. It then sends those LCRs to B. B applies them, flagging the apply transaction in WAL as a replication transaction, AND ALSO sends the LCRs to C. The LCR generator on B sees the WAL from apply, but because it's flagged as a replication transaction, it does not generate LCRs. So C receives LCRs from B just once, without any need for the node_id to to be known in WAL. C can now also apply those LCRs (again flagging the apply transaction as replication) and it can also skip sending them to A, because it seems that they originated at A. Now suppose we have a more complex topology. Suppose we have a cluster of four servers A .. D which, for improved tolerance against network outages, are all connected pairwise. Normally all the links are up, so each server sends all the LCRs it generates directly to all other servers. But how do we prevent cycles? A generates a change and sends it to B, C, and D. B then sees that the change came from A so it sends it to C and D. C, receiving that change, sees that came from A via B, so it sends it to D again, whereupon D, which got it from C and knows that the origin is A, sends it to B, who will then send it right back over to D. Obviously, we've got an infinite loop here, so this topology will not work. However, there are several obvious ways to patch it by changing the LCR protocol. Most obviously, and somewhat stupidly, you could add a TTL. A bit smarter, you could have each LCR carry a LIST of node_ids that it had already visited, refusing to send it to any node it had already been to it, instead of a single node_id. Smarter still, you could send handshaking messages around the cluster so that each node can build up a spanning tree and prefix each LCR it sends with the list of additional nodes to which the recipient must deliver it. So, normally, A would send a message to each of B, C, and D destined only for that node; but if the A-C link went down, A would choose either B or D and send each LCR to that node destined for that node *and C*; then, A would forward the message. Or perhaps you think this is too complex and not worth supporting anyway, and that might be true, but the point is that if you insist that all of the identifying information must be carried in WAL, you've pretty much ruled it out, because we are not going to put TTL fields, or lists of node IDs, or lists of destinations, in WAL. But there is no reason they can't be attached to LCRs, whi
Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)
Dean Rasheed writes: > On 12 February 2012 02:06, Vik Reykja wrote: >> I decided to take a crack at the todo item created from the following post: >> http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php > Here's my review of this patch. I've marked this patch committed, although in the end there was nothing left of it ;-). After teaching the trigger skip logic that old PK nulls or new FK nulls mean the constraint needn't be checked, there is no case where ri_KeysEqual is called on data that is not known to be all-non-null on one side or the other. So it doesn't matter whether we use plain equality or is-not-distinct logic. We could have applied these changes anyway but I didn't see much value, since as previously noted there would be some cases where the comparison got microscopically slower. I did add a comment about the point in case anybody revisits the code again in future. > There's also a separate question about whether the RI trigger > functions need to be doing these key comparisons, given that they are > done earlier when the triggers are queued > (http://archives.postgresql.org/pgsql-performance/2005-10/msg00459.php), > but the savings to be made there are likely to be smaller than the > savings this patch makes by not queuing the triggers at all. I haven't looked into this question. It would only matter for PK updates (since the FK-side triggers make no such comparisons), so it's probably not going to make much difference in typical workloads. Still, if anybody wants to investigate ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backport of fsync queue compaction
To what version will we backport it? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign keys in pgbench
Jeff Janes writes: > I think that pgbench should it make it easy to assess the impact of > foreign key constraints. > The attached adds a --foreign-keys option to initialization mode which > creates all the relevant constraints between the default tables. I had need of this for testing what I'm doing with foreign keys, so I went ahead and gave it a quick review (just cosmetic changes) and committed it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
Hi, On Wednesday, June 20, 2012 12:24:54 AM Heikki Linnakangas wrote: > On 19.06.2012 18:46, Andres Freund wrote: > > On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote: > >> Well, that was easier than I thought. Attached is a patch to make > >> XLogRecPtr a uint64, on top of my other WAL format patches. I think we > >> should go ahead with this. > > > > Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not > > sure if having two representations which just have a constant factor > > inbetween really makes sense. > I wasn't planning to, it didn't even occur to me that we might be able > to get rid of XLogSegNo to be honest. There's places that deal whole > segments, rather than with specific byte positions in the WAL, so I > think XLogSegNo makes more sense in that context. Take > XLogArchiveNotifySeg(), for example. It notifies the archiver that a > given segment is ready for archiving, so we pass an XLogSegNo to > identify that segment as an argument. I suppose we could pass an > XLogRecPtr that points to the beginning of the segment instead, but it > doesn't really feel like an improvement to me. I am not sure its a win either, was just wondering because they now are that similar. > >> Should we keep the old representation in the replication protocol > >> messages? That would make it simpler to write a client that works with > >> different server versions (like pg_receivexlog). Or, while we're at it, > >> perhaps we should mandate network-byte order for all the integer and > >> XLogRecPtr fields in the replication protocol. > > > > The replication protocol uses pq_sendint for integers which should take > > care of converting to big endian already. I don't think anything but the > > wal itself is otherwise transported in a binary fashion? So I don't > > think there is any such architecture dependency in the protocol > > currently? > We only use pg_sendint() for the few values exchanged in the handshake > before we start replicating, but once we begin, we just send structs > > around. For example, in ProcessStandbyReplyMessage(): > > static void > > ProcessStandbyReplyMessage(void) > > { > > > > StandbyReplyMessage reply; > > > > pq_copymsgbytes(&reply_message, (char *) &reply, > > sizeof(StandbyReplyMessage)); > > > > ... > > After that, we just the fields in the reply struct like in any other > struct. Yes, forgot that, true. I guess the best fix would be to actually send normal messages instead of CopyData ones? Much more to type though... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund wrote: > Yes, thats definitely a valid use-case. But that doesn't preclude > the other - also not uncommon - use-case where you want to have > different master which all contain up2date data. I agree. I was just saying that while one requires an origin_id, the other doesn't. And those not doing MM replication definitely don't need it. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On 19.06.2012 18:46, Andres Freund wrote: On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote: Well, that was easier than I thought. Attached is a patch to make XLogRecPtr a uint64, on top of my other WAL format patches. I think we should go ahead with this. Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure if having two representations which just have a constant factor inbetween really makes sense. I wasn't planning to, it didn't even occur to me that we might be able to get rid of XLogSegNo to be honest. There's places that deal whole segments, rather than with specific byte positions in the WAL, so I think XLogSegNo makes more sense in that context. Take XLogArchiveNotifySeg(), for example. It notifies the archiver that a given segment is ready for archiving, so we pass an XLogSegNo to identify that segment as an argument. I suppose we could pass an XLogRecPtr that points to the beginning of the segment instead, but it doesn't really feel like an improvement to me. Should we keep the old representation in the replication protocol messages? That would make it simpler to write a client that works with different server versions (like pg_receivexlog). Or, while we're at it, perhaps we should mandate network-byte order for all the integer and XLogRecPtr fields in the replication protocol. The replication protocol uses pq_sendint for integers which should take care of converting to big endian already. I don't think anything but the wal itself is otherwise transported in a binary fashion? So I don't think there is any such architecture dependency in the protocol currently? We only use pg_sendint() for the few values exchanged in the handshake before we start replicating, but once we begin, we just send structs around. For example, in ProcessStandbyReplyMessage(): static void ProcessStandbyReplyMessage(void) { StandbyReplyMessage reply; pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage)); > ... After that, we just the fields in the reply struct like in any other struct. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backport of fsync queue compaction
On Tuesday, June 19, 2012 11:39:46 PM Robert Haas wrote: > On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith wrote: > > In January of 2011 Robert committed > > 7f242d880b5b5d9642675517466d31373961cf98 to try and compact the fsync > > queue when clients find it full. There's no visible behavior change, > > just a substantial performance boost possible in the rare but extremely > > bad situations where the background writer stops doing fsync absorption. > > I've been running that in production at multiple locations since > > practically the day it hit this mailing list, with backports all the way > > to 8.3 being common (and straightforward to construct). I've never seen > > a hint of a problem with this new code. > > I've been in favor of back-porting this for a while, so you'll get no > argument from me. > > Anyone disagree? Not me, I have seen several sites having problems as well. +1 Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 12:15:03 AM Kevin Grittner wrote: > Simon Riggs wrote: > > If we use WAL in this way, multi-master implies that the data will > > *always* be in a loop. So in any configuration we must be able to > > tell difference between changes made by one node and another. > > Only if you assume that multi-master means identical databases all > replicating the same data to all the others. If I have 72 master > replicating non-conflicting data to one consolidated database, I > consider that to be multi-master, too. > ... > Of course, none of these databases have the same OID for any given > object, and there are numerous different schemas among the > replicating databases, so I need to get to table and column names > before the data is of any use to me. Yes, thats definitely a valid use-case. But that doesn't preclude the other - also not uncommon - use-case where you want to have different master which all contain up2date data. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Simon Riggs wrote: > The proposal is to use WAL to generate the logical change stream. > That has been shown in testing to be around x4 faster than having > a separate change stream, which must also be WAL logged (as Jan > noted). Sure, that's why I want it. > If we use WAL in this way, multi-master implies that the data will > *always* be in a loop. So in any configuration we must be able to > tell difference between changes made by one node and another. Only if you assume that multi-master means identical databases all replicating the same data to all the others. If I have 72 master replicating non-conflicting data to one consolidated database, I consider that to be multi-master, too. Especially if I have other types of databases replicating disjoint data to the same consolidated database, and the 72 sources have several databases replicating disjoint sets of data to them. We have about 1000 replications paths, none of which create a loop which can send data back to the originator or cause conflicts. Of course, none of these databases have the same OID for any given object, and there are numerous different schemas among the replicating databases, so I need to get to table and column names before the data is of any use to me. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 11:46:56 PM Robert Haas wrote: > On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund wrote: > > More seriously: Even if we don't put MM in core I think putting the basis > > for it in core so that somebody can build such a solution reusing the > > existing infrastructure is a sensible idea. Imo the only thing that > > requires explicit support which is hard to add outside of core is > > prevention of loops (aka this patch). Everything else should be doable > > reusing the hopefully modular pieces. > I don't think prevention of loops is hard to do outside of core > either, unless you insist on tying "logical" replication so closely to > WAL that anyone doing MMR is necessarily getting the change stream > from WAL. In fact, I'd go so far as to say that the ONLY part of this > that's hard to do outside of core is change extraction. Even > low-level apply can be implemented as a loadable module. I definitely agree that low-level apply is possible as a module. Sure change extraction needs core support but I was talking about what you need to implement it reusing the "plain" logical support... What I do not understand is how you want to prevent loops in a simple manner without in core support: A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT action. B reads the lcr stream from A and applies the changes. A new HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT action. A reads the lcr stream from B and ??? At this point you need to prevent a loop. If you have the information where a change originally happened (xl_origin_id = A in this case) you can have the simple filter on A which ignores change records if lcr_origin_id == local_replication_origin_id). > >> Right. If we decide we need this, and if we did decide to conflate > >> the WAL stream, both of which I disagree with as noted above, then we > >> still don't need it on every record. It would probably be sufficient > >> for local transactions to do nothing at all (and we can implicitly > >> assume that they have master node ID = local node ID) and transactions > >> which are replaying remote changes to emit one record per XID per > >> checkpoint cycle containing the remote node ID. > > > > Youve gone from a pretty trivial 150 line patch without any runtime/space > > overhead to something *considerably* more complex in that case though. > > > > You suddently need to have relatively complex logic to remember which > > information you got for a certain xid (and forget that information > > afterwards) and whether you already logged that xid and you need to have > > to decide about logging that information at multiple places. > You need a backend-local hash table inside the wal reader process, and > that hash table needs to map XIDs to node IDs. And you occasionally > need to prune it, so that it doesn't eat too much memory. None of > that sounds very hard. Its not very hard. Its just more complex than what I propose(d). > > Btw, what do you mean with "conflating" the stream? I don't really see > > that being proposed. > It seems to me that you are intent on using the WAL stream as the > logical change stream. I think that's a bad design. Instead, you > should extract changes from WAL and then ship them around in a format > that is specific to logical replication. No, I don't want that. I think we will need some different format once we have agreed how changeset extraction works. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 00:11, Tom Lane wrote: > Andres Freund writes: >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: ... (If you are thinking of something sufficiently high-level that merging could possibly work, then it's not WAL, and we shouldn't be trying to make the WAL representation cater for it.) > >> Do you really see this as such a big problem? > > It looks suspiciously like "I have a hammer, therefore every problem > must be a nail". I don't like the design concept of cramming logical > replication records into WAL in the first place. The evidence from prototypes shown at PgCon was that using the WAL in this way was very efficient and that this level of efficiency is necessary to make it work in a practical manner. Postgres would not be the first database to follow this broad design. > However, if we're dead set on doing it that way, let us put information > that is only relevant to logical replication records into only the > logical replication records. Agreed. In this case, the origin node id information needs to be available on the WAL record so we can generate the logical changes (LCRs). > Saving a couple bytes in each such record > is penny-wise and pound-foolish, I'm afraid; especially when you're > nailing down hard, unexpansible limits at the very beginning of the > development process in order to save those bytes. Restricting the number of node ids is being done so that there is no impact on anybody not using this feature. In later implementations, it should be possible to support greater numbers of node ids by having a variable length header. But that is an unnecessary complication for a first release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 05:59, Christopher Browne wrote: > But it's undesirable to pull *all* the bulk of contents of WAL around > if it's only part of the data that is going to get applied. On a > "physical streaming" replica, any logical data that gets captured will > be useless. And on a "logical replica," they "physical" bits of WAL > will be useless. Completely agree. What we're discussing here is the basic information content of a WAL record to enable the construction of suitable LCRs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions over pathological TCP connections
I just realized this is essentially an instance of the Two General's Problem; which is something I feel should have been more obvious to me. On Tue, Jun 19, 2012 at 5:50 PM, Leon Smith wrote: > On Tue, Jun 19, 2012 at 11:59 AM, Robert Haas wrote: > >> On Tue, Jun 19, 2012 at 1:56 AM, Tom Lane wrote: >> > The transaction would be committed before a command success report is >> > delivered to the client, so I don't think delivered-and-not-marked is >> > possible. >> >> ...unless you have configured synchronous_commit=off, or fsync=off. >> >> Or unless your disk melts into a heap of slag and you have to restore >> from backup. You can protect against that last case using synchronous >> replication. >> > > > But hard disk failure isn't in the failure model I was concerned about. > =)To be perfectly honest, I'm not too concerned with either hard drive > failure or network failure, as we are deploying on Raid 1+0 database > server talking to the client over a redundant LAN, and using asynchronous > (Slony) replication to an identical database server just in case. No > single point of failure is a key philosophy of this app from top to bottom. > Like I said, this is mostly idle curiosity. > > But I'm also accustomed to trying to get work done on shockingly > unreliable internet connections. As a result, network failure is > something I think about quite a lot when writing networked applications. > So this is not entirely idle curiosity either. > > And thinking about this a bit more, it's clear that the database has to > commit before the result is sent, on the off chance that the transaction > fails and needs to be retried. And that an explicit transaction block > isn't really a solution either, because a "BEGIN; SELECT dequeue_row()" > would get the row to the client without marking it as taken, but the > pathological TCP disconnect could then attack the following "COMMIT;", > leaving the client to think that the row has not been actually taken when > it in fact has. > > It's not clear to me that this is even a solvable problem without > modifying the schema to include both a "taken" and a "finished processing" > state, and then letting elements be re-delievered after a period of time. > But this would then allow a pathological demon with the power to cause > TCP connects have a single element delivered and processed multiple times. > > In any case, thanks for the responses... > > Best, > Leon >
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 04:31, Kevin Grittner wrote: > I've done a lot of MM replication, > and so far have not had to use a topology which allowed loops. The proposal is to use WAL to generate the logical change stream. That has been shown in testing to be around x4 faster than having a separate change stream, which must also be WAL logged (as Jan noted). If we use WAL in this way, multi-master implies that the data will *always* be in a loop. So in any configuration we must be able to tell difference between changes made by one node and another. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use of int4/int32 in C code
"Kevin Grittner" writes: > Tom Lane wrote: >> I have not looked to see how many places do that. If it's a reasonably >> small number of places, I'm OK with getting rid of int4 at the C level. >> (int2/int8 the same of course.) > $ find -name '*.h' -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat > | egrep -c '\bint4\b' > 570 That one looks like a lot, but a quick eyeball check suggests that many of them are in comments and/or embedded SQL fragments where they actually represent references to the SQL datatype, and so should be left alone. The impression I get is that the actual typedef uses are concentrated in a couple of contrib modules and some of the tsearch stuff, where they'd probably not be that painful to fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 10:58:44 PM Marko Kreen wrote: > On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs wrote: > > On 13 June 2012 19:28, Andres Freund wrote: > >> This adds a new configuration parameter multimaster_node_id which > >> determines the id used for wal originating in one cluster. > > > > Looks good and it seems this aspect at least is commitable in this CF. > > > > Design decisions I think we need to review are > > > > * Naming of field. I think origin is the right term, borrowing from > > Slony. > > I have not read too deeply here, so maybe I am missing > some important detail here, but idea that users need > to coordinate a integer config parameter globally does not > sound too attractive to me. > > Why not limit integers to local storage only and map > them to string idents on config, UI and transport? That should be possible. I don't want to go there till we have the base stuff in but it shouldn't be too hard. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 05:46, Robert Haas wrote: > It seems to me that you are intent on using the WAL stream as the > logical change stream. I think that's a bad design. Instead, you > should extract changes from WAL and then ship them around in a format > that is specific to logical replication. The proposal is to read the WAL in order to generate LCRs. The information needs to be in the original data in order to allow it to be passed onwards. In a multi-master config there will be WAL records with many origin node ids at any time, so this information must be held at WAL record level not page level. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas wrote: >> Btw, what do you mean with "conflating" the stream? I don't really see that >> being proposed. > > It seems to me that you are intent on using the WAL stream as the > logical change stream. I think that's a bad design. Instead, you > should extract changes from WAL and then ship them around in a format > that is specific to logical replication. Yeah, that seems worth elaborating on. What has been said several times is that it's pretty necessary to capture the logical changes into WAL. That seems pretty needful, in order that the replication data gets fsync()ed avidly, and so that we don't add in the race condition of needing to fsync() something *else* almost exactly as avidly as is the case for WAL today.. But it's undesirable to pull *all* the bulk of contents of WAL around if it's only part of the data that is going to get applied. On a "physical streaming" replica, any logical data that gets captured will be useless. And on a "logical replica," they "physical" bits of WAL will be useless. What I *want* you to mean is that there would be: a) WAL readers that pull the "physical bits", and b) WAL readers that just pull "logical bits." I expect it would be fine to have a tool that pulls LCRs out of WAL to prepare that to be sent to remote locations. Is that what you have in mind? Or are you feeling that the "logical bits" shouldn't get captured in WAL altogether, so we need to fsync() them into a different stream of files? -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_restore logging inconsistency
Excerpts from Josh Kupershmidt's message of mié may 30 14:55:12 -0400 2012: > Hi all, > > Bosco Rama recently complained[1] about not seeing a message printed > by pg_restore for each LO to be restored. The culprit seems to be the > different level passed to ahlog() for this status message: > > pg_backup_archiver.c:ahlog(AH, 2, "restoring large object with OID %u\n", > oid); > pg_backup_tar.c:ahlog(AH, 1, "restoring large object OID > %u\n", oid); > > depending on whether one is restoring a tar-format or custom-format > dump. I think these messages should be logged at the same level, to > avoid this inconsistency. The attached patch logs them both with > level=1, and makes the message texts identical. Note, as of 9.0 there > is already a line like this printed for each LO: > > pg_restore: executing BLOB 135004 > > so I could see the argument for instead wanting to hide the "restoring > large object" messages. However, the OP was interested in seeing > something like a status indicator for the lo_write() calls which may > take a long time, and the above message isn't really helpful for that > purpose as it is printed earlier in the restore process. Plus it seems > reasonable to make verbose mode, well, verbose. I applied this patch all the way back to 8.3. Thanks. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use of int4/int32 in C code
Tom Lane wrote: > I have not looked to see how many places do that. If it's a reasonably > small number of places, I'm OK with getting rid of int4 at the C level. > (int2/int8 the same of course.) $ find -name '*.h' -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat \ | egrep -c '\bint2\b' 178 | egrep -c '\bint4\b' 570 | egrep -c '\bint8\b' 212 -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 19 June 2012 14:03, Tom Lane wrote: > "Every WAL record"? Why in heck would you attach it to every record? > Surely putting it in WAL page headers would be sufficient. We could > easily afford to burn a page switch (if not a whole segment switch) > when changing masters. This does appear to be a reasonable idea at first glance, since it seems that each node has just a single node id, but that is not the case. As we pass changes around we maintain the same origin id for a change, so there is a mix of origin node ids at the WAL record level, not the page level. The concept of originating node id is essentially same as that used in Slony. > I'm against the idea of eating any spare space we have in WAL record > headers for this purpose, anyway; there are likely to be more pressing > needs in future. Not sure what those pressing needs are, but I can't see any. What we are doing here is fairly important, just not as important as crash recovery. But then that has worked pretty much unchanged for some time now. I raised the possibility of having variable length headers, but there is no requirement for that yet. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions over pathological TCP connections
On Tue, Jun 19, 2012 at 11:59 AM, Robert Haas wrote: > On Tue, Jun 19, 2012 at 1:56 AM, Tom Lane wrote: > > The transaction would be committed before a command success report is > > delivered to the client, so I don't think delivered-and-not-marked is > > possible. > > ...unless you have configured synchronous_commit=off, or fsync=off. > > Or unless your disk melts into a heap of slag and you have to restore > from backup. You can protect against that last case using synchronous > replication. > But hard disk failure isn't in the failure model I was concerned about. =)To be perfectly honest, I'm not too concerned with either hard drive failure or network failure, as we are deploying on Raid 1+0 database server talking to the client over a redundant LAN, and using asynchronous (Slony) replication to an identical database server just in case. No single point of failure is a key philosophy of this app from top to bottom. Like I said, this is mostly idle curiosity. But I'm also accustomed to trying to get work done on shockingly unreliable internet connections. As a result, network failure is something I think about quite a lot when writing networked applications. So this is not entirely idle curiosity either. And thinking about this a bit more, it's clear that the database has to commit before the result is sent, on the off chance that the transaction fails and needs to be retried. And that an explicit transaction block isn't really a solution either, because a "BEGIN; SELECT dequeue_row()" would get the row to the client without marking it as taken, but the pathological TCP disconnect could then attack the following "COMMIT;", leaving the client to think that the row has not been actually taken when it in fact has. It's not clear to me that this is even a solvable problem without modifying the schema to include both a "taken" and a "finished processing" state, and then letting elements be re-delievered after a period of time. But this would then allow a pathological demon with the power to cause TCP connects have a single element delivered and processed multiple times. In any case, thanks for the responses... Best, Leon
Re: [HACKERS] Backport of fsync queue compaction
Excerpts from Robert Haas's message of mar jun 19 17:39:46 -0400 2012: > On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith wrote: > > In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98 > > to try and compact the fsync queue when clients find it full. There's no > > visible behavior change, just a substantial performance boost possible in > > the rare but extremely bad situations where the background writer stops > > doing fsync absorption. I've been running that in production at multiple > > locations since practically the day it hit this mailing list, with backports > > all the way to 8.3 being common (and straightforward to construct). I've > > never seen a hint of a problem with this new code. > > I've been in favor of back-porting this for a while, so you'll get no > argument from me. +1. I even thought we had already backported it and was surprised to discover we hadn't, when we had this problem at a customer, not long ago. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backport of fsync queue compaction
On Tue, Jun 19, 2012 at 5:39 PM, Robert Haas wrote: > On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith wrote: >> In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98 >> to try and compact the fsync queue when clients find it full. There's no >> visible behavior change, just a substantial performance boost possible in >> the rare but extremely bad situations where the background writer stops >> doing fsync absorption. I've been running that in production at multiple >> locations since practically the day it hit this mailing list, with backports >> all the way to 8.3 being common (and straightforward to construct). I've >> never seen a hint of a problem with this new code. > > I've been in favor of back-porting this for a while, so you'll get no > argument from me. > > Anyone disagree? I recall reviewing that; it seemed like quite a good change. Me likes. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund wrote: > More seriously: Even if we don't put MM in core I think putting the basis for > it in core so that somebody can build such a solution reusing the existing > infrastructure is a sensible idea. Imo the only thing that requires explicit > support which is hard to add outside of core is prevention of loops (aka this > patch). Everything else should be doable reusing the hopefully modular pieces. I don't think prevention of loops is hard to do outside of core either, unless you insist on tying "logical" replication so closely to WAL that anyone doing MMR is necessarily getting the change stream from WAL. In fact, I'd go so far as to say that the ONLY part of this that's hard to do outside of core is change extraction. Even low-level apply can be implemented as a loadable module. >> Right. If we decide we need this, and if we did decide to conflate >> the WAL stream, both of which I disagree with as noted above, then we >> still don't need it on every record. It would probably be sufficient >> for local transactions to do nothing at all (and we can implicitly >> assume that they have master node ID = local node ID) and transactions >> which are replaying remote changes to emit one record per XID per >> checkpoint cycle containing the remote node ID. > Youve gone from a pretty trivial 150 line patch without any runtime/space > overhead to something *considerably* more complex in that case though. > > You suddently need to have relatively complex logic to remember which > information you got for a certain xid (and forget that information afterwards) > and whether you already logged that xid and you need to have to decide about > logging that information at multiple places. You need a backend-local hash table inside the wal reader process, and that hash table needs to map XIDs to node IDs. And you occasionally need to prune it, so that it doesn't eat too much memory. None of that sounds very hard. > Btw, what do you mean with "conflating" the stream? I don't really see that > being proposed. It seems to me that you are intent on using the WAL stream as the logical change stream. I think that's a bad design. Instead, you should extract changes from WAL and then ship them around in a format that is specific to logical replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use of int4/int32 in C code
Robert Haas writes: > On Tue, Jun 19, 2012 at 9:47 AM, Tom Lane wrote: >> I thought the general idea was to use int32 most places, but int4 in >> catalog declarations. I don't think it's tremendously important if >> somebody uses the other though. > I concur with Peter that TMTOWTDI is not the right way to do this. I > think we ought to get rid of int4 in code and use int32 everywhere. I have not looked to see how many places do that. If it's a reasonably small number of places, I'm OK with getting rid of int4 at the C level. (int2/int8 the same of course.) If we are going to do that, though, we need to actually remove those typedefs. Leaving them around on the grounds that third-party code might be using them will just allow cases to creep back in. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backport of fsync queue compaction
On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith wrote: > In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98 > to try and compact the fsync queue when clients find it full. There's no > visible behavior change, just a substantial performance boost possible in > the rare but extremely bad situations where the background writer stops > doing fsync absorption. I've been running that in production at multiple > locations since practically the day it hit this mailing list, with backports > all the way to 8.3 being common (and straightforward to construct). I've > never seen a hint of a problem with this new code. I've been in favor of back-porting this for a while, so you'll get no argument from me. Anyone disagree? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 04:58, Marko Kreen wrote: > On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs wrote: >> On 13 June 2012 19:28, Andres Freund wrote: >>> This adds a new configuration parameter multimaster_node_id which determines >>> the id used for wal originating in one cluster. >> >> Looks good and it seems this aspect at least is commitable in this CF. >> >> Design decisions I think we need to review are >> >> * Naming of field. I think origin is the right term, borrowing from Slony. > > I have not read too deeply here, so maybe I am missing > some important detail here, but idea that users need > to coordinate a integer config parameter globally does not > sound too attractive to me. > > Why not limit integers to local storage only and map > them to string idents on config, UI and transport? Yes, that can be done. This is simply how the numbers are used internally, similar to oids. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] performance regression in 9.2 when loading lots of small tables
On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas wrote: > On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes wrote: >> There was a regression introduced in 9.2 that effects the creation and >> loading of lots of small tables in a single transaction. >> >> It affects the loading of a pg_dump file which has a large number of >> small tables (10,000 schemas, one table per schema, 10 rows per >> table). I did not test other schema configurations, so these >> specifics might not be needed to invoke the problem. >> >> It causes the loading of a dump with "psql -1 -f " to run at half the >> previous speed. Speed of loading without the -1 is not changed. > > I tried to figure out why this was happening. I tried it out on the > IBM POWER7 box after building from > f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14 > minutes. 'time' reports that psql used 38 seconds of user time and 11 > seconds of system time. The remaining time was presumably spent > waiting for the backend and/or the kernel. > > I ran the backend under strace -cf and it had this to say: > > % time seconds usecs/call calls errors syscall > -- --- --- - - > 45.31 39.888639 10 3836379 write > 18.80 16.553878 3 4929697 lseek > 13.51 11.890826 6 1906179 12819 read > 10.24 9.011690 7 1372354 5995 recv > 6.27 5.517929 5 1107766 brk > 2.07 1.818345 91 20068 fsync > 1.46 1.281999 15 87260 send > 1.15 1.015621 24 42527 11334 open > > I did a separate run using perf and it had this to say: > > Events: 1M cycles > + 13.18% postgres postgres [.] FlushRelationBuffers > + 9.65% postgres postgres [.] comparetup_index_btree > + 9.34% swapper [kernel.kallsyms] [k] > .pseries_dedicated_idle_sleep > + 4.41% postgres postgres [.] CopyReadLine > + 4.23% postgres postgres [.] tuplesort_heap_siftup > + 4.17% postgres postgres [.] > LockReassignCurrentOwner > + 3.88% postgres postgres [.] pg_mbstrlen_with_len > + 2.74% postgres postgres [.] pg_verify_mbstr_len > + 2.46% postgres postgres [.] NextCopyFromRawFields > + 2.44% postgres postgres [.] btint4cmp > + 2.08% postgres libc-2.14.90.so [.] memcpy > + 2.06% postgres postgres [.] hash_seq_search > + 1.55% postgres [kernel.kallsyms] [k] .__copy_tofrom_user > + 1.29% postgres libc-2.14.90.so [.] > __GI_strtoll_l_internal > + 1.16% postgres postgres [.] pg_utf_mblen > > There are certainly opportunities for optimization there but it's sure > not obvious to me what it has to do with either of the commits you > mention. I haven't actually verified that there's a regression on > this box as result of either of those commits, though: I just profiled > master. Maybe I better go check that. I built REL9_1_STABLE from commit 1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even slower than master. So something is different in my environment versus yours. I ran this build with perf also, and got this: 38.77% postgres postgres [.] FlushRelationBuffers 6.80% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep 4.84% postgres postgres [.] comparetup_index_btree 3.23% postgres postgres [.] CopyReadLine 2.24% postgres postgres [.] tuplesort_heap_siftup 2.23% postgres postgres [.] pg_mbstrlen_with_len 2.08% postgres postgres [.] LockReassignCurrentOwner 2.01% postgres postgres [.] pg_verify_mbstr_len 1.90% postgres postgres [.] NextCopyFromRawFields 1.62% postgres postgres [.] btint4cmp 1.41% postgres libc-2.14.90.so[.] memcpy 1.39% postgres postgres [.] LWLockAcquire 1.22% postgres postgres [.] LWLockRelease 1.19% postgres postgres [.] pg_utf_mblen 1.05% postgres [kernel.kallsyms] [k] .__copy_tofrom_user -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture
Andres Freund wrote: > The problem is just that to support basically arbitrary decoding > requirements you need to provide at least those pieces of > information in a transactionally consistent manner: > * the data > * table names > * column names > * type information > * replication configuration I'm not sure that the last one needs to be in scope for the WAL stream, but the others I definitely agree eventually need to be available to a logical transaction stream consumer. You lay out the alternative ways to get all of this pretty clearly, and I don't know what the best answer is; it seems likely that there is not one best answer. In the long run, more than one of those options might need to be supported, to support different environments. As an initial implementation, I'm leaning toward the position that requiring a hot standby or a catalog-only proxy is acceptable. I think that should allow an application to be written which emits everything except the replication configuration. That will allow us to hook up everything we need at our shop. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
On tis, 2012-06-19 at 09:33 -0400, Tom Lane wrote: > Come to think of it, another possible factor is that LIKE can't use > ordinary indexes on text if the locale isn't C. But he reported that the plans are the same. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Near-duplicate RI NO ACTION and RESTRICT triggers
On 19 June 2012 17:48, Tom Lane wrote: > I think that the argument for having the RESTRICT triggers behave > like this is that the SQL spec envisions the RESTRICT check occurring > immediately when the individual PK row is updated/deleted, and so there > would be no opportunity for another PK row to be updated into its place. > (Or, in plainer English, RESTRICT should mean "you can't modify this > row's keys at all if it has dependents".) Because we implement RESTRICT > through an AFTER trigger that can't run earlier than end-of-statement, > we can't exactly match the spec's semantics, but we can get fairly > close so long as you don't think about what would be seen by > e.g. user-written triggers executing during the statement. > > I'm happy with continuing to have this behavioral difference between > the two sets of triggers, but wanted to throw it up for discussion: > does anyone think it'd be better to apply ri_Check_Pk_Match in the > RESTRICT triggers too? > In SQL:2008 they've re-worded the descriptions of these actions and added an explicit note to clarify the intended difference: """ — ON UPDATE RESTRICT: any change to a referenced column in the referenced table is prohibited if there is a matching row. — ON UPDATE NO ACTION (the default): there is no referential update action; the referential constraint only specifies a constraint check. NOTE 38 — Even if constraint checking is not deferred, ON UPDATE RESTRICT is a stricter condition than ON UPDATE NO ACTION. ON UPDATE RESTRICT prohibits an update to a particular row if there are any matching rows; ON UPDATE NO ACTION does not perform its constraint check until the entire set of rows to be updated has been processed. """ and there's a similar note in the DELETE case. So I think the current behaviour is correct, and there are probably genuine use cases for both types of check. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs wrote: > On 13 June 2012 19:28, Andres Freund wrote: >> This adds a new configuration parameter multimaster_node_id which determines >> the id used for wal originating in one cluster. > > Looks good and it seems this aspect at least is commitable in this CF. > > Design decisions I think we need to review are > > * Naming of field. I think origin is the right term, borrowing from Slony. I have not read too deeply here, so maybe I am missing some important detail here, but idea that users need to coordinate a integer config parameter globally does not sound too attractive to me. Why not limit integers to local storage only and map them to string idents on config, UI and transport? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] performance regression in 9.2 when loading lots of small tables
On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes wrote: > There was a regression introduced in 9.2 that effects the creation and > loading of lots of small tables in a single transaction. > > It affects the loading of a pg_dump file which has a large number of > small tables (10,000 schemas, one table per schema, 10 rows per > table). I did not test other schema configurations, so these > specifics might not be needed to invoke the problem. > > It causes the loading of a dump with "psql -1 -f " to run at half the > previous speed. Speed of loading without the -1 is not changed. I tried to figure out why this was happening. I tried it out on the IBM POWER7 box after building from f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14 minutes. 'time' reports that psql used 38 seconds of user time and 11 seconds of system time. The remaining time was presumably spent waiting for the backend and/or the kernel. I ran the backend under strace -cf and it had this to say: % time seconds usecs/call callserrors syscall -- --- --- - - 45.31 39.888639 10 3836379 write 18.80 16.553878 3 4929697 lseek 13.51 11.890826 6 1906179 12819 read 10.249.011690 7 1372354 5995 recv 6.275.517929 5 1107766 brk 2.071.818345 91 20068 fsync 1.461.281999 15 87260 send 1.151.015621 24 42527 11334 open I did a separate run using perf and it had this to say: Events: 1M cycles + 13.18% postgres postgres [.] FlushRelationBuffers + 9.65% postgres postgres [.] comparetup_index_btree + 9.34% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep + 4.41% postgres postgres [.] CopyReadLine + 4.23% postgres postgres [.] tuplesort_heap_siftup + 4.17% postgres postgres [.] LockReassignCurrentOwner + 3.88% postgres postgres [.] pg_mbstrlen_with_len + 2.74% postgres postgres [.] pg_verify_mbstr_len + 2.46% postgres postgres [.] NextCopyFromRawFields + 2.44% postgres postgres [.] btint4cmp + 2.08% postgres libc-2.14.90.so[.] memcpy + 2.06% postgres postgres [.] hash_seq_search + 1.55% postgres [kernel.kallsyms] [k] .__copy_tofrom_user + 1.29% postgres libc-2.14.90.so[.] __GI_strtoll_l_internal + 1.16% postgres postgres [.] pg_utf_mblen There are certainly opportunities for optimization there but it's sure not obvious to me what it has to do with either of the commits you mention. I haven't actually verified that there's a regression on this box as result of either of those commits, though: I just profiled master. Maybe I better go check that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund wrote: > Robert Haas wrote: >> Tom Lane wrote: >>> However, if we're dead set on doing it that way, let us put >>> information that is only relevant to logical replication records >>> into only the logical replication records. >> Right. If we decide we need this, and if we did decide to >> conflate the WAL stream, both of which I disagree with as noted >> above, then we still don't need it on every record. It would >> probably be sufficient for local transactions to do nothing at >> all (and we can implicitly assume that they have master node ID = >> local node ID) and transactions which are replaying remote >> changes to emit one record per XID per checkpoint cycle >> containing the remote node ID. > Youve gone from a pretty trivial 150 line patch without any > runtime/space overhead to something *considerably* more complex in > that case though. I think it might be worth it. I've done a lot of MM replication, and so far have not had to use a topology which allowed loops. Not only would you be reserving space in the WAL stream which was not useful for those not using MM replication, you would be reserving it when even many MM configurations would not need it. Now you could argue that the 16 bits you want to use are already there and are not yet used for anything; but there are two counter-arguments to that: you lose the opportunity to use them for something else, and you might want more than 16 bits. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Tuesday, June 19, 2012 09:58:43 PM Robert Haas wrote: > On Tue, Jun 19, 2012 at 3:48 PM, Andres Freund wrote: > > Why is that code not used more widely? Quite a bit of our list usage > > should be replaced embedding list element in larger structs imo. There > > are also open- coded inline list manipulations around (check aset.c for > > example). > Because we've got a million+ lines of code and nobody knows where all > the bodies are buried. Heh, yes ;). Ive hit several times as you uncovered at least twice ;) > > 1. dllist.h has double the element overhead by having an inline value > > pointer (which is not needed when embedding) and a pointer to the list > > (which I have a hard time seing as being useful) > > 2. only double linked list, mine provided single and double linked ones > > 3. missing macros to use when embedded in a larger struct (containerof() > > wrappers and for(...) support basically) > > 4. most things are external function calls... > > 5. way much more branches/complexity in most of the functions. My > > implementation doesn't use any branches for the typical easy > > modifications (push, pop, remove element somewhere) and only one for the > > typical tests (empty, has-next, ...) > > > > The performance and memory aspects were crucial for the aforementioned > > toy project (slab allocator for postgres). Its not that crucial for the > > applycache where the lists currently are mostly used although its also > > relatively performance sensitive and obviously does a lot of list > > manipulation/iteration. > > > > If I had to decide I would add the missing api in dllist.h to my > > implementation and then remove it. Its barely used - and only in an > > embedded fashion - as far as I can see. > > I can understand though if that argument is met with doubt by others ;). > > If thats the way it has to go I would add some more convenience support > > for embedding data to dllist.h and settle for that. > > I think it might be simpler to leave the name as Dllist and just > overhaul the implementation along the lines you suggest, rather than > replacing it with something completely different. Mostly, I don't > want to add a third thing if we can avoid it, given that Dllist as it > exists today is used only lightly. Well, if its the name, I have no problem with changing it, but I don't see how you can keep the api as it currently is and address my points. If there is some buyin I can try to go either way (keeping the existing name, changing the api, adjusting the callers or just adjust the callers, throw away the old implementation) I just don't want to get into that just to see somebody isn't agreeing with the fundamental idea. The most contentious point is probably relying on USE_INLINE being available anywhere. Which I believe to be the point now that we have gotten rid of some platforms. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Tue, Jun 19, 2012 at 11:02 PM, Andres Freund wrote: > On Tuesday, June 19, 2012 09:59:48 PM Marko Kreen wrote: >> On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund > wrote: >> > +/* >> > + * removes a node from a list >> > + * Attention: O(n) >> > + */ >> > +static inline void ilist_s_remove(ilist_s_head *head, >> > + ilist_s_node *node) >> > +{ >> > + ilist_s_node *last = &head->head; >> > + ilist_s_node *cur; >> > +#ifndef NDEBUG >> > + bool found = false; >> > +#endif >> > + while ((cur = last->next)) >> > + { >> > + if (cur == node) >> > + { >> > + last->next = cur->next; >> > +#ifndef NDEBUG >> > + found = true; >> > +#endif >> > + break; >> > + } >> > + last = cur; >> > + } >> > + assert(found); >> > +} >> >> This looks weird. >> >> In cyclic list removal is: >> >> node->prev->next = node->next; >> node->next->prev = node->prev; >> >> And thats it. > Thats the single linked list, not the double linked one. Thats why it has a > O(n) warning tacked on... Oh, you have several list implementations there. Sorry, I was just browsing and it caught my eye. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling
On Tuesday, June 19, 2012 09:55:30 PM Robert Haas wrote: > On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund wrote: > > From: Andres Freund > > > > The previous coding could miss xlog writeouts at several places. E.g. > > when wal was written out by the background writer or even after a commit > > if synchronous_commit=off. > > This could lead to delays in sending data to the standby of up to 7 > > seconds. > > > > To fix this move the responsibility of notification to the layer where > > the neccessary information is actually present. We take some care not to > > do the notification while we hold conteded locks like WALInsertLock or > > WalWriteLock locks. > > I am not convinced that it's a good idea to wake up every walsender > every time we do XLogInsert(). XLogInsert() is a super-hot code path, > and adding more overhead there doesn't seem warranted. We need to > replicate commit, commit prepared, etc. quickly, by why do we need to > worry about a short delay in replicating heap_insert/update/delete, > for example? They don't really matter until the commit arrives. 7 > seconds might be a bit long, but that could be fixed by decreasing the > polling interval for walsender to, say, a second. Its not woken up every XLogInsert call. Its only woken up if there was an actual disk write + fsync in there. Thats exactly the point of the patch. The wakeup rate is actually lower for synchronous_commit=on than before because then it unconditionally did a wakeup for every commit (and similar) and now only does that if something has been written + fsynced. > Parenthetically, I find it difficult to extract inline patches. No > matter whether I try to use it using Gmail + show original or the web > site, something always seems to get garbled. Will use git send-mail --attach next time... Btw, git am should be able to extract the patches for you. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
Hi, On Tuesday, June 19, 2012 09:59:48 PM Marko Kreen wrote: > On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund wrote: > > +/* > > + * removes a node from a list > > + * Attention: O(n) > > + */ > > +static inline void ilist_s_remove(ilist_s_head *head, > > + ilist_s_node *node) > > +{ > > + ilist_s_node *last = &head->head; > > + ilist_s_node *cur; > > +#ifndef NDEBUG > > + bool found = false; > > +#endif > > + while ((cur = last->next)) > > + { > > + if (cur == node) > > + { > > + last->next = cur->next; > > +#ifndef NDEBUG > > + found = true; > > +#endif > > + break; > > + } > > + last = cur; > > + } > > + assert(found); > > +} > > This looks weird. > > In cyclic list removal is: > > node->prev->next = node->next; > node->next->prev = node->prev; > > And thats it. Thats the single linked list, not the double linked one. Thats why it has a O(n) warning tacked on... The double linked one is just as you said: /* * removes a node from a list */ static inline void ilist_d_remove(unused_attr ilist_d_head *head, ilist_d_node *node) { ilist_d_check(head); node->prev->next = node->next; node->next->prev = node->prev; ilist_d_check(head); } Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund wrote: > +/* > + * removes a node from a list > + * Attention: O(n) > + */ > +static inline void ilist_s_remove(ilist_s_head *head, > + ilist_s_node *node) > +{ > + ilist_s_node *last = &head->head; > + ilist_s_node *cur; > +#ifndef NDEBUG > + bool found = false; > +#endif > + while ((cur = last->next)) > + { > + if (cur == node) > + { > + last->next = cur->next; > +#ifndef NDEBUG > + found = true; > +#endif > + break; > + } > + last = cur; > + } > + assert(found); > +} This looks weird. In cyclic list removal is: node->prev->next = node->next; node->next->prev = node->prev; And thats it. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Tue, Jun 19, 2012 at 3:48 PM, Andres Freund wrote: > Why is that code not used more widely? Quite a bit of our list usage should be > replaced embedding list element in larger structs imo. There are also open- > coded inline list manipulations around (check aset.c for example). Because we've got a million+ lines of code and nobody knows where all the bodies are buried. > 1. dllist.h has double the element overhead by having an inline value pointer > (which is not needed when embedding) and a pointer to the list (which I have a > hard time seing as being useful) > 2. only double linked list, mine provided single and double linked ones > 3. missing macros to use when embedded in a larger struct (containerof() > wrappers and for(...) support basically) > 4. most things are external function calls... > 5. way much more branches/complexity in most of the functions. My > implementation doesn't use any branches for the typical easy modifications > (push, pop, remove element somewhere) and only one for the typical tests > (empty, has-next, ...) > > The performance and memory aspects were crucial for the aforementioned toy > project (slab allocator for postgres). Its not that crucial for the applycache > where the lists currently are mostly used although its also relatively > performance sensitive and obviously does a lot of list manipulation/iteration. > > If I had to decide I would add the missing api in dllist.h to my > implementation and then remove it. Its barely used - and only in an embedded > fashion - as far as I can see. > I can understand though if that argument is met with doubt by others ;). If > thats the way it has to go I would add some more convenience support for > embedding data to dllist.h and settle for that. I think it might be simpler to leave the name as Dllist and just overhaul the implementation along the lines you suggest, rather than replacing it with something completely different. Mostly, I don't want to add a third thing if we can avoid it, given that Dllist as it exists today is used only lightly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling
On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund wrote: > From: Andres Freund > > The previous coding could miss xlog writeouts at several places. E.g. when wal > was written out by the background writer or even after a commit if > synchronous_commit=off. > This could lead to delays in sending data to the standby of up to 7 seconds. > > To fix this move the responsibility of notification to the layer where the > neccessary information is actually present. We take some care not to do the > notification while we hold conteded locks like WALInsertLock or WalWriteLock > locks. I am not convinced that it's a good idea to wake up every walsender every time we do XLogInsert(). XLogInsert() is a super-hot code path, and adding more overhead there doesn't seem warranted. We need to replicate commit, commit prepared, etc. quickly, by why do we need to worry about a short delay in replicating heap_insert/update/delete, for example? They don't really matter until the commit arrives. 7 seconds might be a bit long, but that could be fixed by decreasing the polling interval for walsender to, say, a second. When I was doing some testing recently, the case that was sort of confusing was the replay of AccessExcusiveLocks. The lock didn't show up on the standby for many seconds after it had showed up on the master. But that's more a feature than a bug when you really thinking about it - postponing the lock on the slave for as long as possible just reduces the user impact of having to take them there at all. Parenthetically, I find it difficult to extract inline patches. No matter whether I try to use it using Gmail + show original or the web site, something always seems to get garbled. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Tuesday, June 19, 2012 09:16:41 PM Robert Haas wrote: > On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund wrote: > > Adds a single and a double linked list which can easily embedded into > > other datastructures and can be used without any additional allocations. > > dllist.h advertises that it's embeddable. Can you use that instead, > or enhance it slightly to support what you want to do? Oh, wow. Didn't know that existed. I had $subject lying around from a play- around project, so I didn't look too hard. Why is that code not used more widely? Quite a bit of our list usage should be replaced embedding list element in larger structs imo. There are also open- coded inline list manipulations around (check aset.c for example). *looks* Not really that happy with it though: 1. dllist.h has double the element overhead by having an inline value pointer (which is not needed when embedding) and a pointer to the list (which I have a hard time seing as being useful) 2. only double linked list, mine provided single and double linked ones 3. missing macros to use when embedded in a larger struct (containerof() wrappers and for(...) support basically) 4. most things are external function calls... 5. way much more branches/complexity in most of the functions. My implementation doesn't use any branches for the typical easy modifications (push, pop, remove element somewhere) and only one for the typical tests (empty, has-next, ...) The performance and memory aspects were crucial for the aforementioned toy project (slab allocator for postgres). Its not that crucial for the applycache where the lists currently are mostly used although its also relatively performance sensitive and obviously does a lot of list manipulation/iteration. If I had to decide I would add the missing api in dllist.h to my implementation and then remove it. Its barely used - and only in an embedded fashion - as far as I can see. I can understand though if that argument is met with doubt by others ;). If thats the way it has to go I would add some more convenience support for embedding data to dllist.h and settle for that. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] return values of backend sub-main functions
On Tue, Jun 19, 2012 at 7:31 AM, Peter Eisentraut wrote: > On ons, 2012-01-18 at 21:21 +0200, Peter Eisentraut wrote: >> On lör, 2012-01-07 at 16:41 -0500, Tom Lane wrote: >> > Peter Eisentraut writes: >> > > I suggest that we change PostgresMain(), PostmasterMain(), BackendRun(), >> > > WalSenderMain(), and WalSndLoop() to return void as well. >> > >> > I agree this code is not very consistent or useful, but one question: >> > what should the callers do if one of these functions *does* return? >> >> I was thinking of a two-pronged approach: First, add >> __attribute__((noreturn)) to the functions. This will cause a suitable >> compiler to verify on a source-code level that nothing actually returns >> from the function. And second, at the call site, put an abort(); /* >> not reached */. Together, this will make the code cleaner and more >> consistent, and will also help the compiler out a bit about the control >> flow. > > Patch for 9.3 attached. Seems reasonable on a quick read-through. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
On Tue, Jun 19, 2012 at 3:03 PM, Alvaro Herrera wrote: > That's what I thought. I will commit to both branches soon, then. I think there might be three branches involved. > Mind you, this should have been an "open item", not a commitfest item. > (Actually not even an open item. We should have committed it right > away.) True, but it's not a perfect world, and I didn't feel qualified to commit it myself. Adding it the CommitFest at least prevented it from getting lost forever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
On Tue, Jun 19, 2012 at 1:43 AM, Tom Lane wrote: > Amit Kapila writes: >>> AFAIR you can create pg_control from scratch already with pg_resetxlog. >>> The hard part is coming up with values for the counters, such as the >>> next WAL location. Some of them such as next OID are pretty harmless >>> if you don't guess right, but I'm worried that wrong next WAL could >>> make things worse not better. > >> I believe if WAL files are proper as mentioned in Alvaro's mail, the >> purposed logic should generate correct values. > > I've got a problem with the assumption that, when pg_control is trash, > megabytes or gigabytes of WAL can still be relied on completely. > > I'm almost inclined to suggest that we not get next-LSN from WAL, but > by scanning all the pages in the main data store and computing the max > observed LSN. This is clearly not very attractive from a performance > standpoint, but it would avoid the obvious failure mode where you lost > some recent WAL segments along with pg_control. I think it could be useful to have a tool that scans all the blocks and computes that value, but I'd want it to just print the value out and let me decide what to do about it. There are cases where you don't necessarily want to clobber pg_control, but you do have future LSNs in your data file pages. This can be either because the disk ate your WAL, or because you didn't create recovery.conf, or because your disk corrupted the LSNs on the data file pages. I'd want a tool that could be either run on an individual file, or recursively on a directory. In terms of the TODO item, I haven't yet heard anyone clearly state "I wanted to use pg_controldata but it couldn't because X so therefore we need this patch". Alvaro mentioned the case where pg_control is missing altogether, but: [rhaas pgsql]$ rm ~/pgdata/global/pg_control [rhaas pgsql]$ postgres postgres: could not find the database system Expected to find it in the directory "/Users/rhaas/pgdata", but could not open file "/Users/rhaas/pgdata/global/pg_control": No such file or directory [rhaas pgsql]$ pg_resetxlog ~/pgdata pg_resetxlog: could not open file "global/pg_control" for reading: No such file or directory If you are sure the data directory path is correct, execute touch global/pg_control and try again. [rhaas pgsql]$ touch ~/pgdata/global/pg_control [rhaas pgsql]$ pg_resetxlog ~/pgdata pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it Guessed pg_control values: First log file ID after reset:0 First log file segment after reset: 69 pg_control version number:922 Catalog version number: 201206141 Database system identifier: 5755831325641078488 Latest checkpoint's TimeLineID: 1 Latest checkpoint's full_page_writes: off Latest checkpoint's NextXID: 0/3 Latest checkpoint's NextOID: 1 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID:3 Latest checkpoint's oldestXID's DB: 0 Latest checkpoint's oldestActiveXID: 0 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment:16777216 Maximum length of identifiers:64 Maximum columns in an index: 32 Maximum size of a TOAST chunk:1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value If these values seem acceptable, use -f to force reset. [rhaas pgsql]$ pg_resetxlog -f ~/pgdata pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it Transaction log reset [rhaas pgsql]$ postgres LOG: database system was shut down at 2012-06-19 15:25:28 EDT LOG: database system is ready to accept connections LOG: autovacuum launcher started So I still don't understand what problem we're solving here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use of int4/int32 in C code
On 19 June 2012 20:11, Robert Haas wrote: > On Tue, Jun 19, 2012 at 9:47 AM, Tom Lane wrote: >> Peter Eisentraut writes: >>> What is the latest theory on using int4 vs. int32 in C code? >>> (equivalently int2, int16) >> >> I thought the general idea was to use int32 most places, but int4 in >> catalog declarations. I don't think it's tremendously important if >> somebody uses the other though. > > I concur with Peter that TMTOWTDI is not the right way to do this. I > think we ought to get rid of int4 in code and use int32 everywhere. > >>> While we're at it, how do we feel about using C standard types like >>> int32_t instead of (or initially in addition to) our own definitions? >> >> Can't get very excited about this either. The most likely outcome of >> a campaign to substitute the standard types is that back-patching would >> become a truly painful activity. IMO, anything that is going to result >> in tens of thousands of diffs had better have a more-than-cosmetic >> reason. (That wouldn't apply if we only used int32_t in new code ... >> but then, instead of two approved ways to do it, there would be three. >> Which doesn't seem like it improves matters.) > > On this one, I agree with you. Yeah. I find pgindent changes annoying when doing a git blame myself. Now, granted, you can mostly take care of that by having the tool ignore whitespace changes, but that doesn't always work perfectly, and I haven't been able to figure out a better way of managing that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 07:24:13 PM Robert Haas wrote: > On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane wrote: > > Andres Freund writes: > >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: > ... (If you are thinking > of something sufficiently high-level that merging could possibly work, > then it's not WAL, and we shouldn't be trying to make the WAL > representation cater for it.) > >> > >> Do you really see this as such a big problem? > > > > It looks suspiciously like "I have a hammer, therefore every problem > > must be a nail". I don't like the design concept of cramming logical > > replication records into WAL in the first place. > > Me, neither. I think it's necessary to try to find a way of > generating logical replication records from WAL. But once generated, > I think those records should form their own stream, independent of > WAL. If you take the contrary position that they should be included > in WAL, then when you filter the WAL stream down to just the records > of interest to logical replication, you end up with a WAL stream with > holes in it, which is one of the things that Andres listed as an > unresolved design problem in his original email. Yes. > Moreover, this isn't necessary at all for single-master replication, > or even multi-source replication where each table has a single master. > It's only necessary for full multi-master replication, which we have > no consensus to include in core, and even if we did have a consensus > to include it in core, it certainly shouldn't be the first feature we > design. Well, you can't blame a patch/prototype trying to implement what it claims to implement ;) More seriously: Even if we don't put MM in core I think putting the basis for it in core so that somebody can build such a solution reusing the existing infrastructure is a sensible idea. Imo the only thing that requires explicit support which is hard to add outside of core is prevention of loops (aka this patch). Everything else should be doable reusing the hopefully modular pieces. > > However, if we're dead set on doing it that way, let us put information > > that is only relevant to logical replication records into only the > > logical replication records. > Right. If we decide we need this, and if we did decide to conflate > the WAL stream, both of which I disagree with as noted above, then we > still don't need it on every record. It would probably be sufficient > for local transactions to do nothing at all (and we can implicitly > assume that they have master node ID = local node ID) and transactions > which are replaying remote changes to emit one record per XID per > checkpoint cycle containing the remote node ID. Youve gone from a pretty trivial 150 line patch without any runtime/space overhead to something *considerably* more complex in that case though. You suddently need to have relatively complex logic to remember which information you got for a certain xid (and forget that information afterwards) and whether you already logged that xid and you need to have to decide about logging that information at multiple places. Btw, what do you mean with "conflating" the stream? I don't really see that being proposed. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund wrote: > Adds a single and a double linked list which can easily embedded into other > datastructures and can be used without any additional allocations. dllist.h advertises that it's embeddable. Can you use that instead, or enhance it slightly to support what you want to do? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use of int4/int32 in C code
On Tue, Jun 19, 2012 at 9:47 AM, Tom Lane wrote: > Peter Eisentraut writes: >> What is the latest theory on using int4 vs. int32 in C code? >> (equivalently int2, int16) > > I thought the general idea was to use int32 most places, but int4 in > catalog declarations. I don't think it's tremendously important if > somebody uses the other though. I concur with Peter that TMTOWTDI is not the right way to do this. I think we ought to get rid of int4 in code and use int32 everywhere. >> While we're at it, how do we feel about using C standard types like >> int32_t instead of (or initially in addition to) our own definitions? > > Can't get very excited about this either. The most likely outcome of > a campaign to substitute the standard types is that back-patching would > become a truly painful activity. IMO, anything that is going to result > in tens of thousands of diffs had better have a more-than-cosmetic > reason. (That wouldn't apply if we only used int32_t in new code ... > but then, instead of two approved ways to do it, there would be three. > Which doesn't seem like it improves matters.) On this one, I agree with you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
Excerpts from Robert Haas's message of mar jun 19 11:36:41 -0400 2012: > > On Mon, Jun 18, 2012 at 3:30 PM, Alvaro Herrera > wrote: > > Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012: > > > >> Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr > >> and SvPVUTF8() when turning a perl string into a cstring. > > > > Hmm, this patch belongs into back branches too, right? Not just the > > current development tree? > > It seems like a bug fix to me. That's what I thought. I will commit to both branches soon, then. Mind you, this should have been an "open item", not a commitfest item. (Actually not even an open item. We should have committed it right away.) -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On 19 June 2012 19:44, Peter Geoghegan wrote: > You could do that, and some people do use custom collations for > various reasons. That's obviously very much of minority interest > though. Most people will just use citext or something. However, since > citext is itself a client of varstr_cmp(), this won't help you. I spoke too soon - that would work fine in the U.S, since the text is converted to lower case on-the-fly in a locale and collation aware fashion. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
"Kevin Grittner" writes: > I wasn't aware that en_US.UTF-8 doesn't have equivalence without > equality. I guess that surprising result in my last post is just > plain inevitable with that collation then. Bummer. Is there > actually anyone who finds that to be a useful behavior? For a > collation which considered upper-case and lower-case to be > equivalent, would PostgreSQL sort as I wanted, or is it doing some > tie-break per column within equivalent values? Well, there are two different questions there. As far as the overall structure of an ORDER BY or btree index is concerned, all it knows is what the datatype comparator functions tell it. There is no such thing as a second pass to reconsider values found to be "equal"; that's all the info there is. As far as the behavior of the comparator function for text is concerned, we choose to break ties reported by strcoll via strcmp. But that's not visible from outside the comparator, so there's no notion of an "equivalence" behavior different from "equality". I'm not exactly convinced that we could have a second-pass tiebreak mechanism without creating serious problems for btree indexes; in particular it seems like ordering considering only some leading keys might not match the actual index ordering. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers reduced, v1
On Fri, Jun 15, 2012 at 4:27 PM, Dimitri Fontaine wrote: > Allow me to open the new season of the DML trigger series, named > pg_event_trigger. This first episode is all about setting up the drama, > so that next ones make perfect sense. Comments: 1. I still think we ought to get rid of the notion of BEFORE or AFTER (i.e. pg_event_trigger.evttype) and just make that detail part of the event name (e.g. pg_event_trigger.evtevent). Many easily forseeable event types will be more like "during" rather than "before" or "after", and for those that do have a notion of before and after, we can have two different event names and include the word "before" or "after" there. I am otherwise satisfied with the schema you've chosen. 2. I think it's important to be able to add new types of event triggers without creating excessive parser bloat. I think it's important to use some kind of generic syntax here which will be able to apply to all types of triggers we may want to add, now or in the future. The easiest way to do that is to use literal syntax for the list of command tags, rather than writing them out as key words: e.g. 'ALTER TABLE' rather than ALTER TABLE. It's not quite as pretty, but the savings in parser bloat and future code change seem well worth it. Or, alternatively, we could use identifier style, e.g. alter_table, as I previously suggested. 3. The event trigger cache seems to be a few bricks shy of a load. First, event_trigger_cache_is_stalled is mis-named; I think you mean "stale", not "stalled". Second, instead of setting that flag and then rebuilding the cache when you see the flag set, how about just blowing away the cache contents whenever you would have set the flag? That seems a whole lot simpler and cleaner, and removes the need for a force_rebuild flag on BuildEventTriggerCache(). Third, ISTM that this isn't going to work correctly if backend A performs an event after backend B has built its cache. To fix this, I think you need to rip out all the places where you force a rebuild locally and instead use something like CacheRegisterSyscacheCallback() to blow away the cache whenever something changes; you might find it helpful to look at attoptcache.c. 4. The documentation doesn't build. openjade:reference.sgml:44:4:W: cannot generate system identifier for general entity "alterEventTrigger" openjade:reference.sgml:44:4:E: general entity "alterEventTrigger" not defined and no default entity openjade:reference.sgml:44:21:E: reference to entity "alterEventTrigger" for which no system identifier could be generated openjade:reference.sgml:44:3: entity was defined here openjade:reference.sgml:86:4:W: cannot generate system identifier for general entity "createEventTrigger" openjade:reference.sgml:86:4:E: general entity "createEventTrigger" not defined and no default entity openjade:reference.sgml:86:22:E: reference to entity "createEventTrigger" for which no system identifier could be generated openjade:reference.sgml:86:3: entity was defined here openjade:reference.sgml:125:4:W: cannot generate system identifier for general entity "dropEventTrigger" openjade:reference.sgml:125:4:E: general entity "dropEventTrigger" not defined and no default entity openjade:reference.sgml:125:20:E: reference to entity "dropEventTrigger" for which no system identifier could be generated openjade:reference.sgml:125:3: entity was defined here openjade:catalogs.sgml:1868:35:E: character "_" is not allowed in the value of attribute "ZONE" openjade:catalogs.sgml:1868:19:X: reference to non-existent ID "CATALOG-PG-EVENT_TRIGGER" openjade:trigger.sgml:43:47:X: reference to non-existent ID "SQL-CREATECOMMANDTRIGGER" 5. In terms of a psql command, I think that \dev is both not very mnemonic and, as you mentioned in the comment, possibly confusable with SQL/MED. If we're going to have a \d command for this, I suggest something like \dy, which is not very mnemonic either but at least seems unlikely to be confused with anything else. Other things that are unused include \dh, \dj, \dk, \dq, \dw, and \dz, if any of those grab you, or a somewhat broader range of things (but still nothing very memorable) if we include capital letters. Or we could branch out into punctuation, like \d& -- or things that don't begin with the letter d, but that doesn't seem like a particularly good idea. 6. In objectaddress.c, I think that get_object_address_event_trigger should be eliminated in favor of an additional case in get_object_address_unqualified. 7. There are many references to command triggers that still need to be cleaned up. trigger.sgml still makes reference to the name command triggers. plpgsql.sgml also contains vestiges of the command trigger notation, and references to some TG_* variables that I don't think exist in this version of the patch. event_trigger.c is identified (twice) as cmdtrigger.c in the file header comment. The header comment for InsertEventTriggerTuple makes reference to pg_cmdtrigger, as does a l
Re: [HACKERS] sortsupport for text
On 19 June 2012 18:57, Kevin Grittner wrote: > We weren't using en_US.UTF-8 collation (or any other "proper" > collation) on Sybase -- I'm not sure whether they even supported > proper collation sequences on the versions we used. I'm thinking of > when we were using their "case insensitive" sorting. I don't know > the implementation details, but the behavior was consistent with > including each character-based column twice: once in the requested > position in the ORDER BY clause but folded to a consistent case, and > again after all the columns in the ORDER BY clause in original form, > with C collation. > > I wasn't aware that en_US.UTF-8 doesn't have equivalence without > equality. Not that many do. The underlying cause of the problem back in 2005 was the tacit assumption that none do, which presumably we got away with for a while. I mentioned Hungarian, but it happens a bit in Swedish too. PostgreSQL supported Unicode before 2005, when the tie-breaker was introduced. I know at least one Swede who used Postgres95. I just took a look at the REL6_4 branch, and it looks much the same in 1999 as it did in 2005, in that there is no tie-breaker after the strcoll(). Now, that being the case, and Hungarian in particular having a whole bunch of these equivalencies, I have to wonder if the original complainant's problem really was diagnosed correctly. It could of had something to do with the fact that texteq() was confused about whether it reported equality or equivalency - it may have taken that long for the (len1 != len2) fastpath thing (only holds for equality, not equivalence, despite the fact that the 2005-era strcoll() call checks equivalence within texteq() ) to trip someone out, because texteq() would have thereby given inconsistent answers in a very subtle way, that were not correct either according to the Hungarian locale, nor according to simple bitwise equality. That's mostly speculation, but the question must be asked. > I guess that surprising result in my last post is just > plain inevitable with that collation then. Bummer. Is there > actually anyone who finds that to be a useful behavior? Looking at the details again and assuming a US locale, yeah, it is. The substance of your complain holds though. > For a collation which considered upper-case and lower-case to be > equivalent, would PostgreSQL sort as I wanted, or is it doing some > tie-break per column within equivalent values? You could do that, and some people do use custom collations for various reasons. That's obviously very much of minority interest though. Most people will just use citext or something. However, since citext is itself a client of varstr_cmp(), this won't help you. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture
Hi, The most important part, even for people not following my discussion with Robert is at the bottom where the possible wal decoding strategies are laid out. On Tuesday, June 19, 2012 03:20:58 AM Robert Haas wrote: > On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund wrote: > >> > Hm. Yes, you could do that. But I have to say I don't really see a > >> > point. Maybe the fact that I do envision multimaster systems at some > >> > point is clouding my judgement though as its far less easy in that > >> > case. > >> > >> Why? I don't think that particularly changes anything. > > > > Because it makes conflict detection very hard. I also don't think its a > > feature worth supporting. Whats the use-case of updating records you > > cannot properly identify? > Don't ask me; I just work here. I think it's something that some > people want, though. I mean, if you don't support replicating a table > without a primary key, then you can't even run pgbench in a > replication environment. Well, I have no problem with INSERT only tables not having a PK. And pgbench_history is the only pgbench table that doesn't have a pkey? And thats only truncated... > Is that an important workload? Well, > objectively, no. But I guarantee you that other people with more > realistic workloads than that will complain if we don't have it. > Absolutely required on day one? Probably not. Completely useless > appendage that no one wants? Not that, either. Maybe. And I don't really care, so if others see that as important I am happy to appease them ;). > >> In my view, a logical replication solution is precisely one in which > >> the catalogs don't need to be in sync. If the catalogs have to be in > >> sync, it's not logical replication. ISTM that what you're talking > >> about is sort of a hybrid between physical replication (pages) and > >> logical replication (tuples) - you want to ship around raw binary > >> tuple data, but not entire pages. > > Ok, thats a valid point. Simon argued at the cluster summit that > > everything thats not physical is logical. Which has some appeal because > > it seems hard to agree what exactly logical rep is. So definition by > > exclusion makes kind of sense ;) > Well, the words are fuzzy, but I would define logical replication to > be something which is independent of the binary format in which stuff > gets stored on disk. If it's not independent of the disk format, then > you can't do heterogenous replication (between versions, or between > products). That precise limitation is the main thing that drives > people to use anything other than SR in the first place, IME. Not in mine. The main limitation I see is that you cannot write anything on the standby. Which sucks majorly for many things. Its pretty much impossible to "fix" that for SR outside of very limited cases. While many scenarios don't need multimaster *many* need to write outside of the standby's replication set. > > I think what you categorized as "hybrid logical/physical" rep solves an > > important use-case thats very hard to solve at the moment. Before my > > 2ndquadrant days I had several client which had huge problemsing the > > trigger based solutions because their overhead simply was to big a > > burden on the master. They couldn't use SR either because every > > consuming database kept loads of local data. > > I think such scenarios are getting more and more common. > I think this is to some extent true, but I also think you're > conflating two different things. Change extraction via triggers > introduces overhead that can be eliminated by reconstructing tuples > from WAL in the background rather than forcing them to be inserted > into a shadow table (and re-WAL-logged!) in the foreground. I will > grant that shipping the tuple as a binary blob rather than as text > eliminates additional overehead on both ends, but it also closes off a > lot of important use cases. As I noted in my previous email, I think > that ought to be a performance optimization that we do, if at all, > when it's been proven safe, not a baked-in part of the design. Even a > solution that decodes WAL to text tuples and ships those around and > reinserts the via pure SQL should be significantly faster than the > replication solutions we have today; if it isn't, something's wrong. Its not only the logging side which is a limitation in todays replication scenarios. The apply side scales even worse because its *very* hard to distribute it between multiple backends. > >> The problem with that is it's going to be tough to make robust. Users > >> could easily end up with answers that are total nonsense, or probably > >> even crash the server. > > Why? > Because the routines that decode tuples don't include enough sanity > checks to prevent running off the end of the block, or even the end of > memory completely. Consider a corrupt TOAST pointer that indicates > that there is a gigabyte of data stored in an 8kB block. One of the > common s
Re: [HACKERS] sortsupport for text
Peter Geoghegan wrote: > Kevin Grittner wrote: >> I'm pretty sure that when I was using Sybase ASE the order for >> non-equal values was always predictable, and it behaved in the >> manner I describe below. I'm less sure about any other product. > > Maybe it used a physical row identifier as a tie-breaker? Note > that we use ItemPointers as a tie-breaker for sorting index > tuples. > > I imagine that it was at least predictable among columns being > sorted, if only because en_US.UTF-8 doesn't have any notion of > equivalence (that is, it just so happens that there are no two > strings that are equivalent but not bitwise equal). It would > surely be impractical to do a comparison for the entire row, as > that could be really expensive. We weren't using en_US.UTF-8 collation (or any other "proper" collation) on Sybase -- I'm not sure whether they even supported proper collation sequences on the versions we used. I'm thinking of when we were using their "case insensitive" sorting. I don't know the implementation details, but the behavior was consistent with including each character-based column twice: once in the requested position in the ORDER BY clause but folded to a consistent case, and again after all the columns in the ORDER BY clause in original form, with C collation. I wasn't aware that en_US.UTF-8 doesn't have equivalence without equality. I guess that surprising result in my last post is just plain inevitable with that collation then. Bummer. Is there actually anyone who finds that to be a useful behavior? For a collation which considered upper-case and lower-case to be equivalent, would PostgreSQL sort as I wanted, or is it doing some tie-break per column within equivalent values? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Do we want a xmalloc or similar function in the Backend?
On Tuesday, June 19, 2012 07:35:53 PM Robert Haas wrote: > On Tue, Jun 19, 2012 at 10:17 AM, Andres Freund wrote: > > There are 70+ calls of malloc in the backend in the form of > > > > type* foo = malloc(sizeof(...)); > > if(!foo) > > elog(ERROR, "could not allocate memory"); > > > > which is a bit annoying to write at times. Would somebody argue against > > introducing a function that does the above named xmalloc() or > > malloc_or_die()? > > I can't even find 70 malloc calls in the entire backend, let alone 70 > with that pattern. Still, I don't think malloc_or_error (not die) > would be a bad idea. $ ack '\bmalloc\s*\(' src/backend/|wc -l 70 10-15 or so of those are comments. The 70+ came from me running on some development branch with commitfest patches applied... > But the error should definitely be written as: > > ereport(ERROR, > (errcode(ERRCODE_OUT_OF_MEMORY), > errmsg("out of memory"))); > > ...not elog. Yes, definitely. Currently some of those locations (e.g. in xlog.c) are only protected by Asserts... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Do we want a xmalloc or similar function in the Backend?
On Tue, Jun 19, 2012 at 10:17 AM, Andres Freund wrote: > There are 70+ calls of malloc in the backend in the form of > > type* foo = malloc(sizeof(...)); > if(!foo) > elog(ERROR, "could not allocate memory"); > > which is a bit annoying to write at times. Would somebody argue against > introducing a function that does the above named xmalloc() or malloc_or_die()? I can't even find 70 malloc calls in the entire backend, let alone 70 with that pattern. Still, I don't think malloc_or_error (not die) would be a bad idea. But the error should definitely be written as: ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of memory"))); ...not elog. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
On Mon, Jun 18, 2012 at 1:30 PM, Alvaro Herrera wrote: > > Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012: > >> Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr >> and SvPVUTF8() when turning a perl string into a cstring. > > Hmm, this patch belongs into back branches too, right? Not just the > current development tree? (Have been out of town, sorry for the late reply) Yeah, back to 9.1. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote: > On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: > > It calls pg_flush_data inside of copy_file which does the > > posix_fadvise... So maybe just put the sync_file_range in pg_flush_data? > > The functions in fd.c aren't linked to initdb, so it's a challenge to > share that code (I remember now: that's why I didn't use pg_flush_data > originally). I don't see a clean way to resolve that, do you? > > Also, it seems that sync_file_range() doesn't help with creating a > database much (on my machine), so it doesn't seem important to use it > there. > > Right now I'm inclined to leave the patch as-is. Fine with that, I wanted to bring it up and see it documented. I have marked it with ready for committer. That committer needs to decide on - N in the regression tests or not, but that shouldn't be much of a problem ;) Cool! Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane wrote: > Andres Freund writes: >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: ... (If you are thinking of something sufficiently high-level that merging could possibly work, then it's not WAL, and we shouldn't be trying to make the WAL representation cater for it.) > >> Do you really see this as such a big problem? > > It looks suspiciously like "I have a hammer, therefore every problem > must be a nail". I don't like the design concept of cramming logical > replication records into WAL in the first place. Me, neither. I think it's necessary to try to find a way of generating logical replication records from WAL. But once generated, I think those records should form their own stream, independent of WAL. If you take the contrary position that they should be included in WAL, then when you filter the WAL stream down to just the records of interest to logical replication, you end up with a WAL stream with holes in it, which is one of the things that Andres listed as an unresolved design problem in his original email. Moreover, this isn't necessary at all for single-master replication, or even multi-source replication where each table has a single master. It's only necessary for full multi-master replication, which we have no consensus to include in core, and even if we did have a consensus to include it in core, it certainly shouldn't be the first feature we design. > However, if we're dead set on doing it that way, let us put information > that is only relevant to logical replication records into only the > logical replication records. Right. If we decide we need this, and if we did decide to conflate the WAL stream, both of which I disagree with as noted above, then we still don't need it on every record. It would probably be sufficient for local transactions to do nothing at all (and we can implicitly assume that they have master node ID = local node ID) and transactions which are replaying remote changes to emit one record per XID per checkpoint cycle containing the remote node ID. > Saving a couple bytes in each such record > is penny-wise and pound-foolish, I'm afraid; especially when you're > nailing down hard, unexpansible limits at the very beginning of the > development process in order to save those bytes. I completely agree. I think that, as Dan said upthread, having a 64 or 128 bit ID so that it can be generated automatically rather than configured by an administrator who must be careful not to duplicate node IDs in any pair of systems that could ever end up talking to each other would be a vast usability improvement. Perhaps systems A, B, and C are replicating to each other today, as are systems D and E. But now suppose that someone decides they want to replicate one table between A and D. Suddenly the node IDs have to be distinct where they didn't before, and now there's potentially a problem to hassle with that wouldn't have been an issue if the node IDs had been wide enough to begin with. It is not unusual for people to decide after-the-fact to begin replicating between machines where this wasn't originally anticipated and which may even be even be under different administrative control. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: > It calls pg_flush_data inside of copy_file which does the posix_fadvise... So > maybe just put the sync_file_range in pg_flush_data? The functions in fd.c aren't linked to initdb, so it's a challenge to share that code (I remember now: that's why I didn't use pg_flush_data originally). I don't see a clean way to resolve that, do you? Also, it seems that sync_file_range() doesn't help with creating a database much (on my machine), so it doesn't seem important to use it there. Right now I'm inclined to leave the patch as-is. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On 19 June 2012 17:45, Kevin Grittner wrote: > Peter Geoghegan wrote: >> Are you sure that they actually have a tie-breaker, and don't just >> make the distinction between equality and equivalence (if only >> internally)? > > I'm pretty sure that when I was using Sybase ASE the order for > non-equal values was always predictable, and it behaved in the > manner I describe below. I'm less sure about any other product. Maybe it used a physical row identifier as a tie-breaker? Note that we use ItemPointers as a tie-breaker for sorting index tuples. I imagine that it was at least predictable among columns being sorted, if only because en_US.UTF-8 doesn't have any notion of equivalence (that is, it just so happens that there are no two strings that are equivalent but not bitwise equal). It would surely be impractical to do a comparison for the entire row, as that could be really expensive. >> I don't see why you'd want a tie-breaker across multiple keys. I >> mean, you could, I just don't see any reason to. > > So that you can have entirely repeatable results across multiple > runs? I suppose that isn't an unreasonable use-case, but it would be unreasonable to require that everyone pay for it if, particularly if it implied another strcmp(). You don't need me to tell you that we generally discourage the idea that the order that tuples are returned in is defined in any way beyond that asked for by the user. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
Excerpts from Boszormenyi Zoltan's message of mar jun 19 12:44:04 -0400 2012: > OK, all 4 Check* functions are now moved back into proc.c, > nothing outside of timeout.c touches anything in it. New patches > are attached. Yeah, I like this one better, thanks. It seems to me that the "check" functions are no longer "check" anymore, right? I mean, they don't check whether the time has expired. It can be argued that CheckDeadLock is well named, because it does check whether there is a deadlock before doing anything else; but CheckStandbyTimeout is no longer a check -- it just sends a signal. Do we need to rename these functions? Why are you using the deadlock timeout for authentication? Wouldn't it make more sense to have a separate TimeoutName, just to keep things clean? The "NB:" comment here doesn't seem to be useful anymore: + /* + * Init, Destroy and Check functions for different timeouts. + * + * NB: all Check* functions are run inside a signal handler, so be very wary + * about what is done in them or in called routines. + */ In base_timeouts you don't initialize fin_time for any of the members. This is probably unimportant but then why initialize start_time? -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Near-duplicate RI NO ACTION and RESTRICT triggers
Our current interpretation of the difference between foreign keys with ON UPDATE/DELETE NO ACTION and those with ON UPDATE/DELETE RESTRICT is that they mean the same thing but RESTRICT checks are not deferrable. It follows from this that the trigger code ought to be the same for NO ACTION and RESTRICT cases, and so it occurred to me that we could get rid of a few hundred lines in ri_triggers.c if we removed the duplicated code. Comparing the actual code in the different functions, though, there is a difference: the NO ACTION triggers call ri_Check_Pk_Match to see if another PK row has been inserted/modified to provide the same key values that the trigger subject row no longer does. (ri_Check_Pk_Match also makes some checks for NULL-key cases, but these are redundant with tests made in the main trigger functions, so they ought to go away.) The RESTRICT triggers do not do this. It's fairly easy to construct a case where it makes a difference: create temp table pp (f1 int primary key); create temp table cc (f1 int references pp on update no action); insert into pp values(12); insert into pp values(11); update pp set f1=f1+1; -- now we have 13 and 12 insert into cc values(13); -- ok update pp set f1=f1+1; -- now we have 14 and 13, FK is still OK update pp set f1=f1+1; -- would result in 15 and 14, so FK fails If you change the foreign key to be ON UPDATE RESTRICT, the second UPDATE fails because the RESTRICT trigger doesn't notice that another PK row has been modified to provide the key required by the FK row. I think that the argument for having the RESTRICT triggers behave like this is that the SQL spec envisions the RESTRICT check occurring immediately when the individual PK row is updated/deleted, and so there would be no opportunity for another PK row to be updated into its place. (Or, in plainer English, RESTRICT should mean "you can't modify this row's keys at all if it has dependents".) Because we implement RESTRICT through an AFTER trigger that can't run earlier than end-of-statement, we can't exactly match the spec's semantics, but we can get fairly close so long as you don't think about what would be seen by e.g. user-written triggers executing during the statement. I'm happy with continuing to have this behavioral difference between the two sets of triggers, but wanted to throw it up for discussion: does anyone think it'd be better to apply ri_Check_Pk_Match in the RESTRICT triggers too? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
Peter Geoghegan wrote: > Kevin Grittner wrote: >> Since we appear to be questioning everything in this area, I'll >> raise something which has been bugging me for a while: in some >> other systems I've used, the "tie-breaker" comparison for >> equivalent values comes after equivalence sorting on *all* sort >> keys, rather than *each* sort key. > > Are you sure that they actually have a tie-breaker, and don't just > make the distinction between equality and equivalence (if only > internally)? I'm pretty sure that when I was using Sybase ASE the order for non-equal values was always predictable, and it behaved in the manner I describe below. I'm less sure about any other product. > I don't see why you'd want a tie-breaker across multiple keys. I > mean, you could, I just don't see any reason to. So that you can have entirely repeatable results across multiple runs? >> test=# select * from c order by 2; >> last_name | first_name >> ---+ >> smith | bob >> SMITH | EDWARD >> smith | peter >> (3 rows) >> >> This seems completely wrong: >> >> test=# select * from c order by 1,2; >> last_name | first_name >> ---+ >> smith | bob >> smith | peter >> SMITH | EDWARD >> (3 rows) > > Agreed. Definitely a POLA violation. > >> I'm sure the latter is harder to do and slower to execute; but >> the former just doesn't seem defensible as correct. > > This same gripe is held by the author of that sorting document I > linked to from the Unicode consortium, with a very similar > example. So it seems like this could be a win from several > perspectives, as it would enable the strxfrm() optimisation. I'm > pretty sure that pg_upgrade wouldn't be very happy about this, so > we'd have to have a legacy compatibility mode. At a minimum, it could require rebuilding indexes with multiple columns where any indexed value before the last index column can be equivalent but not equal. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi, On Tuesday, June 19, 2012 06:11:20 PM Tom Lane wrote: > Andres Freund writes: > > On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: > >>> ... (If you are thinking > >>> of something sufficiently high-level that merging could possibly work, > >>> then it's not WAL, and we shouldn't be trying to make the WAL > >>> representation cater for it.) > > > > Do you really see this as such a big problem? > It looks suspiciously like "I have a hammer, therefore every problem > must be a nail". I don't like the design concept of cramming logical > replication records into WAL in the first place. There are - so far - no specific "logical replication records". Its a relatively minor amount of additional data under wal_level=logical for existing records. HEAP_UPDATE gets the old primary key on updates changing the pkey and HEAP_DELETE always has the pkey. HEAP_INSERT|UPDATE| DELETE,HEAP2_MULTI_INSERT put their information in another XLogRecData block than the page to handle full page writes. Thats it. I can definitely understand hesitation about that, but I simply see no realistic way to solve the issues of existing replication solutions otherwise. Do you have a better idea to solve those than the above? Without significant complications of the backend code and without loads of additional writes going on? I *really* would like to hear them if you do. > However, if we're dead set on doing it that way, let us put information > that is only relevant to logical replication records into only the > logical replication records. I found, and still do, the idea of having the origin_id in there rather elegant. If people prefer adding the same block to all of the above xlog records: I can live with that and will then do so. It makes some things more complicated, but its not too bad. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables
Hi Fujita-san, Could you rebase this patch towards the latest tree? It was unavailable to apply the patch cleanly. I looked over the patch, then noticed a few points. At ProcessCopyOptions(), defel->arg can be NIL, isn't it? If so, cstate->convert_binary is not a suitable flag to check redundant option. It seems to me cstate->convert_selectively is more suitable flag to check it. + else if (strcmp(defel->defname, "convert_binary") == 0) + { + if (cstate->convert_binary) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("conflicting or redundant options"))); + cstate->convert_selectively = true; + if (defel->arg == NULL || IsA(defel->arg, List)) + cstate->convert_binary = (List *) defel->arg; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("argument to option \"%s\" must be a list of column names", + defel->defname))); + } At NextCopyFrom(), this routine computes default values if configured. In case when these values are not referenced, it might be possible to skip unnecessary calculations. Is it unavailable to add logic to avoid to construct cstate->defmap on unreferenced columns at BeginCopyFrom()? Thanks, 2012/5/11 Etsuro Fujita : >> -Original Message- >> From: Robert Haas [mailto:robertmh...@gmail.com] >> Sent: Friday, May 11, 2012 1:36 AM >> To: Etsuro Fujita >> Cc: pgsql-hackers@postgresql.org >> Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file >> foreign tables >> >> On Tue, May 8, 2012 at 7:26 AM, Etsuro Fujita > >> wrote: >> > I would like to propose to improve parsing efficiency of >> > contrib/file_fdw by selective parsing proposed by Alagiannis et >> > al.[1], which means that for a CSV/TEXT file foreign table, file_fdw >> > performs binary conversion only for the columns needed for query >> > processing. Attached is a WIP patch implementing the feature. >> >> Can you add this to the next CommitFest? Looks interesting. > > Done. > > Best regards, > Etsuro Fujita > >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL >> Company >> > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On 19 June 2012 16:17, Kevin Grittner wrote: > Peter Geoghegan wrote: > >> So, just to give a bit more weight to my argument that we should >> recognise that equivalent strings ought to be treated identically > > Since we appear to be questioning everything in this area, I'll > raise something which has been bugging me for a while: in some other > systems I've used, the "tie-breaker" comparison for equivalent > values comes after equivalence sorting on *all* sort keys, rather > than *each* sort key. Are you sure that they actually have a tie-breaker, and don't just make the distinction between equality and equivalence (if only internally)? I would have checked that myself already, but I don't have access to any other RDBMS that I'd expect to care about these kinds of distinctions. They make sense for ensuring that the text comparator's notion of equality is consistent with text's general notion (if that's bitwise equality, which I suspect it is in these other products too for the same reasons it is for us). I don't see why you'd want a tie-breaker across multiple keys. I mean, you could, I just don't see any reason to. > test=# select * from c order by 2; > last_name | first_name > ---+ > smith | bob > SMITH | EDWARD > smith | peter > (3 rows) > > This seems completely wrong: > > test=# select * from c order by 1,2; > last_name | first_name > ---+ > smith | bob > smith | peter > SMITH | EDWARD > (3 rows) Agreed. Definitely a POLA violation. > I'm sure the latter is harder to do and slower to execute; but the > former just doesn't seem defensible as correct. This same gripe is held by the author of that sorting document I linked to from the Unicode consortium, with a very similar example. So it seems like this could be a win from several perspectives, as it would enable the strxfrm() optimisation. I'm pretty sure that pg_upgrade wouldn't be very happy about this, so we'd have to have a legacy compatibility mode. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw in contrib
On Tue, Jun 19, 2012 at 10:15 AM, Kohei KaiGai wrote: > Let me push the pgsql_fdw in core from different perspective. > > Right now, FDW is a feature that will take many enhancement in > the near future like join-pushdown, writable APIs and so on. > If we would not have a FDW extension in core that communicate > with actual RDBMS, instead of flat files, it makes our development > efforts more complex. Please assume a scenario people alwats > tries to submit two patches; one for the core, and the other for > a particular out-of-tree extension. yeah. plus, it's nice to have a very high quality fdw implementation in core for others to crib from. I have no objection -- I just took a convenient opportunity to hijack the thread for some hand-wavy ideas about an extension repository. :-). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund writes: > On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: >>> ... (If you are thinking >>> of something sufficiently high-level that merging could possibly work, >>> then it's not WAL, and we shouldn't be trying to make the WAL >>> representation cater for it.) > Do you really see this as such a big problem? It looks suspiciously like "I have a hammer, therefore every problem must be a nail". I don't like the design concept of cramming logical replication records into WAL in the first place. However, if we're dead set on doing it that way, let us put information that is only relevant to logical replication records into only the logical replication records. Saving a couple bytes in each such record is penny-wise and pound-foolish, I'm afraid; especially when you're nailing down hard, unexpansible limits at the very beginning of the development process in order to save those bytes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions over pathological TCP connections
On Tue, Jun 19, 2012 at 1:56 AM, Tom Lane wrote: > The transaction would be committed before a command success report is > delivered to the client, so I don't think delivered-and-not-marked is > possible. ...unless you have configured synchronous_commit=off, or fsync=off. Or unless your disk melts into a heap of slag and you have to restore from backup. You can protect against that last case using synchronous replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Skip checkpoint on promoting from streaming replication
On Tue, Jun 19, 2012 at 5:30 PM, Kyotaro HORIGUCHI wrote: > Thank you. > >> What happens if the server skips an end-of-recovery checkpoint, >> is promoted to the master, runs some write transactions, >> crashes and restarts automatically before it completes >> checkpoint? In this case, the server needs to do crash recovery >> from the last checkpoint record with old timeline ID to the >> latest WAL record with new timeline ID. How does crash recovery >> do recovery beyond timeline? > > Basically the same as archive recovery as far as I saw. It is > already implemented to work in that way. > > After this patch applied, StartupXLOG() gets its > recoveryTargetTLI from the new field lastestTLI in the control > file instead of the latest checkpoint. And the latest checkpoint > record informs its TLI and WAL location as before, but its TLI > does not have a significant meaning in the recovery sequence. > > Suggest the case following, > > |seg 1 | seg 2 | > TLI 1 |...c..|00| > C P X > TLI 2 |00| > > * C - checkpoint, P - promotion, X - crash just after here > > This shows the situation that the latest checkpoint(restartpoint) > has been taken at TLI=1/SEG=1/OFF=4 and promoted at > TLI=1/SEG=2/OFF=5, then crashed just after TLI=2/SEG=2/OFF=8. > Promotion itself inserts no wal records but creates a copy of the > current segment for new TLI. the file for TLI=2/SEG=1 should not > exist. (Who will create it?) > > The control file will looks as follows > > latest checkpoint : TLI=1/SEG=1/OFF=4 > latest TLI : 2 > > So the crash recovery sequence starts from SEG=1/LOC=4. > expectedTLIs will be (2, 1) so 1 will naturally be selected as > the TLI for SEG1 and 2 for SEG2 in XLogFileReadAnyTLI(). > > In the closer view, startup constructs expectedTLIs reading the > timeline hisotry file corresponds to the recoveryTargetTLI. Then > runs the recovery sequence from the redo point of the latest > checkpoint using WALs with the largest TLI - which is > distinguised by its file name, not header - within the > expectedTLIs in XLogPageRead(). The only difference to archive > recovery is XLogFileReadAnyTLI() reads only the WAL files already > sit in pg_xlog directory, and not reaches archive. The pages with > the new TLI will be naturally picked up as mentioned above in > this sequence and then will stop at the last readable record. > > latestTLI field in the control file is updated just after the TLI > was incremented and the new WAL files with the new TLI was > created. So the crash recovery sequence won't stop before > reaching the WAL with new TLI disignated in the control file. Is it guaranteed that all the files (e.g., the latest timeline history file) required for such crash recovery exist in pg_xlog? If yes, your approach might work well. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas wrote: > Well, that was easier than I thought. Attached is a patch to make XLogRecPtr > a uint64, on top of my other WAL format patches. I think we should go ahead > with this. +1. > The LSNs on pages are still stored in the old format, to avoid changing the > on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control > file and WAL are changed, however, so an initdb (or at least pg_resetxlog) > is required. Seems fine. > Should we keep the old representation in the replication protocol messages? > That would make it simpler to write a client that works with different > server versions (like pg_receivexlog). Or, while we're at it, perhaps we > should mandate network-byte order for all the integer and XLogRecPtr fields > in the replication protocol. That would make it easier to write a client > that works across different architectures, in >= 9.3. The contents of the > WAL would of course be architecture-dependent, but it would be nice if > pg_receivexlog and similar tools could nevertheless be > architecture-independent. I share Andres' question about how we're doing this already. I think if we're going to break this, I'd rather do it in 9.3 than 5 years from now. At this point it's just a minor annoyance, but it'll probably get worse as people write more tools that understand WAL. > I kept the %X/%X representation in error messages etc. I'm quite used to > that output, so reluctant to change it, although it's a bit silly now that > it represents just 64-bit value. Using UINT64_FORMAT would also make the > messages harder to translate. I could go either way on this one, but I have no problem with the way you did it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg default's verbosity?
On Tue, Jun 19, 2012 at 2:15 AM, Tom Lane wrote: > Robert Haas writes: >> There might be something to the idea of demoting a few of the things >> we've traditionally had as NOTICEs, though. IME, the following two >> messages account for a huge percentage of the chatter: > >> NOTICE: CREATE TABLE will create implicit sequence "foo_a_seq" for >> serial column "foo.a" >> NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index >> "foo_pkey" for table "foo" > > Personally, I'd have no problem with flat-out dropping (not demoting) > both of those two specific messages. I seem to recall that Bruce has > lobbied for them heavily in the past, though. My vote would be to make 'em DEBUG1, and similarly with the UNIQUE message that Fabian mentions downthread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote: > On 18.06.2012 21:08, Heikki Linnakangas wrote: > > On 18.06.2012 21:00, Robert Haas wrote: > >> On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund > >> > >> wrote: > 1. Use a 64-bit segment number, instead of the log/seg combination. > And don't waste the last segment on each logical 4 GB log file. The > concept of a "logical log file" is now completely gone. XLogRecPtr is > unchanged, > but it should now be understood as a plain 64-bit value, just split > into > two 32-bit integers for historical reasons. On disk, this means that > there will be log files ending in FF, those were skipped before. > >>> > >>> Whats the reason for keeping that awkward split now? There aren't > >>> that many > >>> users of xlogid/xcrecoff and many of those would be better served by > >>> using > >>> helper macros. > >> > >> I wondered that, too. There may be a good reason for keeping it split > >> up that way, but we at least oughta think about it a bit. > > > > The page header contains an XLogRecPtr (LSN), so if we change it we'll > > have to deal with pg_upgrade. I guess we could still keep XLogRecPtr > > around as the on-disk representation, and convert between the 64-bit > > integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - > > many xlog calculations would admittedly be simpler if it was an uint64. > > Well, that was easier than I thought. Attached is a patch to make > XLogRecPtr a uint64, on top of my other WAL format patches. I think we > should go ahead with this. Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure if having two representations which just have a constant factor inbetween really makes sense. > The LSNs on pages are still stored in the old format, to avoid changing > the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the > control file and WAL are changed, however, so an initdb (or at least > pg_resetxlog) is required. Sounds sensible. > Should we keep the old representation in the replication protocol > messages? That would make it simpler to write a client that works with > different server versions (like pg_receivexlog). Or, while we're at it, > perhaps we should mandate network-byte order for all the integer and > XLogRecPtr fields in the replication protocol. The replication protocol uses pq_sendint for integers which should take care of converting to big endian already. I don't think anything but the wal itself is otherwise transported in a binary fashion? So I don't think there is any such architecture dependency in the protocol currently? I don't really see a point in keeping around a backward-compatible representation just for the sake of running such tools on multiple versions. I might not be pragmatic enough, but: Why would you want to do that *at the moment*? Many of the other tools are already version specific, so... When the protocol starts to be used by more tools, maybe, but imo were not there yet. But then its not hard to convert to the old representation for that. > I kept the %X/%X representation in error messages etc. I'm quite used to > that output, so reluctant to change it, although it's a bit silly now > that it represents just 64-bit value. Using UINT64_FORMAT would also > make the messages harder to translate. No opinion on that. Its easier to see for me whether two values are exactly the same or very similar with the 64bit representation but its harder to gauge bigger differences. So ... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
On Mon, Jun 18, 2012 at 3:30 PM, Alvaro Herrera wrote: > Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012: > >> Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr >> and SvPVUTF8() when turning a perl string into a cstring. > > Hmm, this patch belongs into back branches too, right? Not just the > current development tree? It seems like a bug fix to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout wrote: > On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote: >> The fly in the ointment with any of these ideas is that the "configure >> list" is not a list of exact cipher names, as per Magnus' comment that >> the current default includes tests like "!aNULL". I am not sure that >> we know how to evaluate such conditions if we are applying an >> after-the-fact check on the selected cipher. Does OpenSSL expose any >> API for evaluating whether a selected cipher meets such a test? > > I'm not sure whether there's an API for it, but you can certainly check > manually with "openssl ciphers -v", for example: > > $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP' > NULL-SHA SSLv3 Kx=RSA Au=RSA Enc=None Mac=SHA1 > NULL-MD5 SSLv3 Kx=RSA Au=RSA Enc=None Mac=MD5 > > ...etc... > > So unless the openssl includes the code twice there must be a way to > extract the list from the library. There doubtless is, but I'd being willing to wager that you won't be able to figure out the exact method without reading the source code for 'opennssl ciphers' to see how it was done there, and most likely you'll find that at least one of the functions they use has no man page. Documentation isn't their strong point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, Jun 18, 2012 at 09:34:30PM +0300, Peter Eisentraut wrote: > On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote: > > - defaulting to initdb -N in the regression suite is not a good imo, > > because that way the buildfarm won't catch problems in that area... > > > The regression test suite also starts postgres with fsync off. This is > good, and I don't want to have more overhead in the tests. It's not > like we already have awesome test coverage for OS interaction. What sorts of coverage would we want? Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
Peter Geoghegan wrote: > So, just to give a bit more weight to my argument that we should > recognise that equivalent strings ought to be treated identically Since we appear to be questioning everything in this area, I'll raise something which has been bugging me for a while: in some other systems I've used, the "tie-breaker" comparison for equivalent values comes after equivalence sorting on *all* sort keys, rather than *each* sort key. For example, this much makes sense with lc_collate = 'en_US.UTF-8': test=# create table c (last_name text not null, first_name text); CREATE TABLE test=# insert into c values ('smith', 'bob'), ('smith', 'peter'), ('SMITH', 'EDWARD'); INSERT 0 3 test=# select * from c order by 2; last_name | first_name ---+ smith | bob SMITH | EDWARD smith | peter (3 rows) This seems completely wrong: test=# select * from c order by 1,2; last_name | first_name ---+ smith | bob smith | peter SMITH | EDWARD (3 rows) I have seen other databases which get it in the order I would expect -- where the C compare only matters within groups of equivalent rows. It seems that PostgreSQL effectively orders by: last_name using collation 'en_US.UTF-8' last_name using collation 'C' first_name using collation 'en_US.UTF-8' first_name using collation 'C' while some other products order by: last_name using collation 'en_US.UTF-8' first_name using collation 'en_US.UTF-8' last_name using collation 'C' first_name using collation 'C' I'm sure the latter is harder to do and slower to execute; but the former just doesn't seem defensible as correct. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw in contrib
2012/6/19 Merlin Moncure : > On Mon, Jun 18, 2012 at 11:01 PM, Robert Haas wrote: >> On Mon, Jun 18, 2012 at 12:24 PM, Merlin Moncure wrote: >>> I can't help but wonder (having been down the contrib/core/extension >>> road myself) if it isn't better to improve the facilities to register >>> and search for qualified extensions (like Perl CPAN) so that people >>> looking for code to improve their backends can find it. That way, >>> you're free to release/do xyz/abandon your project as you see fit >>> without having to go through -hackers. This should also remove a lot >>> of the stigma with not being in core since if stock postgres >>> installations can access the necessary modules via CREATE EXTENSION, I >>> think it will make it easier for projects like this to get used with >>> the additional benefit of decentralizing project management. >> >> Well, if you're the type that likes to install everything via RPMs >> (and I am) then you wouldn't want this behavior, especially not >> automagically. It seems to open up a host of security risks, too: I >> believe Tom has previously stated that Red Hat (and other distro) >> packaging guidelines forbid packages from installing software in >> places where they can then turn around and run it. I suppose CPAN >> must have some sort of exception to this policy, though, so maybe it's >> not ironclad. > > Huh? CPAN is just one example -- PyPM for python, npm for node etc > etc. There is some distinction to that rule that is being missed so > that it doesn't apply to cases like this -- probably that it is > transacted by the user and/or requires su. > > I think your points are supporting mine: the vast majority of postgres > administrators deal only with o/s packaged rpms and therefore with the > status quo are unable to utilize any extension that is not packaged > with contrib. This means that there are two possible outcomes if you > want to write an extension: > > 1) get accepted into core > 2) never get used > > Given that running the gauntlet to #1 is not a particularly attractive > option (or even possible) in many cases for various reasons it tends > to discourage module development by 3rd parties. There are several > very high quality extensions that could get a lot more exposure...for > example pl/sh. > > But anyways, if you're happy with pgsql_fdw (aside: i looked at it, > and it's great!) in core, then by all means... > Let me push the pgsql_fdw in core from different perspective. Right now, FDW is a feature that will take many enhancement in the near future like join-pushdown, writable APIs and so on. If we would not have a FDW extension in core that communicate with actual RDBMS, instead of flat files, it makes our development efforts more complex. Please assume a scenario people alwats tries to submit two patches; one for the core, and the other for a particular out-of-tree extension. I believe it is reasonable choice to have a FDW for PostgreSQL in core, rather than MySQL or Oracle. We can use this extension to show how new feature works, whenever we try to extend FDW APIs. BTW, Hanada-san- It seems to me the expected result of regression test was not included in this patch. Please submit it again- Thanks, -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
Excerpts from Boszormenyi Zoltan's message of mar jun 19 04:44:35 -0400 2012: > > 2012-06-18 19:46 keltezéssel, Alvaro Herrera írta: > > Excerpts from Boszormenyi Zoltan's message of vie may 11 03:54:13 -0400 > > 2012: > >> Hi, > >> > >> another rebasing and applied the GIT changes in > >> ada8fa08fc6cf5f199b6df935b4d0a730aaa4fec to the > >> Windows implementation of PGSemaphoreTimedLock. > > Hi, > > > > I gave the framework patch a look. One thing I'm not sure about is the > > way you've defined the API. It seems a bit strange to have a nice and > > clean, generic interface in timeout.h; and then have the internal > > implementation of the API cluttered with details such as what to do when > > the deadlock timeout expires. Wouldn't it be much better if we could > > keep the details of CheckDeadLock itself within proc.c, for example? > > Do you mean adding a callback function argument to for enable_timeout() > would be better? Well, that's not exactly what I was thinking, but maybe your idea is better than what I had in mind. > > Same for the other specific Check*Timeout functions. It seems to me > > that the only timeout.c specific requirement is to be able to poke > > base_timeouts[].indicator and fin_time. Maybe that can get done in > > timeout.c and then have the generic checker call the module-specific > > checker. > > Or instead of static functions, Check* functions can be external > to timeout.c. It seemed to be a good idea to move all the timeout > related functions to timeout.c. Yeah, except that they play with members of the timeout_params struct, which hopefully does not need to be exported. So if you can just move (that is, put back in their original places) the portions that do not touch that strcut, it'd be great. > > Also, I see you're calling GetCurrentTimestamp() in the checkers; maybe > > more than once per signal if you have multiple timeouts enabled. > > Actually, GetCurrentTimestamp() is not called multiple times, > because only the first timeout source in the queue can get triggered. I see. > > BTW it doesn't seem that datatype/timestamp.h is really necessary in > > timeout.h. > > You are wrong. get_timeout_start() returns TimestampTz and it's > defined in datatype/timestamp.h. Oh, right. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers