Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On 5/3/12 2:54 PM, Josh Berkus wrote: (2) If logical transactions had been implemented as additions to the WAL stream, and Slony was using that, do you think they would still have been usable for this recovery? Quite possibly not. The key advantage that I see in londiste/slony replication is that your data stream has absolutely nothing to do with anything binary or internal to Postgres. That means that the only way corruption will travel from a master to a slave is if the corruption is in the actual fields being updated, and even that's not a given (ie: UPDATING a field to a completely new value would not propagate corruption even if the old value of the field was corrupted). So, embedding a logical stream into WAL is not inherently bad... what would be bad is if that logical stream was susceptible to corruption due to something like full page writes. Simply embedding the exact same info slony or londiste captures into the WAL should be fine (though likely defeats the purpose). Translating binary WAL data into DML statements would very likely allow corruption to travel from master to slave. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
One thing I wanted to mention is that non-binary replication has an added advantage over binary from a DR standpoint: if corruption occurs on a master it is more likely to make it into your replicas thanks to full page writes. You might want to consider that depending on how sensitive your data is. Yeah, we've seen this a few times. We just recently had to rescue a client from HS-wide corruption using Slony. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Josh Berkus j...@agliodbs.com wrote: One thing I wanted to mention is that non-binary replication has an added advantage over binary from a DR standpoint: if corruption occurs on a master it is more likely to make it into your replicas thanks to full page writes. You might want to consider that depending on how sensitive your data is. Yeah, we've seen this a few times. We just recently had to rescue a client from HS-wide corruption using Slony. That's an interesting point. Out of curiosity, how did the corruption originate? It suggests a couple questions: (1) Was Slony running before the corruption occurred? If not, how was Slony helpful? I know that in our environment, where we have both going through separate streams, with a repository of the logical transactions, we would use PITR recovery to get to the latest known good state which we could easily identify, and then replay the logical transactions to top it off to get current. If necessary we could skip logical transactions which were problematic results of the corruption. (2) If logical transactions had been implemented as additions to the WAL stream, and Slony was using that, do you think they would still have been usable for this recovery? Perhaps sending both physical and logical transaction streams over the WAN isn't such a bad thing, if it gives us more independent recovery mechanisms. That's fewer copies than we're sending with current trigger-based techniques. It would be particularly attractive is we could omit (filter out) certain tables before going across the WAN. I would be willing to risk sending the big raster-scanned documents through just the physical channel so long as I had a nightly compare of md5sum values on both sides so we can resend any corrupted data (or tell people to rescan). -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
That's an interesting point. Out of curiosity, how did the corruption originate? We're still not sure. It appears to be in the system catalogs, though. Note that the original master developed memory issues. It suggests a couple questions: (1) Was Slony running before the corruption occurred? No. If not, how was Slony helpful? Install, replicate DB logically, new DB works fine. (2) If logical transactions had been implemented as additions to the WAL stream, and Slony was using that, do you think they would still have been usable for this recovery? Quite possibly not. Perhaps sending both physical and logical transaction streams over the WAN isn't such a bad thing, if it gives us more independent recovery mechanisms. That's fewer copies than we're sending with current trigger-based techniques. Frankly, there's nothing wrong with the Slony model for replication except for the overhead of: 1. triggers 2. queues 3. Running DDL However, the three above are really big issues. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On 4/29/12 9:27 AM, Kevin Grittner wrote: Maybe I can help with that by describing what the Wisconsin court system does for circuit court data. Thanks for the write-up, it was insightful. One thing I wanted to mention is that non-binary replication has an added advantage over binary from a DR standpoint: if corruption occurs on a master it is more likely to make it into your replicas thanks to full page writes. You might want to consider that depending on how sensitive your data is. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Tom Lane wrote: Kevin Grittner writes: Well, the xReader approach (Aakash's GSoC project) is to serve as a proxy for a WAL stream going to a hot standby, to interpret each incoming WAL record by cheating and querying the HS before passing the WAL along, and then using sync rep to the HS regardless of whether the incoming stream is sync or async. What reliability problem do you see with this approach? Well, first off, it requires a replica that *is* a physical replica (else it won't have the same table OIDs, for instance). I'm not sure I get the point of logical replication that requires a physical replica as a prerequisite. Maybe I can help with that by describing what the Wisconsin court system does for circuit court data. Due to state law, state supreme court rules, and the general suckiness of the WAN available to us, each of Wisconsin's 72 counties has a database server in the county seat which is the official data of record. We have two parallel forms of backup -- WAL-based and logical. For WAL-based replication we maintain a copy of the last two weekly base backups and all WAL files needed to recover from either of them to any point in time. One copy of this is required to be in the county, on a separate system with separate drives. The other copy is maintained centrally. One rather large server has a hot standby of all county databases, currently fed through copying WAL files (although we plan to move to streaming this year). These physical backups are the primary source for recovering from catastrophic failure of a county server. We have software to top off the recovered database with transactions from the logical transaction stream as needed, since that is more up-to-date. We have also found this combination of physical and logical transactions useful when someone trying to do a data fix with direct SQL statements against a production database mangled things badly. We use PITR recovery up to just before the damage, then apply logical transactions from that point forward, skipping the problem transactions. Until recently our logical transaction stream was generated from our Java middle tier running on the database servers. This has recently been converted to use a generalized PostgreSQL trigger, written in C, which is attached to all tables for which we want logical replication (about 90% of them). The trigger captures data into transaction tables in the database as part of the same transaction they are representing, so that the generation of the logical transaction data is atomic with the represented data. We didn't switch to one of the existing solutions because of all the code downstream which uses the logical transaction streams -- we didn't want to change the format or delivery method of the logical steam because it's simply not feasible to change everything at once. The way the logical stream is currently delivered is still through the aforementioned Java middle tier. It reads the logical transaction log tables and sends the same data in non-persistent JMS messages through queues to the TRAN client at the central location, just as we did for years before we ever heard of PostgreSQL. Normally, the TRAN client reads each logical transaction once, and feeds many replication targets at the central site. These include: - four Central Information Repository (CIR) databases each of which contains data from all 72 counties (two machines each at two different sites) - four Database Transaction Repository (DTR) databases on the same servers as the CIR databases (but separate RAIDs) This is used as a supplementary source for recovery (as mentioned above), as well as for auditing what happened on a case when there is any question, and certain activity reports. - a filter to identify transactions of possible interest to the publishers we use for interfaces to other agencies (district attorneys, public defenders, department of corrections, police agencies, social service agencies, etc.) Transactions of interest are queued for closer review by the publisher software, which actually posts messages if the transaction is found to actually be of interest. - other databases for the state court's central services for such things as court interpreters used for non-English speakers in the various counties The number and varied nature of the targets is significant, because the fan out and timing differences need to be considered. The TRAN client normally receives one copy of the logical data for each transaction, and about 99% of the time the transaction's data is replicated to all databases (and queued for review by publishers) within a second of someone pressing Enter 300 miles away. It's that other 1% that gets really interesting. To deal with the possibility that one or more central targets were down or running very slowly, the TRAN client will give up on keeping them at the
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Sun, Apr 29, 2012 at 3:27 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: It would be enormously more performant for the master to be emitting logical replication records to start with, since it already has all the right names etc at hand at basically no cost. Not when the consumers are across a WAN, and that WAN is the biggest performance bottleneck and the most expensive resource involved. I agree that the WAN is important, for both bandwidth and response time. Though it isn't a given that logical change records (LCRs) will require more bandwidth than physical WAL. WAL contains full page images, index changes and other information that would be absent from the LCR stream. It also depends upon the specification of the LCRs - what metadata is included and whether the LCRs use text or binary. Those choices have other impacts as well, so measurements and detailed analysis is required to justify how to proceed. Which is what is in progress now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Simon Riggs wrote: Though it isn't a given that logical change records (LCRs) will require more bandwidth than physical WAL. It would help if you could share a little more of what you have in mind. It's not yet clear to me whether you're talking about adding logical records to the WAL stream, replacing page-oriented records in the WAL stream with logical records, or leaving the WAL stream alone and generating a second, new stream of logical data. WAL contains full page images, index changes and other information that would be absent from the LCR stream. Other than full page images, which could be eliminated in other ways before streaming the WAL, what is in the WAL file that would not be needed to maintain an exact replica of the cluster, suitable for quick deployment in place of a lost cluster for disaster recovery? Or do you foresee eliminating some of the current page-image oriented WAL records in favor of logical WAL records, with an expectation that the logical records will be smaller? measurements and detailed analysis is required to justify how to proceed. Which is what is in progress now. Could you clarify? Are you referring to this discussion or some effort at 2Q that hasn't yet been shared with the community? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Sun, Apr 29, 2012 at 7:36 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Simon Riggs wrote: Though it isn't a given that logical change records (LCRs) will require more bandwidth than physical WAL. It would help if you could share a little more of what you have in mind. It's not yet clear to me whether you're talking about adding logical records to the WAL stream, replacing page-oriented records in the WAL stream with logical records, or leaving the WAL stream alone and generating a second, new stream of logical data. WAL contains full page images, index changes and other information that would be absent from the LCR stream. Other than full page images, which could be eliminated in other ways before streaming the WAL, what is in the WAL file that would not be needed to maintain an exact replica of the cluster, suitable for quick deployment in place of a lost cluster for disaster recovery? Or do you foresee eliminating some of the current page-image oriented WAL records in favor of logical WAL records, with an expectation that the logical records will be smaller? measurements and detailed analysis is required to justify how to proceed. Which is what is in progress now. Could you clarify? Are you referring to this discussion or some effort at 2Q that hasn't yet been shared with the community? These are early days yet and I've been deliberately vague on the format and content of the LCRs. That is to allow the idea that we have choices to bed in a little, so we can discuss them. The one thing that does stand clear in my mind at this stage is that the physical transport of LCRs should be the current streaming infrastructure. So the different aspects of the design are at differing stages of certainty (in my mind). Much of today has been spent on a presentation explaining the thought processes and options available to us. Deciding between those options needs to be fact based rather than just a matter of opinion and I want to ensure we make the right choices, whatever they are. I've mentioned a few times already that we're working on prototypes all of which will be shared with the community. RD, in that order. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Translating WAL is a very hard task. No kidding. I would think it's impossible on its face. Just for starters, where will you get table and column names from? (Looking at the system catalogs is cheating, and will not work reliably anyway.) IMO, if we want non-physical replication, we're going to need to build it in at a higher level than after-the-fact processing of WAL. I foresee wasting quite a lot of effort on the currently proposed approaches before we admit that they're unworkable. I think the question we should be asking ourselves is not whether WAL as it currently exists is adequate for logical replication, but rather or not it could be made adequate. For example, suppose that we were to arrange things so that, after each checkpoint, the first insert, update, or delete record for a given relfilenode after each checkpoint emits a special WAL record that contains the relation name, schema OID, attribute names, and attribute type OIDs. Well, now we are much closer to being able to do some meaningful decoding of the tuple data, and it really doesn't cost us that much. Handling DDL (and manual system catalog modifications) seems pretty tricky, but I'd be very reluctant to give up on it without banging my head against the wall pretty hard. The trouble with giving up on WAL completely and moving to a separate replication log is that it means a whole lot of additional I/O, which is bound to have a negative effect on performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Sun, 2012-04-29 at 16:33 -0400, Robert Haas wrote: On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Translating WAL is a very hard task. No kidding. I would think it's impossible on its face. Just for starters, where will you get table and column names from? (Looking at the system catalogs is cheating, and will not work reliably anyway.) IMO, if we want non-physical replication, we're going to need to build it in at a higher level than after-the-fact processing of WAL. I foresee wasting quite a lot of effort on the currently proposed approaches before we admit that they're unworkable. I think the question we should be asking ourselves is not whether WAL as it currently exists is adequate for logical replication, but rather or not it could be made adequate. Agreed. For example, suppose that we were to arrange things so that, after each checkpoint, the first insert, update, or delete record for a given relfilenode after each checkpoint emits a special WAL record that contains the relation name, schema OID, attribute names, and attribute type OIDs. Not just the first after checkpoint, but also the first after a schema change, even though will duplicate the wals with changes to system catalog, it is likely much cheaper overall to always have a fresh structure in wal stream. And if we really want to do WAL--logical--SQL_text conversion on a host separate from the master, we also need to insert there the type definitions of user-defined types together with at least types output functions in some form . So you basically need a large part of postgres for reliably making sense of WAL. Well, now we are much closer to being able to do some meaningful decoding of the tuple data, and it really doesn't cost us that much. Handling DDL (and manual system catalog modifications) seems pretty tricky, but I'd be very reluctant to give up on it without banging my head against the wall pretty hard. Most straightforward way is to have a more or less full copy of pg_catalog also on the WAL-filtering / WAL-conversion node, and to use it in 1:1 replicas of transactions recreated from the WAL . This way we can avoid recreating any alternate views of the masters schema. Then again, we could do it all on master and inside the wal-writing transaction and thus avoid large chunk of the problems. If the receiving side is also PostgreSQL with same catalog structure (i.e same major version) then we don't actually need to handle DDL in any complicated way, it would be enough to just carry over the changes to system tables . The main reason we don't do it currently for trigger-based logical replication is the restriction of not being able to have triggers on system tables. I hope it is much easier to have the triggerless record generation also work on system tables. The trouble with giving up on WAL completely and moving to a separate replication log is that it means a whole lot of additional I/O, which is bound to have a negative effect on performance. Why would you give up WAL ? Or do you mean that the new logical-wal needs to have same commit time behaviour as WAL to be reliable ? I'd envision a scenario where the logi-wal is sent to slave or distribution hub directly and not written at the local host at all. An optionally sync mode similar to current sync WAL replication could be configured. I hope this would run mostly in parallel with local WAL generation so not much extra wall-clock time would be wasted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- --- Hannu Krosing PostgreSQL Unlimited Scalability and Performance Consultant 2ndQuadrant Nordic PG Admin Book: http://www.2ndQuadrant.com/books/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing ha...@2ndquadrant.com wrote: I think the question we should be asking ourselves is not whether WAL as it currently exists is adequate for logical replication, but rather or not it could be made adequate. Agreed. And of course I meant but rather whether or not it could be made adequate, but I dropped a word. For example, suppose that we were to arrange things so that, after each checkpoint, the first insert, update, or delete record for a given relfilenode after each checkpoint emits a special WAL record that contains the relation name, schema OID, attribute names, and attribute type OIDs. Not just the first after checkpoint, but also the first after a schema change, even though will duplicate the wals with changes to system catalog, it is likely much cheaper overall to always have a fresh structure in wal stream. Yes. And if we really want to do WAL--logical--SQL_text conversion on a host separate from the master, we also need to insert there the type definitions of user-defined types together with at least types output functions in some form . Yes. So you basically need a large part of postgres for reliably making sense of WAL. Agreed, but I think that's a problem we need to fix and not a tolerable situation at all. If a user can create a type-output function that goes and looks at the state of the database to determine what to output, then we are completely screwed, because that basically means you would need to have a whole Hot Standby instance up and running just to make it possible to run type output functions. Now you might be able to build a mechanism around that that is useful to some people in some situations, but wow does that sound painful. What I want is for the master to be able to cheaply rattle off the tuples that got inserted, updated, or deleted as those things happen; needing a whole second copy of the database just to do that does not meet my definition of cheap. Furthermore, it's not really clear that it's sufficient anyway, since there are problems with what happens before the HS instance reaches consistency, what happens when it crashes and restarts, and how do we handle the case when the system catalog we need to examine to generate the logical replication records is access-exclusive-locked? Seems like a house of cards. Some of this might be possible to mitigate contractually, by putting limits on what type input/output functions are allowed to do. Or we could invent a new analog of type input/output functions that is explicitly limited in this way, and support only types that provide it. But I think the real key is that we can't rely on catalog access: the WAL stream has to have enough information to allow the reader to construct some set of in-memory hash tables with sufficient detail to reliably decode WAL. Or at least that's what I'm thinking. Most straightforward way is to have a more or less full copy of pg_catalog also on the WAL-filtering / WAL-conversion node, and to use it in 1:1 replicas of transactions recreated from the WAL . This way we can avoid recreating any alternate views of the masters schema. See above; I have serious doubts that this can ever be made to work robustly. Then again, we could do it all on master and inside the wal-writing transaction and thus avoid large chunk of the problems. If the receiving side is also PostgreSQL with same catalog structure (i.e same major version) then we don't actually need to handle DDL in any complicated way, it would be enough to just carry over the changes to system tables . I agree it'd be preferable to handle DDL in terms of system catalog updates, rather than saying, well, this is an ALTER TABLE .. RENAME. But you need to be able to decode tuples using the right tuple descriptor, even while that's changing under you. Why would you give up WAL ? For lack of ability to make it work. Don't underestimate how hard it's going to nail this down. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Robert Haas robertmh...@gmail.com writes: On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing ha...@2ndquadrant.com wrote: So you basically need a large part of postgres for reliably making sense of WAL. Agreed, but I think that's a problem we need to fix and not a tolerable situation at all. If a user can create a type-output function that goes and looks at the state of the database to determine what to output, then we are completely screwed, because that basically means you would need to have a whole Hot Standby instance up and running just to make it possible to run type output functions. You mean like enum_out? Or for that matter array_out, record_out, range_out? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Sun, Apr 29, 2012 at 11:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing ha...@2ndquadrant.com wrote: So you basically need a large part of postgres for reliably making sense of WAL. Agreed, but I think that's a problem we need to fix and not a tolerable situation at all. If a user can create a type-output function that goes and looks at the state of the database to determine what to output, then we are completely screwed, because that basically means you would need to have a whole Hot Standby instance up and running just to make it possible to run type output functions. You mean like enum_out? Or for that matter array_out, record_out, range_out? Yeah, exactly. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Fri, Apr 27, 2012 at 11:18 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Andres Freund and...@2ndquadrant.com wrote: I'm not convinced that I would rather see the logic fixed inside the master as opposed to being deployable on the master's machine, the slave machine, or even on its own machine in between. There are use cases where the translation from WAL to logical takes place on the master, the standby or other locations. It's becoming clear that filtering records on the source is important in high bandwidth systems, so the initial work focuses on putting that on the master, i.e. the source. Which was not my first thought either. If you use cascading, this would still allow you to have master - standby - logical. Translating WAL is a very hard task. Some time ago, I did also think an external tool would help (my initial design was called xfilter), but I no longer think that is likely to work very well apart from very simple cases. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Aakash, when you get a chance, could you fill in the inch-stones from the GSoC proposal page onto the Wiki page? Sure, http://wiki.postgresql.org/wiki/XReader updated. On Sat, Apr 28, 2012 at 3:48 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Andres Freund and...@2ndquadrant.com wrote: In the current, prototypal, state there is one component thats integrated into the server (because it needs information thats only available there). The xReader design was based on the idea that it would be nice not to cause load on the master machine, and that by proxying the WAL stream to the HS, using synchronous replication style to write from xReader to the HS, you could use the HS for a source for that data with it being at exactly the right point in time to query it. I'm not convinced that I would rather see the logic fixed inside the master as opposed to being deployable on the master's machine, the slave machine, or even on its own machine in between. That component is layered ontop of a totally generic xlog reading/parsing library that doesn't care at all where its running. That's cool. Its also used in another cluster to read the received (filtered) stream. I don't quite follow what you're saying there. I plan to submit the XLogReader (thats what its called atm) before everything else, so everybody can take a look as soon as possible. Great! That will allow more discussion and planning. I took a *very* short glance over the current wiki description of xReader and from that it seems to me it would benefit from trying to make it architecturally more similar to the rest of pg. We're planning on using existing protocol to talk between pieces. Other than breaking it out so that it can run somewhere other than inside the server, and allowing clients to connect to xReader to listen to WAL events of interest, are you referring to anything else? I also would suggest reviewing how the current walreceiver/sender, and their protocol, work. Of course! The first inch-stone in the GSoC project plan basically consists of creating an executable that functions as a walreceiver and a walsender to just pass things through from the master to the slave. We build from there by allowing clients to connect (again, over existing protocol) and register for events of interest, and then recognizing different WAL records to generate events. The project was just going to create a simple client to dump the information to disk, but with the time saved by adopting what you've already done, that might leave more time for generating a useful client. Aakash, when you get a chance, could you fill in the inch-stones from the GSoC proposal page onto the Wiki page? I think the descriptions of those interim steps would help people understand your proposal better. Obviously, some of the particulars of tasks and the dates may need adjustment based on the new work which is expected to appear before you start, but what's there now would be a helpful reference. -Kevin
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Hello Andres, The xReader design was based on the idea that it would be nice not to cause load on the master machine, and that by proxying the WAL stream to the HS, using synchronous replication style to write from xReader to the HS, you could use the HS for a source for that data with it being at exactly the right point in time to query it. Yes, that does make sense for some workloads. I don't think its viable for everything though, thats why were not aiming for that ourselves atm. Regarding the above, what would be a case where querying the HS will not suffice? On Sat, Apr 28, 2012 at 4:02 AM, Andres Freund and...@2ndquadrant.comwrote: Hi Kevin, Hi Aakash, On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote: Andres Freund and...@2ndquadrant.com wrote: In the current, prototypal, state there is one component thats integrated into the server (because it needs information thats only available there). The xReader design was based on the idea that it would be nice not to cause load on the master machine, and that by proxying the WAL stream to the HS, using synchronous replication style to write from xReader to the HS, you could use the HS for a source for that data with it being at exactly the right point in time to query it. Yes, that does make sense for some workloads. I don't think its viable for everything though, thats why were not aiming for that ourselves atm. I'm not convinced that I would rather see the logic fixed inside the master as opposed to being deployable on the master's machine, the slave machine, or even on its own machine in between. I don't think that you can do everything apart from the master. We currently need shared memory for coordination between the moving parts, thats why we have it inside the master. It also have the advantage of being easier to setup. That component is layered ontop of a totally generic xlog reading/parsing library that doesn't care at all where its running. That's cool. Its also used in another cluster to read the received (filtered) stream. I don't quite follow what you're saying there. To interpret the xlog back into something that can be used for replication you need to read it again. After filtering we again write valid WAL, so we can use the same library on the sending|filtering side and on the receiving side. But thats actually off topic for this thread ;) I took a *very* short glance over the current wiki description of xReader and from that it seems to me it would benefit from trying to make it architecturally more similar to the rest of pg. We're planning on using existing protocol to talk between pieces. Other than breaking it out so that it can run somewhere other than inside the server, and allowing clients to connect to xReader to listen to WAL events of interest, are you referring to anything else? It sounds like the xReader is designed to be one multiplexing process. While this definitely has some advantages resource-usage-wise it doesn't seem to be fitting the rest of the design that well. The advantages might outweigh everything else, but I am not sure about that. Something like registering/deregistering also doesn't fit that well with the way walsender works as far as I understand it. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Simon Riggs si...@2ndquadrant.com writes: Translating WAL is a very hard task. No kidding. I would think it's impossible on its face. Just for starters, where will you get table and column names from? (Looking at the system catalogs is cheating, and will not work reliably anyway.) IMO, if we want non-physical replication, we're going to need to build it in at a higher level than after-the-fact processing of WAL. I foresee wasting quite a lot of effort on the currently proposed approaches before we admit that they're unworkable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Tom Lane wrote: Simon Riggs writes: Translating WAL is a very hard task. No kidding. I would think it's impossible on its face. Just for starters, where will you get table and column names from? (Looking at the system catalogs is cheating, and will not work reliably anyway.) Well, the xReader approach (Aakash's GSoC project) is to serve as a proxy for a WAL stream going to a hot standby, to interpret each incoming WAL record by cheating and querying the HS before passing the WAL along, and then using sync rep to the HS regardless of whether the incoming stream is sync or async. What reliability problem do you see with this approach? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Kevin Grittner kevin.gritt...@wicourts.gov writes: Tom Lane wrote: Simon Riggs writes: Translating WAL is a very hard task. No kidding. I would think it's impossible on its face. Well, the xReader approach (Aakash's GSoC project) is to serve as a proxy for a WAL stream going to a hot standby, to interpret each incoming WAL record by cheating and querying the HS before passing the WAL along, and then using sync rep to the HS regardless of whether the incoming stream is sync or async. What reliability problem do you see with this approach? Well, first off, it requires a replica that *is* a physical replica (else it won't have the same table OIDs, for instance). I'm not sure I get the point of logical replication that requires a physical replica as a prerequisite. Next, it breaks immediately in the face of DDL: CREATE TABLE foo (f1 int); BEGIN; ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2; INSERT INTO foo (f2) VALUES (1); COMMIT; The standby is not going to think that the ALTER is committed, so it will not report the right column name when it comes time to translate the INSERT. Actually, you can break it even more easily than that: CREATE TABLE bar AS SELECT ... What will you do with the insertions executed by this CREATE? They have to be executed before the creation of table bar is committed. Also, you'd need a query connection per database (at least, maybe one per session if you were trying to track DDL effects), which seems rather a lot of load on the HS slave. That together with the requirement for synchronous operation seems absolutely catastrophic from a performance standpoint. If you want to spend the summer building a toy, fine, but I don't see this going anywhere for production purposes. It would be enormously more performant for the master to be emitting logical replication records to start with, since it already has all the right names etc at hand at basically no cost. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Hello Tom, I'm not sure I get the point of logical replication that requires a physical replica as a prerequisite. It would be enormously more performant for the master to be emitting logical replication records to start with, since it already has all the right names etc at hand at basically no cost. I want to emphasize that any system which uses the logical reader setup, including a replication system, tends to be very computationally intensive on the database which it queries all so often. In most of the environments, the source database is the main database, and as such, any performance degradation on this database is very bad. On the other hand, if we offload almost all the work to the physical replica, our source database, which is the main database, still functions at the same throughput. Thus, at the cost of having the replication system as a whole run a little slower than it could if it were using the main database, we have made sure that our performance critical main source database is not affected in performance at all.
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Sat, Apr 28, 2012 at 4:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Tom Lane wrote: Simon Riggs writes: Translating WAL is a very hard task. No kidding. I would think it's impossible on its face. Well, the xReader approach (Aakash's GSoC project) is to serve as a proxy for a WAL stream going to a hot standby, to interpret each incoming WAL record by cheating and querying the HS before passing the WAL along, and then using sync rep to the HS regardless of whether the incoming stream is sync or async. What reliability problem do you see with this approach? Well, first off, it requires a replica that *is* a physical replica (else it won't have the same table OIDs, for instance). It's even harder than that because the physical replica needs to be able to be shutdown/restarted. We currently restart WAL from last checkpoint, but if changes have already been made and committed then the catalog will be incorrect when we replay the WAL again. So, I think saying very hard is justified. We basically have a choice of shipping full metadata with every change or finding some way to avoid doing that. Investigating the latter seems like a worthwhile task but there are limits, as you say. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
[replaced bad email address for Josh (which was my fault)] Aakash Goel aakash.b...@gmail.com wrote: All, the wiki page is now up at http://wiki.postgresql.org/wiki/XReader. Note that the approach Aakash is taking doesn't involve changes to the backend code, it is strictly a standalone executable to which functions as a proxy to a hot standby and to which clients like replications systems connect. There is a possible additional configuration which wouldn't require a hot standby, if time permits. I am not clear on whether 2nd Quadrant's code takes this approach or builds it into the server. I think we need to know that much before we can get very far in discussion. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
On Friday, April 27, 2012 11:04:04 PM Kevin Grittner wrote: [replaced bad email address for Josh (which was my fault)] Aakash Goel aakash.b...@gmail.com wrote: All, the wiki page is now up at http://wiki.postgresql.org/wiki/XReader. Note that the approach Aakash is taking doesn't involve changes to the backend code, it is strictly a standalone executable to which functions as a proxy to a hot standby and to which clients like replications systems connect. There is a possible additional configuration which wouldn't require a hot standby, if time permits. I am not clear on whether 2nd Quadrant's code takes this approach or builds it into the server. I think we need to know that much before we can get very far in discussion. In the current, prototypal, state there is one component thats integrated into the server (because it needs information thats only available there). That component is layered ontop of a totally generic xlog reading/parsing library that doesn't care at all where its running. Its also used in another cluster to read the received (filtered) stream. I plan to submit the XLogReader (thats what its called atm) before everything else, so everybody can take a look as soon as possible. I took a *very* short glance over the current wiki description of xReader and from that it seems to me it would benefit from trying to make it architecturally more similar to the rest of pg. I also would suggest reviewing how the current walreceiver/sender, and their protocol, work. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Andres Freund and...@2ndquadrant.com wrote: In the current, prototypal, state there is one component thats integrated into the server (because it needs information thats only available there). The xReader design was based on the idea that it would be nice not to cause load on the master machine, and that by proxying the WAL stream to the HS, using synchronous replication style to write from xReader to the HS, you could use the HS for a source for that data with it being at exactly the right point in time to query it. I'm not convinced that I would rather see the logic fixed inside the master as opposed to being deployable on the master's machine, the slave machine, or even on its own machine in between. That component is layered ontop of a totally generic xlog reading/parsing library that doesn't care at all where its running. That's cool. Its also used in another cluster to read the received (filtered) stream. I don't quite follow what you're saying there. I plan to submit the XLogReader (thats what its called atm) before everything else, so everybody can take a look as soon as possible. Great! That will allow more discussion and planning. I took a *very* short glance over the current wiki description of xReader and from that it seems to me it would benefit from trying to make it architecturally more similar to the rest of pg. We're planning on using existing protocol to talk between pieces. Other than breaking it out so that it can run somewhere other than inside the server, and allowing clients to connect to xReader to listen to WAL events of interest, are you referring to anything else? I also would suggest reviewing how the current walreceiver/sender, and their protocol, work. Of course! The first inch-stone in the GSoC project plan basically consists of creating an executable that functions as a walreceiver and a walsender to just pass things through from the master to the slave. We build from there by allowing clients to connect (again, over existing protocol) and register for events of interest, and then recognizing different WAL records to generate events. The project was just going to create a simple client to dump the information to disk, but with the time saved by adopting what you've already done, that might leave more time for generating a useful client. Aakash, when you get a chance, could you fill in the inch-stones from the GSoC proposal page onto the Wiki page? I think the descriptions of those interim steps would help people understand your proposal better. Obviously, some of the particulars of tasks and the dates may need adjustment based on the new work which is expected to appear before you start, but what's there now would be a helpful reference. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Hi Kevin, Hi Aakash, On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote: Andres Freund and...@2ndquadrant.com wrote: In the current, prototypal, state there is one component thats integrated into the server (because it needs information thats only available there). The xReader design was based on the idea that it would be nice not to cause load on the master machine, and that by proxying the WAL stream to the HS, using synchronous replication style to write from xReader to the HS, you could use the HS for a source for that data with it being at exactly the right point in time to query it. Yes, that does make sense for some workloads. I don't think its viable for everything though, thats why were not aiming for that ourselves atm. I'm not convinced that I would rather see the logic fixed inside the master as opposed to being deployable on the master's machine, the slave machine, or even on its own machine in between. I don't think that you can do everything apart from the master. We currently need shared memory for coordination between the moving parts, thats why we have it inside the master. It also have the advantage of being easier to setup. That component is layered ontop of a totally generic xlog reading/parsing library that doesn't care at all where its running. That's cool. Its also used in another cluster to read the received (filtered) stream. I don't quite follow what you're saying there. To interpret the xlog back into something that can be used for replication you need to read it again. After filtering we again write valid WAL, so we can use the same library on the sending|filtering side and on the receiving side. But thats actually off topic for this thread ;) I took a *very* short glance over the current wiki description of xReader and from that it seems to me it would benefit from trying to make it architecturally more similar to the rest of pg. We're planning on using existing protocol to talk between pieces. Other than breaking it out so that it can run somewhere other than inside the server, and allowing clients to connect to xReader to listen to WAL events of interest, are you referring to anything else? It sounds like the xReader is designed to be one multiplexing process. While this definitely has some advantages resource-usage-wise it doesn't seem to be fitting the rest of the design that well. The advantages might outweigh everything else, but I am not sure about that. Something like registering/deregistering also doesn't fit that well with the way walsender works as far as I understand it. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)
Andres Freund and...@2ndquadrant.com wrote: Something like registering/deregistering also doesn't fit that well with the way walsender works as far as I understand it. If you look at the diagrams on the xReader Wiki page, the lines labeled XLOG stream are the ones using walsender/walreceiver. The green arrows represent normal connections to the database, to run queries to retrieve metadata needed to interpret the WAL records, and the lines labeled Listener n are expected to use the pg protocol to connect, but won't be talking page-oriented WAL -- they will be dealing with logical interpretation of the WAL. The sort of data which could be fed to a database which doesn't have the same page images. Like Slony et al do. Perhaps, given other points you made, the library for interpreting the WAL records could be shared, and hopefully a protocol for the clients, although that seems a lot more muddy to me at this point. If we can share enough code, there may be room for both approaches with minimal code duplication. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers