Re: [HACKERS] Perf regression in 2.6.32 (Ubuntu 10.04 LTS)
Robert Haas wrote: Greg, have you run into any other evidence suggesting a problem with 2.6.32? I haven't actually checked myself yet. Right now the only distribution shipping 2.6.32 usefully is Ubuntu 10.04, which I can't recommend anyone use on a server because their release schedules are way too aggressive to ever deliver stable versions anymore. So until either RHEL6 or Debian Squeeze ships, very later this year or early next, the performance of 2.6.32 is irrelevant to me. And by then I'm hoping that the early adopters have squashed more of the obvious bugs here. 2.6.32 is 11 months old at this point, which makes it still a bleeding edge kernel in my book. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] todo point: plpgsql - scrollable cursors are supported
2010/10/7 Robert Haas : > On Mon, Oct 4, 2010 at 2:52 AM, Pavel Stehule wrote: >> I am thinking so you can remove a "scrollable cursor support" from >> ToDo for plpgsql. Scrollable cursors are supported and supported >> syntax are same as core SQL language. > > I agree, removed. I also removed WITH HOLD cursors, which we seem to > have as well. I think so we doesn't support WITH HOLD cursor syntax yet. Maybe we have similar functionality. Don't know. Pavel > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] leaky views, yet again
On 07.10.2010 06:39, Robert Haas wrote: On Tue, Oct 5, 2010 at 3:42 PM, Tom Lane wrote: Right, *column* filtering seems easy and entirely secure. The angst here is about row filtering. Can we have a view in which users can see the values of a column for some rows, with perfect security that they can't identify values for the hidden rows? The stronger form is that they shouldn't even be able to tell that hidden rows exist, which is something your view doesn't try to do; but there are at least some applications where that would be desirable. I took a crack at documenting the current behavior; see attached. Looks good. It gives the impression that you need to be able to a create custom function to exploit, though. It would be good to mention that internal functions can be used too, revoking access to CREATE FUNCTION does not make you safe. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
(2010/10/06 10:21), KaiGai Kohei wrote: > I'll check the patch for more details, please wait for a few days. I could find out some matters in this patch, independent from the discussion of localhost. (About pg_hba.conf.sample, I'm sorry for the missuggestion. Please fix up it according to Tom's comment.) * The logic is still unclear for me. The check_hostname() immediately returns with false, if the resolved remote hostname is NOT matched with the hostname described in pg_hba.conf. | +static bool | +check_hostname(hbaPort *port, const char *hostname) | +{ | + struct addrinfo *gai_result, *gai; | + int ret; | + boolfound; | + | + /* Lookup remote host name if not already done */ | + if (!port->remote_hostname) | + { | + charremote_hostname[NI_MAXHOST]; | + | + if (pg_getnameinfo_all(&port->raddr.addr, port->raddr.salen, | + remote_hostname, sizeof(remote_hostname), | + NULL, 0, | + 0)) | + return false; | + | + port->remote_hostname = strdup(remote_hostname); | + } | + | + if (strcmp(port->remote_hostname, hostname) != 0) | + return false; | + | + /* Lookup IP from host name and check against original IP */ However, it seems to me you expected an opposite behavior. If the resolved hostname is matched with the hostname described in pg_hba.conf, we can consider this HbaLine to be a suitable configuration without any fallbacks. Right? It so, it should be as follows: if (strcmp(port->remote_hostname, hostname) == 0) return true; In addition, we should go the rest of fallback code on mismatch cases only, don't we? * Why getnameinfo() in the fallback loop? At the fallback code when the hostname was matched (I believe this code is intended to handle the case when hostname was NOT matched.) calls getnameinfo() for each candidate of remote addresses. But its result is not referenced by anybody. Is it really necessary? | + found = false; | + for (gai = gai_result; gai; gai = gai->ai_next) | + { | + charhostinfo[NI_MAXHOST]; | + | + getnameinfo(gai->ai_addr, gai->ai_addrlen, | + hostinfo, sizeof(hostinfo), | + NULL, 0, | + NI_NUMERICHOST); | + | + if (gai->ai_addr->sa_family == port->raddr.addr.ss_family) | + { | + if (gai->ai_addr->sa_family == AF_INET) | + { | + if (ipv4eq((struct sockaddr_in *) gai->ai_addr, | + (struct sockaddr_in *) &port->raddr.addr)) | + { | + found = true; | + break; | + } | + } | + else if (gai->ai_addr->sa_family == AF_INET6) | + { | + if (ipv6eq((struct sockaddr_in6 *) gai->ai_addr, | + (struct sockaddr_in6 *) &port->raddr.addr)) | + { | + found = true; | + break; | + } | + } | + } | + } | + | + if (gai_result) | + freeaddrinfo(gai_result); | + | + return found; | +} * Slash ('/') after the hostname At the parse_hba_line(), the parsed token which contains either hostname or cidr address is sliced into two parts on the first '/' character, if exist. Then, even if cidr_slash is not NULL, it shall be ignored when top-half of the token is hostname, not numeric address. | else | { | /* IP and netmask are specified */ | parsedline->ip_cmp_method = ipCmpMask; | | /* need a modifiable copy of token */ | token = pstrdup(token); | | /* Check if it has a CIDR suffix and if so isolate it */ | cidr_slash = strchr(token, '/'); | if (cidr_slash) | *cidr_slash = '\0'; : | ret = pg_getaddrinfo_all(token, NULL, &hints, &gai_result); | - if (ret || !gai_result) | + if (ret == 0 && gai_result) | + memcpy(&parsedline->addr, gai_result->ai_addr, | + gai_result->ai_addrlen); | + else if (ret == EAI_NONAME) | + parsedline->hostname = token; | + else | { It allows the following configuration works without any errors. (In fact, it works for me.) # IPv4/6 local connections: host all all kaigai.myhome.cx/today_is_sunny trust It seems to me, we should raise an error, if both of cidr_slash and parsedline->hostname are not NULL. Thanks, -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] leaky views, yet again
On Tue, Oct 5, 2010 at 3:42 PM, Tom Lane wrote: > Right, *column* filtering seems easy and entirely secure. The angst > here is about row filtering. Can we have a view in which users can see > the values of a column for some rows, with perfect security that they > can't identify values for the hidden rows? The stronger form is that > they shouldn't even be able to tell that hidden rows exist, which is > something your view doesn't try to do; but there are at least some > applications where that would be desirable. I took a crack at documenting the current behavior; see attached. It turns out that a view which only uses boolean operators in the WHERE clause is not obviously subvertable, because we judge those operations to have no cost. (It seems unwise to rely on this for security, though.) Anything more complicated - that does row filtering - is easily hacked. See within for details. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company document-leaky-views.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Fix snapshot taking inconsistencies
On Mon, 4 Oct 2010, Marko Tiikkaja wrote: the patch does, modules start behaving weirdly. So what I'm suggesting is: - Deprecate pg_parse_and_rewrite(). I have no idea how the project has done this in the past, but grepping the source code for "deprecated" suggests that we just remove the function. - Introduce a new function, specifically designed for SQL functions. Both callers of pg_parse_and_rewrite (init_sql_fcache and fmgr_sql_validator) call check_sql_fn_retval after pg_parse_and_rewrite so I think we could merge those into the new function. Does anyone have any concerns about this? Better ideas? The only comment I've seen on this was from Tom and his only concern was that old code wouldn't be able to compile against a new version of the function. What you propose (delete pg_parse_and_rewrite) and replacing it with a function of the new name sounds fine. Since no one else has proposed a better idea and the commit fest is ticking away I think you should go ahead and do that. Regards, Marko Tiikkaja Steve -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] todo point: plpgsql - scrollable cursors are supported
On Mon, Oct 4, 2010 at 2:52 AM, Pavel Stehule wrote: > I am thinking so you can remove a "scrollable cursor support" from > ToDo for plpgsql. Scrollable cursors are supported and supported > syntax are same as core SQL language. I agree, removed. I also removed WITH HOLD cursors, which we seem to have as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug / shortcoming in has_*_privilege
(2010/10/07 2:05), Alvaro Herrera wrote: >>> Another thing that could raise eyebrows is that I chose to remove the >>> "missing_ok" argument from get_role_oid_or_public, so it's not a perfect >>> mirror of it. None of the current callers need it, but perhaps people >>> would like these functions to be consistent. >>> >> Tom Lane suggested to add missing_ok argument, although it is not a must- >> requirement. > > Actually I think he suggested the opposite. > Ahh, I understood his suggestion as literal. Well, I'd like to mark this patch as 'ready for committer'. Thanks, -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security hook on table creation
(2010/10/07 6:21), Alvaro Herrera wrote: Excerpts from Robert Haas's message of mié oct 06 17:02:22 -0400 2010: 2010/10/5 KaiGai Kohei: However, we also have a few headache cases. DefineType() creates a new type object and its array type, but it does not call CommandCounterIncrement() by the end of this function, so the new type entries are not visible from the plugin modules, even if we put a security hook at tail of the DefineType(). DefineFunction() also has same matter. It create a new procedure object, but it also does not call CommandCounterIncrement() by the end of this function, except for the case when ProcedureCreate() invokes language validator function. So I guess the first question here is why it's important to be able to see the new entry. I am thinking that you want it so that, for example, you can fetch the namespace OID to perform an SE-Linux type transition. Is that right? I'm not sure that there's any point trying to optimize these to the point of avoiding CommandCounterIncrement. Surely DefineType et al are not performance-sensitive operations. Performance optimization is not the point here. If we need to call CommandCounterIncrement() before invocation of security hooks, we also need to analyze its side-effect and to confirm it is harmless. Even if it is harmless, I think it gives us more burden of code maintenance than the idea of two hooks on creation. Maybe we need a variant of InvokeObjectAccessHook that does a CCI only if a hook is present. The problem I see with this idea is that it becomes a lot harder to track down whether it ocurred or not for any given operation. Yes, it is a burden of code maintenance. Thanks, -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security hook on table creation
(2010/10/07 6:02), Robert Haas wrote: > 2010/10/5 KaiGai Kohei: >> I began to revise the security hooks, but I could find a few cases that does >> not work with the approach of post-creation security hooks. >> If we try to fix up the core PG routine to become suitable for the >> post-creation >> security hooks, it shall need to put more CommandCounterIncrement() on >> various >> places, so it made me doubtful whether this approach gives really minimum >> impact >> to the core PG routine, or not. > > We definitely don't want to add CCIs without a good reason. > >> Some of object classes have CommandCounterIncrement() just after the routine >> to create a new object. For example, DefineRelation() calls it just after the >> heap_create_with_catalog(), so the new relation entry is visible for plugin >> modules using SearchSysCache(), as long as we put the post-creation security >> hook aftere the CommandCounterIncrement(). >> >> However, we also have a few headache cases. >> DefineType() creates a new type object and its array type, but it does not >> call CommandCounterIncrement() by the end of this function, so the new type >> entries are not visible from the plugin modules, even if we put a security >> hook at tail of the DefineType(). >> DefineFunction() also has same matter. It create a new procedure object, >> but it also does not call CommandCounterIncrement() by the end of this >> function, except for the case when ProcedureCreate() invokes language >> validator function. > > So I guess the first question here is why it's important to be able to > see the new entry. I am thinking that you want it so that, for > example, you can fetch the namespace OID to perform an SE-Linux type > transition. Is that right? > We assumed that namespace OID can be fetched from the entry of pg_class, so we thought the common InvokeObjectAccessHook() dose not need to take many arguments, because we can pull out corresponding properties of new object (such as namespace OID) from SysCache using OID of new object. So, InvokeObjectAccessHook() must deliver OID of the namespace, rather than OID of the new object, if it is not visible. > Maybe we need a variant of InvokeObjectAccessHook that does a CCI only > if a hook is present. I can't see that we're going to want to pay for > that unconditionally, but it shouldn't surprise anyone that loading an > enhanced security provider comes with some overhead. > The reason why we tried to move the object creation hooks into post object creation was to reduce number of security hooks and burden of code maintenance. However, it seems to me paying attention for object visibility gives us more burden rather than we have multiple creation hooks. >> E.g, it may be possible to design creation of relation as follows: >> >> DefineRelation(...) >> { >> /* DAC permission checks here */ >> : >> /* MAC permission checks; it returns its private data */ >> opaque = check_relation_create(..); >> : >> /* insertion into pg_class catalog */ >> relationId = heap_create_with_catalog(...); >> : >> /* assign security labels on the new object */ >> InvokeObjectAccessHook(OBJECT_TABLE, OAT_POST_CREATE, >>relationId, 0, opaque); >> } > > I'd like to try to avoid that, if we can. That's going to make this > code far more complex to understand and maintain. > Against our feeling, I consider the idea of two hooks help us to understand and maintain security hooks in the future. Because the place where we should put the prep-creation hook is just after the default privilege checks, it is quite obvious. If we would have only post-creation hook, is it obvious where we should put the security hook on function creation, for example? In the case when we have two hooks, obviously, the prep-creation hook will be after the DAC checks, and the post-creation hook will be after the insert/update of system catalogs. It seems to me quite easy to understand and maintain. Thanks, -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
Robert Haas writes: > Nice. What was the overall effect on memory consumption? Before: cspell: 31352608 total in 3814 blocks; 37432 free (6 chunks); 31315176 used After: cspell: 16214816 total in 1951 blocks; 13744 free (12 chunks); 16201072 used This is on a 32-bit machine that uses MAXALIGN 8, and I also had enable_cassert on (hence extra palloc chunk overhead) so it may be overstating the amount of savings you'd see in production. But it's in the same ballpark as what Pavel reported originally. AFAICT practically all of the useful savings comes from the one place he targeted originally, and the other changes are just marginal gravy. Before I throw it away, here's some data about the allocations that go through that code on the Czech dictionary. First column is number of calls of the given size, second is requested size in bytes: 1 1 695 2 1310 3 1965 4 2565 5 1856 6 578 7 301 8 7 9 2 10 707733 12 520 16 107035 20 16 24 22606 28 3 32 8814 36 59 40 4305 44 2 48 2238 52 2 56 1236 60 20 64 816 68 599 76 1 80 408 84 9 88 334 92 2 96 246 100 1 104 164 108 13 112 132 116 110 124 1 128 107 132 3 136 81 140 1 144 77 148 40 156 46 164 29 172 39 180 2 184 35 188 31 196 19 204 16 212 12 220 10 228 3 244 1 304 1 400 1 1120 The spikes at 12/20/28 bytes correspond to SPNodes with 1/2/3 data items. BTW, on a 64-bit machine we're really paying through the nose for the pointers in SPNodeData --- the pointers are bad enough, and their alignment effects are worse. If we were to try to change this over to a pointer-free representation, we could probably replace those pointers by 32-bit offsets, which would save a full factor of 2 on 64-bit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
On Wed, Oct 6, 2010 at 7:36 PM, Tom Lane wrote: > Teodor Sigaev writes: >>> on 32bit from 27MB (3399 blocks) to 13MB (1564 blocks) >>> on 64bit from 55MB to cca 27MB. > >> Good results. But, I think, there are more places in ispell to use >> hold_memory(): >> - affixes and affix tree >> - regis (REGex for ISpell, regis.c) > > I fixed the affix stuff as much as possible (some of the structures are > re-palloc'd so they can't easily be included). It appears that hacking > up regis, or any of the remaining allocations, wouldn't be worth the > trouble. Using the Czech dictionary on a 32-bit machine, I see about > 16MB going through the compacted-alloc code and only about 375K going > through regular small palloc's. Nice. What was the overall effect on memory consumption? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
Teodor Sigaev writes: >> on 32bit from 27MB (3399 blocks) to 13MB (1564 blocks) >> on 64bit from 55MB to cca 27MB. > Good results. But, I think, there are more places in ispell to use > hold_memory(): > - affixes and affix tree > - regis (REGex for ISpell, regis.c) I fixed the affix stuff as much as possible (some of the structures are re-palloc'd so they can't easily be included). It appears that hacking up regis, or any of the remaining allocations, wouldn't be worth the trouble. Using the Czech dictionary on a 32-bit machine, I see about 16MB going through the compacted-alloc code and only about 375K going through regular small palloc's. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
Pavel Stehule writes: > this simple patch reduce a persistent allocated memory for tsearch > ispell dictionaries. > on 32bit from 27MB (3399 blocks) to 13MB (1564 blocks) > on 64bit from 55MB to cca 27MB. Applied with revisions --- I got rid of the risky static state as per discussion, and extended the hackery to strings and Aff nodes as suggested by Teodor. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security hook on table creation
Excerpts from Robert Haas's message of mié oct 06 17:02:22 -0400 2010: > 2010/10/5 KaiGai Kohei : > > However, we also have a few headache cases. > > DefineType() creates a new type object and its array type, but it does not > > call CommandCounterIncrement() by the end of this function, so the new type > > entries are not visible from the plugin modules, even if we put a security > > hook at tail of the DefineType(). > > DefineFunction() also has same matter. It create a new procedure object, > > but it also does not call CommandCounterIncrement() by the end of this > > function, except for the case when ProcedureCreate() invokes language > > validator function. > > So I guess the first question here is why it's important to be able to > see the new entry. I am thinking that you want it so that, for > example, you can fetch the namespace OID to perform an SE-Linux type > transition. Is that right? I'm not sure that there's any point trying to optimize these to the point of avoiding CommandCounterIncrement. Surely DefineType et al are not performance-sensitive operations. > Maybe we need a variant of InvokeObjectAccessHook that does a CCI only > if a hook is present. The problem I see with this idea is that it becomes a lot harder to track down whether it ocurred or not for any given operation. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security hook on table creation
2010/10/5 KaiGai Kohei : > I began to revise the security hooks, but I could find a few cases that does > not work with the approach of post-creation security hooks. > If we try to fix up the core PG routine to become suitable for the > post-creation > security hooks, it shall need to put more CommandCounterIncrement() on various > places, so it made me doubtful whether this approach gives really minimum > impact > to the core PG routine, or not. We definitely don't want to add CCIs without a good reason. > Some of object classes have CommandCounterIncrement() just after the routine > to create a new object. For example, DefineRelation() calls it just after the > heap_create_with_catalog(), so the new relation entry is visible for plugin > modules using SearchSysCache(), as long as we put the post-creation security > hook aftere the CommandCounterIncrement(). > > However, we also have a few headache cases. > DefineType() creates a new type object and its array type, but it does not > call CommandCounterIncrement() by the end of this function, so the new type > entries are not visible from the plugin modules, even if we put a security > hook at tail of the DefineType(). > DefineFunction() also has same matter. It create a new procedure object, > but it also does not call CommandCounterIncrement() by the end of this > function, except for the case when ProcedureCreate() invokes language > validator function. So I guess the first question here is why it's important to be able to see the new entry. I am thinking that you want it so that, for example, you can fetch the namespace OID to perform an SE-Linux type transition. Is that right? Maybe we need a variant of InvokeObjectAccessHook that does a CCI only if a hook is present. I can't see that we're going to want to pay for that unconditionally, but it shouldn't surprise anyone that loading an enhanced security provider comes with some overhead. > E.g, it may be possible to design creation of relation as follows: > > DefineRelation(...) > { > /* DAC permission checks here */ > : > /* MAC permission checks; it returns its private data */ > opaque = check_relation_create(..); > : > /* insertion into pg_class catalog */ > relationId = heap_create_with_catalog(...); > : > /* assign security labels on the new object */ > InvokeObjectAccessHook(OBJECT_TABLE, OAT_POST_CREATE, > relationId, 0, opaque); > } I'd like to try to avoid that, if we can. That's going to make this code far more complex to understand and maintain. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
On Wed, Oct 6, 2010 at 6:21 AM, Magnus Hagander wrote: > It's not common, but i've certainly come across a number of virtual > machines where localhost resolves (through /etc/hosts) to the machines > "real" IP rather than 127.0.01, because 127.0.0.1 simply doesn't > exist. It's perfectly fine for localhost to resolve to the machine's external ip address. It would be weird for it to resolve to some other host's ip address like the vm's host machine. But having 127.0.0.1 not exist would be positively broken. All kinds of things wouldn't work. Are you sure about that part? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, 2010-10-06 at 18:04 +0300, Heikki Linnakangas wrote: > The key is whether you are guaranteed to have zero data loss or not. We agree that is an important question. You seem willing to trade anything for that guarantee. I seek a more pragmatic approach that balances availability and risk. Those views are different, but not inconsistent. Oracle manages to offer multiple options and so can we. If you desire that, go for it. But don't try to stop others having a simple, pragmatic approach. The code to implement your desired option is more complex and really should come later. I don't in any way wish to block that option in this release, or any other, but please don't try to persuade people it's the only sensible option 'cos it damn well isn't. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
> Seems reasonable, but what is a CAP database? Database based around the CAP theorem[1]. Cassandra, Dynamo, Hypertable, etc. For us, the equation is: CAD, as in Consistency, Availability, Durability. Pick any two, at best. But it's a very similar bag of issues as the ones CAP addresses. [1]http://www.julianbrowne.com/article/viewer/brewers-cap-theorem -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 09:04 PM, Dimitri Fontaine wrote: > Ok so I think we're agreeing here: what I said amounts to propose that > the code does work this way when the quorum is such setup, and/or is > able to reject any non-read-only transaction (those that needs a real > XID) until your standby is fully in sync. > > I'm just saying that this should be an option, not the only choice. I'm sorry, I just don't see the use case for a mode that drops guarantees when they are most needed. People who don't need those guarantees should definitely go for async replication instead. What does a synchronous replication mode that falls back to async upon failure give you, except for a severe degradation in performance during normal operation? Why not use async right away in such a case? > Depends, lots of things out there work quite well in best effort mode, > even if some projects needs more careful thinking. That's again the idea > of waiting forever or just continuing, there's a middle-ground which is > starting the system before reaching the durability requirements or > downgrading it to read only, or even off, until you get them. In such cases the admin should be free to reconfigure the quorum. And yes, a read-only mode might be feasible. Please just don't fool the admin with a "best effort" things that guarantees nothing (but trouble). > If you ask for a quorum larger than what the current standbys are able > to deliver, and you're set to wait forever until the quorum is reached, > you just blocked the master. Correct. That's the intended behavior. > Good news is that the quorum is a per-transaction setting I definitely like the per-transaction thing. > so opening a > superuser connection to act on the currently waiting transaction is > still possible (pass/fail, but fail is what at this point? shutdown to > wait some more offline?). Not sure I'm following here. The admin will be busy re-establishing (connections to) standbies, killing transactions on the master doesn't help anything - whether or not the master waits forever. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 20:57, Josh Berkus wrote: While it's nice to dismiss case (1) as an edge-case, consider the likelyhood of someone running PostgreSQL with fsync=off on cloud hosting. In that case, having k = N = 5 does not seem like an unreasonable arrangement if you want to ensure durability via replication. It's what the CAP databases do. Seems reasonable, but what is a CAP database? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gist picksplit iteration
Marios Vodas writes: > I would expect it to start from 0, since C arrays are 0 based. > So my question is why does this happen? Well I don't have any good answer other than "it's the API". Time to have a look at some contrib code and some other, maybe, like ip4r or prefix (the former is fixed size, the later a varlena struct, pick a good example for you). -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner writes: > There's no point in time I > ever mind if a standby is a "candidate" or not. Either I want to > synchronously replicate to X standbies, or not. Ok so I think we're agreeing here: what I said amounts to propose that the code does work this way when the quorum is such setup, and/or is able to reject any non-read-only transaction (those that needs a real XID) until your standby is fully in sync. I'm just saying that this should be an option, not the only choice. And that by having a clear view of the system's state, it's possible to have a clear error response policy set out. > This is an admin decision. Whether or not your standbies are up and > running or not, existing or just about to be bought, that doesn't have > any impact on your durability requirements. Depends, lots of things out there work quite well in best effort mode, even if some projects needs more careful thinking. That's again the idea of waiting forever or just continuing, there's a middle-ground which is starting the system before reaching the durability requirements or downgrading it to read only, or even off, until you get them. You can read my proposal as a way to allow our users to choose between those two incompatible behaviours. > Of course, it doesn't make sense to wait-forever on *every* standby that > ever gets added. Quorum commit is required, yes (and that's what this > thread is about, IIRC). But with quorum commit, adding a standby only > improves availability, but certainly doesn't block the master in any > way. (Quite the opposite: it can allow the master to continue, if it has > been blocked before because the quorum hasn't been reached). If you ask for a quorum larger than what the current standbys are able to deliver, and you're set to wait forever until the quorum is reached, you just blocked the master. Good news is that the quorum is a per-transaction setting, so opening a superuser connection to act on the currently waiting transaction is still possible (pass/fail, but fail is what at this point? shutdown to wait some more offline?). Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Hello Dimitri, On 10/06/2010 05:41 PM, Dimitri Fontaine wrote: > - when do you start considering the standby as a candidate to your sync >rep requirements? That question doesn't make much sense to me. There's no point in time I ever mind if a standby is a "candidate" or not. Either I want to synchronously replicate to X standbies, or not. > Lots of the discussion we're having are taking as an implicit that the > answer is "as soon as you know about its existence, that must be at the > pg_start_backup() point". This is an admin decision. Whether or not your standbies are up and running or not, existing or just about to be bought, that doesn't have any impact on your durability requirements. If you want your banking accounts data to be saved in at least two different locations, I think that's your requirement. You'd be quite unhappy if your bank lost your last month's salary, but stated: "hey, at least we didn't have any downtime". > And you can offer an option to guarantee the wait-forever behavior only > when it makes sense, rather than trying to catch your own tail as soon > as a standby is added in the mix Of course, it doesn't make sense to wait-forever on *every* standby that ever gets added. Quorum commit is required, yes (and that's what this thread is about, IIRC). But with quorum commit, adding a standby only improves availability, but certainly doesn't block the master in any way. (Quite the opposite: it can allow the master to continue, if it has been blocked before because the quorum hasn't been reached). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
Robert Haas writes: > On Wed, Oct 6, 2010 at 1:40 PM, Tom Lane wrote: >> Robert Haas writes: >>> ...but I don't really see why that has to be done as part of this patch. >> >> Because patches that reduce maintainability seldom get cleaned up after. > I don't think you've made a convincing argument that this patch does > that, but if you're feeling motivated to go clean this up, I'm more > than happy to get out of the way. Yeah, I'll take it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
All, Let me clarify and consolidate this discussion. Again, it's my goal that this thread specifically identify only the problems and desired behaviors for synch rep with more than one sync standby. There are several issues with even one sync standby which still remain unresolved, but I believe that we should discuss those on a separate thread, for clarity. I also strongly believe that we should get single-standby functionality committed and tested *first*, before working further on multi-standby. So, to summarize earlier discussion on this thread: There are 2 reasons to have more than one sync standby: 1) To increase durability above the level of a single synch standby, even at the cost of availability. 2) To increase availability without decreasing durability below the level offered by a single sync standby. The "pure" setup for each of these options, where N is the number of standbys and k is the number of acks required from standbys is: 1) k = N, N > 1, apply 2) k = 1, N > 1, recv (Timeouts are a specific compromise of durability for availability on *one* server, and as such will not be discussed here. BTW, I was the one who suggested a timeout, rather than Simon, so if you don't like the idea, harass me about it.) Any other configuration (3) than the two above is a specific compromise between durability and availability, for example: 3a) k = 2, N = 3, fsync 3b) k = 3, N = 10, recv ... should give you better durability than case 2) and better availability than case 1). While it's nice to dismiss case (1) as an edge-case, consider the likelyhood of someone running PostgreSQL with fsync=off on cloud hosting. In that case, having k = N = 5 does not seem like an unreasonable arrangement if you want to ensure durability via replication. It's what the CAP databases do. After eliminating some of my issues as non-issues, here's what we're left with for problems on the above: (1), (3) Accounting/Registration. Implementing any of these cases would seem to require some form of accounting and/or registration on the master in terms of, at a minimum, the number of acks for each data send. More likely we will need, as proposed on other threads, a register of standbys and the sync state of each. Not only will this accounting/registration be hard code to write, it will have at least *some* performance overhead. Whether that overhead is minority or substantial can only be determined through testing. Further, there's the issue of whether, and how, we transmit this register to the standbys so that they can be promoted. (2), (3) Degradation: (Jeff) these two cases make sense only if we give DBAs the tools they need to monitor which standbys are falling behind, and to drop and replace those standbys. Otherwise we risk giving DBAs false confidence that they have better-than-1-standby reliability when actually they don't. Current tools are not really adequate for this. (1), (3) Dynamic Re-configuration: we need the ability to add and remove standbys at runtime. We also need to have a verdict on how to handle the case where a transaction is pending, per Heikki. (2), (3) Promotion: all multi-standby high-availability cases only make sense if we provide tools to promote the most current standby to be the new master. Otherwise the whole cluster still goes down whenever we have to replace the master. We also should provide some mechanism for promoting an async standby to sync; this has already been discussed. (1) Consistency: this is another DBA-false-confidence issue. DBAs who implement (1) are liable to do so thinking that they are not only guaranteeing the consistency of every standby with the master, but the consistency of every standby with every other standby -- a kind of dummy multi-master. They are not, so it will take multiple reminders and workarounds in the docs to explain this. And we'll get complaints anyway. (1), (2), (3) Initialization: (Dimitri) we need a process whereby a standby can go from cloned to synched to being a sync rep standby, and possibly from degraded to synced again and back. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
On Wed, Oct 6, 2010 at 1:40 PM, Tom Lane wrote: > Robert Haas writes: >> I think it would be cleaner to get rid of checkTmpCtx() and instead >> have dispell_init() set up and tear down the temporary context, > > What I was thinking of doing was getting rid of the static variable > altogether: we should do what you say above, but in the form of a > state struct that's created and destroyed by additional calls from > dispell_init(). Then that state struct could also carry the > infrastructure for this additional hack. It's a little more notation to > pass an additional parameter through all these routines, but from the > standpoint of understandability and maintainability it's clearly worth > it. > >> void NISetupForDictionaryLoad(); >> void NICleanupAfterDictionaryLoad(); > > More like > > NISpellState *NISpellInit(); > NISpellTerm(NISpellState *stat); > >> ...but I don't really see why that has to be done as part of this patch. > > Because patches that reduce maintainability seldom get cleaned up after. I don't think you've made a convincing argument that this patch does that, but if you're feeling motivated to go clean this up, I'm more than happy to get out of the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
Robert Haas writes: > I think it would be cleaner to get rid of checkTmpCtx() and instead > have dispell_init() set up and tear down the temporary context, What I was thinking of doing was getting rid of the static variable altogether: we should do what you say above, but in the form of a state struct that's created and destroyed by additional calls from dispell_init(). Then that state struct could also carry the infrastructure for this additional hack. It's a little more notation to pass an additional parameter through all these routines, but from the standpoint of understandability and maintainability it's clearly worth it. > void NISetupForDictionaryLoad(); > void NICleanupAfterDictionaryLoad(); More like NISpellState *NISpellInit(); NISpellTerm(NISpellState *stat); > ...but I don't really see why that has to be done as part of this patch. Because patches that reduce maintainability seldom get cleaned up after. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug / shortcoming in has_*_privilege
Excerpts from KaiGai Kohei's message of mar oct 05 00:06:05 -0400 2010: > (2010/09/07 6:16), Alvaro Herrera wrote: > > Excerpts from Jim Nasby's message of jue jun 10 17:54:43 -0400 2010: > >> test...@workbook=# select has_table_privilege( 'public', 'test', 'SELECT' > >> ); > >> ERROR: role "public" does not exist > > > > Here's a patch implementing this idea. > > > I checked this patch. Thanks. > It seems to me it replaces whole of get_role_oid() in has_*_privilege > functions by the new get_role_oid_or_public(), so this patch allows > to accept the pseudo "public" user in consistent way. Yes. > The pg_has_role_*() functions are exception. It will raise an error > with error message of "role "public" does not exist". > Is it an expected bahavior, isn't it? Yes. You cannot grant "public" to roles; according to the definition of public, this doesn't make sense. Accordingly, I chose to reject "public" as an input for pg_has_role and friends. > > Another thing that could raise eyebrows is that I chose to remove the > > "missing_ok" argument from get_role_oid_or_public, so it's not a perfect > > mirror of it. None of the current callers need it, but perhaps people > > would like these functions to be consistent. > > > Tom Lane suggested to add missing_ok argument, although it is not a must- > requirement. Actually I think he suggested the opposite. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] standby registration (was: is sync rep stalled?)
On Wed, Oct 6, 2010 at 12:26 PM, Greg Smith wrote: > Now, the more relevant question, what I actually need in order for a Sync > Rep feature in 9.1 to be useful to the people who want it most I talk to. > That would be a simple to configure setup where I list a subset of > "important" nodes, and the appropriate acknowledgement level I want to hear > from one of them. And when one of those nodes gives that acknowledgement, > commit on the master happens too. That's it. For use cases like the > commonly discussed "two local/two remote" situation, the two remote ones > would be listed as the important ones. That sounds fine to me. How do the details work? Each slave publishes a name to the master via a recovery.conf parameter, and the master has a GUC listing the names of the important slaves? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: tsearch - some memory diet
On Mon, Oct 4, 2010 at 2:05 AM, Pavel Stehule wrote: > 2010/10/4 Robert Haas : >> On Oct 3, 2010, at 7:02 PM, Tom Lane wrote: >>> It's not at all apparent that the code is even >>> safe as-is, because it's depending on the unstated assumption that that >>> static variable will get reset once per dictionary. The documentation >>> is horrible: it doesn't really explain what the patch is doing, and what >>> it does say is wrong. >> >> Yep. We certainly would need to convince ourselves that this is correct >> before applying it, and that is all kinds of non-obvious. >> > > what is good documentation? > > This patch doesn't do more, than it removes palloc overhead on just > one structure of TSearch2 ispell dictionary. It isn't related to some > static variable - the most important is fact so this memory is > unallocated by dropping of memory context. After looking at this a bit more, I don't think it's too hard to fix up the comments so that they reflect what's actually going on here. For this patch to be correct, the only thing we really need to believe is that no one is going to try to pfree an SPNode, which seems like something we ought to be able to convince ourselves of. I don't see how the fact that some of the memory may get allocated out of a palloc'd chunk from context X rather than from context X directly can really cause any problems otherwise. The existing code already depends on the unstated assumption that the static variable will get reset once per dictionary; we're not making that any worse. I think it would be cleaner to get rid of checkTmpCtx() and instead have dispell_init() set up and tear down the temporary context, leaving NULL behind in the global variable after it's torn down, perhaps by having spell.c publish an API like this: void NISetupForDictionaryLoad(); void NICleanupAfterDictionaryLoad(); ...but I don't really see why that has to be done as part of this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] standby registration (was: is sync rep stalled?)
Josh Berkus wrote: However, I think we're getting way the heck away from how far we really want to go for 9.1. Can I point out to people that synch rep is going to involve a fair bit of testing and debugging, and that maybe we don't want to try to implement The World's Most Configurable Standby Spec as a first step? I came up with the following initial spec for Most Configurable Standby Setup Ever recently: -The state of all available standby systems is exposed via a table-like interface, probably an SRF. -As each standby reports back a result, its entry in the table is updated with what level of commit it has accomplished (recv, fsync, etc.) -The table-like list of standby states is then passed to a function, that you could write in SQL or whatever else makes you happy. The function returns a boolean for whether sufficient commit guarantees have been met yet. You can make the conditions required as complicated as you like. -Once that function returns true, commit on the master. Otherwise return to waiting for standby responses. So that's what I actually want here, because all subsets of it proposed so are way too boring. If you cannot express every possible standby situation that anyone will ever think of via an arbitrary function hook, obviously it's not worth building at all. Now, the more relevant question, what I actually need in order for a Sync Rep feature in 9.1 to be useful to the people who want it most I talk to. That would be a simple to configure setup where I list a subset of "important" nodes, and the appropriate acknowledgement level I want to hear from one of them. And when one of those nodes gives that acknowledgement, commit on the master happens too. That's it. For use cases like the commonly discussed "two local/two remote" situation, the two remote ones would be listed as the important ones. Until something that simple is committed, tested, debugged, and had some run-ins with the real world, I have minimal faith that an attempt to anything more complicated has sufficient information to succeed. And complete faith that even trying will fail to deliver something for 9.1. The scope creep that seems to be happening here in the name of "this will be hard to change so it must be right in the first version" boggles my mind. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Heikki Linnakangas writes: > I'm sorry, but I still don't understand the use case you're envisioning. How > many standbys are there? What are you trying to achieve with synchronous > replication over what asynchronous offers? Sorry if I've been unclear, I read loads of message then tried to pick up the right one to answer, and obviously missed to spell out some context. My concern starts with only 1 standby, and is in fact 2 questions: - Why o why you wouldn't be able to fix your sync setup in the master as soon as there's a standby doing a base backup? - when do you start considering the standby as a candidate to your sync rep requirements? Lots of the discussion we're having are taking as an implicit that the answer is "as soon as you know about its existence, that must be at the pg_start_backup() point". I claim that's incorrect, and you can't ask the master to wait forever until the standby is in sync. All the more because there's a window with wal_keep_segments here too, so the sync might never happen. To solve that problem, I propose managing current state of the standby. That means auto registration of any standby, and feedback loop at more stages, and some protocol arbitrage for the standby to be able to say "I'm this far actually" so that the master can know how to consider it, rather than just demote it while live. One you have a clear list of possible states for a standby, and can decide on what errors are meaning in terms of transitions in the state machine, you're able to decide when wait forever is an option and when you should ignore it or refuse any side-effect transaction commit. And you can offer an option to guarantee the wait-forever behavior only when it makes sense, rather than trying to catch your own tail as soon as a standby is added in the mix, with the proposals I've read on how you can't even restart the master at this point. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 04:20 PM, Simon Riggs wrote: > Ending the wait state does not cause data loss. It puts you at *risk* of > data loss, which is a different thing entirely. These kind of risk scenarios is what sync replication is all about. A minimum guarantee that doesn't hold in face of the first few failures (see Jeff's argument) isn't worth a dime. Keep in mind that upon failure, the other nodes presumably get more load. As has been seen with RAID, that easily leads to subsequent failures. Sync rep needs to be able to protect against that *as well*. > If you want to avoid data loss you use N+k redundancy and get on with > life, rather than sitting around waiting. With that notion, I'd argue that quorum_commit needs to be set to exactly k, because any higher value would only cost performance without any useful benefit. But if I want at least k ACKs and if I think it's worth the performance penalty that brings during normal operation, I want that guarantee to hold true *especially* in case of an emergency. If availability is more important, you need to increase N and make sure enough of these (asynchronously) replicated nodes stay up. Increase k (thus quorum commit) for a stronger durability guarantee. > Putting in a feature for people that choose k=0 seems wasteful to me, > since they knowingly put themselves at risk in the first place. Given the above logic, k=0 equals to completely async replication. Not sure what's wrong about that. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 18:02, Dimitri Fontaine wrote: Heikki Linnakangas writes: 1. base-backup — self explaining 2. catch-up — getting the WAL to catch up after base backup 3. wanna-sync — don't yet have all the WAL to get in sync 4. do-sync — all WALs are there, coming soon 5. ok (async | recv | fsync | reply — feedback loop engaged) So you only consider that a standby is a candidate for sync rep when it's reached the ok state, and that's when it's able to fill the feedback loop we've been talking about. Standby state != ok, no waiting no nothing, it's *not* a standby as far as the master is concerned. You're not going to get zero data loss that way. Can you elaborate what the use case for that mode is? You can't pretend to sync with zero data loss until the standby is ready for it, or you need to take the site down while you add your standby. I can see some user willing to take the site down while doing the base backup dance then waiting for initial sync, then only accepting traffic and being secure against data loss, but I'd much rather that be an option and you could watch for your standby's state in a system view. Meanwhile, I can't understand any reason for the master to pretend it can safely manage any sync-rep transaction while there's no standby around. Either you wait for the quorum and don't have it, or you have to track standby states with precision and maybe actively reject writes. I'm sorry, but I still don't understand the use case you're envisioning. How many standbys are there? What are you trying to achieve with synchronous replication over what asynchronous offers? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 17:20, Simon Riggs wrote: On Wed, 2010-10-06 at 15:26 +0300, Heikki Linnakangas wrote: You're not going to get zero data loss that way. Ending the wait state does not cause data loss. It puts you at *risk* of data loss, which is a different thing entirely. Looking at it that way, asynchronous replication just puts you at risk of data loss too, it doesn't necessarily mean you get data loss. The key is whether you are guaranteed to have zero data loss or not. If you don't wait forever, you're not guaranteed zero data loss. It's just best effort, like asynchronous replication. The situation you want to avoid is that the master dies, and you don't know if you have suffered data loss or not. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Heikki Linnakangas writes: >> 1. base-backup — self explaining >> 2. catch-up — getting the WAL to catch up after base backup >> 3. wanna-sync — don't yet have all the WAL to get in sync >> 4. do-sync — all WALs are there, coming soon >> 5. ok (async | recv | fsync | reply — feedback loop engaged) >> >> So you only consider that a standby is a candidate for sync rep when >> it's reached the ok state, and that's when it's able to fill the >> feedback loop we've been talking about. Standby state != ok, no waiting >> no nothing, it's *not* a standby as far as the master is concerned. > > You're not going to get zero data loss that way. Can you elaborate what the > use case for that mode is? You can't pretend to sync with zero data loss until the standby is ready for it, or you need to take the site down while you add your standby. I can see some user willing to take the site down while doing the base backup dance then waiting for initial sync, then only accepting traffic and being secure against data loss, but I'd much rather that be an option and you could watch for your standby's state in a system view. Meanwhile, I can't understand any reason for the master to pretend it can safely manage any sync-rep transaction while there's no standby around. Either you wait for the quorum and don't have it, or you have to track standby states with precision and maybe actively reject writes. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, 2010-10-06 at 15:26 +0300, Heikki Linnakangas wrote: > You're not going to get zero data loss that way. Ending the wait state does not cause data loss. It puts you at *risk* of data loss, which is a different thing entirely. If you want to avoid data loss you use N+k redundancy and get on with life, rather than sitting around waiting. Putting in a feature for people that choose k=0 seems wasteful to me, since they knowingly put themselves at risk in the first place. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
On 10/06/2010 09:49 AM, Stephen Frost wrote: * Tom Lane (t...@sss.pgh.pa.us) wrote: That appears to me to be a broken (non RFC compliant) VM setup. However, maybe what this is telling us is we need to expose the setting? Or perhaps better, try 127.0.0.1, ::1, localhost, in that order. Yeah, I'd be happier if we exposed it, to be honest. Either that, or figure out a way to get rid of it entirely by using a different method, but that's a much bigger issue. Please don't expose it. It will a source of yet more confusion. People already get confused by the difference between listening addresses and pg_hba.conf addresses. It's one of the most frequent points of confusion seen on IRC. Adding another address to configure will just compound the confusion badly. I much prefer Tom's last suggestion. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
* Tom Lane (t...@sss.pgh.pa.us) wrote: > That appears to me to be a broken (non RFC compliant) VM setup. > However, maybe what this is telling us is we need to expose the setting? > Or perhaps better, try 127.0.0.1, ::1, localhost, in that order. Yeah, I'd be happier if we exposed it, to be honest. Either that, or figure out a way to get rid of it entirely by using a different method, but that's a much bigger issue. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] host name support in pg_hba.conf
On Wed, Oct 6, 2010 at 15:34, Tom Lane wrote: > Magnus Hagander writes: >> On Wed, Oct 6, 2010 at 15:16, Tom Lane wrote: >>> However, the usage in pgstat.c is hard-wired, meaning that if you >>> have a configuration where "localhost" doesn't resolve correctly >>> for whatever reason, there's no simple recourse to get the stats >>> collector working. So ISTM there is an argument for changing that. > >> Well, hardcoding it will break the (unusual) case when localhost isn't >> 127.0.0.1 / ::1. (You'd obviously have to have it try both ipv4 and >> ipv6). > > You didn't read what I wrote before. Those numeric addresses define the > loopback address, *not* "localhost". When localhost fails to resolve > as those address(es), it's localhost that is wrong. We have actually > seen this in the field with bogus DNS providers. > >> It's not common, but i've certainly come across a number of virtual >> machines where localhost resolves (through /etc/hosts) to the machines >> "real" IP rather than 127.0.01, because 127.0.0.1 simply doesn't >> exist. > > That appears to me to be a broken (non RFC compliant) VM setup. Can't argue with that. But it exists. > However, maybe what this is telling us is we need to expose the setting? > Or perhaps better, try 127.0.0.1, ::1, localhost, in that order. That was kind of my point, that yes, we probably need to do one of those at least. Today it is "kind of exposed", because you can edit /etc/hosts - you don't need to rely on DNS for it. I just don't want to lose that ability. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
Magnus Hagander writes: > On Wed, Oct 6, 2010 at 15:16, Tom Lane wrote: >> However, the usage in pgstat.c is hard-wired, meaning that if you >> have a configuration where "localhost" doesn't resolve correctly >> for whatever reason, there's no simple recourse to get the stats >> collector working. So ISTM there is an argument for changing that. > Well, hardcoding it will break the (unusual) case when localhost isn't > 127.0.0.1 / ::1. (You'd obviously have to have it try both ipv4 and > ipv6). You didn't read what I wrote before. Those numeric addresses define the loopback address, *not* "localhost". When localhost fails to resolve as those address(es), it's localhost that is wrong. We have actually seen this in the field with bogus DNS providers. > It's not common, but i've certainly come across a number of virtual > machines where localhost resolves (through /etc/hosts) to the machines > "real" IP rather than 127.0.01, because 127.0.0.1 simply doesn't > exist. That appears to me to be a broken (non RFC compliant) VM setup. However, maybe what this is telling us is we need to expose the setting? Or perhaps better, try 127.0.0.1, ::1, localhost, in that order. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
On Wed, Oct 6, 2010 at 15:16, Tom Lane wrote: > Andrew Dunstan writes: >> On 10/06/2010 04:05 AM, Peter Eisentraut wrote: >>> On tis, 2010-10-05 at 22:17 -0400, Tom Lane wrote: So far as I can find, there is *no* standard mandating that localhost means the loopback address. > >>> Should we then change pgstat.c to use IP addresses instead of hardcoding >>> "localhost"? > >> I understood Tom to be saying we should not rely on "localhost" for >> authentication, not that we shouldn't use it at all. > > I think it's all right to use it as the default value for > listen_addresses, because (1) it's an understandable default, > and (2) users can change the setting if it doesn't work. > > However, the usage in pgstat.c is hard-wired, meaning that if you > have a configuration where "localhost" doesn't resolve correctly > for whatever reason, there's no simple recourse to get the stats > collector working. So ISTM there is an argument for changing that. Well, hardcoding it will break the (unusual) case when localhost isn't 127.0.0.1 / ::1. (You'd obviously have to have it try both ipv4 and ipv6). It's not common, but i've certainly come across a number of virtual machines where localhost resolves (through /etc/hosts) to the machines "real" IP rather than 127.0.01, because 127.0.0.1 simply doesn't exist. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
Andrew Dunstan writes: > On 10/06/2010 04:05 AM, Peter Eisentraut wrote: >> On tis, 2010-10-05 at 22:17 -0400, Tom Lane wrote: >>> So far as I can find, there is *no* standard >>> mandating that localhost means the loopback address. >> Should we then change pgstat.c to use IP addresses instead of hardcoding >> "localhost"? > I understood Tom to be saying we should not rely on "localhost" for > authentication, not that we shouldn't use it at all. I think it's all right to use it as the default value for listen_addresses, because (1) it's an understandable default, and (2) users can change the setting if it doesn't work. However, the usage in pgstat.c is hard-wired, meaning that if you have a configuration where "localhost" doesn't resolve correctly for whatever reason, there's no simple recourse to get the stats collector working. So ISTM there is an argument for changing that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
Peter Eisentraut writes: > On tis, 2010-10-05 at 22:17 -0400, Tom Lane wrote: >> So far as I can find, there is *no* standard >> mandating that localhost means the loopback address. > Should we then change pgstat.c to use IP addresses instead of hardcoding > "localhost"? Hm, perhaps so. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 15:22, Dimitri Fontaine wrote: What is necessary here is a clear view on the possible states that a standby can be in at any time, and we must stop trying to apply to some non-ready standby the behavior we want when it's already in-sync. From my experience operating londiste, those states would be: 1. base-backup — self explaining 2. catch-up — getting the WAL to catch up after base backup 3. wanna-sync — don't yet have all the WAL to get in sync 4. do-sync — all WALs are there, coming soon 5. ok (async | recv | fsync | reply — feedback loop engaged) So you only consider that a standby is a candidate for sync rep when it's reached the ok state, and that's when it's able to fill the feedback loop we've been talking about. Standby state != ok, no waiting no nothing, it's *not* a standby as far as the master is concerned. You're not going to get zero data loss that way. Can you elaborate what the use case for that mode is? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner writes: > On 10/06/2010 04:31 AM, Simon Riggs wrote: >> That situation would require two things >> * First, you have set up async replication and you're not monitoring it >> properly. Shame on you. > > The way I read it, Jeff is complaining about the timeout you propose > that effectively turns sync into async replication in case of a failure. > > With a master that waits forever, the standby that's newly required for > quorum certainly still needs its time to catch up. But it wouldn't live > in danger of being "optimized away" for availability in case it cannot > catch up within the given timeout. It's a tradeoff between availability > and durability. What is necessary here is a clear view on the possible states that a standby can be in at any time, and we must stop trying to apply to some non-ready standby the behavior we want when it's already in-sync. From my experience operating londiste, those states would be: 1. base-backup — self explaining 2. catch-up — getting the WAL to catch up after base backup 3. wanna-sync — don't yet have all the WAL to get in sync 4. do-sync — all WALs are there, coming soon 5. ok (async | recv | fsync | reply — feedback loop engaged) So you only consider that a standby is a candidate for sync rep when it's reached the ok state, and that's when it's able to fill the feedback loop we've been talking about. Standby state != ok, no waiting no nothing, it's *not* a standby as far as the master is concerned. The other states allow to manage accepting a new standby into an existing setup, and to manage error failures. When we stop receiving the feedback loop events, the master knows the slave ain't in the "ok" state any more and can demote it to "wanna-sync", because it has to keep WALs until the slave comes again. If the standby is not back online and wal_keep_segments makes it so that we can't keep its wal anymore, the state gets back to "base-backup". Not going into every details here (for example, we might need some protocol arbitrage for the standby to be able to explain the master that it's ok even if the master thinks it's not), but my point is that without a clear list of standby states, we're going to hinder the master in situations where it makes no sense to do so. > Or do you envision any use case that requires a quorum of X standbies > for normal operation but is just fine with only none to (X-1) standbies > in case of failures? IMO that's when sync replication is most needed and > when it absolutely should hold to its promises - even if it means to > stop the system. > > There's no point in continuing operation if you cannot guarantee the > minimum requirements for durability. If you happen to want such a thing, > you should better rethink your minimum requirement (as performance for > normal operations might benefit from a lower minimum as well). This part of the discussion made me think of yet another refinement on the Quorum Commit idea, even if I'm beginning to think that can be material for later. Basic Quorum Commit is having each transaction on the master wait for a total number of votes to accept the transaction synced. Each standby has a weight, meaning 1 or more votes. The problem is the flexibility isn't there, some cases are impossible to setup. Also people want to be able to specify their favorite standby and that's quickly uneasy. Idea : segment the votes into "colors" or any categories you like. Have each standby be a member of a category list, and require per-category quorums to be reached. This is the same as attributing roles to standbys and saying that they're all equivalent as soon as part of the given role, with the added flexibility that you can sometime want more than one standby of a given role to take part of the quorum. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
On 10/06/2010 04:05 AM, Peter Eisentraut wrote: On tis, 2010-10-05 at 22:17 -0400, Tom Lane wrote: So far as I can find, there is *no* standard mandating that localhost means the loopback address. Should we then change pgstat.c to use IP addresses instead of hardcoding "localhost"? I understood Tom to be saying we should not rely on "localhost" for authentication, not that we shouldn't use it at all. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 13:41, Magnus Hagander wrote: That's only for a narrow definition of availability. For a lot of people, having access to your data isn't considered availability unless you can trust the data... Ok, fair enough. For that, synchronous replication in the "wait forever" mode is the only alternative. That on its own doesn't give you any boost in availability, on the contrary, but coupled with suitable clustering tools to handle failover and deciding when the standby is dead, you can achieve that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 10:17, Heikki Linnakangas wrote: > On 06.10.2010 11:09, Fujii Masao wrote: >> >> On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas >> wrote: >>> >>> No. Synchronous replication does not help with availability. It allows >>> you >>> to achieve zero data loss, ie. if the master dies, you are guaranteed >>> that >>> any transaction that was acknowledged as committed, is still committed. >> >> Hmm.. but we can increase availability without any data loss by using >> synchronous >> replication. Many people have already been using synchronous >> replication softwares >> such as DRBD for that purpose. > > Sure, but it's not the synchronous aspect that increases availability. It's > the replication aspect, and we already have that. Making the replication > synchronous allows zero data loss in case the master suddenly dies, but it > comes at the cost of availability. That's only for a narrow definition of availability. For a lot of people, having access to your data isn't considered availability unless you can trust the data... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gincostestimate
On Wed, Sep 8, 2010 at 1:02 AM, Teodor Sigaev wrote: > Fixed, and slightly reworked to be more clear. > Attached patch is based on your patch. The patch will improve accuracy of plans using gin indexes. It only adds block-level statistics information into the meta pages in gin indexes. Data-level statistics are not collected by the patch, and there are no changes in pg_statistic. The stats page is updated only in VACUUM. ANALYZE doesn't update the information at all. In addition, REINDEX, VACUUM FULL, and CLUSTER reset the information to zero, but the reset is not preferable. Is it possible to fill the statistic fields at bulk index-build? No one wants to run VACUUM after VACUUM FULL to update the GIN stats. We don't have any methods to dump the meta information at all. They might be internal information, but some developers and debuggers might want such kinds of tools. Contrib/pageinspect might be a good location to have such function; it has bt_metap(). The patch can be applied cleanly, no compiler warnings, and it passed all existing regression tests. There are no additional documentation and regression tests -- I'm not sure whether we should have them. If the patch is an internal improvement, docs are not needed. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] is sync rep stalled?
Tom Lane writes: > I think the point here is that it's possible to have sync-rep > configurations in which it's impossible to take a base backup. Sorry to be slow. I still don't understand that problem. I can understand why people want "wait forever", but I can't understand when the following strange idea apply: consider my non-ready standby there as a full member of the distributed setup already. I've been making plenty of noise about this topic in the past, at the beginning of plans for SR in 9.0 IIRC, pushing Heikki into having a worked out state machine to figure out what are the known states of a standby and what we can do with each. We've cancelled that and said it would maybe necessary for Synchronous Replication. Here we go, right? So, first thing first, when is it a good idea to consider a standby that's not yet had its base backup, let alone validated that after taking it the master still has enough WAL for the backup to be valid as far as initialising the slave goes, to consider this broken standby as someone we wait forever on? I say a standby is registered when it's currently "attached" and already able to keep up in async. That's a time when you can slow down the master until this new member catches up to full sync or whatever you've setup. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support Lack of google and archives-fu today means no link to those mails. Yet… -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep at Oct 5
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote: > Also on the topic of failover how do we want to deal with the master > failing over. Say M->{S1,S2} and M fails and we promote S1 to M1. Can > M1->S2? What if S2 was further along in processing than S1 when M > failed? I don't think we want to take on this complexity for 9.1 but > this means that after M fails you won't have a synchronous replica until > you rebuild or somehow reset S2. Those are problems that can be resolved, but that is the current state. The trick, I guess, is to promote the correct standby. Those are generic issues, not related to any specific patch. Thanks for keeping those issues in the limelight. > > == Path Minimization == > > > > We want to be able to minimize and control the path of data transfer, > > * so that the current master doesn't have initiate transfer to all > > dependent nodes, thereby reducing overhead on master > > * so that if the path from current master to descendent is expensive we > > would minimize network costs. > > > > This requirement is commonly known as "relaying". > > > > In its most simply stated form, we want one standby to be able to get > > WAL data from another standby. e.g. M -> S -> S. Stating the problem in > > that way misses out on the actual requirement, since people would like > > the arrangement to be robust in case of failures of M or any S. If we > > specify the exact arrangement of paths then we need to respecify the > > arrangement of paths if a server goes down. > > Are we going to allow these paths to be reconfigured on a live cluster? > If we have M->S1->S2 and we want to reconfigure S2 to read from M then > S2 needs to get the data that has already been committed on S1 from > somewhere (either S1 or M). This has solutions but it adds to the > complexity. Maybe not for 9.1 If you switch from M -> S1 -> S2 to M -> (S1, S2) it should work fine. At the moment that needs a shutdown/restart, but that could easily be done with a disconnect/reconnect following a file reload. The problem is how much WAL is stored on (any) node. Currently that is wal_keep_segments, which doesn't work very well, but I've seen no better ideas that cover all important cases. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 10:53 AM, Heikki Linnakangas wrote: > Wow, that is really short. Are you sure? I have no first hand experience > with DRBD, Neither do I. > and reading that man page, I get the impression that the > timeout us just for deciding that the TCP connection is dead. There is > also the ko-count parameter, which defaults to zero. I would guess that > ko-count=0 is "wait forever", while ko-count=1 is what you described, > but I'm not sure. Yeah, sounds more likely. Then I'm surprised that I didn't find any warning that the Protocol C definitely reduces availability (with the ko-count=0 default, that is). Instead, they only state that it's the most used replication mode, which really makes me wonder. [1] Sorry for adding confusion by not researching properly. Regards Markus Wanner [1] DRDB Repliaction Modes http://www.drbd.org/users-guide-emb/s-replication-protocols.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 11:49, Fujii Masao wrote: On Wed, Oct 6, 2010 at 5:17 PM, Heikki Linnakangas wrote: Sure, but it's not the synchronous aspect that increases availability. It's the replication aspect, and we already have that. Making the replication synchronous allows zero data loss in case the master suddenly dies, but it comes at the cost of availability. Yep. But I mean that the synchronous aspect is helpful to increase the availability of the system which requires no data loss. In asynchronous replication, when the master goes down, we have to salvage the missing WAL for the standby from the failed master to avoid data loss. This would take very long and decrease the availability of the system which doesn't accept any data loss. Since the synchronous doesn't require such a salvage, it can increase the availability of such a system. In general, salvaging the WAL that was not sent to the standby yet is outright impossible. You can't achieve zero data loss with asynchronous replication at all. If we want only no data loss, we have only to implement the wait-forever option. But if we make consideration for the above-mentioned availability, the return-immediately option also would be required. In some (many, I think) cases, I think that we need to consider availability and no data loss together, and consider the balance of them. If you need both, you need three servers as Simon pointed out earlier. There is no way around that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 11:39, Markus Wanner wrote: On 10/06/2010 10:17 AM, Heikki Linnakangas wrote: On 06.10.2010 11:09, Fujii Masao wrote: Hmm.. but we can increase availability without any data loss by using synchronous replication. Many people have already been using synchronous replication softwares such as DRBD for that purpose. Sure, but it's not the synchronous aspect that increases availability. It's the replication aspect, and we already have that. ..the *asynchronous* replication aspect, yes. The drdb.conf man page [1] describes parameters of DRDB. It's worth noting that even in "Protocol C" (synchronous mode), they sport a timeout of only 6 seconds (by default). Wow, that is really short. Are you sure? I have no first hand experience with DRBD, and reading that man page, I get the impression that the timeout us just for deciding that the TCP connection is dead. There is also the ko-count parameter, which defaults to zero. I would guess that ko-count=0 is "wait forever", while ko-count=1 is what you described, but I'm not sure. It's not hard to imagine the master failing in a way that first causes the connection to standby to drop, and the disk failing 6 seconds later. A fire that destroys the network cable first and then spreads to the disk array for example. [1]: drdb.conf man page: http://www.drbd.org/users-guide/re-drbdconf.html -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 5:17 PM, Heikki Linnakangas wrote: > On 06.10.2010 11:09, Fujii Masao wrote: >> >> On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas >> wrote: >>> >>> No. Synchronous replication does not help with availability. It allows >>> you >>> to achieve zero data loss, ie. if the master dies, you are guaranteed >>> that >>> any transaction that was acknowledged as committed, is still committed. >> >> Hmm.. but we can increase availability without any data loss by using >> synchronous >> replication. Many people have already been using synchronous >> replication softwares >> such as DRBD for that purpose. > > Sure, but it's not the synchronous aspect that increases availability. It's > the replication aspect, and we already have that. Making the replication > synchronous allows zero data loss in case the master suddenly dies, but it > comes at the cost of availability. Yep. But I mean that the synchronous aspect is helpful to increase the availability of the system which requires no data loss. In asynchronous replication, when the master goes down, we have to salvage the missing WAL for the standby from the failed master to avoid data loss. This would take very long and decrease the availability of the system which doesn't accept any data loss. Since the synchronous doesn't require such a salvage, it can increase the availability of such a system. If we want only no data loss, we have only to implement the wait-forever option. But if we make consideration for the above-mentioned availability, the return-immediately option also would be required. In some (many, I think) cases, I think that we need to consider availability and no data loss together, and consider the balance of them. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 10:17 AM, Heikki Linnakangas wrote: > On 06.10.2010 11:09, Fujii Masao wrote: >> Hmm.. but we can increase availability without any data loss by using >> synchronous >> replication. Many people have already been using synchronous >> replication softwares >> such as DRBD for that purpose. > > Sure, but it's not the synchronous aspect that increases availability. > It's the replication aspect, and we already have that. ..the *asynchronous* replication aspect, yes. The drdb.conf man page [1] describes parameters of DRDB. It's worth noting that even in "Protocol C" (synchronous mode), they sport a timeout of only 6 seconds (by default). After that, the primary node proceeds without any kind of guarantee (which can be thought of as switching to async replication). Just as Simon proposes for Postgres as well. Maybe that really is enough for now. Everybody that needs stricter durability guarantees needs to wait for Postgres-R ;-) Regards Markus Wanner [1]: drdb.conf man page: http://www.drbd.org/users-guide/re-drbdconf.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Triggers on VIEWs
On 5 October 2010 21:17, Bernd Helmle wrote: > Basic summary of this patch: > Thanks for the review. > * The patch includes a fairly complete discussion about INSTEAD OF triggers > and their usage on views. There are also additional enhancements to the RULE > documentation, which seems, given that this might supersede the usage of > RULES for updatable views, reasonable. > > * The patch passes regressions tests and comes with a bunch of its own > regression tests. I think they are complete, they cover statement and row > Level trigger and show the usage for JOINed views for example. > > * I've checked against a draft of the SQL standard, the behavior of the > patch seems to match the spec (my copy might be out of date, however). > > * The code looks pretty good to me, there are some low level error messages > exposing some implementation details, which could be changed (e.g. "wholerow > is NULL"), but given that this is most of the time unexpected and is used in > some older code as well, this doesn't seem very important. > Hopefully that error should never happen, since it would indicate a bug in the code rather than a user error. > * The implementation introduces the notion of "wholerow". This is a junk > target list entry which allows the executor to carry the view information to > an INSTEAD OF trigger. In case of DELETE/UPDATE, the rewriter is responsible > to inject the new "wholerow" TLE and make the original query to point to a > new range table entry (correct me, when i'm wrong), which is based on the > view's query. I'm not sure i'm happy with the notion of "wholerow" here, > maybe "viewrow" or "viewtarget" is more descriptive? > That's a good description of how it works. I chose "wholerow" because that matched similar terminology used already, for example in preptlist.c when doing FOR UPDATE/SHARE on a view. I don't feel strongly about it, but my inclination is to stick with "wholerow" unless someone feels strongly otherwise. > * I'm inclined to say that INSTEAD OF triggers have less overhead than > RULES, but this is not proven yet with a reasonable benchmark. > It's difficult to come up with a general statement on performance because there are so many variables that might affect it. In a few simple tests, I found that for repeated small updates RULEs and TRIGGERs perform roughly the same, but for bulk updates (one query updating 1000s of rows) a RULE is best. > I would like to do some more tests/review, but going to mark this patch as > "Ready for Committer", so that someone more qualified on the executor part > can have a look on it during this commitfest, if that's okay for us? > Thanks for looking at it. I hope this is useful functionality to make it easier to write updatable views, and perhaps it will help with implementing auto-updatable views too. Cheers, Dean > -- > Thanks > > Bernd > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 11:09, Fujii Masao wrote: On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas wrote: No. Synchronous replication does not help with availability. It allows you to achieve zero data loss, ie. if the master dies, you are guaranteed that any transaction that was acknowledged as committed, is still committed. Hmm.. but we can increase availability without any data loss by using synchronous replication. Many people have already been using synchronous replication softwares such as DRBD for that purpose. Sure, but it's not the synchronous aspect that increases availability. It's the replication aspect, and we already have that. Making the replication synchronous allows zero data loss in case the master suddenly dies, but it comes at the cost of availability. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas wrote: > No. Synchronous replication does not help with availability. It allows you > to achieve zero data loss, ie. if the master dies, you are guaranteed that > any transaction that was acknowledged as committed, is still committed. Hmm.. but we can increase availability without any data loss by using synchronous replication. Many people have already been using synchronous replication softwares such as DRBD for that purpose. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 08:31 AM, Heikki Linnakangas wrote: > On 06.10.2010 01:14, Josh Berkus wrote: >> Last I checked, our goal with synch standby was to increase availablity, >> not decrease it. > > No. Synchronous replication does not help with availability. It allows > you to achieve zero data loss, ie. if the master dies, you are > guaranteed that any transaction that was acknowledged as committed, is > still committed. Strictly speaking, it even reduces availability. Which is why nobody actually wants *only* synchronous replication. Instead they use quorum commit or semi-synchronous (shudder) replication, which only requires *some* nodes to be in sync, but effectively replicates asynchronously to the others. >From that point of view, the requirement of having one synch and two async standbies is pretty much the same as having three synch standbies with a quorum commit of 1. (Except for additional availability of the later variant, because in case of a failure of the one sync standby, any of the others can take over without admin intervention). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] host name support in pg_hba.conf
On tis, 2010-10-05 at 22:17 -0400, Tom Lane wrote: > So far as I can find, there is *no* standard > mandating that localhost means the loopback address. Should we then change pgstat.c to use IP addresses instead of hardcoding "localhost"? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 10:52 AM, Jeff Davis wrote: > I'm not sure I entirely understand. I was concerned about the case of a > standby server being allowed to lag behind the rest by a large number of > WAL records. That can't happen in the "wait for all servers to apply" > case, because the system would become unavailable rather than allow a > significant difference in the amount of WAL applied. > > I'm not saying that an unavailable system is good, but I don't see how > my particular complaint applies to the "wait for all servers to apply" > case. > > The case I was worried about is: > * 1 master and 2 standby > * The rule is "wait for at least one standby to apply the WAL" > > In your notation, I believe that's M -> { S1, S2 } > > In that case, if one S1 is just a little faster than S2, then S2 might > build up a significant queue of unapplied WAL. Then, when S1 goes down, > there's no way for the slower one to acknowledge a new transaction > without playing through all of the unapplied WAL. > > Intuitively, the administrator would think that he was getting both HA > and redundancy, but in reality the availability is no better than if > there were only two servers (M -> S1), except that it might be faster to > replay the WAL then to set up a new standby (but that's not guaranteed). Agreed. This is similar to my previous complaint. http://archives.postgresql.org/pgsql-hackers/2010-09/msg00946.php This problem would happen even if we fix the quorum to 1 as Josh propose. To avoid this, the master must wait for ACK from all the connected synchronous standbys. I think that this is likely to happen especially when we choose 'apply' replication level. Because that level can easily lag a synchronous standby because of the conflict between recovery and read-only query. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 04:31 AM, Simon Riggs wrote: > That situation would require two things > * First, you have set up async replication and you're not monitoring it > properly. Shame on you. The way I read it, Jeff is complaining about the timeout you propose that effectively turns sync into async replication in case of a failure. With a master that waits forever, the standby that's newly required for quorum certainly still needs its time to catch up. But it wouldn't live in danger of being "optimized away" for availability in case it cannot catch up within the given timeout. It's a tradeoff between availability and durability. > So it can occur in both cases, though it now looks to me that its less > important an issue in either case. So I think this doesn't rate the term > dangerous to describe it any longer. The proposed timeout certainly still sounds dangerous to me. I'd rather recommend setting it to an incredibly huge value to minimize its dangers and get sync replication when that is what has been asked for. Use async replication for increased availability. Or do you envision any use case that requires a quorum of X standbies for normal operation but is just fine with only none to (X-1) standbies in case of failures? IMO that's when sync replication is most needed and when it absolutely should hold to its promises - even if it means to stop the system. There's no point in continuing operation if you cannot guarantee the minimum requirements for durability. If you happen to want such a thing, you should better rethink your minimum requirement (as performance for normal operations might benefit from a lower minimum as well). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 01:14, Josh Berkus wrote: You start a new one from the latest base backup and let it catch up? Possibly modifying the config file in the master to let it know about the new standby, if we go down that path. This part doesn't seem particularly hard to me. Agreed, not sure of the issue there. See previous post. The critical phrase is *without restarting the master*. AFAICT, no patch has addressed the need to change the master's synch configuration without restarting it. It's possible that I'm not following something, in which case I'd love to have it pointed out. Fair enough. I agree it's important that the configuration can be changed on the fly. It's orthogonal to the other things discussed, so let's just assume for now that we'll have that. If not in the first version, it can be added afterwards. "pg_ctl reload" is probably how it will be done. There is some interesting behavioral questions there on what happens when the configuration is changed. Like if you first define that 3 out of 5 servers must acknowledge, and you have an in-progress commit that has received 2 acks already. If you then change the config to "2 out of 4" servers must acknowledge, is the in-progress commit now satisfied? From the admin point of view, the server that was removed from the system might've been one that had acknowledged already, and logically in the new configuration the transaction has only received 1 acknowledgment from those servers that are still part of the system. Explicitly naming the standbys in the config file would solve that particular corner case, but it would no doubt introduce other similar ones. But it's an orthogonal issue, we'll figure it out when we get there. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers