Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 09/15/2010 08:54 PM, Robert Haas wrote: I think that the bar for committing to another in-core replication solution right now is probably fairly high. I'm not trying to convince you to accept the Postgres-R patch.. at least not now. showing-off BTW, that'd be what I call a huge patch: bgworkers, excluding dynshmem and imessages: 34 files changed, 2910 insertions(+), 1421 deletions(-) from there to Postgres-R: 98 files changed, 14856 insertions(+), 230 deletions(-) /showing-off I am pretty doubtful that our current architecture is going to get us to the full feature set we'd eventually like to have - multi-master, partial replication, etc. Would be hard to do, due to the (physical) format of WAL, yes. That's why Postgres-R uses its own (logical) wire format. But we're not ever going to have ten replication solutions in core, so we need to think pretty carefully about what we accept. That's very understandable. That conversation probably needs to start from the other end - is the overall architecture correct for us? - before we get down to specific patches. On the other hand, I'm very interested in laying the groundwork for parallel query Cool. Maybe we should take another look at bgworkers, as soon as a parallel querying feature gets planned? and I think there are probably a number of bits of architecture both from this project and Postgres-XC, that could be valuable contributions to PostgreSQL; (...note that Postgres-R is license compatible, as opposed to the GPL'ed Postgres-XC project...) however, in neither case do I expect them to be accepted without significant modification. Sure, that's understandable as well. I've published this part of the infrastructure to get some feedback as early as possible on that part of Postgres-R. As you can certainly imagine, it's important for me that any modification to such a patch from Postgres-R would still be compatible to what I use it for in Postgres-R and not cripple any functionality there, because that'd probably create more work for me than not getting the patch accepted upstream at all. I'm saying it's hard to think about committing any of them because they aren't really independent of each other or of other parts of Postgres-R. As long as you don't consider imessages and dynshmem a part of Postgres-R, they are independent of the rest of Postgres-R in the technical sense. And for any kind of parallel querying feature, imessages and dynshmem might be of help as well. So I currently don't see where I could de-couple these patches any further. If you have a specific requirement, please don't hesitate to ask. I feel like there is an antagonistic thread to this conversation, and some others that we've had. I hope I'm misreading that, because it's not my intent to piss you off. I'm just offering my honest feedback. Your mileage may vary; others may feel differently; none of it is personal. That's absolutely fine. I'm thankful for your feedback. Also note that I initially didn't even want to add the bgworker patches to the commit fest. I've de-coupled and published these separate from Postgres-R with a) the hope to get feedback (more than for the overall Postgres-R patch) and b) to show others that such a facility exists and is ready to be reused. I didn't really expect them to get accepted to Postgres core at the moment. But the Postgres team normally asks for sharing concepts and ideas as early as possible... OK, I think I understand what you're trying to say now. I guess I feel like the ideal architecture for any sort of solution that needs a pool of workers would be to keep around the workers that most recently proved to be useful. Upon needing a new worker, you look for one that's available and already bound to the correct database. If you find one, you assign him to the new task. That's mostly how bgworkers are designed, yes. The min/max idle background worker GUCs allow a loose control over how many spare processes you want to allow hanging around doing nothing. If not, you find the one that's been idle longest and either (a) kill him off and start a new one that is bound to the correct database or, even better, (b) tell him to flush his caches and rebind to the correct database. Hm.. sorry if I didn't express this more clearly. What I'm trying to say is that (b) isn't worth implementing, because it doesn't offer enough of an improvement over (a). The only saving would be the fork() and some basic process initialization. Being able to re-use a bgworker connected to the correct database already gives you most of the benefit, namely not having to fork() *and* re-connect to the database for every job. Back at the technical issues, let me try to summarize the feedback and what I do with it. In general, there's not much use for bgworkers for just autovacuum as the only background job. I agree. Tom raised the 'lots of databases' issue. I agree that the
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Thu, Sep 16, 2010 at 4:47 AM, Markus Wanner mar...@bluegap.ch wrote: showing-off BTW, that'd be what I call a huge patch: bgworkers, excluding dynshmem and imessages: 34 files changed, 2910 insertions(+), 1421 deletions(-) from there to Postgres-R: 98 files changed, 14856 insertions(+), 230 deletions(-) /showing-off Yeah, that's huge. :-) That conversation probably needs to start from the other end - is the overall architecture correct for us? - before we get down to specific patches. On the other hand, I'm very interested in laying the groundwork for parallel query Cool. Maybe we should take another look at bgworkers, as soon as a parallel querying feature gets planned? Well, that will obviously depend somewhat on the wishes of whoever decides to work on parallel query, but it seems reasonable to me. It would be nice to get some pieces of this committed incrementally but as I say I fear there is too much dependency on what might happen later, at least the way things are structured now. and I think there are probably a number of bits of architecture both from this project and Postgres-XC, that could be valuable contributions to PostgreSQL; (...note that Postgres-R is license compatible, as opposed to the GPL'ed Postgres-XC project...) Yeah. +1 for license compatibility. As you can certainly imagine, it's important for me that any modification to such a patch from Postgres-R would still be compatible to what I use it for in Postgres-R and not cripple any functionality there, because that'd probably create more work for me than not getting the patch accepted upstream at all. That's an understandable goal, but it may be difficult to achieve. As long as you don't consider imessages and dynshmem a part of Postgres-R, they are independent of the rest of Postgres-R in the technical sense. And for any kind of parallel querying feature, imessages and dynshmem might be of help as well. So I currently don't see where I could de-couple these patches any further. I agree. I've already said my piece on how I think that stuff would need to be reworked to be acceptable, so we might have to agree to disagree on those, especially if your goal is to get something committed that doesn't involve a major rewrite on your end. I didn't really expect them to get accepted to Postgres core at the moment. But the Postgres team normally asks for sharing concepts and ideas as early as possible... Absolutely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Morning, On 09/16/2010 04:26 PM, Robert Haas wrote: I agree. I've already said my piece on how I think that stuff would need to be reworked to be acceptable, so we might have to agree to disagree on those, especially if your goal is to get something committed that doesn't involve a major rewrite on your end. Just for clarification: you are referring to imessages and dynshmem here, right? I agree that dynshmem needs to be reworked and rethought. And imessages simply depends on dynshmem. If you are referring to the bgworker stuff, I'm not quite clear about what I could do to make bgworker more acceptable. (Except perhaps for removing the dependency on imessages). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Thu, Sep 16, 2010 at 1:20 PM, Markus Wanner mar...@bluegap.ch wrote: On 09/16/2010 04:26 PM, Robert Haas wrote: I agree. I've already said my piece on how I think that stuff would need to be reworked to be acceptable, so we might have to agree to disagree on those, especially if your goal is to get something committed that doesn't involve a major rewrite on your end. Just for clarification: you are referring to imessages and dynshmem here, right? I agree that dynshmem needs to be reworked and rethought. And imessages simply depends on dynshmem. Yes, I was referring to imessages and dynshmem. If you are referring to the bgworker stuff, I'm not quite clear about what I could do to make bgworker more acceptable. (Except perhaps for removing the dependency on imessages). I'm not sure, either. It would be nice if there were a way to create a general facility here that we could then build various applications on, but I'm not sure whether that's the case. We had some back-and-forth about what is best for replication vs. what is best for vacuum vs. what is best for parallel query. If we could somehow conceive of a system that could serve all of those needs without introducing any more configuration complexity than what we have now, that would of course be very interesting. But judging by your comments I'm not very sure such a thing is feasible, so perhaps wait-and-see is the best approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 09/15/2010 03:44 AM, Robert Haas wrote: Hmm. So what happens if you have 1000 databases with a minimum of 1 worker per database and an overall limit of 10 workers? The first 10 databases would get an idle worker. As soon as real jobs arrive, the idle workers on databases that don't have any pending jobs get terminated in favor of the databases for which there are pending jobs. Admittedly, that mechanism isn't too clever, yet. I.e. if there always are enough jobs for one database, the others could starve. With 1000 databases and a max of only 10 workers, chances for having a spare worker for the database that gets the next job are pretty low, yes. But that's the case with the proposed 5 minute timeout as well. Lowering that timeout wouldn't increase the chance. And while it might make the start of a new bgworker quicker in the above mentioned case, I think there's not much advantage over just setting max_idle_background_workers = 0. OTOH such a timeout would be easy enough to implement. The admin would be faced with yet another GUC, though. Hmm, I see. That's probably not helpful for autovacuum, but I can see it being useful for replication. Glad to hear. I still think maybe we ought to try to crack the nut of allowing backends to rebind to a different database. That would simplify things here a good deal, although then again maybe it's too complex to be worth it. Also note that it would re-introduce some of the costs we try to avoid with keeping the connected bgworker around. And in case you afford having at least a few spare bgworkers around per database (i.e. less than 10 or 20 databases), potential savings seem to be negligible again. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Wed, Sep 15, 2010 at 2:48 AM, Markus Wanner mar...@bluegap.ch wrote: Hmm. So what happens if you have 1000 databases with a minimum of 1 worker per database and an overall limit of 10 workers? The first 10 databases would get an idle worker. As soon as real jobs arrive, the idle workers on databases that don't have any pending jobs get terminated in favor of the databases for which there are pending jobs. Admittedly, that mechanism isn't too clever, yet. I.e. if there always are enough jobs for one database, the others could starve. I haven't scrutinized your code but it seems like the minimum-per-database might be complicating things more than necessary. You might find that you can make the logic simpler without that. I might be wrong, though. I guess the real issue here is whether it's possible to, and whether you're interested in, extracting a committable subset of this work, and if so what that subset should look like. There's sort of a chicken-and-egg problem with large patches; if you present them as one giant monolithic patch, they're too large to review. But if you break them down into smaller patches, it doesn't really fix the problem unless the pieces have independent value. Even in the two years I've been involved in the project, a number of different contributors have gone through the experience of submitting a patch that only made sense if you assumed that the follow-on patch was also going to get accepted, and as no one was willing to assume that, the first patch didn't get committed either. Where people have been able to break things down into a series of small to medium-sized incremental improvements, things have gone more smoothly. For example, Simon was able to get a batch to start the bgwriter during archive recovery committed to 8.4. That didn't have a lot of independent value, but it had some, and it paved the way for Hot Standby in 9.0. Had someone thought of a way to decompose that project into more than two truly independent pieces, I suspect it might have even gone more smoothly (although of course that's an arguable point and YMMV). I still think maybe we ought to try to crack the nut of allowing backends to rebind to a different database. That would simplify things here a good deal, although then again maybe it's too complex to be worth it. Also note that it would re-introduce some of the costs we try to avoid with keeping the connected bgworker around. How? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Robert, On 09/15/2010 07:23 PM, Robert Haas wrote: I haven't scrutinized your code but it seems like the minimum-per-database might be complicating things more than necessary. You might find that you can make the logic simpler without that. I might be wrong, though. I still think of that as a feature, not something I'd like to simplify away. Are you arguing it could raise the chance of adaption of bgworkers to Postgres? I'm not seeing your point here. I guess the real issue here is whether it's possible to, and whether you're interested in, extracting a committable subset of this work, and if so what that subset should look like. Well, as it doesn't currently provide any real benefit for autovacuum, it depends on how much hackers like to prepare for something like Postgres-R or parallel querying. There's sort of a chicken-and-egg problem with large patches; if you present them as one giant monolithic patch, they're too large to review. But if you break them down into smaller patches, it doesn't really fix the problem unless the pieces have independent value. I don't quite get what you are trying to say here. I splited the bgworker projet from Postgres-R into 6 separate patches. Are you saying that's too few or too much? You are welcome to argue about independent patches, i.e. this patch 1 of 6 (as $SUBJECT implies) might have some value, according to Alvaro. Admittedly, patch 2 of 6 is the biggest and most important chunk of functionality of the whole set. Also note that it would re-introduce some of the costs we try to avoid with keeping the connected bgworker around. How? I'm talking about the cost of connecting to a database (and disconnecting), most probably flushing caches, and very probably some kind of re-registering with the coordinator. Most of what a normal backend does at startup. About the only thing you'd save here is the fork() and very basic process setup. I really doubt that's worth the effort. Having some more idle processes around doesn't hurt that much and solves the problem, I think. Thanks for your feedback. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Wed, Sep 15, 2010 at 2:28 PM, Markus Wanner mar...@bluegap.ch wrote: I guess the real issue here is whether it's possible to, and whether you're interested in, extracting a committable subset of this work, and if so what that subset should look like. Well, as it doesn't currently provide any real benefit for autovacuum, it depends on how much hackers like to prepare for something like Postgres-R or parallel querying. I think that the bar for committing to another in-core replication solution right now is probably fairly high. I am pretty doubtful that our current architecture is going to get us to the full feature set we'd eventually like to have - multi-master, partial replication, etc. But we're not ever going to have ten replication solutions in core, so we need to think pretty carefully about what we accept. That conversation probably needs to start from the other end - is the overall architecture correct for us? - before we get down to specific patches. On the other hand, I'm very interested in laying the groundwork for parallel query, and I think there are probably a number of bits of architecture both from this project and Postgres-XC, that could be valuable contributions to PostgreSQL; however, in neither case do I expect them to be accepted without significant modification. There's sort of a chicken-and-egg problem with large patches; if you present them as one giant monolithic patch, they're too large to review. But if you break them down into smaller patches, it doesn't really fix the problem unless the pieces have independent value. I don't quite get what you are trying to say here. I splited the bgworker projet from Postgres-R into 6 separate patches. Are you saying that's too few or too much? I'm saying it's hard to think about committing any of them because they aren't really independent of each other or of other parts of Postgres-R. I feel like there is an antagonistic thread to this conversation, and some others that we've had. I hope I'm misreading that, because it's not my intent to piss you off. I'm just offering my honest feedback. Your mileage may vary; others may feel differently; none of it is personal. Also note that it would re-introduce some of the costs we try to avoid with keeping the connected bgworker around. How? I'm talking about the cost of connecting to a database (and disconnecting), most probably flushing caches, and very probably some kind of re-registering with the coordinator. Most of what a normal backend does at startup. About the only thing you'd save here is the fork() and very basic process setup. I really doubt that's worth the effort. Having some more idle processes around doesn't hurt that much and solves the problem, I think. OK, I think I understand what you're trying to say now. I guess I feel like the ideal architecture for any sort of solution that needs a pool of workers would be to keep around the workers that most recently proved to be useful. Upon needing a new worker, you look for one that's available and already bound to the correct database. If you find one, you assign him to the new task. If not, you find the one that's been idle longest and either (a) kill him off and start a new one that is bound to the correct database or, even better, (b) tell him to flush his caches and rebind to the correct database. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Mon, Aug 30, 2010 at 11:30 AM, Markus Wanner mar...@bluegap.ch wrote: On 08/30/2010 04:52 PM, Tom Lane wrote: Let me just point out that awhile back we got a *measurable* performance boost by eliminating a single indirect fetch from the buffer addressing code path. I'll take a look a that, thanks. So I don't have any faith in untested assertions Neither do I. Thus I'm probably going to try my approach. As a matter of project management, I am inclined to think that until we've hammered out this issue, there's not a whole lot useful that can be done on any of the BG worker patches. So I am wondering if we should set those to Returned with Feedback or bump them to a future CommitFest. The good news is that, after a lot of back and forth, I think we've identified the reason underpinning much of why Markus and I have been disagreeing about dynshmem and imessages - namely, whether or not it's possible to allocate shared_buffers as something other than one giant slab without taking an unacceptable performance hit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On 09/14/2010 06:26 PM, Robert Haas wrote: As a matter of project management, I am inclined to think that until we've hammered out this issue, there's not a whole lot useful that can be done on any of the BG worker patches. So I am wondering if we should set those to Returned with Feedback or bump them to a future CommitFest. I agree in general. I certainly don't want to hold back the commit fest. What bugs me a bit is that I didn't really get much feedback regarding the *bgworker* portion of code. Especially as that's the part I'm most interested in feedback. However, I currently don't have any time to work on these patches, so I'm fine with dropping them from the current commit fest. The good news is that, after a lot of back and forth, I think we've identified the reason underpinning much of why Markus and I have been disagreeing about dynshmem and imessages - namely, whether or not it's possible to allocate shared_buffers as something other than one giant slab without taking an unacceptable performance hit. Agreed. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Excerpts from Markus Wanner's message of mar sep 14 12:56:59 -0400 2010: What bugs me a bit is that I didn't really get much feedback regarding the *bgworker* portion of code. Especially as that's the part I'm most interested in feedback. I think we've had enough problems with the current design of forking a new autovac process every once in a while, that I'd like to have them as permanent processes instead, waiting for orders from the autovac launcher. From that POV, bgworkers would make sense. I cannot promise a timely review however :-( -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Alvaro Herrera alvhe...@commandprompt.com writes: I think we've had enough problems with the current design of forking a new autovac process every once in a while, that I'd like to have them as permanent processes instead, waiting for orders from the autovac launcher. From that POV, bgworkers would make sense. That seems like a fairly large can of worms to open: we have never tried to make backends switch from one database to another, and I don't think I'd want to start such a project with autovac. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Excerpts from Tom Lane's message of mar sep 14 13:46:17 -0400 2010: Alvaro Herrera alvhe...@commandprompt.com writes: I think we've had enough problems with the current design of forking a new autovac process every once in a while, that I'd like to have them as permanent processes instead, waiting for orders from the autovac launcher. From that POV, bgworkers would make sense. That seems like a fairly large can of worms to open: we have never tried to make backends switch from one database to another, and I don't think I'd want to start such a project with autovac. Yeah, what I was thinking is that each worker would still die after completing the run, but a new one would be started immediately; it would go to sleep until a new assignment arrived. (What got me into this was the whole latch thing, actually.) This is a very raw idea however, so don't mind me much. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 09/14/2010 07:46 PM, Tom Lane wrote: Alvaro Herreraalvhe...@commandprompt.com writes: I think we've had enough problems with the current design of forking a new autovac process every once in a while, that I'd like to have them as permanent processes instead, waiting for orders from the autovac launcher. From that POV, bgworkers would make sense. Okay, great. That seems like a fairly large can of worms to open: we have never tried to make backends switch from one database to another, and I don't think I'd want to start such a project with autovac. They don't. Even with bgworker, every backend stays connected to the same backend. You configure the min and max amounts of idle backends *per database*. Plus the overall max of background workers, IIRC. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Markus Wanner mar...@bluegap.ch writes: On 09/14/2010 07:46 PM, Tom Lane wrote: That seems like a fairly large can of worms to open: we have never tried to make backends switch from one database to another, and I don't think I'd want to start such a project with autovac. They don't. Even with bgworker, every backend stays connected to the same backend. You configure the min and max amounts of idle backends *per database*. Plus the overall max of background workers, IIRC. So there is a minimum of one avworker per database? That's a guaranteed nonstarter. There are many people with thousands of databases, but no need for thousands of avworkers. I'm also pretty unclear why you speak of min and max numbers of workers when the proposal (AIUI) is to have the workers there always, rather than have them come and go. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, I'm glad discussion on this begins. On 09/14/2010 07:55 PM, Tom Lane wrote: So there is a minimum of one avworker per database? Nope, you can set that to 0. You don't *need* to keep idle workers around. That's a guaranteed nonstarter. There are many people with thousands of databases, but no need for thousands of avworkers. Well, yeah, bgworkers are primarily designed to be used for Postgres-R, where you easily get more background workers than normal backends. And having idle backends around waiting for a next job is certainly preferable over having to re-connect every time. I've advertised the bgworker infrastructure for use for parallel querying as well. Again, that use case easily leads to having more background workers than normal backends. And you don't want to wait for them all to re-connect for every query they need to help with. I'm also pretty unclear why you speak of min and max numbers of workers when the proposal (AIUI) is to have the workers there always, rather than have them come and go. This step 1 of the bgworker set of patches turns the av*launcher* (coordinator) into a permanent process (even if autovacuum is off). The background workers can still come and go. However, they don't necessarily *need* to terminate after having done their job. The coordinator controls them and requests new workers or commands idle ones to terminate *as required*. I don't think there's that much different to the current implementation. Setting both, the min and max number of idle bgworkers to 0 should in fact give you the exact same behavior as we currently have: namely to terminate each av/bgworker after its job is done, never having idle workers around. Which might or might not be the optimal configuration for users with lots of databases, that's hard to predict. And it depends a lot on the load distribution over the databases and on how clever the coordinator manages the bgworkers. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On 09/14/2010 08:06 PM, Robert Haas wrote: One idea I had was to have autovacuum workers stick around for a period of time after finishing their work. When we need to autovacuum a database, first check whether there's an existing worker that we can use, and if so use him. If not, start a new one. If that puts us over the max number of workers, kill of the one that's been waiting the longest. But workers will exit anyway if not reused after a certain period of time. That's pretty close to how bgworkers are implemented, now. Except for the need to terminate after a certain period of time. What is that intended to be good for? Especially considering that the avlauncher/coordinator knows the current amount of work (number of jobs) per database. The idea here would be to try to avoid all the backend startup costs: process creation, priming the caches, etc. But I'm not really sure it's worth the effort. I think we need to look for ways to further reduce the overhead of vacuuming, but this doesn't necessarily seem like the thing that would have the most bang for the buck. Well, the pressure has simply been bigger for Postgres-R. It should be possible to do benchmarks using Postgres-R and compare against a max_idle_background_workers = 0 configuration that leads to termination and re-connecting for ever remote transaction to be applied. However, that's not going to say anything about whether or not it's worth it for autovacuum. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Tue, Sep 14, 2010 at 1:56 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Tom Lane's message of mar sep 14 13:46:17 -0400 2010: Alvaro Herrera alvhe...@commandprompt.com writes: I think we've had enough problems with the current design of forking a new autovac process every once in a while, that I'd like to have them as permanent processes instead, waiting for orders from the autovac launcher. From that POV, bgworkers would make sense. That seems like a fairly large can of worms to open: we have never tried to make backends switch from one database to another, and I don't think I'd want to start such a project with autovac. Yeah, what I was thinking is that each worker would still die after completing the run, but a new one would be started immediately; it would go to sleep until a new assignment arrived. (What got me into this was the whole latch thing, actually.) This is a very raw idea however, so don't mind me much. What would be the advantage of that? One idea I had was to have autovacuum workers stick around for a period of time after finishing their work. When we need to autovacuum a database, first check whether there's an existing worker that we can use, and if so use him. If not, start a new one. If that puts us over the max number of workers, kill of the one that's been waiting the longest. But workers will exit anyway if not reused after a certain period of time. The idea here would be to try to avoid all the backend startup costs: process creation, priming the caches, etc. But I'm not really sure it's worth the effort. I think we need to look for ways to further reduce the overhead of vacuuming, but this doesn't necessarily seem like the thing that would have the most bang for the buck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Tue, Sep 14, 2010 at 2:26 PM, Markus Wanner mar...@bluegap.ch wrote: On 09/14/2010 08:06 PM, Robert Haas wrote: One idea I had was to have autovacuum workers stick around for a period of time after finishing their work. When we need to autovacuum a database, first check whether there's an existing worker that we can use, and if so use him. If not, start a new one. If that puts us over the max number of workers, kill of the one that's been waiting the longest. But workers will exit anyway if not reused after a certain period of time. That's pretty close to how bgworkers are implemented, now. Except for the need to terminate after a certain period of time. What is that intended to be good for? To avoid consuming system resources forever if they're not being used. Especially considering that the avlauncher/coordinator knows the current amount of work (number of jobs) per database. The idea here would be to try to avoid all the backend startup costs: process creation, priming the caches, etc. But I'm not really sure it's worth the effort. I think we need to look for ways to further reduce the overhead of vacuuming, but this doesn't necessarily seem like the thing that would have the most bang for the buck. Well, the pressure has simply been bigger for Postgres-R. It should be possible to do benchmarks using Postgres-R and compare against a max_idle_background_workers = 0 configuration that leads to termination and re-connecting for ever remote transaction to be applied. Well, presumably that would be fairly disastrous. I would think, though, that you would not have a min/max number of workers PER DATABASE, but an overall limit on the upper size of the total pool - I can't see any reason to limit the minimum size of the pool, but I might be missing something. However, that's not going to say anything about whether or not it's worth it for autovacuum. Personally, my position is that if someone does something that is only a small improvement on its own but which has the potential to help with other things later, that's a perfectly legitimate patch and we should try to accept it. But if it's not a clear (even if small) win then the bar is a lot higher, at least in my book. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On 09/14/2010 08:41 PM, Robert Haas wrote: To avoid consuming system resources forever if they're not being used. Well, what timeout would you choose. And how would you justify it compared to the amounts of system resources consumed by an idle process sitting there and waiting for a job? I'm not against such a timeout, but so far I felt that unlimited would be the best default. Well, presumably that would be fairly disastrous. I would think, though, that you would not have a min/max number of workers PER DATABASE, but an overall limit on the upper size of the total pool That already exists (in addition to the other parameters). - I can't see any reason to limit the minimum size of the pool, but I might be missing something. I tried to mimic what others do, for example apache pre-fork. Maybe it's just another way of trying to keep the overall resource consumption at a reasonable level. The minimum is helpful to eliminate waits for backends starting up. Note here that the coordinator can only request to fork one new bgworker at a time. It then needs to wait until that new bgworker registers with the coordinator, until it can request further bgworkers from the postmaster. (That's due to the limitation in communication between the postmaster and coordinator). Personally, my position is that if someone does something that is only a small improvement on its own but which has the potential to help with other things later, that's a perfectly legitimate patch and we should try to accept it. But if it's not a clear (even if small) win then the bar is a lot higher, at least in my book. I don't think it's an improvement over the current autovacuum behavior. Not intended to be one. But it certainly shouldn't hurt, either. It only has the potential to help with other things, namely parallel querying. And of course replication (Postgres-R). Or any other kind of background job you come up with (where background means not requiring a client connection). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Tue, Sep 14, 2010 at 2:59 PM, Markus Wanner mar...@bluegap.ch wrote: On 09/14/2010 08:41 PM, Robert Haas wrote: To avoid consuming system resources forever if they're not being used. Well, what timeout would you choose. And how would you justify it compared to the amounts of system resources consumed by an idle process sitting there and waiting for a job? I'm not against such a timeout, but so far I felt that unlimited would be the best default. I don't have a specific number in mind. 5 minutes? Well, presumably that would be fairly disastrous. I would think, though, that you would not have a min/max number of workers PER DATABASE, but an overall limit on the upper size of the total pool That already exists (in addition to the other parameters). Hmm. So what happens if you have 1000 databases with a minimum of 1 worker per database and an overall limit of 10 workers? - I can't see any reason to limit the minimum size of the pool, but I might be missing something. I tried to mimic what others do, for example apache pre-fork. Maybe it's just another way of trying to keep the overall resource consumption at a reasonable level. The minimum is helpful to eliminate waits for backends starting up. Note here that the coordinator can only request to fork one new bgworker at a time. It then needs to wait until that new bgworker registers with the coordinator, until it can request further bgworkers from the postmaster. (That's due to the limitation in communication between the postmaster and coordinator). Hmm, I see. That's probably not helpful for autovacuum, but I can see it being useful for replication. I still think maybe we ought to try to crack the nut of allowing backends to rebind to a different database. That would simplify things here a good deal, although then again maybe it's too complex to be worth it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 08/27/2010 10:46 PM, Robert Haas wrote: What other subsystems are you imagining servicing with a dynamic allocator? If there were a big demand for this functionality, we probably would have been forced to implement it already, but that's not the case. We've already discussed the fact that there are massive problems with using it for something like shared_buffers, which is by far the largest consumer of shared memory. Understood. I certainly plan to look into that for a better understanding of the problems those pose for dynamically allocated memory. I think it would be great if we could bring some more flexibility to our memory management. There are really two layers of problems here. Full ACK. One is resizing the segment itself, and one is resizing structures within the segment. As far as I can tell, there is no portable API that can be used to resize the shm itself. For so long as that remains the case, I am of the opinion that any meaningful resizing of the objects within the shm is basically unworkable. So we need to solve that problem first. Why should resizing of the objects within the shmem be unworkable? Doesn't my patch(es) prove the exact opposite? Being able to resize objects within the shm requires some kind of underlying dynamic allocation. And I rather like to be in control of that allocator than having to deal with two dozen different implementations on different OSes and their libraries. There are a couple of possible solutions, which have been discussed here in the past. I currently don't have much interest in dynamic resizing. Being able to resize the overall amount of shared memory on the fly would be nice, sure. But the total amount of RAM in a server changes rather infrequently. Being able to use what's available more efficiently is what I'm interested in. That doesn't need any kind of additional or different OS level support. It's just a matter of making better use of what's available - within Postgres itself. Next, we have to think about how we're going to resize data structures within this expandable shm. Okay, that's where I'm getting interested. Many of these structures are not things that we can easily move without bringing the system to a halt. For example, it's difficult to see how you could change the base address of shared buffers without ceasing all system activity, at which point there's not really much advantage over just forcing a restart. Similarly with LWLocks or the ProcArray. I guess that's what Bruce wanted to point out by saying our data structures are mostly continuous. I.e. not dynamic lists or hash tables, but plain simple arrays. Maybe that's a subjective impression, but I seem to hear complaints about their fixed size and inflexibility quite often. Try to imagine the flexibility that dynamic lists could give us. And if you can't move them, then how will you grow them if (as will likely be the case) there's something immediately following them in memory. One possible solution is to divide up these data structures into slabs. For example, we might imagine allocating shared_buffers in 1GB chunks. Why 1GB and do yet another layer of dynamic allocation within that? The buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit more for all of the book-keeping stuff. To make this work, we'd need to change the memory layout so that each chunk would include all of the miscellaneous stuff that we need to do bookkeeping for that chunk, such as the LWLocks and buffer descriptors. That doesn't seem completely impossible, but there would be some performance penalty, because you could no longer index into shared buffers from a single base offset. AFAICT we currently have three fixed size blocks to manage shared buffers: the buffer blocks themselves, the buffer descriptors, the strategy status (for the freelist) and the buffer lookup table. It's not obvious to me how these data structures should perform better than a dynamically allocated layout. One could rather argue that combining (some of) the bookkeeping stuff with data itself would lead to better locality and thus perform better. Instead, you'd need to determine which chunk contains the buffer you want, look up the base address for that chunk, and then index into the chunk. Maybe that overhead wouldn't be significant (or maybe it would); at any rate, it's not completely free. There's also the problem of handling the partial chunk at the end, especially if that happens to be the only chunk. This sounds way too complicated, yes. Use 8K chunks and most of the problems vanish. I think the problems for other arrays are similar, or more severe. I can't see, for example, how you could resize the ProcArray using this approach. Try not to think in terms of resizing, but dynamic allocation. I.e. being able to resize ProcArray (and thus being able to alter max_connections on the fly) would take a lot more work. Just
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
(Sorry, need to disable Ctrl-Return, which quite often sends mails earlier than I really want.. continuing my mail) On 08/27/2010 10:46 PM, Robert Haas wrote: Yeah, probably. I think designing something that works efficiently over a network is a somewhat different problem than designing something that works on an individual node, and we probably shouldn't let the designs influence each other too much. Agreed. Thus I've left out any kind of congestion avoidance stuff from imessages so far. There's no padding or sophisticated allocation needed. You just need a pointer to the last byte read (P1), the last byte allowed to be read (P2), and the last byte allocated (P3). Writers take a spinlock, advance P3, release the spinlock, write the message, take the spinlock, advance P2, release the spinlock, and signal the reader. That would block parallel writers (i.e. only one process can write to the queue at any time). I feel like there's probably some variant of this idea that works around that problem. The problem is that when a worker finishes writing a message, he needs to know whether to advance P2 only over his own message or also over some subsequent message that has been fully written in the meantime. I don't know exactly how to solve that problem off the top of my head, but it seems like it might be possible. I've tried pretty much that before. And failed. Because the allocation-order (i.e. the time the message gets created in preparation for writing to it) isn't necessarily the same as the sending-order (i.e. when the process has finished writing and decides to send the message). To satisfy the FIFO property WRT the sending order, you need to decouple allocation form the ordering (i.e. queuing logic). (And yes, it has taken me a while to figure out what's wrong in Postgres-R, before I've even noticed about that design bug). Readers take the spinlock, read P1 and P2, release the spinlock, read the data, take the spinlock, advance P1, and release the spinlock. It would require copying data in case a process only needs to forward the message. That's a quick pointer dequeue and enqueue exercise ATM. If we need to do that, that's a compelling argument for having a single messaging area rather than one per backend. Absolutely, yes. But I'm not sure I see why we would need that sort of capability. Why wouldn't you just arrange for the sender to deliver the message directly to the final recipient? A process can read and even change the data of the message before forwarding it. Something the coordinator in Postgres-R does sometimes. (As it is the interface to the GCS and thus to the rest of the nodes in the cluster). For parallel querying (on a single node) that's probably less important a feature. So, they know in advance how large the message will be but not what the contents will be? What are they doing? Filling the message until it's (mostly) full and then continue with the next one. At least that's how the streaming approach on top of imessages works. But yes, it's somewhat annoying to have to know the message size in advance. I didn't implement realloc so far. Nor can I think of any other solution. Note that separation of allocation and queue ordering is required anyway for the above reasons. Well, the fact that something is commonly used doesn't mean it's right for us. Tabula raza, we might design the whole system differently, but changing it now is not to be undertaken lightly. Hopefully the above comments shed some light on my concerns. In short, (1) I don't want to preallocate a big chunk of memory we might not use, Isn't that's exactly what we do now for lots of sub-systems, and what I'd like to improve (i.e. reduce to a single big chunk). (2) I fear reducing the overall robustness of the system, and Well, that applies to pretty much every new feature you add. (3) I'm uncertain what other systems would be able leverage a dynamic allocator of the sort you propose. Okay, that's up to me to show evidences (or at least a PoC). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Markus Wanner mar...@bluegap.ch writes: AFAICT we currently have three fixed size blocks to manage shared buffers: the buffer blocks themselves, the buffer descriptors, the strategy status (for the freelist) and the buffer lookup table. It's not obvious to me how these data structures should perform better than a dynamically allocated layout. Let me just point out that awhile back we got a *measurable* performance boost by eliminating a single indirect fetch from the buffer addressing code path. We used to have an array of pointers pointing to the actual buffers, and we removed that in favor of assuming the buffers were laid out in a contiguous array, so that the address of buffer N could be computed with a shift-and-add, eliminating the pointer fetch. I forget exactly what the numbers were, but it was significant enough to make us change it. So I don't have any faith in untested assertions that we can convert these data structures to use dynamic allocation with no penalty. It's very difficult to see how you'd do that without introducing a new layer of indirection, and our experience shows that that layer will cost you. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 08/30/2010 04:52 PM, Tom Lane wrote: Let me just point out that awhile back we got a *measurable* performance boost by eliminating a single indirect fetch from the buffer addressing code path. I'll take a look a that, thanks. So I don't have any faith in untested assertions Neither do I. Thus I'm probably going to try my approach. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 08/26/2010 11:57 PM, Robert Haas wrote: It wouldn't require you to preallocate a big chunk of shared memory Agreed, you wouldn't have to allocate it in advance. We would still want a configurable upper limit. So this can be seen as another approach for an implementation of a dynamic allocator. (Which should be separate from the exact imessages implementation, just for the sake of modularization already, IMO). In addition, it means that maximum_message_queue_size_per_backend (or whatever it's called) can be changed on-the-fly; that is, it can be PGC_SIGHUP rather than PGC_POSTMASTER. That's certainly a point. However, as you are proposing a solution to just one subsystem (i.e. imessages), I don't find it half as convincing. If you are saying it *should* be possible to resize shared memory in a portable way, why not do it for *all* subsystems right away? I still remember Tom saying it's not something that's doable in a portable way. Why and how should it be possible for a per-backend basis? How portable is mmap() really? Why don't we use in in Postgres as of now? I certainly think that these are orthogonal issues: whether to use fixed boundaries or to dynamically allocate the memory available is one thing, dynamic resizing is another. If the later is possible, I'm certainly not opposed to it. (But would still favor dynamic allocation). As to efficiency, the process is not much different once the initial setup is completed. I fully agree to that. I'm more concerned about ease of use for developers. Simply being able to alloc() from shared memory makes things easier than having to invent a separate allocation method for every subsystem, again and again (the argument that people are more used to multi-threaded argument). Doing the extra setup just to send one or two messages might suck. But maybe that just means this isn't the right mechanism for those cases (e.g. the existing XID-wraparound logic should still use signal multiplexing rather than this system). I see the value of this as being primarily for streaming big chunks of data, not so much for sending individual, very short messages. I agree that simple signals don't need a full imessage. But as soon as you want to send some data (like which database to vacuum), or require the delivery guarantee (i.e. no single message gets lost, as opposed to signals), then imessages should be cheap enough. The current approach uses plain spinlocks, which are more efficient. Note that both, appending as well as removing from the queue are writing operations, from the point of view of the queue. So I don't think LWLocks buy you anything here, either. I agree that this might not be useful. We don't really have all the message types defined yet, though, so it's hard to say. What does the type of lock used have to do with message types? IMO it doesn't matter what kind of message or what size you want to send. For appending or removing a pointer to or from a message queue, a spinlock seems to be just the right thing to use. I understand the need to limit the amount of data in flight, but I don't think that sending any type of message should ever block. Messages are atomic in that regard. Either they are ready to be delivered (in entirety) or not. Thus the sender needs to hold back the message, if the recipient is overloaded. (Also note that currently imessages are bound to a maximum size of around 8 KB). That's functionally equivalent to blocking, isn't it? I think that's just a question of what API you want to expose. Hm.. well, yeah, depends on what level you are arguing. The imessages API can be used in a completely non-blocking fashion. So a process can theoretically do other work while waiting for messages. For parallel querying, the helper/worker backends would probably need to block, if the origin backend is not ready to accept more data, yes. However, making it accept and process another job in the mean time seems hard to do. But not an imessages problem per se. (While with the above streaming layer I've mentioned, that would not be possible, because that blocks). For replication, that might be the case, but for parallel query, per-queue seems about right. At any rate, no design we've discussed will let individual queues grow without bound. Extend parallel querying to multiple nodes and you are back at the same requirement. However, it's certainly something that can be done atop imessages. I'm unsure if doing it as part of imessages is a good thing or not. Given the above requirement, I don't currently think so. Using multiple queues with different priorities, as you proposed, would probably make it more feasible. You probably need this, but 8KB seems like a pretty small chunk size. For node-internal messaging, I probably agree. Would need benchmarking, as it's a compromise between latency and overhead, IMO. I've chosen 8KB so these messages (together with some GCS
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Fri, Aug 27, 2010 at 2:17 PM, Markus Wanner mar...@bluegap.ch wrote: In addition, it means that maximum_message_queue_size_per_backend (or whatever it's called) can be changed on-the-fly; that is, it can be PGC_SIGHUP rather than PGC_POSTMASTER. That's certainly a point. However, as you are proposing a solution to just one subsystem (i.e. imessages), I don't find it half as convincing. What other subsystems are you imagining servicing with a dynamic allocator? If there were a big demand for this functionality, we probably would have been forced to implement it already, but that's not the case. We've already discussed the fact that there are massive problems with using it for something like shared_buffers, which is by far the largest consumer of shared memory. If you are saying it *should* be possible to resize shared memory in a portable way, why not do it for *all* subsystems right away? I still remember Tom saying it's not something that's doable in a portable way. I think it would be great if we could bring some more flexibility to our memory management. There are really two layers of problems here. One is resizing the segment itself, and one is resizing structures within the segment. As far as I can tell, there is no portable API that can be used to resize the shm itself. For so long as that remains the case, I am of the opinion that any meaningful resizing of the objects within the shm is basically unworkable. So we need to solve that problem first. There are a couple of possible solutions, which have been discussed here in the past. One very appealing option is to use POSIX shm rather than sysv shm. AFAICT, it is possible to portably resize a POSIX shm using ftruncate(), though I am not sure to what extent this is supported on Windows. One significant advantage of using POSIX shm is that the default limits for POSIX shm on many operating systems are much higher than the corresponding limits for sysv shm; in fact, some people have expressed the opinion that it might be worth making the switch for that reason alone, since it is no secret that a default value of 32MB or less for shared_buffers is not enough to get reasonable performance on many modern systems. I believe, however, that Tom Lane thinks we need to get a bit more out of it than that to make it worthwhile. One obstacle to making the switch is that POSIX shm does not provide a way to fetch the number of processes attached to the shared memory segment, which is a critical part of our infrastructure to prevent accidentally running multiple postmasters on the same data directory at the same time. Consequently, it seems hard to see how we can make that switch completely. At a minimum, we'll probably need to maintain a small sysv shm for interlock purposes. OK, so let's suppose we use POSIX shm for most of the shared memory segment, and keep only our fixed-size data structures in the sysv shm. Then what? Well, then we can potentially resize it. Because we are using a process-based model, this will require some careful gymnastics. Let's say we're growing the shm. The backend that is initiating the operation will call ftruncate() and then send signal all of the other backends (using a sinval message or a multiplexed signal or some such mechanism) to unmap and remap the shared memory segment. Any failure to remap the shared memory segment is at least a FATAL for that backend, and very likely a PANIC, so this had better not be something we plan to do routinely - for example, we wouldn't want to do this as a way of adapting to changing load conditions. It would probably be acceptable to do it in a situation such as a postgresql.conf reload, to accommodate a change in the server parameter that can't otherwise be changed without a restart, since the worst case scenario is, well, we have to restart anyway. Once all that's done, it's safe to start allocating memory from the newly added portion of the shm. Conversely, if we want to shrink the shm, the considerations are similar, but we have to do everything in the opposite order. First, we must ensure that the portion of the shm we're about to release is unused. Then, we tell all the backends to unmap and remap it. Once we've confirmed that they have done so, we ftruncate() it to the new size. Next, we have to think about how we're going to resize data structures within this expandable shm. Many of these structures are not things that we can easily move without bringing the system to a halt. For example, it's difficult to see how you could change the base address of shared buffers without ceasing all system activity, at which point there's not really much advantage over just forcing a restart. Similarly with LWLocks or the ProcArray. And if you can't move them, then how will you grow them if (as will likely be the case) there's something immediately following them in memory. One possible solution is to divide up these data structures into slabs. For example,
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, thanks for your feedback on this, it sort of got lost below the discussion about the dynamic shared memory stuff, IMO. On 08/26/2010 04:39 AM, Robert Haas wrote: It's not clear to me whether it's better to have a single coordinator process that handles both autovacuum and other things, or whether it's better to have two separate processes. It has been proposed by Alvaro and/or Tom (IIRC) to reduce code duplication. Compared to the former approach, it certainly seems cleaner that way and it has helped reduce duplicate code a lot. I'm envisioning such a coordinator process to handle coordination of other background processes as well, for example for distributed and/or parallel querying. Having just only one process reduces the amount of interaction required between coordinators (i.e. autovacuum shouldn't ever start on databases for which replication didn't start, yet, as the autovacuum worker would be unable to connect to the database at that stage). It also reduces the amount of extra processes required, and thus I think also general complexity. What'd be the benefits of having separate coordinator processes? They'd be doing pretty much the same: coordinate background processes. (And yes, I clearly consider autovacuum to be just one kind of background process). I agree with this criticism, but the other thing that strikes me as a nonstarter is having the postmaster participate in the imessages framework. This is simply not the case (anymore). (And one of the reasons a separate coordinator process is required, instead of letting the postmaster do this kind of coordination). Our general rule is that the postmaster must avoid touching shared memory; else a backend that scribbles on shared memory might take out the postmaster, leading to a failure of the crash-and-restart logic. That rule is well understood and followed by the bg worker infrastructure patches. If you find code for which that isn't true, please point at it. The crash-and-restart logic should work just as it did with the autovacuum launcher. Regards Markus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Itagaki-san, thanks for reviewing this. On 08/26/2010 03:39 AM, Itagaki Takahiro wrote: Other changes in the patch doesn't seem be always needed for the purpose. In other words, the patch is not minimal. Hm.. yeah, maybe the separation between step1 and step2 is a bit arbitrary. I'll look into it. The original purpose could be done without IMessage. Agreed, that's the one exception. I've mentioned why that is and I don't currently feel like coding an unneeded variant which doesn't use imessages. Also, min/max_spare_background_workers are not used in the patch at all. You are right, it only starts to get used in step2, so the addition should probably move there, right. (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is maybe typo.) Uh, correct, thank you for pointing this out. (I've originally named these helper processes before. Merging with autovacuum, it made more sense to name them background *workers*) The most questionable point for me is why you didn't add any hook functions in the coordinator process. Because I'm a hook-hater. ;-) No, seriously: I don't see what problem hooks could have solved. I'm coding in C and extending the Postgres code. Deciding for hooks and an API to use them requires good knowledge of where exactly you want to hook and what API you want to provide. Then that API needs to remain stable for an extended time. I don't think any part of the bg worker infrastructure currently is anywhere close to that. With the patch, you can extend the coordinator protocols with IMessage, but it requires patches to core at handle_imessage(). If you want fully-extensible workers, we should provide a method to develop worker codes in an external plugin. It's originally intended as an internal infrastructure. Offering its capabilities to the outside would requires stabilization, security control and working out an API. All of which is certainly not something I intend to do. Is it possible to develop your own codes in the plugin? If possible, you can use IMessage as a private protocol freely in the plugin. Am I missing something? Well, what problem(s) are you trying to solve with such a thing? I've no idea what direction you are aiming at, sorry. However, it's certainly different or extending bg worker, so it would need to be a separate patch, IMO. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On 08/26/2010 05:01 AM, Itagaki Takahiro wrote: Markus, do you need B? Or A + standard backend processes are enough? If you need B eventually, starting with B might be better. No, I certainly don't need B. Why not just use an ordinary backend to do user defined background processing? It covers all of the API stability and the security issues I've raised. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Thu, Aug 26, 2010 at 7:42 PM, Markus Wanner mar...@bluegap.ch wrote: Markus, do you need B? Or A + standard backend processes are enough? No, I certainly don't need B. OK, I see why you proposed coordinator hook (yeah, I call it hook :) rather than adding user-defined processes. Why not just use an ordinary backend to do user defined background processing? It covers all of the API stability and the security issues I've raised. However, we have autovacuum worker processes in addition to normal backend processes. Does it show a fact that there are some jobs we cannot run in normal backends? For example, normal backends cannot do anything in idle time, so a time-based polling job is difficult in backends. It might be ok to fork processes for each interval when the polling interval is long, but it is not effective for short interval cases. I'd like to use such kind of process as an additional stats collector. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Itagaki-san, On 08/26/2010 01:02 PM, Itagaki Takahiro wrote: OK, I see why you proposed coordinator hook (yeah, I call it hook :) rather than adding user-defined processes. I see. If you call that a hook, I'm definitely not a hook-hater ;-) at least not according to your definition. However, we have autovacuum worker processes in addition to normal backend processes. Does it show a fact that there are some jobs we cannot run in normal backends? Hm.. understood. You can use VACUUM from a cron job. And that's the problem autovacuum solves. So in a way, that's just a convenience feature. You want the same for general purpose user defined background processing, right? For example, normal backends cannot do anything in idle time, so a time-based polling job is difficult in backends. It might be ok to fork processes for each interval when the polling interval is long, but it is not effective for short interval cases. I'd like to use such kind of process as an additional stats collector. Did you follow the discussion I had with Dimitri, who was trying something similar, IIRC. See the bg worker - overview thread. There might be some interesting bits thinking into that direction. Regards Markus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Thu, Aug 26, 2010 at 6:07 AM, Markus Wanner mar...@bluegap.ch wrote: What'd be the benefits of having separate coordinator processes? They'd be doing pretty much the same: coordinate background processes. (And yes, I clearly consider autovacuum to be just one kind of background process). I dunno. It was just a thought. I haven't actually looked at the code to see how much synergy there is. (Sorry, been really busy...) I agree with this criticism, but the other thing that strikes me as a nonstarter is having the postmaster participate in the imessages framework. This is simply not the case (anymore). (And one of the reasons a separate coordinator process is required, instead of letting the postmaster do this kind of coordination). Oh, OK. I see now that I misinterpreted what you wrote. On the more general topic of imessages, I had one other thought that might be worth considering. Instead of using shared memory, what about using a file that is shared between the sender and receiver? So for example, perhaps each receiver will read messages from a file called pg_messages/%d, where %d is the backend ID. And writers will write into that file. Perhaps both readers and writers mmap() the file, or perhaps there's a way to make it work with just read() and write(). If you actually mmap() the file, you could probably manage it in a fashion pretty similar to what you had in mind for wamalloc, or some other setup that minimizes locking. In particular, ISTM that if we want this to be usable for parallel query, we'll want to be able to have one process streaming data in while another process streams data out, with minimal interference between these two activities. On the other hand, for processes that only send and receive messages occasionally, this might just be overkill (and overhead). You'd be just as well off wrapping the access to the file in an LWLock: the reader takes the lock, reads the data, marks it read, and releases the lock. The writer takes the lock, writes data, and releases the lock. It almost seems to me that there are two different kinds of messages here: control messages and data messages. Control messages are things like vacuum this database! or flush your cache! or execute this query and send the results to backend %d! or cancel the currently executing query!. They are relatively small (in some cases, fixed-size), relatively low-volume, don't need complex locking, and can generally be processed serially but with high priority. Data messages are streams of tuples, either from a remote database from which we are replicating, or between backends that are executing a parallel query. These messages may be very large and extremely high-volume, are very sensitive to concurrency problems, but are not high-priority. We want to process them as quickly as possible, of course, but the work may get interrupted by control messages. Another point is that it's reasonable, at least in the case of parallel query, for the action of sending a data message to *block*. If one part of the query is too far ahead of the rest of the query, we don't want to queue up results forever, perhaps using CPU or I/O resources that some other backend needs to catch up, exhausting available disk space, etc. Instead, at some point, we just block and wait for the queue to drain. I suppose there's no avoiding the possibility that sending a control message might also block, but certainly we wouldn't like a control message to block because the relevant queue is full of data messages. So I kind of wonder whether we ought to have two separate systems, one for data and one for control, with someone different characteristics. I notice that one of your bg worker patches is for OOO-messages. I apologize again for not having read through it, but how much does that resemble separating the control and data channels? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Robert, On 08/26/2010 02:44 PM, Robert Haas wrote: I dunno. It was just a thought. I haven't actually looked at the code to see how much synergy there is. (Sorry, been really busy...) No problem, was just wondering if there's any benefit you had in mind. On the more general topic of imessages, I had one other thought that might be worth considering. Instead of using shared memory, what about using a file that is shared between the sender and receiver? What would that buy us? (At the price of more system calls and disk I/O)? Remember that the current approach (IIRC) uses exactly one syscall to send a message: kill() to send the (multiplexed) signal. (Except on strange platforms or setups that don't have a user-space spinlock implementation and need to use system mutexes). So for example, perhaps each receiver will read messages from a file called pg_messages/%d, where %d is the backend ID. And writers will write into that file. Perhaps both readers and writers mmap() the file, or perhaps there's a way to make it work with just read() and write(). If you actually mmap() the file, you could probably manage it in a fashion pretty similar to what you had in mind for wamalloc, or some other setup that minimizes locking. That would still require proper locking, then. So I'm not seeing the benefit. In particular, ISTM that if we want this to be usable for parallel query, we'll want to be able to have one process streaming data in while another process streams data out, with minimal interference between these two activities. That's well possible with the current approach. About the only limitation is that a receiver can only consume the messages in the order they got into the queue. But pretty much any backend can send messages to any other backend concurrently. (Well, except that I think there currently are bugs in wamalloc). On the other hand, for processes that only send and receive messages occasionally, this might just be overkill (and overhead). You'd be just as well off wrapping the access to the file in an LWLock: the reader takes the lock, reads the data, marks it read, and releases the lock. The writer takes the lock, writes data, and releases the lock. The current approach uses plain spinlocks, which are more efficient. Note that both, appending as well as removing from the queue are writing operations, from the point of view of the queue. So I don't think LWLocks buy you anything here, either. It almost seems to me that there are two different kinds of messages here: control messages and data messages. Control messages are things like vacuum this database! or flush your cache! or execute this query and send the results to backend %d! or cancel the currently executing query!. They are relatively small (in some cases, fixed-size), relatively low-volume, don't need complex locking, and can generally be processed serially but with high priority. Data messages are streams of tuples, either from a remote database from which we are replicating, or between backends that are executing a parallel query. These messages may be very large and extremely high-volume, are very sensitive to concurrency problems, but are not high-priority. We want to process them as quickly as possible, of course, but the work may get interrupted by control messages. Another point is that it's reasonable, at least in the case of parallel query, for the action of sending a data message to *block*. If one part of the query is too far ahead of the rest of the query, we don't want to queue up results forever, perhaps using CPU or I/O resources that some other backend needs to catch up, exhausting available disk space, etc. I agree that such a thing isn't currently covered. And it might be useful. However, adding two separate queues with different priority would be very simple to do. (Note, however, that there already are the standard unix signals for very simple kinds of control signals. I.e. for aborting a parallel query, you could simply send SIGINT to all background workers involved). I understand the need to limit the amount of data in flight, but I don't think that sending any type of message should ever block. Messages are atomic in that regard. Either they are ready to be delivered (in entirety) or not. Thus the sender needs to hold back the message, if the recipient is overloaded. (Also note that currently imessages are bound to a maximum size of around 8 KB). It might be interesting to note that I've just implemented some kind of streaming mechanism *atop* of imessages for Postgres-R. A data stream gets fragmented into single messages. As you pointed out, there should be some kind of congestion control. However, in my case, that needs to cover the inter-node connection as well, not just imessages. So I think the solution to that problem needs to be found on a higher level. I.e. in the Postgres-R case, I want to limit the *overall* amount of recovery data
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Markus Wanner mar...@bluegap.ch writes: On 08/26/2010 02:44 PM, Robert Haas wrote: On the more general topic of imessages, I had one other thought that might be worth considering. Instead of using shared memory, what about using a file that is shared between the sender and receiver? What would that buy us? Not having to have a hard limit on the space for unconsumed messages? The current approach uses plain spinlocks, which are more efficient. Please note the coding rule that says that the code should not execute more than a few straight-line instructions while holding a spinlock. If you're copying long messages while holding the lock, I don't think spinlocks are acceptable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On 08/26/2010 09:22 PM, Tom Lane wrote: Not having to have a hard limit on the space for unconsumed messages? Ah, I see. However, spilling to disk is unwanted for the current use cases of imessages. Instead the sender needs to be able to deal with out-of-(that-specific-part-of-shared)-memory conditions. Please note the coding rule that says that the code should not execute more than a few straight-line instructions while holding a spinlock. If you're copying long messages while holding the lock, I don't think spinlocks are acceptable. Writing the payload data for imessages to shared memory doesn't need any kind of lock. (Because the relevant chunk of shared memory got allocated via wamalloc, which grants the allocator exclusive control over the returned chunk). Only appending and removing (the pointer to the data) to and from the queue requires taking a spinlock. And I think that still qualifies. However, your concern is valid for wamalloc, which is more critical in that regard. Regards Markus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Thu, Aug 26, 2010 at 3:03 PM, Markus Wanner mar...@bluegap.ch wrote: On the more general topic of imessages, I had one other thought that might be worth considering. Instead of using shared memory, what about using a file that is shared between the sender and receiver? What would that buy us? (At the price of more system calls and disk I/O)? Remember that the current approach (IIRC) uses exactly one syscall to send a message: kill() to send the (multiplexed) signal. (Except on strange platforms or setups that don't have a user-space spinlock implementation and need to use system mutexes). It wouldn't require you to preallocate a big chunk of shared memory without knowing how much of it you'll actually need. For example, suppose we implement parallel query. If the message queues can be allocated on the fly, then you can just say maximum_message_queue_size_per_backend = 16MB and that'll probably be good enough for most installations. On systems where parallel query is not used (e.g. because they only have 1 or 2 processors) then it costs nothing. On systems where parallel query is used extensively (e.g. because they have 32 processors), you'll allocate enough space for the number of backends that actually need message buffers, and not more than that. Furthermore, if parallel query is used at some times (say, for nightly reporting) but not others (say, for daily OLTP queries), the buffers can be deallocated when the helper backends exit (or paged out if they are idle), and that memory can be reclaimed for other use. In addition, it means that maximum_message_queue_size_per_backend (or whatever it's called) can be changed on-the-fly; that is, it can be PGC_SIGHUP rather than PGC_POSTMASTER. Being able to change GUCs without shutting down the postmaster is a *big deal* for people running in 24x7 operations. Even things like wal_level that aren't apt to be changed more than once in a blue moon are a problem (once you go from not having a standby to having a standby, you're unlikely to want to go backwards), and this would likely need more tweaking. You might find that you need more memory for better throughput, or that you need to reclaim memory for other purposes. Especially if it's a hard allocation for any number of backends, rather than something that backends can allocate only as and when they need it. As to efficiency, the process is not much different once the initial setup is completed. Just because you write to a memory-mapped file rather than a shared memory segment doesn't mean that you're necessarily doing disk I/O. On systems that support it, you could also choose to map a named POSIX shm rather than a disk file. Either way, there might be a little more overhead at startup but that doesn't seem so bad; presumably the amount of work that the worker is doing is large compared to the overhead of a few system calls, or you're probably in trouble anyway, since our process startup overhead is pretty substantial already. The only time it seems like the overhead would be annoying is if a process is going to use this system, but only lightly. Doing the extra setup just to send one or two messages might suck. But maybe that just means this isn't the right mechanism for those cases (e.g. the existing XID-wraparound logic should still use signal multiplexing rather than this system). I see the value of this as being primarily for streaming big chunks of data, not so much for sending individual, very short messages. On the other hand, for processes that only send and receive messages occasionally, this might just be overkill (and overhead). You'd be just as well off wrapping the access to the file in an LWLock: the reader takes the lock, reads the data, marks it read, and releases the lock. The writer takes the lock, writes data, and releases the lock. The current approach uses plain spinlocks, which are more efficient. Note that both, appending as well as removing from the queue are writing operations, from the point of view of the queue. So I don't think LWLocks buy you anything here, either. I agree that this might not be useful. We don't really have all the message types defined yet, though, so it's hard to say. I understand the need to limit the amount of data in flight, but I don't think that sending any type of message should ever block. Messages are atomic in that regard. Either they are ready to be delivered (in entirety) or not. Thus the sender needs to hold back the message, if the recipient is overloaded. (Also note that currently imessages are bound to a maximum size of around 8 KB). That's functionally equivalent to blocking, isn't it? I think that's just a question of what API you want to expose. It might be interesting to note that I've just implemented some kind of streaming mechanism *atop* of imessages for Postgres-R. A data stream gets fragmented into single messages. As you pointed out, there should be some kind of congestion
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Thu, Aug 26, 2010 at 3:40 PM, Markus Wanner mar...@bluegap.ch wrote: On 08/26/2010 09:22 PM, Tom Lane wrote: Not having to have a hard limit on the space for unconsumed messages? Ah, I see. However, spilling to disk is unwanted for the current use cases of imessages. Instead the sender needs to be able to deal with out-of-(that-specific-part-of-shared)-memory conditions. Shared memory can be paged out, too, if it's not being used enough to keep the OS from deciding to evict it. And I/O to a mmap()'d file or shared memory region can remain in RAM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner mar...@bluegap.ch wrote: This patch turns the existing autovacuum launcher into an always running process, partly called the coordinator. If autovacuum is disabled, the coordinator process still gets started and keeps around, but it doesn't dispatch vacuum jobs. I think this part is a reasonable proposal, but... The coordinator process now uses imessages to communicate with background (autovacuum) workers and to trigger a vacuum job. It also adds two new controlling GUCs: min_spare_background_workers and max_spare_background_workers. Other changes in the patch doesn't seem be always needed for the purpose. In other words, the patch is not minimal. The original purpose could be done without IMessage. Also, min/max_spare_background_workers are not used in the patch at all. (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is maybe typo.) The most questionable point for me is why you didn't add any hook functions in the coordinator process. With the patch, you can extend the coordinator protocols with IMessage, but it requires patches to core at handle_imessage(). If you want fully-extensible workers, we should provide a method to develop worker codes in an external plugin. Is it possible to develop your own codes in the plugin? If possible, you can use IMessage as a private protocol freely in the plugin. Am I missing something? -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Wed, Aug 25, 2010 at 9:39 PM, Itagaki Takahiro itagaki.takah...@gmail.com wrote: On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner mar...@bluegap.ch wrote: This patch turns the existing autovacuum launcher into an always running process, partly called the coordinator. If autovacuum is disabled, the coordinator process still gets started and keeps around, but it doesn't dispatch vacuum jobs. I think this part is a reasonable proposal, but... It's not clear to me whether it's better to have a single coordinator process that handles both autovacuum and other things, or whether it's better to have two separate processes. The coordinator process now uses imessages to communicate with background (autovacuum) workers and to trigger a vacuum job. It also adds two new controlling GUCs: min_spare_background_workers and max_spare_background_workers. Other changes in the patch doesn't seem be always needed for the purpose. In other words, the patch is not minimal. The original purpose could be done without IMessage. Also, min/max_spare_background_workers are not used in the patch at all. (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is maybe typo.) The most questionable point for me is why you didn't add any hook functions in the coordinator process. With the patch, you can extend the coordinator protocols with IMessage, but it requires patches to core at handle_imessage(). If you want fully-extensible workers, we should provide a method to develop worker codes in an external plugin. I agree with this criticism, but the other thing that strikes me as a nonstarter is having the postmaster participate in the imessages framework. Our general rule is that the postmaster must avoid touching shared memory; else a backend that scribbles on shared memory might take out the postmaster, leading to a failure of the crash-and-restart logic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
On Thu, Aug 26, 2010 at 11:39 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner mar...@bluegap.ch wrote: This patch turns the existing autovacuum launcher into an always running process, partly called the coordinator. It's not clear to me whether it's better to have a single coordinator process that handles both autovacuum and other things, or whether it's better to have two separate processes. Ah, we can separate the proposal to two topics: A. Support to run non-vacuum jobs from autovacuum launcher B. Support user defined background processes A was proposed in the original 1 of 6 patch, but B might be more general. If we have a separated coordinator, B will be required. Markus, do you need B? Or A + standard backend processes are enough? If you need B eventually, starting with B might be better. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers