Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Tomas Vondrawrites: > I've checked how this worked in 9.2 (before the 9.3 patch that split the > file per db), and back then last_statsrequest (transformed to > request_time) was used to decide whether we need to write something. But > now we do that by simply checking whether the list is empty. Right. In effect, 9.3 moved the decision about "do we need to write stats because of this request" from the main loop to pgstat_recv_inquiry() ... but we forgot to incorporate any check for whether the request was already satisfied into pgstat_recv_inquiry(). We can do that, though, as per either of the patches under discussion --- and once we do so, maintaining DBWriteRequest.request_time seems a bit pointless. It's conceivable that we'd try to implement merging of close-together requests in a way that would take account of how far back the oldest unsatisfied request is. But I think that a global oldest-request time would be sufficient for that; we don't need to track it per-database. In any case, it's hard to see exactly how to make that work without putting a gettimeofday() call into the inner message handling loop, which I believe we won't want to do on performance grounds. The previous speculation about doing writes every N messages or when we have no input to process seems more likely to lead to a useful answer. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On 05/31/2016 07:24 PM, Tom Lane wrote: Tomas Vondrawrites: On 05/31/2016 06:59 PM, Tom Lane wrote: I'm confused here --- are you speaking of having removed if (msg->cutoff_time > req->request_time) req->request_time = msg->cutoff_time; ? That is not a check for clock skew, it's intending to be sure that req->request_time reflects the latest request for this DB when we've seen more than one request. But since req->request_time isn't actually being used anywhere, it's useless code. Ah, you're right. I've made the mistake of writing the e-mail before drinking any coffee today, and I got distracted by the comment change. I reformatted the actual check for clock skew, but I do not think I changed its behavior. I'm not sure it does not change the behavior, though. request_time only became unused after you removed the two places that set the value (one of them in the clock skew check). Well, it's unused in the sense that the if-test quoted above is the only place in HEAD that examines the value of request_time. And since that if-test only controls whether we change the value, and not whether we proceed to make the clock skew check, I don't see how it's related to clock skew or indeed anything else at all. I see, in that case it indeed is useless. I've checked how this worked in 9.2 (before the 9.3 patch that split the file per db), and back then last_statsrequest (transformed to request_time) was used to decide whether we need to write something. But now we do that by simply checking whether the list is empty. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Tomas Vondrawrites: > On 05/31/2016 06:59 PM, Tom Lane wrote: >> I'm confused here --- are you speaking of having removed >> if (msg->cutoff_time > req->request_time) >> req->request_time = msg->cutoff_time; >> ? That is not a check for clock skew, it's intending to be sure that >> req->request_time reflects the latest request for this DB when we've >> seen more than one request. But since req->request_time isn't >> actually being used anywhere, it's useless code. > Ah, you're right. I've made the mistake of writing the e-mail before > drinking any coffee today, and I got distracted by the comment change. >> I reformatted the actual check for clock skew, but I do not think I >> changed its behavior. > I'm not sure it does not change the behavior, though. request_time only > became unused after you removed the two places that set the value (one > of them in the clock skew check). Well, it's unused in the sense that the if-test quoted above is the only place in HEAD that examines the value of request_time. And since that if-test only controls whether we change the value, and not whether we proceed to make the clock skew check, I don't see how it's related to clock skew or indeed anything else at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On 05/31/2016 06:59 PM, Tom Lane wrote: Tomas Vondrawrites: On 05/26/2016 10:10 PM, Tom Lane wrote: I posted a patch at https://www.postgresql.org/message-id/13023.1464213...@sss.pgh.pa.us which I think is functionally equivalent to what you have here, but it goes to some lengths to make the code more readable, whereas this is just adding another layer of complication to something that's already a mess (eg, the request_time field is quite useless as-is). So I'd like to propose pushing that in place of this patch ... do you care to review it first? I've reviewed the patch today, and it seems fine to me - correct and achieving the same goal as the patch posted to this thread (plus fixing the issue with shared catalogs and fixing many comments). Thanks for reviewing! FWIW do you still plan to back-patch this? Minimizing the amount of changes was one of the things I had in mind when writing "my" patch, which is why I ended up with parts that are less readable. Yeah, I think it's a bug fix and should be back-patched. I'm not in favor of making things more complicated just to reduce the number of lines a patch touches. The one change I'm not quite sure about is the removal of clock skew detection in pgstat_recv_inquiry(). You've removed the first check on the inquiry, replacing it with this comment: It seems sufficient to check for clock skew once per write round. But the first check was comparing msg/req, while the second check looks at dbentry/cur_ts. I don't see how those two clock skew check are redundant - if they are, the comment should explain that I guess. I'm confused here --- are you speaking of having removed if (msg->cutoff_time > req->request_time) req->request_time = msg->cutoff_time; ? That is not a check for clock skew, it's intending to be sure that req->request_time reflects the latest request for this DB when we've seen more than one request. But since req->request_time isn't actually being used anywhere, it's useless code. Ah, you're right. I've made the mistake of writing the e-mail before drinking any coffee today, and I got distracted by the comment change. I reformatted the actual check for clock skew, but I do not think I changed its behavior. I'm not sure it does not change the behavior, though. request_time only became unused after you removed the two places that set the value (one of them in the clock skew check). I'm not sure this is a bad change, though. But there was a dependency between the new request and the preceding one. Another thing is that if you believe merging requests across databases is a silly idea, maybe we should bite the bullet and replace the list of requests with a single item. I'm not convinced about this, though. No, I don't want to do that either. We're not spending much code by having pending_write_requests be a list rather than a single entry, and we might eventually figure out a reasonable way to time the flushes so that we can merge requests. +1 regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Tomas Vondrawrites: > On 05/26/2016 10:10 PM, Tom Lane wrote: >> I posted a patch at >> https://www.postgresql.org/message-id/13023.1464213...@sss.pgh.pa.us >> which I think is functionally equivalent to what you have here, but >> it goes to some lengths to make the code more readable, whereas this >> is just adding another layer of complication to something that's >> already a mess (eg, the request_time field is quite useless as-is). >> So I'd like to propose pushing that in place of this patch ... do you >> care to review it first? > I've reviewed the patch today, and it seems fine to me - correct and > achieving the same goal as the patch posted to this thread (plus fixing > the issue with shared catalogs and fixing many comments). Thanks for reviewing! > FWIW do you still plan to back-patch this? Minimizing the amount of > changes was one of the things I had in mind when writing "my" patch, > which is why I ended up with parts that are less readable. Yeah, I think it's a bug fix and should be back-patched. I'm not in favor of making things more complicated just to reduce the number of lines a patch touches. > The one change I'm not quite sure about is the removal of clock skew > detection in pgstat_recv_inquiry(). You've removed the first check on > the inquiry, replacing it with this comment: > It seems sufficient to check for clock skew once per write round. > But the first check was comparing msg/req, while the second check looks > at dbentry/cur_ts. I don't see how those two clock skew check are > redundant - if they are, the comment should explain that I guess. I'm confused here --- are you speaking of having removed if (msg->cutoff_time > req->request_time) req->request_time = msg->cutoff_time; ? That is not a check for clock skew, it's intending to be sure that req->request_time reflects the latest request for this DB when we've seen more than one request. But since req->request_time isn't actually being used anywhere, it's useless code. I reformatted the actual check for clock skew, but I do not think I changed its behavior. > Another thing is that if you believe merging requests across databases > is a silly idea, maybe we should bite the bullet and replace the list of > requests with a single item. I'm not convinced about this, though. No, I don't want to do that either. We're not spending much code by having pending_write_requests be a list rather than a single entry, and we might eventually figure out a reasonable way to time the flushes so that we can merge requests. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Hi, On 05/26/2016 10:10 PM, Tom Lane wrote: Tomas Vondrawrites: Attached is a patch that should fix the coalescing, including the clock skew detection. In the end I reorganized the code a bit, moving the check at the end, after the clock skew detection. Otherwise I'd have to do the clock skew detection on multiple places, and that seemed ugly. I hadn't been paying any attention to this thread, I must confess. But I rediscovered this no-coalescing problem while pursuing the poor behavior for shared catalogs that Peter complained of in https://www.postgresql.org/message-id/56ad41ac.1030...@gmx.net I posted a patch at https://www.postgresql.org/message-id/13023.1464213...@sss.pgh.pa.us which I think is functionally equivalent to what you have here, but it goes to some lengths to make the code more readable, whereas this is just adding another layer of complication to something that's already a mess (eg, the request_time field is quite useless as-is). So I'd like to propose pushing that in place of this patch ... do you care to review it first? I've reviewed the patch today, and it seems fine to me - correct and achieving the same goal as the patch posted to this thread (plus fixing the issue with shared catalogs and fixing many comments). FWIW do you still plan to back-patch this? Minimizing the amount of changes was one of the things I had in mind when writing "my" patch, which is why I ended up with parts that are less readable. The one change I'm not quite sure about is the removal of clock skew detection in pgstat_recv_inquiry(). You've removed the first check on the inquiry, replacing it with this comment: It seems sufficient to check for clock skew once per write round. But the first check was comparing msg/req, while the second check looks at dbentry/cur_ts. I don't see how those two clock skew check are redundant - if they are, the comment should explain that I guess. Another thing is that if you believe merging requests across databases is a silly idea, maybe we should bite the bullet and replace the list of requests with a single item. I'm not convinced about this, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Tomas Vondrawrites: > On 05/26/2016 10:10 PM, Tom Lane wrote: >> In view of 52e8fc3e2, there's more or less no case in which we'd be >> writing stats without writing stats for the shared catalogs. So I'm >> tempted to propose that we try to reduce the overhead by merging the >> shared-catalog stats back into the global-stats file, thereby >> halving the filesystem metadata traffic for updating those. > I find this a bit contradictory with the previous paragraph. If you > believe that reducing the filesystem metadata traffic will have a > measurable benefit, then surely merging writes for multiple dbs (thus > not writing the global/shared files multiple times) will have even > higher impact, no? Well, my thinking is that this is something we could get "for free" without any penalty in response time. Going further will require some sort of tradeoff. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On Thu, May 26, 2016 at 6:43 PM, Tomas Vondrawrote: > On 05/26/2016 10:10 PM, Tom Lane wrote: >> Tomas Vondra writes: >> In view of 52e8fc3e2, there's more or less no case in which we'd be >> writing stats without writing stats for the shared catalogs. So I'm >> tempted to propose that we try to reduce the overhead by merging the >> shared-catalog stats back into the global-stats file, thereby >> halving the filesystem metadata traffic for updating those. > > [...] > > That being said, I'm not opposed to merging the shared catalog into the > global-stats file - it's not really a separate database so having it in a > separate file is a bit weird. While looking at this stuff, to be honest I got surprised that shared relation stats are in located in a file whose name depends on InvalidOid, so +1 from here as well to merge that into the global stats file. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Hi, On 05/26/2016 10:10 PM, Tom Lane wrote: Tomas Vondrawrites: Attached is a patch that should fix the coalescing, including the clock skew detection. In the end I reorganized the code a bit, moving the check at the end, after the clock skew detection. Otherwise I'd have to do the clock skew detection on multiple places, and that seemed ugly. I hadn't been paying any attention to this thread, I must confess. But I rediscovered this no-coalescing problem while pursuing the poor behavior for shared catalogs that Peter complained of in https://www.postgresql.org/message-id/56ad41ac.1030...@gmx.net I posted a patch at https://www.postgresql.org/message-id/13023.1464213...@sss.pgh.pa.us which I think is functionally equivalent to what you have here, but it goes to some lengths to make the code more readable, whereas this is just adding another layer of complication to something that's already a mess (eg, the request_time field is quite useless as-is). So I'd like to propose pushing that in place of this patch ... do you care to review it first? I do care and I'll look at it over the next few days. FWIW when writing that patch I intentionally refrained from major changes, as I think the plan was to backpatch it. But +1 for more readable code from me. Reacting to the thread overall: I see Noah's concern about wanting to merge the write work for requests about different databases. I've got mixed feelings about that: it's arguable that any such change would make things worse not better. In particular, it's inevitable that trying to merge writes will result in delaying the response to the first request, whether or not we are able to merge anything. That's not good in itself, and it means that we can't hope to merge requests over any very long interval, which very possibly will prevent any merging from happening in real situations. Also, considering that we know the stats collector can be pretty slow to respond at all under load, I'm worried that this would result in more outright failures. Moreover, what we'd hope to gain from merging is fewer writes of the global stats file and the shared-catalog stats file; but neither of those are very big, so I'm skeptical of what we'd win. Yep. Clearly there's a trade-off between slowing down response to the first request vs. speeding-up the whole process, but as you point out we probably can't gain enough to justify that. I wonder if that's still true on clusters with many databases (say, shared systems with thousands of dbs). Perhaps walking the list just once would save enough CPU to make this a win. In any case, if we decide to abandon the idea of merging requests for multiple databases, that probably means we can further simplify the code. last_statrequests is a list but it actually never contains more than just a single request. We kept it that way because of the plan to add the merging. But if that's not worth it ... In view of 52e8fc3e2, there's more or less no case in which we'd be writing stats without writing stats for the shared catalogs. So I'm tempted to propose that we try to reduce the overhead by merging the shared-catalog stats back into the global-stats file, thereby halving the filesystem metadata traffic for updating those. I find this a bit contradictory with the previous paragraph. If you believe that reducing the filesystem metadata traffic will have a measurable benefit, then surely merging writes for multiple dbs (thus not writing the global/shared files multiple times) will have even higher impact, no? E.g. let's assume we're still writing the global+shared+db files for each database. If there are requests for 10 databases, we'll write 30 files. If we merge those requests first, we're writing only 12 files. So I'm not sure about the metadata traffic argument, we'd need to see some numbers showing it really makes a difference. That being said, I'm not opposed to merging the shared catalog into the global-stats file - it's not really a separate database so having it in a separate file is a bit weird. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Tomas Vondrawrites: > Attached is a patch that should fix the coalescing, including the clock > skew detection. In the end I reorganized the code a bit, moving the > check at the end, after the clock skew detection. Otherwise I'd have to > do the clock skew detection on multiple places, and that seemed ugly. I hadn't been paying any attention to this thread, I must confess. But I rediscovered this no-coalescing problem while pursuing the poor behavior for shared catalogs that Peter complained of in https://www.postgresql.org/message-id/56ad41ac.1030...@gmx.net I posted a patch at https://www.postgresql.org/message-id/13023.1464213...@sss.pgh.pa.us which I think is functionally equivalent to what you have here, but it goes to some lengths to make the code more readable, whereas this is just adding another layer of complication to something that's already a mess (eg, the request_time field is quite useless as-is). So I'd like to propose pushing that in place of this patch ... do you care to review it first? Reacting to the thread overall: I see Noah's concern about wanting to merge the write work for requests about different databases. I've got mixed feelings about that: it's arguable that any such change would make things worse not better. In particular, it's inevitable that trying to merge writes will result in delaying the response to the first request, whether or not we are able to merge anything. That's not good in itself, and it means that we can't hope to merge requests over any very long interval, which very possibly will prevent any merging from happening in real situations. Also, considering that we know the stats collector can be pretty slow to respond at all under load, I'm worried that this would result in more outright failures. Moreover, what we'd hope to gain from merging is fewer writes of the global stats file and the shared-catalog stats file; but neither of those are very big, so I'm skeptical of what we'd win. In view of 52e8fc3e2, there's more or less no case in which we'd be writing stats without writing stats for the shared catalogs. So I'm tempted to propose that we try to reduce the overhead by merging the shared-catalog stats back into the global-stats file, thereby halving the filesystem metadata traffic for updating those. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Hi, On 03/15/2016 03:04 AM, Noah Misch wrote: On Mon, Mar 14, 2016 at 01:33:08PM +0100, Tomas Vondra wrote: On 03/14/2016 07:14 AM, Noah Misch wrote: On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: +* XXX Maybe this should also care about the clock skew, just like the +* block a few lines down. Yes, it should. (The problem is large (>~100s), backward clock resets, not skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such cases need the correction a few lines down. I'll look into that. I have to admit I have a hard time reasoning about the code handling clock skew, so it might take some time, though. No hurry; it would be no problem to delay this several months. Attached is a patch that should fix the coalescing, including the clock skew detection. In the end I reorganized the code a bit, moving the check at the end, after the clock skew detection. Otherwise I'd have to do the clock skew detection on multiple places, and that seemed ugly. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services pgstat-coalesce-v3.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On Mon, Mar 14, 2016 at 01:33:08PM +0100, Tomas Vondra wrote: > On 03/14/2016 07:14 AM, Noah Misch wrote: > >On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: > >>+* XXX Maybe this should also care about the clock skew, just like the > >>+* block a few lines down. > > > >Yes, it should. (The problem is large (>~100s), backward clock resets, not > >skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" > >will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such > >cases need the correction a few lines down. > > I'll look into that. I have to admit I have a hard time reasoning about the > code handling clock skew, so it might take some time, though. No hurry; it would be no problem to delay this several months. > >The other thing needed here is to look through and update comments about > >last_statrequests. For example, this loop is dead code due to us writing > >files as soon as we receive one inquiry: > > > > /* > > * Find the last write request for this DB. If it's older than the > > * request's cutoff time, update it; otherwise there's nothing to do. > > * > > * Note that if a request is found, we return early and skip the below > > * check for clock skew. This is okay, since the only way for a DB > > * request to be present in the list is that we have been here since the > > * last write round. > > */ > > slist_foreach(iter, _statrequests) ... > > > >I'm okay keeping the dead code for future flexibility, but the comment should > >reflect that. > > Yes, that's another thing that I'd like to look into. Essentially the > problem is that we always process the inquiries one by one, so we never > actually see a list with more than a single element. Correct? Correct. > I think the best way to address that is to peek is to first check how much > data is in the UDP queue, and then fetching all of that before actually > doing the writes. Peeking at the number of requests first (or even some > reasonable hard-coded limit) should should prevent starving the inquirers in > case of a steady stream or inquiries. Now that you mention it, a hard-coded limit sounds good: write the files for pending inquiries whenever the socket empties or every N messages processed, whichever comes first. I don't think the amount of pending UDP data is portably available, and I doubt it's important. Breaking every, say, one thousand messages will make the collector predictably responsive to inquiries, and that is the important property. I would lean toward making this part 9.7-only; it would be a distinct patch from the one previously under discussion. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Hi, On 03/14/2016 07:14 AM, Noah Misch wrote: [Aside: your new mail editor is rewrapping lines in quoted material, and the result is messy. I have rerewrapped one paragraph below.] Thanks, I've noticed that too. I've been testing Evolution in the past few days, and apparently the line wrapping algorithm is broken. I've switched back to Thunderbird, so hopefully that'll fix it. On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: On Sun, 2016-03-13 at 18:46 -0400, Noah Misch wrote: I've not attempted to study the behavior on slow hardware. Instead, my report used stat-coalesce-v1.patch[1] to simulate slow writes. (That diagnostic patch no longer applies cleanly, so I'm attaching a rebased version. I've changed the patch name from "stat-coalesce" to "slow-stat-simulate" to more-clearly distinguish it from the "pgstat-coalesce" patch.) Problems remain after applying your patch; consider "VACUUM pg_am" behavior: 9.2 w/ stat-coalesce-v1.patch: VACUUM returns in 3s, stats collector writes each file 1x over 3s HEAD w/ slow-stat-simulate-v2.patch: VACUUM returns in 3s, stats collector writes each file 5x over 15s HEAD w/ slow-stat-simulate-v2.patch and your patch: VACUUM returns in 10s, stats collector writes no files over 10s Oh damn, the timestamp comparison in pgstat_recv_inquiry should be in the opposite direction. After fixing that "VACUUM pg_am" completes in 3 seconds and writes each file just once. That fixes things. "make check" passes under an 8s stats write time. OK, good. --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -4836,6 +4836,20 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len) } /* +* Ignore requests that are already resolved by the last write. +* +* We discard the queue of requests at the end of pgstat_write_statsfiles(), +* so the requests already waiting on the UDP socket at that moment can't +* be discarded in the previous loop. +* +* XXX Maybe this should also care about the clock skew, just like the +* block a few lines down. Yes, it should. (The problem is large (>~100s), backward clock resets, not skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such cases need the correction a few lines down. I'll look into that. I have to admit I have a hard time reasoning about the code handling clock skew, so it might take some time, though. The other thing needed here is to look through and update comments about last_statrequests. For example, this loop is dead code due to us writing files as soon as we receive one inquiry: /* * Find the last write request for this DB. If it's older than the * request's cutoff time, update it; otherwise there's nothing to do. * * Note that if a request is found, we return early and skip the below * check for clock skew. This is okay, since the only way for a DB * request to be present in the list is that we have been here since the * last write round. */ slist_foreach(iter, _statrequests) ... I'm okay keeping the dead code for future flexibility, but the comment should reflect that. Yes, that's another thing that I'd like to look into. Essentially the problem is that we always process the inquiries one by one, so we never actually see a list with more than a single element. Correct? I think the best way to address that is to peek is to first check how much data is in the UDP queue, and then fetching all of that before actually doing the writes. Peeking at the number of requests first (or even some reasonable hard-coded limit) should should prevent starving the inquirers in case of a steady stream or inquiries. regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
[Aside: your new mail editor is rewrapping lines in quoted material, and the result is messy. I have rerewrapped one paragraph below.] On Mon, Mar 14, 2016 at 02:00:03AM +0100, Tomas Vondra wrote: > On Sun, 2016-03-13 at 18:46 -0400, Noah Misch wrote: > > I've not attempted to study the behavior on slow hardware. Instead, my > > report used stat-coalesce-v1.patch[1] to simulate slow writes. (That > > diagnostic patch no longer applies cleanly, so I'm attaching a rebased > > version. I've changed the patch name from "stat-coalesce" to > > "slow-stat-simulate" to more-clearly distinguish it from the > > "pgstat-coalesce" patch.) Problems remain after applying your patch; > > consider "VACUUM pg_am" behavior: > > > > 9.2 w/ stat-coalesce-v1.patch: > > VACUUM returns in 3s, stats collector writes each file 1x over 3s > > HEAD w/ slow-stat-simulate-v2.patch: > > VACUUM returns in 3s, stats collector writes each file 5x over 15s > > HEAD w/ slow-stat-simulate-v2.patch and your patch: > > VACUUM returns in 10s, stats collector writes no files over 10s > > Oh damn, the timestamp comparison in pgstat_recv_inquiry should be in > the opposite direction. After fixing that "VACUUM pg_am" completes in 3 > seconds and writes each file just once. That fixes things. "make check" passes under an 8s stats write time. > --- a/src/backend/postmaster/pgstat.c > +++ b/src/backend/postmaster/pgstat.c > @@ -4836,6 +4836,20 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len) > } > > /* > + * Ignore requests that are already resolved by the last write. > + * > + * We discard the queue of requests at the end of > pgstat_write_statsfiles(), > + * so the requests already waiting on the UDP socket at that moment > can't > + * be discarded in the previous loop. > + * > + * XXX Maybe this should also care about the clock skew, just like the > + * block a few lines down. Yes, it should. (The problem is large (>~100s), backward clock resets, not skew.) A clock reset causing "msg->clock_time < dbentry->stats_timestamp" will usually also cause "msg->cutoff_time < dbentry->stats_timestamp". Such cases need the correction a few lines down. The other thing needed here is to look through and update comments about last_statrequests. For example, this loop is dead code due to us writing files as soon as we receive one inquiry: /* * Find the last write request for this DB. If it's older than the * request's cutoff time, update it; otherwise there's nothing to do. * * Note that if a request is found, we return early and skip the below * check for clock skew. This is okay, since the only way for a DB * request to be present in the list is that we have been here since the * last write round. */ slist_foreach(iter, _statrequests) ... I'm okay keeping the dead code for future flexibility, but the comment should reflect that. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On Sun, 2016-03-13 at 18:46 -0400, Noah Misch wrote: > On Thu, Mar 03, 2016 at 06:08:07AM +0100, Tomas Vondra wrote: > > > > On 02/03/2016 06:46 AM, Noah Misch wrote: > > > > > > On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: > > > > > > > > On 12/22/2015 03:49 PM, Noah Misch wrote: > > > > > > > > > > On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera > > > > > wrote: > > > > > > > > > > > > I have pushed it now. Further testing, of course, is > > > > > > always welcome. > > > > > While investigating stats.sql buildfarm failures, mostly on > > > > > animals > > > > > axolotl and shearwater, I found that this patch (commit > > > > > 187492b) > > > > > inadvertently removed the collector's ability to coalesce > > > > > inquiries. > > > > > Every PGSTAT_MTYPE_INQUIRY received now causes one stats file > > > > > write. > > > > > Before, pgstat_recv_inquiry() did: > > > > > > > > > > if (msg->inquiry_time > last_statrequest) > > > > > last_statrequest = msg->inquiry_time; > > > > > > > > > > and pgstat_write_statsfile() did: > > > > > > > > > > globalStats.stats_timestamp = GetCurrentTimestamp(); > > > > > ... (work of writing a stats file) ... > > > > > last_statwrite = globalStats.stats_timestamp; > > > > > last_statrequest = last_statwrite; > > > > > > > > > > If the collector entered pgstat_write_statsfile() with more > > > > > inquiries > > > > > waiting in its socket receive buffer, it would ignore them as > > > > > being too > > > > > old once it finished the write and resumed message > > > > > processing. Commit > > > > > 187492b converted last_statrequest to a "last_statrequests" > > > > > list that we > > > > > wipe after each write. > > So I've been looking at this today, and I think the attached patch > > should do > > the trick. I can't really verify it, because I've been unable to > > reproduce the > > non-coalescing - I presume it requires much slower system (axolotl > > is RPi, not > > sure about shearwater). > > > > The patch simply checks DBEntry,stats_timestamp in > > pgstat_recv_inquiry() and > > ignores requests that are already resolved by the last write (maybe > > this > > should accept a file written up to PGSTAT_STAT_INTERVAL in the > > past). > > > > The required field is already in DBEntry (otherwise it'd be > > impossible to > > determine if backends need to send inquiries or not), so this is > > pretty > > trivial change. I can't think of a simpler patch. > > > > Can you try applying the patch on a machine where the problem is > > reproducible? I might have some RPi machines laying around (for > > other > > purposes). > I've not attempted to study the behavior on slow hardware. Instead, > my report > used stat-coalesce-v1.patch[1] to simulate slow writes. (That > diagnostic > patch no longer applies cleanly, so I'm attaching a rebased > version. I've > changed the patch name from "stat-coalesce" to "slow-stat-simulate" > to > more-clearly distinguish it from the "pgstat-coalesce" > patch.) Problems > remain after applying your patch; consider "VACUUM pg_am" behavior: > > 9.2 w/ stat-coalesce-v1.patch: > VACUUM returns in 3s, stats collector writes each file 1x over 3s > HEAD w/ slow-stat-simulate-v2.patch: > VACUUM returns in 3s, stats collector writes each file 5x over 15s > HEAD w/ slow-stat-simulate-v2.patch and your patch: > VACUUM returns in 10s, stats collector writes no files over 10s Oh damn, the timestamp comparison in pgstat_recv_inquiry should be in the opposite direction. After fixing that "VACUUM pg_am" completes in 3 seconds and writes each file just once. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 14afef6..e750d46 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -4836,6 +4836,20 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len) } /* + * Ignore requests that are already resolved by the last write. + * + * We discard the queue of requests at the end of pgstat_write_statsfiles(), + * so the requests already waiting on the UDP socket at that moment can't + * be discarded in the previous loop. + * + * XXX Maybe this should also care about the clock skew, just like the + * block a few lines down. + */ + dbentry = pgstat_get_db_entry(msg->databaseid, false); + if ((dbentry != NULL) && (msg->cutoff_time <= dbentry->stats_timestamp)) + return; + + /* * There's no request for this DB yet, so create one. */ newreq = palloc(sizeof(DBWriteRequest)); @@ -4852,7 +4866,6 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len) * retreat in the system clock reading could otherwise cause us to neglect * to update the stats file for a long time. */ - dbentry = pgstat_get_db_entry(msg->databaseid, false); if ((dbentry != NULL)
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On Thu, Mar 03, 2016 at 06:08:07AM +0100, Tomas Vondra wrote: > On 02/03/2016 06:46 AM, Noah Misch wrote: > >On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: > >>On 12/22/2015 03:49 PM, Noah Misch wrote: > >>>On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: > I have pushed it now. Further testing, of course, is always welcome. > >>> > >>>While investigating stats.sql buildfarm failures, mostly on animals > >>>axolotl and shearwater, I found that this patch (commit 187492b) > >>>inadvertently removed the collector's ability to coalesce inquiries. > >>>Every PGSTAT_MTYPE_INQUIRY received now causes one stats file write. > >>>Before, pgstat_recv_inquiry() did: > >>> > >>> if (msg->inquiry_time > last_statrequest) > >>> last_statrequest = msg->inquiry_time; > >>> > >>>and pgstat_write_statsfile() did: > >>> > >>> globalStats.stats_timestamp = GetCurrentTimestamp(); > >>>... (work of writing a stats file) ... > >>> last_statwrite = globalStats.stats_timestamp; > >>> last_statrequest = last_statwrite; > >>> > >>>If the collector entered pgstat_write_statsfile() with more inquiries > >>>waiting in its socket receive buffer, it would ignore them as being too > >>>old once it finished the write and resumed message processing. Commit > >>>187492b converted last_statrequest to a "last_statrequests" list that we > >>>wipe after each write. > > So I've been looking at this today, and I think the attached patch should do > the trick. I can't really verify it, because I've been unable to reproduce the > non-coalescing - I presume it requires much slower system (axolotl is RPi, not > sure about shearwater). > > The patch simply checks DBEntry,stats_timestamp in pgstat_recv_inquiry() and > ignores requests that are already resolved by the last write (maybe this > should accept a file written up to PGSTAT_STAT_INTERVAL in the past). > > The required field is already in DBEntry (otherwise it'd be impossible to > determine if backends need to send inquiries or not), so this is pretty > trivial change. I can't think of a simpler patch. > > Can you try applying the patch on a machine where the problem is > reproducible? I might have some RPi machines laying around (for other > purposes). I've not attempted to study the behavior on slow hardware. Instead, my report used stat-coalesce-v1.patch[1] to simulate slow writes. (That diagnostic patch no longer applies cleanly, so I'm attaching a rebased version. I've changed the patch name from "stat-coalesce" to "slow-stat-simulate" to more-clearly distinguish it from the "pgstat-coalesce" patch.) Problems remain after applying your patch; consider "VACUUM pg_am" behavior: 9.2 w/ stat-coalesce-v1.patch: VACUUM returns in 3s, stats collector writes each file 1x over 3s HEAD w/ slow-stat-simulate-v2.patch: VACUUM returns in 3s, stats collector writes each file 5x over 15s HEAD w/ slow-stat-simulate-v2.patch and your patch: VACUUM returns in 10s, stats collector writes no files over 10s [1] http://www.postgresql.org/message-id/20151222144950.ga2553...@tornado.leadboat.com diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 14afef6..4308df2 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3444,6 +3444,7 @@ pgstat_send_bgwriter(void) NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) { + unsigned total = 0; int len; PgStat_Msg msg; int wr; @@ -3555,6 +3556,10 @@ PgstatCollectorMain(int argc, char *argv[]) errmsg("could not read statistics message: %m"))); } + elog(LOG, "stats %d: %u + %u = %u", +msg.msg_hdr.m_type, total, len, total + len); + total += len; + /* * We ignore messages that are smaller than our common header */ @@ -3947,6 +3952,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs) */ fputc('E', fpout); + if (1) + { + PG_SETMASK(); + pg_usleep(3 * 100L); + PG_SETMASK(); + } + if (ferror(fpout)) { ereport(LOG, diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out index a811265..064cf9f 100644 --- a/src/test/regress/expected/stats.out +++ b/src/test/regress/expected/stats.out @@ -63,6 +63,9 @@ begin SELECT (n_tup_ins > 0) INTO updated3 FROM pg_stat_user_tables WHERE relname='trunc_stats_test'; +raise log 'stats updated as of % snapshot: 1:% 2:% 3:%', + pg_stat_get_snapshot_timestamp(), updated1, updated2, updated3; + exit when updated1 and updated2 and updated3; -- wait a little diff --git a/src/test/regress/sql/stats.sql
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Hi, On 02/03/2016 06:46 AM, Noah Misch wrote: On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: On 12/22/2015 03:49 PM, Noah Misch wrote: On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: I have pushed it now. Further testing, of course, is always welcome. While investigating stats.sql buildfarm failures, mostly on animals axolotl and shearwater, I found that this patch (commit 187492b) inadvertently removed the collector's ability to coalesce inquiries. Every PGSTAT_MTYPE_INQUIRY received now causes one stats file write. Before, pgstat_recv_inquiry() did: if (msg->inquiry_time > last_statrequest) last_statrequest = msg->inquiry_time; and pgstat_write_statsfile() did: globalStats.stats_timestamp = GetCurrentTimestamp(); ... (work of writing a stats file) ... last_statwrite = globalStats.stats_timestamp; last_statrequest = last_statwrite; If the collector entered pgstat_write_statsfile() with more inquiries waiting in its socket receive buffer, it would ignore them as being too old once it finished the write and resumed message processing. Commit 187492b converted last_statrequest to a "last_statrequests" list that we wipe after each write. So I've been looking at this today, and I think the attached patch should do the trick. I can't really verify it, because I've been unable to reproduce the non-coalescing - I presume it requires much slower system (axolotl is RPi, not sure about shearwater). The patch simply checks DBEntry,stats_timestamp in pgstat_recv_inquiry() and ignores requests that are already resolved by the last write (maybe this should accept a file written up to PGSTAT_STAT_INTERVAL in the past). The required field is already in DBEntry (otherwise it'd be impossible to determine if backends need to send inquiries or not), so this is pretty trivial change. I can't think of a simpler patch. Can you try applying the patch on a machine where the problem is reproducible? I might have some RPi machines laying around (for other purposes). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services pgstat-coalesce-v1.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On 02/03/2016 06:46 AM, Noah Misch wrote: On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: On 12/22/2015 03:49 PM, Noah Misch wrote: ... If the collector entered pgstat_write_statsfile() with more inquiries waiting in its socket receive buffer, it would ignore them as being too old once it finished the write and resumed message processing. Commit 187492b converted last_statrequest to a "last_statrequests" list that we wipe after each write. So essentially we remove the list of requests, and thus on the next round we don't know the timestamp of the last request and write the file again unnecessarily. Do I get that right? Essentially right. Specifically, for each database, we must remember the globalStats.stats_timestamp of the most recent write. It could be okay to forget the last request timestamp. (I now doubt I picked the best lines to quote, above.) What if we instead kept the list but marked the requests as 'invalid' so that we know the timestamp? In that case we'd be able to do pretty much exactly what the original code did (but at per-db granularity). The most natural translation of the old code would be to add a write_time field to struct DBWriteRequest. One can infer "invalid" from write_time and request_time. There are other reasonable designs, though. OK, makes sense. I'll look into that. We'd have to cleanup the list once in a while not to grow excessively large, but something like removing entries older than PGSTAT_STAT_INTERVAL should be enough. Specifically, if you assume the socket delivers messages in the order sent, you may as well discard entries having write_time at least PGSTAT_STAT_INTERVAL older than the most recent cutoff_time seen in a PgStat_MsgInquiry. That delivery order assumption does not hold in general, but I expect it's close enough for this purpose. Agreed. If I get that right, it might result in some false negatives (in the sense that we'll remove a record too early, forcing us to write the database file again). But I expect that to be a rare case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
On Mon, Feb 01, 2016 at 07:03:45PM +0100, Tomas Vondra wrote: > On 12/22/2015 03:49 PM, Noah Misch wrote: > >On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: > >>I have pushed it now. Further testing, of course, is always welcome. > > > >While investigating stats.sql buildfarm failures, mostly on animals axolotl > >and shearwater, I found that this patch (commit 187492b) inadvertently > >removed > >the collector's ability to coalesce inquiries. Every PGSTAT_MTYPE_INQUIRY > >received now causes one stats file write. Before, pgstat_recv_inquiry() did: > > > > if (msg->inquiry_time > last_statrequest) > > last_statrequest = msg->inquiry_time; > > > >and pgstat_write_statsfile() did: > > > > globalStats.stats_timestamp = GetCurrentTimestamp(); > >... (work of writing a stats file) ... > > last_statwrite = globalStats.stats_timestamp; > > last_statrequest = last_statwrite; > > > >If the collector entered pgstat_write_statsfile() with more inquiries waiting > >in its socket receive buffer, it would ignore them as being too old once it > >finished the write and resumed message processing. Commit 187492b converted > >last_statrequest to a "last_statrequests" list that we wipe after each write. > > So essentially we remove the list of requests, and thus on the next round we > don't know the timestamp of the last request and write the file again > unnecessarily. Do I get that right? Essentially right. Specifically, for each database, we must remember the globalStats.stats_timestamp of the most recent write. It could be okay to forget the last request timestamp. (I now doubt I picked the best lines to quote, above.) > What if we instead kept the list but marked the requests as 'invalid' so > that we know the timestamp? In that case we'd be able to do pretty much > exactly what the original code did (but at per-db granularity). The most natural translation of the old code would be to add a write_time field to struct DBWriteRequest. One can infer "invalid" from write_time and request_time. There are other reasonable designs, though. > We'd have to cleanup the list once in a while not to grow excessively large, > but something like removing entries older than PGSTAT_STAT_INTERVAL should > be enough. Specifically, if you assume the socket delivers messages in the order sent, you may as well discard entries having write_time at least PGSTAT_STAT_INTERVAL older than the most recent cutoff_time seen in a PgStat_MsgInquiry. That delivery order assumption does not hold in general, but I expect it's close enough for this purpose. > >As for how to fix this, the most direct translation of the old logic is to > >retain last_statrequests entries that could help coalesce inquiries.I lean > >toward that for an initial, back-patched fix. > > That seems reasonable and I believe it's pretty much the idea I came up with > above, right? Depending on how you define "entries that could help coalesce > inquiries". Yes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Hi, On 12/22/2015 03:49 PM, Noah Misch wrote: On Mon, Feb 18, 2013 at 06:19:12PM -0300, Alvaro Herrera wrote: I have pushed it now. Further testing, of course, is always welcome. While investigating stats.sql buildfarm failures, mostly on animals axolotl and shearwater, I found that this patch (commit 187492b) inadvertently removed the collector's ability to coalesce inquiries. Every PGSTAT_MTYPE_INQUIRY received now causes one stats file write. Before, pgstat_recv_inquiry() did: if (msg->inquiry_time > last_statrequest) last_statrequest = msg->inquiry_time; and pgstat_write_statsfile() did: globalStats.stats_timestamp = GetCurrentTimestamp(); ... (work of writing a stats file) ... last_statwrite = globalStats.stats_timestamp; last_statrequest = last_statwrite; If the collector entered pgstat_write_statsfile() with more inquiries waiting in its socket receive buffer, it would ignore them as being too old once it finished the write and resumed message processing. Commit 187492b converted last_statrequest to a "last_statrequests" list that we wipe after each write. So essentially we remove the list of requests, and thus on the next round we don't know the timestamp of the last request and write the file again unnecessarily. Do I get that right? What if we instead kept the list but marked the requests as 'invalid' so that we know the timestamp? In that case we'd be able to do pretty much exactly what the original code did (but at per-db granularity). We'd have to cleanup the list once in a while not to grow excessively large, but something like removing entries older than PGSTAT_STAT_INTERVAL should be enough. Actually, I think that was the idea when I wrote the patch, but apparently I got distracted and it did not make it into the code. I modeled a machine with slow stats writes using the attached diagnostic patch (not for commit). It has pgstat_write_statsfiles() sleep just before renaming the temporary file, and it logs each stats message received. A three second delay causes stats.sql to fail the way it did on shearwater[1] and on axolotl[2]. Inquiries accumulate until the socket receive buffer overflows, at which point the socket drops stats messages whose effects we later test for. The 3s delay makes us drop some 85% of installcheck stats messages. Logs from a single VACUUM show receipt of five inquiries ("stats 1:") with 3s between them: 24239 2015-12-10 04:21:03.439 GMT LOG: connection authorized: user=nmisch database=test 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 1888 + 936 = 2824 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 2824 + 376 = 3200 24236 2015-12-10 04:21:03.454 GMT LOG: stats 2: 3200 + 824 = 4024 24239 2015-12-10 04:21:03.455 GMT LOG: statement: vacuum pg_class 24236 2015-12-10 04:21:03.455 GMT LOG: stats 1: 4024 + 32 = 4056 24236 2015-12-10 04:21:06.458 GMT LOG: stats 12: 4056 + 88 = 4144 24236 2015-12-10 04:21:06.458 GMT LOG: stats 1: 4144 + 32 = 4176 24239 2015-12-10 04:21:06.463 GMT LOG: disconnection: session time: 0:00:03.025 user=nmisch database=test host=[local] 24236 2015-12-10 04:21:09.486 GMT LOG: stats 1: 4176 + 32 = 4208 24236 2015-12-10 04:21:12.503 GMT LOG: stats 1: 4208 + 32 = 4240 24236 2015-12-10 04:21:15.519 GMT LOG: stats 1: 4240 + 32 = 4272 24236 2015-12-10 04:21:18.535 GMT LOG: stats 9: 4272 + 48 = 4320 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 4320 + 936 = 5256 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 5256 + 376 = 5632 24236 2015-12-10 04:21:18.535 GMT LOG: stats 2: 5632 + 264 = 5896 24236 2015-12-10 04:21:18.535 GMT LOG: stats 12: 5896 + 88 = 5984 As for how to fix this, the most direct translation of the old logic is to retain last_statrequests entries that could help coalesce inquiries.I lean toward that for an initial, back-patched fix. That seems reasonable and I believe it's pretty much the idea I came up with above, right? Depending on how you define "entries that could help coalesce inquiries". > It would be good, though, to process two closely-spaced, different-database inquiries in one pgstat_write_statsfiles() call. We do one-database writes and all-databases writes, but we never write "1 < N < all" databases despite the code prepared to do so. I tried calling pgstat_write_statsfiles() only when the socket receive buffer empties. That's dead simple to implement and aggressively coalesces inquiries (even a 45s sleep did not break stats.sql), but it starves inquirers if the socket receive buffer stays full persistently. Ideally, I'd want to process inquiries when the buffer empties _or_ when the oldest inquiry is X seconds old. I don't have a more-specific design in mind, though. That's a nice idea, but I agree that binding the coalescing to buffer like this seems like a pretty bad idea exactly because of the starving. What might work though is if we could look at how much data is there in the