Re: [HACKERS] Enabling Checksums
On 3/18/13 10:52 AM, Bruce Momjian wrote: With a potential 10-20% overhead, I am unclear who would enable this at initdb time. If you survey people who are running PostgreSQL on cloud hardware, be it Amazon's EC2 or similar options from other vendors, you will find a high percentage of them would pay quite a bit of performance to make their storage more reliable. To pick one common measurement for popularity, a Google search on ebs corruption returns 17 million hits. To quote one of those, Baron Schwartz of Percona talking about MySQL on EC2: BTW, I have seen data corruption on EBS volumes. It’s not clear whether it was InnoDB’s fault (extremely unlikely IMO), the operating system’s fault, EBS’s fault, or something else. http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/ *That* uncertainty is where a lot of the demand for this feature is coming from. People deploy into the cloud, their data gets corrupted, and no one call tell them what/why/how it happened. And that means they don't even know what to change to make it better. The only people I see really doing something about this problem all seem years off, and I'm not sure they are going to help--especially since some of them are targeting enterprise storage rather than the cloud-style installations. I assume a user would wait until they suspected corruption to turn it on, and because it is only initdb-enabled, they would have to dump/reload their cluster. The open question is whether this is a usable feature as written, or whether we should wait until 9.4. The reliability issues of both physical and virtual hardware are so widely known that many people will deploy with this on as their default configuration. If you don't trust your existing data, you can't retroactively check it. A checksum of an already corrupt block is useless. Accordingly, there is no use case for converting an installation with real or even suspected problems to a checksummed one. If you wait until you suspect corruption to care about checksums, it's really too late. There is only one available next step: you must do a dump to figure out what's readable. That is the spot that all of the incoming data recovery customers we see at 2ndQuadrant are already in when we're called. The cluster is suspicious, sometimes they can get data out of it with a dump, and if we hack up their install we can usually recover a bit more than they could. After the data from a partially corrupted database is dumped, someone who has just been through that pain might decide they should turn checksums on when they restore the dump. When it's on, they can access future damage easily at the block level when it happens, and possibly repair it without doing a full dump/reload. What's implemented in the feature we're talking about has a good enough UI to handle this entire cycle I see damaged installations go through. In fact, this feature is going to need pg_upgrade changes to detect from pg_controldata that the old/new clusters have the same checksum setting. I think that's done already, but it's certainly something to test out too. Good questions, Bruce, I don't think the reasons behind this feature's demand have been highlighted very well before. I try not to spook the world by talking regularly about how many corrupt PostgreSQL databases I've seen, but they do happen. Most of my regular ranting on crappy SSDs that lie about writes comes from a TB scale PostgreSQL install that got corrupted due to the write-cache flaws of the early Intel SSDs--twice. The would have happily lost even the worst-case 20% of regular performance to avoid going down for two days each time they saw corruption, where we had to dump/reload to get them going again. If the install had checksums, I could have figured out which blocks were damaged and manually fixed them. Without checksums, there's no way to even tell for sure what is broken. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/18/13 10:52 AM, Bruce Momjian wrote: With a potential 10-20% overhead, I am unclear who would enable this at initdb time. If you survey people who are running PostgreSQL on cloud hardware, be it Amazon's EC2 or similar options from other vendors, you will find a high percentage of them would pay quite a bit of performance to make their storage more reliable. To pick one common measurement for popularity, a Google search on ebs corruption returns 17 million hits. To quote one of those, Baron Schwartz of Percona talking about MySQL on EC2: BTW, I have seen data corruption on EBS volumes. It’s not clear whether it was InnoDB’s fault (extremely unlikely IMO), the operating system’s fault, EBS’s fault, or something else. http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/ *That* uncertainty is where a lot of the demand for this feature is coming from. People deploy into the cloud, their data gets corrupted, and no one call tell them what/why/how it happened. And that means they don't even know what to change to make it better. The only people I see really doing something about this problem all seem years off, and I'm not sure they are going to help--especially since some of them are targeting enterprise storage rather than the cloud-style installations. I assume a user would wait until they suspected corruption to turn it on, and because it is only initdb-enabled, they would have to dump/reload their cluster. The open question is whether this is a usable feature as written, or whether we should wait until 9.4. The reliability issues of both physical and virtual hardware are so widely known that many people will deploy with this on as their default configuration. If you don't trust your existing data, you can't retroactively check it. A checksum of an already corrupt block is useless. Accordingly, there is no use case for converting an installation with real or even suspected problems to a checksummed one. If you wait until you suspect corruption to care about checksums, it's really too late. There is only one available next step: you must do a dump to figure out what's readable. That is the spot that all of the incoming data recovery customers we see at 2ndQuadrant are already in when we're called. The cluster is suspicious, sometimes they can get data out of it with a dump, and if we hack up their install we can usually recover a bit more than they could. After the data from a partially corrupted database is dumped, someone who has just been through that pain might decide they should turn checksums on when they restore the dump. When it's on, they can access future damage easily at the block level when it happens, and possibly repair it without doing a full dump/reload. What's implemented in the feature we're talking about has a good enough UI to handle this entire cycle I see damaged installations go through. Good questions, Bruce, I don't think the reasons behind this feature's demand have been highlighted very well before. I try not to spook the world by talking regularly about how many corrupt PostgreSQL databases I've seen, but they do happen. Most of my regular ranting on crappy SSDs that lie about writes comes from a TB scale PostgreSQL install that got corrupted due to the write-cache flaws of the early Intel SSDs--twice. The would have happily lost even the worst-case 20% of regular performance to avoid going down for two days each time they saw corruption, where we had to dump/reload to get them going again. If the install had checksums, I could have figured out which blocks were damaged and manually fixed them. Without checksums, there really was nowhere to go except dump/reload. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 9/1/13 9:38 AM, Heikki Linnakangas wrote: To phrase it differently: We already have RLS. It's shipped as an extension called Veil. Now please explain what's wrong with that statement, if anything. Veil was last updated for 9.1 to work against that version, so the first thing is that it's two versions back from being current. The main improvement for a few now core features, compared to their external/extension predecessors, is that they go through a real review process. I suspect a lot of the criticisms being lobbied against the core RLS feature would also hit Veil if it were evaluated to the same standard. Regardless, I'm seeing a few review themes pop up from this thread: -Comparison against the Veil feature set. -Competitive review against industry expectations, AKA checkbox compliance. -Confirm feature set is useful to government security clearance applications and multi-tenant applications. There's also a secured web application use case that's popped up a few times too; KaiGai has used secured Apache installs for example. -Summary of known covert channels, with documentation coverage. -Assess odds of this implementation's future issues turning into security bugs. My personal hotspot here is that I'd like minimal code exposure to people who don't use this feature at all. Are there parts here that should be compile time enabled? Of course those are all on top of the usual code quality review. Did I miss any big themes on that list? -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 9/1/13 5:54 PM, Greg Stark wrote: So I think up above Tom is wrong about why it's ok that INSERT leaks keys when it reports a unique key violation. The reason why it's ok that there's a covert channel there is that the DBA can avoid that covert channel by being careful when creating unique constraints. He or she should be aware that creating a unique constraint implicitly provides a kind of limited access to data to users who have INSERT privilege even if they lack the real SELECT privilege. And if someone can INSERT values that they can't actually see once they're committed, that's a similarly bad we should describe. People should be dumping their trash in their neighbor's yard. I think eventually this needs to be wrestled to the ground in a robust way. I want to see if all unique violations might be changed to give less information in this sort of RLS context. One rough early idea is to create a new error condition that means you hit something protected by RLS, but doesn't leak any more information than that. Just a generic Security restriction operation that comes out of fishing for keys, inserting outside your area, etc. I want to think through some use cases and review the code to see whether that concept helps or not. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/29/13 2:04 AM, KONDO Mitsumasa wrote: I think that it is almost same as small dirty_background_ratio or dirty_background_bytes. The main difference here is that all writes pushed out this way will be to a single 1GB relation chunk. The odds are better that multiple writes will combine, and that the I/O will involve a lower than average amount of random seeking. Whereas shrinking the size of the write cache always results in more random seeking. The essential improvement is not dirty page size in fsync() but scheduling of fsync phase. I can't understand why postgres does not consider scheduling of fsync phase. Because it cannot get the sort of latency improvements I think people want. I proved to myself it's impossible during the last 9.2 CF when I submitted several fsync scheduling change submissions. By the time you get to the fsync sync phase, on a system that's always writing heavily there is way too much backlog to possibly cope with by then. There just isn't enough time left before the checkpoint should end to write everything out. You have to force writes to actual disk to start happening earlier to keep a predictable schedule. Basically, the longer you go without issuing a fsync, the more uncertainty there is around how long it might take to fire. My proposal lets someone keep all I/O from ever reaching the point where the uncertainty is that high. In the simplest to explain case, imagine that a checkpoint includes a 1GB relation segment that is completely dirty in shared_buffers. When a checkpoint hits this, it will have 1GB of I/O to push out. If you have waited this long to fsync the segment, the problem is now too big to fix by checkpoint time. Even if the 1GB of writes are themselves nicely ordered and grouped on disk, the concurrent background ability is going to chop the combination up into more random I/O than the ideal. Regular consumer disks have a worst case random I/O throughput of less than 2MB/s. My observed progress rates for such systems show you're lucky to get 10MB/s of writes out of them. So how long will the dirty 1GB in the segment take to write? 1GB @ 10MB/s = 102.4 *seconds*. And that's exactly what I saw whenever I tried to play with checkpoint sync scheduling. No matter what you do there, periodically you'll hit a segment that has over a minute of dirty data accumulated, and 60 second latency pauses result. By the point you've reached checkpoint, you're dead when you call fsync on that relation. You *must* hit that segment with fsync more often than once per checkpoint to achieve reasonable latency. With this linear slider idea, I might tune such that no segment will ever get more than 256MB of writes before hitting a fsync instead. I can't guarantee that will work usefully, but the shape of the idea seems to match the problem. Taken together my checkpoint proposal method, * write phase - Almost same, but considering fsync phase schedule. - Considering case of background-write in OS, sort buffer before starting checkpoint write. This cannot work for the reasons I've outlined here. I guarantee you I will easily find a test workload where it performs worse than what's happening right now. If you want to play with this to learn more about the trade-offs involved, that's fine, but expect me to vote against accepting any change of this form. I would prefer you to not submit them because it will waste a large amount of reviewer time to reach that conclusion yet again. And I'm not going to be that reviewer. * fsync phase - Considering checkpoint schedule and write-phase schedule - Executing separated sync_file_range() and sleep, in final fsync(). If you can figure out how to use sync_file_range() to fine tune how much fsync is happening at any time, that would be useful on all the platforms that support it. I haven't tried it just because that looked to me like a large job refactoring the entire fsync absorb mechanism, and I've never had enough funding to take it on. That approach has a lot of good properties, if it could be made to work without a lot of code changes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backup throttling
On 8/27/13 7:58 AM, Robert Haas wrote: We have a *general* need to be able to throttle server-side resource utilization, particularly I/O. This is a problem not only for pg_basebackup, but for COPY, CLUSTER, VACUUM, and even things like UPDATE. Of all of those, the only one for which we currently have any kind of a solution is VACUUM. I didn't mention it specifically, but I always presumed that the Cost limited statements RFC proposal I floated: http://www.postgresql.org/message-id/519ea5ff.5040...@2ndquadrant.com (and am still working on) would handle the base backup case too. pg_basebackup is just another client. Some sort of purpose made solution for pg_basebackup alone may be useful, but I'd be shocked if the sort of general mechanism I described there wasn't good enough to handle many of the backup limiting cases too. Also, once that and the block write counters I also sent an RFC out for are in place, I have a plan for adding a server-wide throttling mechanism. I want to extract a cluster wide read and write rate and put a cluster wide limit on that whole thing. It seemed too hard to jump into without these other two pieces of plumbing in place first. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Does larger i/o size make sense?
On 8/27/13 3:54 PM, Josh Berkus wrote: I believe that Greenplum currently uses 128K. There's a definite benefit for the DW use-case. Since Linux read-ahead can easily give big gains on fast storage, I normally set that to at least 4096 sectors = 2048KB. That's a lot bigger than even this, and definitely necessary for reaching maximum storage speed. I don't think that the block size change alone will necessarily duplicate the gains on seq scans that Greenplum gets though. They've done a lot more performance optimization on that part of the read path than just the larger block size. As far as quantifying whether this is worth chasing, the most useful thing to do here is find some fast storage and profile the code with different block sizes at a large read-ahead. I wouldn't spend a minute on trying to come up with a more complicated management scheme until the potential gain is measured. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 7/20/13 10:08 AM, Kohei KaiGai wrote: Hmm. I didn't have this idea. It seems to me fair enough and kills necessity to enhance RangeTblEntry and getrelid() indeed. I try to fix up this implementation according to your suggestion. How is that going? I'm going to do a serious review of this myself over the next few weeks. I have a good chunk of time set aside for it as part of a larger project. I'm hoping to get more people here involved in that effort too, starting in the November CF if that works out. I've been trying to catch up with your larger plan for this feature for 9.4. You made this comment earlier: Also, I'd like to have discussion for this feature in earlier half of v9.4 to keep time for the remaining features, such as check on writer-side, integration with selinux, and so on Is any of that code around yet? I see that you have split your submissions so that a smaller program can be reviewed today. I'd like to start taking a look at the next step too though. For the project I'm starting to work on here, getting the integration with labeling also done is a very important thing to target for 9.4. It would be nice to see how that fits together today, even if the code for it isn't being reviewed heavily yet. I don't quite understand yet what's missing on the writer side. If you could help explain what's missing there, I would like to read about that. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unsafe GUCs and ALTER SYSTEM WAS: Re: ALTER SYSTEM SET
On 8/5/13 2:36 PM, Josh Berkus wrote: Most of our users not on Heroku are running with superuser as the app user now. Like, 95% of them based on my personal experience (because our object permissions management sucks). My percentage wouldn't be nearly that high. 95% of database installs done by developers? That I could believe. But setups done by ex-{Oracle,DB2,Sybase...} DBAs avoid superuser rights for the application, and that's around 1/3 of the systems I see. In general I agree that there certainly are a lot of superuser only installs out there, I just don't think it's quite as bad as you're painting it. The important thing to realize here is that ALTER SYSTEM SET is a convenience function. It’s useful to the extent that it makes people feel more comfortable with changing things in the database. We will never stop people from shooting themselves in the foot. And if the barriers added for that purpose are too high, like disabling changes to shared_buffers altogether, all you’ll do is make this so restricted that it doesn’t satisfy anyone. The original spec I outlined for how to implement this feature allowed disabling the whole thing. Then it was just commenting out the includedir directive from the postgresql.conf. Allowing people to disable use of this feature in a managed configuration environment is important. Beyond that, I don’t expect that we’ll ever make this foolproof. After arguing out this topic in person with Stephen last night, I’ve lumped ideas here into three major approaches that could be followed: 1) Don’t allow changing unsafe GUCs. 2) Provide a way for the server to start with bad setting and force the administrator to fix the problem they introduced. Some sort of maintenance mode that only allows superuser connections would force someone to clean up this class of problem at next restart, while still giving them enough access to run a new ALTER SYSTEM SET command. 3) Require extra syntax to modify an unsafe GUC. As far as fixing the bad settings goes, there’s already code in initdb to detect how high shared_buffers and max_connections can go. Any bogus shared_buffers/max_connections value above the kernel limits could be worked around with that approach. Here’s how I would try and communicate that a change is unsafe, only allowing it after proving user is paying some attention to the danger: # ALTER SYSTEM SET shared_buffers = ‘8GB’; NOTICE: Changing shared_buffers only takes effect after a server restart. ERROR: Changing shared_buffers incorrectly can prevent the server from starting normally. Use the FORCE option to modify the value anyway. # ALTER SYSTEM SET shared_buffers = ‘8GB’ FORCE; NOTICE: Changing shared_buffers only takes effect after a server restart. ALTER SYSTEM Will bad examples pop up in the Internet that just use FORCE all the time? Sure they will, and people will cut and paste them without paying attention. I don't see why that possibility has to block this feature from being adopted though. That line of thinking leads toward removing trust authentication, because that's similarly abused with cut and paste tutorials. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])
On 8/2/13 4:48 PM, Stephen Frost wrote: * Josh Berkus (j...@agliodbs.com) wrote: I really think this is the wrong approach. If we start removing unsafe parameters from ALTER SYSTEM SET, we basically hobble the feature to the point of uselessness. Out of the 15 or so parameters 80% of our users touch, half of them are on your unsafe list. Reflecting on this a bit more, I'm curious what your list-of-15 is. You can get a top 15-ish list from https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server The list of things I see changed regularly that are on your unsafe list are: listen_addresses, port, shared_buffers, log_directory, log_filename Obvious missing thing from your unsafe list that is also changed regularly is max_connection. I count 6 parameters that are both unsafe and changed regularly. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])
On 7/29/13 3:46 PM, Josh Berkus wrote: Based on the ongoing discussion of this patch, I have moved it to 9.4CF2 (9-2013). Mind you, it would be good to commit some version of it before September. Quite a bit of the patch adds some refactored GUC parameter validation code that seems necessary for this feature, and that's now where the controversial parts are either. That's the part I thought should be looked at for commit now. I'd like to get the change size Amit is carrying around and (and other people are reviewing) reduced here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])
On 8/1/13 10:47 AM, David Johnston wrote: Minor request: could someone enlighten me as to why making the directory location a compile-time option is undesirable. The ongoing argument here is whether to allow moving the directory at all, or to make it fixed to $PGDATA the way recovery.conf is. If you accept that it should float, then it actually needs to be a start time option. Software like Debian moves around the postgresql.conf like this: pg_ctl -c config_file=/etc/postgresql/9.3/main/postgresql.conf ... The way this argument is going the last few days, I'm starting to think that it's worth breaking this style of config directory setup out into its own feature now. Whether or not ALTER SYSTEM SET takes advantage of the config directory or not seems a still raging question. I've been coupling the two together because I think the design of ALTER SYSTEM SET should consider a config directory based approach. But from the perspective of what can get committed first, the config directory really should go first. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])
On 7/28/13 10:49 PM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: I think if we can design conf.d separately for config files of management tools, then it is better to have postgresql.auto.conf to be in $PGDATA rather than in $PGDATA/conf.d One of the biggest current complaints about recovery.conf from Debian/Ubuntu users is the fact that it lives in $PGDATA. Won't we just get the same thing here? I don't think that's the same case, but ... why exactly don't they like recovery.conf, and can you show that the location of the file has anything to do with the underlying complaint? Here is a quick survey of config files: RHEL6: Cluster configuration file at /etc/sysconfig/pgsql, writeable by root. postgresql.conf is in $HOME/8.4/data Debian Squeeze: All configuration files in /etc/postgresql/8.4/main and are owned by the postgres user. That last one shows why there's a complaint, and why it's 75% due to location rather than split role. Debian users expect all of the config files to be relocated to /etc/postgresql or /etc/postgresql-common. Those files are also writeable by the postgres user. recovery.conf doesn't fit that model, and people don't like that. The admin who has to manage servers with puppet would be much happier if recovery.conf was in the /etc/postgresql tree they are already managing. I get this complaint every single time I introduce an experienced Debian admin to PostgreSQL replication setup. They aren't sure whether it's the packagers or PostgreSQL that is to blame for recovery.conf being in $PGDATA instead of /etc, but it's obvious to them someone has screwed up. Given that Debian is the major example of a relocated postgresql.conf in the field, and that work has been done such that a) all config files are in /etc and b) they are all writeable by the postgres user, I don't understand why people are suggesting the auto.conf file or conf.d directory will go anywhere else on that platform anymore. I think the only claims that are suggesting otherwise are thinking about an older, now unsupported version of the Debian packaging. Debian packaging absolutely will want to relocate any new files added here into their /etc directory tree as well, where they will be writeable. They will not go into the $PGDATA directory. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/25/13 6:02 PM, didier wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? If you do that, reads of the data will have to traverse that temporary file to assemble their data. You'll make every later reader pay the random I/O penalty that's being avoided right now. Checkpoints are already postponing these random writes as long as possible. You have to take care of them eventually though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/26/13 5:59 AM, Hannu Krosing wrote: Well, SSD disks do it in the way proposed by didier (AFAIK), by putting random fs pages on one large disk page and having an extra index layer for resolving random-to-sequential ordering. If your solution to avoiding random writes now is to do sequential ones into a buffer, you'll pay for it by having more expensive random reads later. In the SSD write buffer case, that works only because those random reads are very cheap. Do the same thing on a regular drive, and you'll be paying a painful penalty *every* time you read in return for saving work *once* when you write. That only makes sense when your workload is near write-only. It's possible to improve on this situation by having some sort of background process that goes back and cleans up the random data, converting it back into sequentially ordered writes again. SSD controllers also have firmware that does this sort of work, and Postgres might do it as part of vacuum cleanup. But note that such work faces exactly the same problems as writing the data out in the first place. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/26/13 9:14 AM, didier wrote: During recovery you have to load the log in cache first before applying WAL. Checkpoints exist to bound recovery time after a crash. That is their only purpose. What you're suggesting moves a lot of work into the recovery path, which will slow down how long it takes to process. More work at recovery time means someone who uses the default of checkpoint_timeout='5 minutes', expecting that crash recovery won't take very long, will discover it does take a longer time now. They'll be forced to shrink the value to get the same recovery time as they do currently. You might need to make checkpoint_timeout 3 minutes instead, if crash recovery now has all this extra work to deal with. And when the time between checkpoints drops, it will slow the fundamental efficiency of checkpoint processing down. You will end up writing out more data in the end. The interval between checkpoints and recovery time are all related. If you let any one side of the current requirements slip, it makes the rest easier to deal with. Those are all trade-offs though, not improvements. And this particular one is already an option. If you want less checkpoint I/O per capita and don't care about recovery time, you don't need a code change to get it. Just make checkpoint_timeout huge. A lot of checkpoint I/O issues go away if you only do a checkpoint per hour, because instead of random writes you're getting sequential ones to the WAL. But when you crash, expect to be down for a significant chunk of an hour, as you go back to sort out all of the work postponed before. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/26/13 8:32 AM, Tom Lane wrote: What I'd point out is that that is exactly what WAL does for us, ie convert a bunch of random writes into sequential writes. But sooner or later you have to put the data where it belongs. Hannu was observing that SSD often doesn't do that at all. They can maintain logical - physical translation tables that decode where each block was written to forever. When read seeks are really inexpensive, the only pressure to reorder block is wear leveling. That doesn't really help with regular drives though, where the low seek time assumption doesn't play out so well. The whole idea of writing things sequentially and then sorting them out later was the rage in 2001 for ext3 on Linux, as part of the data=journal mount option. You can go back and see that people are confused but excited about the performance at http://www.ibm.com/developerworks/linux/library/l-fs8/index.html Spoiler: if you use a workload that has checkpoint issues, it doesn't help PostgreSQL latency. Just like using a large write cache, you gain some burst performance, but eventually you pay for it with extra latency somewhere. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read
On 7/23/13 2:16 AM, Satoshi Nagayasu wrote: I've been working on new pgstattuple function to allow block sampling [1] in order to reduce block reads while scanning a table. A PoC patch is attached. Take a look at all of the messages linked in https://commitfest.postgresql.org/action/patch_view?id=778 Jaime and I tried to do what you're working on then, including a random block sampling mechanism modeled on the stats_target mechanism. We didn't do that as part of pgstattuple though, which was a mistake. Noah created some test cases as part of his thorough review that were not computing the correct results. Getting the results correct for all of the various types of PostgreSQL tables and indexes ended up being much harder than the sampling part. See http://www.postgresql.org/message-id/20120222052747.ge8...@tornado.leadboat.com in particular for that. This new function, pgstattuple2(), samples only 3,000 blocks (which accounts 24MB) from the table randomly, and estimates several parameters of the entire table. There should be an input parameter to the function for how much sampling to do, and if it's possible to make the scale for it to look like ANALYZE that's helpful too. I have a project for this summer that includes reviving this topic and making sure it works on some real-world systems. If you want to work on this too, I can easily combine that project into what you're doing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] And then there were 5
On 7/23/13 9:06 AM, Robert Haas wrote: But just BTW, you kind of messed up that CommitFest entry. The link you added as a new version of the patch was actually to a different patch from the same flock. Fixed now, since I find myself looking back through history on submissions often enough to care about it being right. My eyes were starting to glaze over by the end of staring at this flock. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/23/13 10:56 AM, Robert Haas wrote: On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote: We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync. By my count, it can include up to 131,072 changed 8K pages. Even better! I can pinpoint exactly what time last night I got tired enough to start making trivial mistakes. Everywhere I said 128 it's actually 131,072, which just changes the range of the GUC I proposed. Getting the number right really highlights just how bad the current situation is. Would you expect the database to dump up to 128K writes into a file and then have low latency when it's flushed to disk with fsync? Of course not. But that's the job the checkpointer process is trying to do right now. And it's doing it blind--it has no idea how many dirty pages might have accumulated before it started. I'm not exactly sure how best to use the information collected. fsync every N writes is one approach. Another is to use accumulated writes to predict how long fsync on that relation should take. Whenever I tried to spread fsync calls out before, the scale of the piled up writes from backends was the input I really wanted available. The segment write count gives an alternate way to sort the blocks too, you might start with the heaviest hit ones. In all these cases, the fundamental I keep coming back to is wanting to cue off past write statistics. If you want to predict relative I/O delay times with any hope of accuracy, you have to start the checkpoint knowing something about the backend and background writer activity since the last one. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 7/23/13 12:10 PM, Josh Berkus wrote: Apparently it's a little much for experienced reviewers to chew on, too, since I've been trying to get someone to review it since the beginning of the Commitfest. It's more than the available experienced reviewers are willing to chew on fully as volunteers. The reward for spending review time is pretty low right now. While I understand the call for resources, this is a bit unfair to KaiGai, who has put in his time reviewing other people's patches. If you read Dean Rasheed's comments, it's obvious he put a useful amount of work into his review suggestions. It is not the case here that KaiGai worked on a review and got nothing in return. Unfortunately that has happened to this particular patch before, but the community did a little better this time. The goal of the CF is usually to reach one of these outcomes for every submission: -The submission is ready for commit -The submission needs improvement in X Review here went far enough to identify an X to be improved. It was a big enough issue that a rewrite is needed, commit at this time isn't possible, and now KaiGai has something we hope is productive he can continue working on. That's all we can really promise here--that if we say something isn't ready for commit yet, that there's some feedback as to why. I would have preferred to see multiple X issues identified here, given that we know there has to be more than just the one in a submission of this size. The rough fairness promises of the CommitFest seem satisfied to me though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 7/23/13 2:30 PM, Josh Berkus wrote: You know as well as me that, as consultants, we can get clients to pay for 10% extra time for review in the course of developing a feature Before this number gets quoted anywhere, I think it's on the low side. I've spent a good bit of time measuring how much time it takes to do a fair offsetting review--one where you put as much time in as it takes to review your submission--and I keep getting numbers more in the 20 to 25% range. The work involved to do a high quality review takes a while. I happen to think the review structure is one of the most important components to PostgreSQL release quality. It used to be a single level review with just the committers, now it's a two level structure. The reason the Postgres code is so good isn't that the submitted development is inherently any better than other projects. There's plenty of bogus material submitted here. It's the high standards for review and commit that are the key filter. The importance of the process to the result isn't weighed as heavily as I think it should be. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Attached is an update that I think sorts out all of the documentation concerns. I broke this section into paragraphs now that it's getting so long too. The only code change is that this now labels the controversial lag here average rate limit schedule lag. That way if someone wants to introduce other measures of rate limit lag, like a more transaction oriented one, you might call that average rate limit transaction lag and tell the two apart. The rewritten documentation here tries to communicate that there is a schedule that acts like it was pre-computed at the start of each client too. It's not ever adjusted based on what individual transactions do. I also noted the way this can cause schedule lag for some time after a slow transaction finishes, since that's the main issue observed so far. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c new file mode 100644 index 2ad8f0b..4e6b608 *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** int unlogged_tables = 0; *** 137,142 --- 137,148 doublesample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ + int64 throttle_delay = 0; + + /* * tablespace selection */ char *tablespace = NULL; *** typedef struct *** 202,212 --- 208,220 int listen; /* 0 indicates that an async query has been * sent */ int sleeping; /* 1 indicates that the client is napping */ + boolthrottling; /* whether nap is for throttling */ int64 until; /* napping until (usec) */ Variable *variables; /* array of variable definitions */ int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + boolis_throttled; /* whether transaction throttling is done */ int use_file; /* index in sql_files for this client */ boolprepared[MAX_FILES]; } CState; *** typedef struct *** 224,229 --- 232,240 instr_time *exec_elapsed; /* time spent executing cmds (per Command) */ int*exec_count; /* number of cmd executions (per Command) */ unsigned short random_state[3]; /* separate randomness for each thread */ + int64 throttle_trigger; /* previous/next throttling (us) */ + int64 throttle_lag; /* total transaction lag behind throttling */ + int64 throttle_lag_max; /* max transaction lag */ } TState; #define INVALID_THREAD((pthread_t) 0) *** typedef struct *** 232,237 --- 243,250 { instr_time conn_time; int xacts; + int64 throttle_lag; + int64 throttle_lag_max; } TResult; /* *** usage(void) *** 356,361 --- 369,375 -N, --skip-some-updates skip updates of pgbench_tellers and pgbench_branches\n -P, --progress=NUM show thread progress report every NUM seconds\n -r, --report-latencies report average latency per command\n +-R, --rate SPEC target rate in transactions per second\n -s, --scale=NUM report this scale factor in output\n -S, --select-onlyperform SELECT-only transactions\n -t, --transactions number of transactions each client runs *** doCustom(TState *thread, CState *st, ins *** 898,914 { PGresult *res; Command **commands; top: commands = sql_files[st-use_file]; if (st-sleeping) { /* are we sleeping? */ instr_time now; INSTR_TIME_SET_CURRENT(now); ! if (st-until = INSTR_TIME_GET_MICROSEC(now)) st-sleeping = 0; /* Done sleeping, go ahead with next command */ else return true;/* Still sleeping, nothing to do here */ } --- 912,973 { PGresult *res; Command **commands; + booltrans_needs_throttle = false; top: commands = sql_files[st-use_file
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Very minor update with V19 here, to reflect Alvaro's comments. The tricky part now reads like this: High rate limit schedule lag values, that is lag values that are large compared to the actual transaction latency, indicate that something is amiss in the throttling process. High schedule lag can highlight a subtle problem there even if the target rate limit is met in the end. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c new file mode 100644 index 2ad8f0b..4e6b608 *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** int unlogged_tables = 0; *** 137,142 --- 137,148 doublesample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ + int64 throttle_delay = 0; + + /* * tablespace selection */ char *tablespace = NULL; *** typedef struct *** 202,212 --- 208,220 int listen; /* 0 indicates that an async query has been * sent */ int sleeping; /* 1 indicates that the client is napping */ + boolthrottling; /* whether nap is for throttling */ int64 until; /* napping until (usec) */ Variable *variables; /* array of variable definitions */ int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + boolis_throttled; /* whether transaction throttling is done */ int use_file; /* index in sql_files for this client */ boolprepared[MAX_FILES]; } CState; *** typedef struct *** 224,229 --- 232,240 instr_time *exec_elapsed; /* time spent executing cmds (per Command) */ int*exec_count; /* number of cmd executions (per Command) */ unsigned short random_state[3]; /* separate randomness for each thread */ + int64 throttle_trigger; /* previous/next throttling (us) */ + int64 throttle_lag; /* total transaction lag behind throttling */ + int64 throttle_lag_max; /* max transaction lag */ } TState; #define INVALID_THREAD((pthread_t) 0) *** typedef struct *** 232,237 --- 243,250 { instr_time conn_time; int xacts; + int64 throttle_lag; + int64 throttle_lag_max; } TResult; /* *** usage(void) *** 356,361 --- 369,375 -N, --skip-some-updates skip updates of pgbench_tellers and pgbench_branches\n -P, --progress=NUM show thread progress report every NUM seconds\n -r, --report-latencies report average latency per command\n +-R, --rate SPEC target rate in transactions per second\n -s, --scale=NUM report this scale factor in output\n -S, --select-onlyperform SELECT-only transactions\n -t, --transactions number of transactions each client runs *** doCustom(TState *thread, CState *st, ins *** 898,914 { PGresult *res; Command **commands; top: commands = sql_files[st-use_file]; if (st-sleeping) { /* are we sleeping? */ instr_time now; INSTR_TIME_SET_CURRENT(now); ! if (st-until = INSTR_TIME_GET_MICROSEC(now)) st-sleeping = 0; /* Done sleeping, go ahead with next command */ else return true;/* Still sleeping, nothing to do here */ } --- 912,973 { PGresult *res; Command **commands; + booltrans_needs_throttle = false; top: commands = sql_files[st-use_file]; + /* +* Handle throttling once per transaction by sleeping. It is simpler +* to do this here rather than at the end, because so much complicated +* logic happens below when statements finish. +*/ + if (throttle_delay ! st-is_throttled) + { + /* +* Use inverse transform sampling to randomly generate a delay
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
The v3 patch applies perfectly here now. Attached is a spreadsheet with test results from two platforms, a Mac laptop and a Linux server. I used systems with high disk speed because that seemed like a worst case for this improvement. The actual improvement for shrinking WAL should be even better on a system with slower disks. There are enough problems with the hundred tiny fields results that I think this not quite ready for commit yet. More comments on that below. This seems close though, close enough that I am planning to follow up to see if the slow disk results are better. Reviewing the wal-update-testsuite.sh test program, I think the only case missing that would be useful to add is ten tiny fields, one changed. I think that one is interesting to highlight because it's what benchmark programs like pgbench do very often. The GUC added by the program looks like this: postgres=# show wal_update_compression_ratio ; wal_update_compression_ratio -- 25 Is possible to add a setting here that disables the feature altogether? That always makes it easier to consider a commit, knowing people can roll back the change if it makes performance worse. That would make performance testing easier too. wal-update-testsuit.sh takes as long as 13 minutes, it's long enough that I'd like the easier to automate comparison GUC disabling adds. If that's not practical to do given the intrusiveness of the code, it's not really necessary. I haven't looked at the change enough to be sure how hard this is. On the Mac, the only case that seems to have a slowdown now is hundred tiny fields, half nulled. It would be nice to understand just what is going on with that one. I got some ugly results in two short fields, no change too, along with a couple of other weird results, but I think those were testing procedure issues that can be ignored. The pgbench throttle work I did recently highlights that I can't really make a Mac quiet/consistent for benchmarking very well. Note that I ran all of the Mac tests with assertions on, to try and catch platform specific bugs. The Linux ones used the default build parameters. On Linux hundred tiny fields, half nulled was also by far the worst performing one, with a 30% increase in duration despite the 14% drop in WAL. Exactly what's going on there really needs to be investigated before this seems safe to commit. All of the hundred tiny fields cases seem pretty bad on Linux, with the rest of them running about a 11% duration increase. This doesn't seem ready to commit for this CF, but the number of problem cases is getting pretty small now. Now that I've gotten more familiar with the test programs and the feature, I can run more performance tests on this at any time really. If updates addressing the trouble cases are ready from Amit or Hari before the next CF, send them out and I can look at them without waiting until that one starts. This is a very promising looking performance feature. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com WAL-lz-v3.xls Description: MS-Excel spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])
On 7/22/13 12:58 PM, Amit kapila wrote: As per discussion, updated patch contains following changes: 1. Regression tests for Alter System are removed 2. Parsed the auto file automatically after parsing postgresql.conf 3. Removed addition of include directive in postgresql.conf 4. Removed error handling for parsing errors These changes have shrunk the diff down to 1411 lines of code. I'd like to identify which committer might take this on at this point. In a few respects this is Ready For Committer now, because the committer who takes this on is going to get a strong vote on how to resolve most of the remaining fundamental issues: -Is this the point to finally commit to the config directory approach? -If this does set the config directory usage precedent, is the name used for that appropriate? Whoever suggested the change from conf.d to config made an error IMHO. Every example I've found of other projects doing this style of config refactoring picked either conf.d or a unique, no two are alike name. I'd rather not see Postgres add yet another unique one. (I think the 'postgresql' part of postgresql.auto.conf as the name of the file is redundant too--what else would be in the Postgres config directory but postgresql files?--but that's not a major issue) This could definitely use a round of committer level review of how the GUC handling is being done now too. That part of the code seems to have settled, and things like using the new validate_conf_option could be committed even with other parts still being discussed. Exactly how to best break this out into useful commits is another decision that really needs some input from the potential committer though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 7/22/13 2:57 PM, Andres Freund wrote: * I'd be very surprised if this doesn't make WAL replay of update heavy workloads slower by at least factor of 2. I was thinking about what a benchmark of WAL replay would look like last year. I don't think that data is captured very well yet, and it should be. My idea was to break the benchmark into two pieces. One would take a base backup, then run a series of tests and archive the resulting the WAL. I doubt you can make a useful benchmark here without a usefully populated database, that's why the base backup step is needed. The first useful result then is to measure how long commit/archiving took and the WAL volume, which is what's done by the test harness for this program. Then the resulting backup would be setup for replay. tarring up the backup and WAL archive could even give you a repeatable test set for ones where it's only replay changes happening. Then the main number that's useful, total replay time, would be measured. The main thing I wanted this for wasn't for code changes; it was to benchmark configuration changes. I'd like to be able to answer questions like which I/O scheduler is best for a standby in a way that has real test data behind it. The same approach should useful for answering your concerns about the replay performance impact of this change too though. * It makes changeset extraction either more expensive or it would have to be disabled there. That argues that if committed at all, the ability to turn this off I was asking about would be necessary. It sounds like this *could* work like how minimal WAL archiving levels allow optimizations that are disabled at higher ones--like the COPY into a truncated/new table cheat. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 7/20/13 10:08 AM, Kohei KaiGai wrote: With that change to expand_targetlist(), the change to getrelid() may be unnecessary, and then also the new rowsec_relid field in RangeTblEntry may not be needed. Hmm. I didn't have this idea. It seems to me fair enough and kills necessity to enhance RangeTblEntry and getrelid() indeed. I try to fix up this implementation according to your suggestion. Great, there's one useful bit of feedback for you then, and that seems to address Tom's getrelid concern too. For the active CommitFest, I don't see any place we can go with this right now except for Returned with Feedback. We really need more reviewers willing to put a significant amount of time into going through this code. Anyone who would like to see RLS committed should consider how you can get resources to support a skilled PostgreSQL reviewer spending time on it. (This is a bit much to expect new reviewers to chew on usefully) I've been working on that here, but I don't have anything I can publicly commit to yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List
On 6/24/13 12:57 PM, Josh Berkus wrote: Maciej is correct that this policy also belongs on the how to submit a patch wiki page. I will remedy that. I just reviewed and heavily updated the new section you added to https://wiki.postgresql.org/wiki/Submitting_a_Patch That included the idea that the reviewed patch should be similar in size/scope to the submitted one, as well a slightly deeper discussion of how to fit this work into a feature review quote. I find myself needing to explain this whole subject to potential feature sponsors enough that I've tried a few ways of describing it. The closest analog I've found so far is the way carbon offset work is accounted for. I sometimes refer to the mutual review as an offsetting review in conversation, and I threw that term into the guidelines as well. As far as motivating new reviewers goes, let's talk about positive feedback. Anything that complicates the release notes is a non-starter because that resource is tightly controlled by a small number of people, and it's trying to satisfy a lot of purposes. What I would like to see is an official but simple Review Leaderboard for each release instead. Each time someone writes a review and adds it to CF app with a Review entry, the account that entered it gets a point. Sum the points at the end and there's your weighted list for T-shirt prizes. It should be possible to get that count with a trivial SELECT query out of the CF database, and then produce a simple HTML table at the end of each CF or release. Anything that takes more work than that, and anything that takes *any* time that must come from a committer, is too much accounting. This idea would be a bit unfair to people who review big patches instead of little ones. But an approach that disproportionately rewards new contributors working on small things isn't that terrible. As long as the review tests whether the code compiles and passes the regression tests, that's good enough to deserve a point. I'd be happy if we suddenly had a problem where people were doing only that to try game their leaderboard ranking. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List
On 7/3/13 7:25 PM, Bruce Momjian wrote: The extrapolation of Josh's approach is that committers have to do work that the community wants to maintain their commit rights, but their commit rights are helping the community, so why would people care if you take them away --- you only hurt the community further by doing so. The main problem with having inactive committers (which I don't intend to include the important subject matter committers, who I'll get into at the end here) is that they skew the public information about who commits in a counterproductive way. People visit https://wiki.postgresql.org/wiki/Committers , sees that list of names, and then make conclusions based on its content. And some of those conclusions are wrong because the data is inconsistent. The standards listed are applied when new committers are promoted, but they are not applied symmetrically to remove ones who don't anymore. The #1 obstacle to my getting more time to work on core PostgreSQL is that companies presume regular submitters who are also non-committers don't do a very good job. If you are competent and have a consistent track record of contributions to an open source project, the project would make you a committer, right? Conversely, if you've been contributing for a while but aren't a committer, the most likely explanation is that your work quality is poor. That is a completely reasonable viewpoint based on how most open source projects work. The really terrible part is that it means the longer you've been submitting patches, the *less* competent you're assumed to be. When I tell people I've been submitting things since 2007 but am not a committer, the only logical explanation is that my submissions must suck very hard, right? From that perspective, people who are listed as committers but don't actively do work for the project are causing me a serious problem. When someone who rarely commits can obviously qualify, that *proves* the bar for PostgreSQL committers is actually very low to casual observers. That's the message the project is inadvertently sending by leaving committers on there if they stop working actively. The main thing I'd like to see is having the committer list, and its associated guidelines, updated to reflect that there are subject matter experts committing too. That would pull them out of any what have you done for me lately? computations, and possibly open up a way to get more of them usefully. Here are the first two obvious labels like that: Michael Meskes (meskes) - embedded SQL Teodor Sigaev (teodor) - inverted indexes When even Josh Berkus doesn't even know all of this information, it's clearly way too obscure to expect the rest of the world to figure it out. It also boggles my mind that there isn't already an entry like this on there too: Thom Browne - documentation Each time Thom passes through yet another correction patch that is committed with no change, I find it downright bizarre that a community with such limited committer resources wastes their time with that gatekeeping. The standards for nominating committers seem based on whether they can commit just about anything. I think it's more important to consider whether people are trusted to keep commits within their known area(s) of expertise. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Revive line type
On 6/26/13 9:34 PM, Peter Eisentraut wrote: Still wondering whether to use a A,B,C-based output format per Tom's comment. Wouldn't it also be helpful to remove The points used in the output are not necessarily the points used on input by making that not true? There are three obvious ways you might output a line: -Math class expectations of slope-intercept form: y = mx + b. Intercept forms don't handle both horizontal and vertical lines though, so that's easily out. -Pair of points. A casual observer might get lucky and observe putting a pair of points in and getting the same pair back again, then incorrectly assume that's normally the case. Seems better to never make that possible in the infinite line case. I'm reminded of how row output usually is in order gives a bad impression about ordering guarantees in SQL. -Ax+By+C=0 Outputting that third one, when it's also the internal form, handles any time of line; will show any assumptions about individual points being preserved are wrong; and avoids rounding errors too. The only downside seems to be that bounded lines are easier to show with a pair of points. I think that suggests an alternate, secondary output format would be useful though, rather than that a different default one is a good idea. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] And then there were 5
After pushes from a few people, the remaining submissions are now waiting for commit. I updated each of those to have the latest info in the CF app, and tried to identify what committers have already looked at them. Access to calls stack from GET DIAGNOSTICS statement: ? Add more regression tests mostly for ALTER OPERATOR FAMILY .. ADD/DROP: ? Robert Haas gave a quick review last week that's generated an update since. But that looks it was just a quick scan for issues. Remove unused targets from plan: Alvaro? (He reviewed it already) SQL Command to edit postgresql.conf (ALTER SYSTEM patch): Alvaro volunteered to look at this if no one else gets to it first. UNNEST() (and other functions) WITH ORDINALITY: Greg Stark -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/22/13 4:52 AM, KONDO Mitsumasa wrote: The writeback source code which I indicated part of writeback is almost same as community kernel (2.6.32.61). I also read linux kernel 3.9.7, but it is almost same this part. The main source code difference comes from going back to the RedHat 5 kernel, which means 2.6.18. For many of these versions, you are right that it is only the tuning parameters that were changed in newer versions. Optimizing performance for the old RHEL5 kernel isn't the most important thing, but it's helpful to know the things it does very badly. My fsync patch is only sleep returned succece of fsync and maximum sleep time is set to 10 seconds. It does not cause bad for this problem. It's easy to have hundreds of relations that are getting fsync calls during a checkpoint. If you have 100 relations getting a 10 second sleep each, you could potentially delay checkpoints by 17 minutes this way. I regularly see systems where shared_buffers=8GB and there are 200 to 400 relation segments that need a sync during a checkpoint. This is the biggest problem with your submission. Once you give up following the checkpoint schedule carefully, it is very easy to end up with large checkpoint deadline misses on production servers. If someone thinks they are doing a checkpoint every 5 minutes, but your patch makes them take 20 minutes instead, that is bad. They will not expect that a crash might have to replay that much activity before the server is useful again. You also don't seem afraid of how exceeding the checkpoint timeout is a very bad thing yet. I think it is important that why this problem was caused. We should try to find the cause of which program has bug or problem. The checkpointer process is the problem. There's no filesystem bug or complicated issues involved in many of the bad cases. Here is a simple example that shows how the toughest problem cases happen: -64GB of RAM -10% dirty_background_ratio = 6GB of dirty writes = 6144MB -2MB/s random I/O when concurrent reads are heavy -3027 seconds to clear the cache = 51 minutes That's how you get to an example like the one in my slides: LOG: checkpoint complete: wrote 33282 buers (3.2%); 0 transaction log file(s) added, 60 removed, 129 recycled; write=228.848 s, sync=4628.879 s, total=4858.859 s It's very hard to do better on these, and I don't expect any change to help this a lot. But I don't want to see a change committed that makes this sort of checkpoint 17 minutes longer if there's 100 relations involved either. My patch not only improvement of throughput but also realize stable response time at fsync phase in checkpoint. The main reason your patch improves latency and throughput is that it makes checkpoints farther apart. That's why I drew you a graph showing how the time between checkpoints lined up perfectly with TPS. If it was only a small problem it would be worth considering, but I think it's likely to end up with these 15 minute I've outlined here instead. And I servey about ext3 file system. I wouldn't worry too much about the problems ext3 has. Like the old RHEL5 kernel I was commenting about above, there are a lot of ext3 systems out there. But we can't do a lot about getting good performance from them. It's only important to test that you're not making them a lot worse with a change. My system block size is 4096, but 8192 or more seems to better. It will decrease number of inode and get more large sequential disk fields. I normally increase read-ahead on Linux systems to get faster speed on sequential disk throughput. Changing the block size might work better in some cases, but not many people are willing to do that. Read-ahead is very easy to change at any time. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Design proposal: fsync absorb linear slider
Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative. I have a simple one in mind that captures the biggest problem I see: that the number of backend and checkpoint writes to a file are not connected at all. We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync. Rather than second guess the I/O scheduling, I'd like to take this on directly by recognizing that the size of the problem is proportional to the number of writes to a segment. If you turned off fsync absorption altogether, you'd be at an extreme that allows only 1 write before fsync. That's low latency for each write, but terrible throughput. The maximum throughput case of 128 writes has the terrible latency we get reports about. But what if that trade-off was just a straight, linear slider going from 1 to 128? Just move it to the latency vs. throughput position you want, and see how that works out. The implementation I had in mind was this one: -Add an absorption_count to the fsync queue. -Add a new latency vs. throughput GUC I'll call . Its default value is -1 (or 0), which corresponds to ignoring this new behavior. -Whenever the background write absorbs a fsync call for a relation that's already in the queue, increment the absorption count. -max_segment_absorb 0, have the background writer scan for relations where absorption_count max_segment_absorb. When it finds one, call fsync on that segment. Note that it's possible for this simple scheme to be fooled when writes are actually touching a small number of pages. A process that constantly overwrites the same page is the worst case here. Overwrite it 128 times, and this method would assume you've dirtied every page, while only 1 will actually go to disk when you call fsync. It's possible to track this better. The count mechanism could be replaced with a bitmap of the 128 blocks, so that absorbs set a bit instead of incrementing a count. My gut feel is that this is more complexity than is really necessary here. If in fact the fsync is slimmer than expected, paying for it too much isn't the worst problem to have here. I'd like to build this myself, but if someone else wants to take a shot at it I won't mind. Just be aware the review is the big part here. I should be honest about one thing: I have zero incentive to actually work on this. The moderate amount of sponsorship money I've raised for 9.4 so far isn't getting anywhere near this work. The checkpoint patch review I have been doing recently is coming out of my weekend volunteer time. And I can't get too excited about making this as my volunteer effort when I consider what the resulting credit will look like. Coding is by far the smallest part of work like this, first behind coming up with the design in the first place. And both of those are way, way behind how long review benchmarking takes on something like this. The way credit is distributed for this sort of feature puts coding first, design not credited at all, and maybe you'll see some small review credit for benchmarks. That's completely backwards from the actual work ratio. If all I'm getting out of something is credit, I'd at least like it to be an appropriate amount of it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 CF 1] And then there were 5
On 7/22/13 10:48 PM, Tom Lane wrote: Greg Smith g...@2ndquadrant.com writes: Remove unused targets from plan: Alvaro? (He reviewed it already) Really I should do that one, but it seems like all my available cycles have been going into bug fixing lately :-( I just put you down as the committer on it, given Alvaro waved that way too. A further look at recently ALTER OPERATOR FAMILY changes shows that Robert has been working on those quite a bit. I'm putting him down as the committer on this last one. If the CF drags on a bit due to committer time being spent on bugs, given the date here I think that's OK. I would put next week as a loose target for finishing these off. I worry a lot less about committers keeping to their schedules than when there are 50 regular contributors still moving around. That leaves only 1 patch where I have no idea who will commit it: access to calls stack from GET DIAGNOSTICS statement -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/20/13 4:48 AM, didier wrote: With your tests did you try to write the hot buffers first? ie buffers with a high refcount, either by sorting them on refcount or at least sweeping the buffer list in reverse? I never tried that version. After a few rounds of seeing that all changes I tried were just rearranging the good and bad cases, I got pretty bored with trying new changes in that same style. by writing to the OS the less likely to be recycle buffers first it may have less work to do at fsync time, hopefully they have been written by the OS background task during the spread and are not re-dirtied by other backends. That is the theory. In practice write caches are so large now, there is almost no pressure forcing writes to happen until the fsync calls show up. It's easily possible to enter the checkpoint fsync phase only to discover there are 4GB of dirty writes ahead of you, ones that have nothing to do with the checkpoint's I/O. Backends are constantly pounding the write cache with new writes in situations with checkpoint spikes. The writes and fsync calls made by the checkpoint process are only a fraction of the real I/O going on. The volume of data being squeezed out by each fsync call is based on total writes to that relation since the checkpoint. That's connected to the writes to that relation happening during the checkpoint, but the checkpoint writes can easily be the minority there. It is not a coincidence that the next feature I'm working on attempts to quantify the total writes to each 1GB relation chunk. That's the most promising path forward on the checkpoint problem I've found. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
was measuring my theories as directly as possible. I would like to make a itemizing list which can be proof of my patch from you. Because DBT-2 benchmark spent lot of time about 1 setting test per 3 - 4 hours. That's great, but to add some perspective here I have spent over 1 year of my life running tests like this. The development cycle to do something useful in this area is normally measured in months of machine time running benchmarks, not hours or days. You're doing well so far, but you're just getting started. My itemized list is simple: throw out all results where the checkpoint end goes more than 5% beyond its targets. When that happens, no matter what you think is causing your gain, I will assume it's actually less total writes that are improving things. I'm willing to consider an optional, sloppy checkpoint approach that uses heavy load to adjust how often checkpoints happen. But if we're going to do that, it has to be extremely clear that the reason for the gain is the checkpoint spacing--and there is going to be a crash recovery time penalty paid for it. And this patch is not how I would do that. It's not really clear yet where the gains you're seeing are really coming from. If you re-ran all your tests with pg_stat_bgwriter before/after snapshots, logged every fsync call, and then build some tools to analyze the fsync call latency, then you'll have enough data to talk about this usefully. That's what I consider the bare minimum evidence to consider changing something here. I have all of those features in pgbench-tools with checkpoint logging turned way up, but they're not all in the dbt2 toolset yet as far as I know. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
Please stop all this discussion of patents in this area. Bringing up a US patents here makes US list members more likely to be treated as willful infringers of that patent: http://www.ipwatchdog.com/patent/advanced-patent/willful-patent-infringement/ if the PostgreSQL code duplicates that method one day. The idea of surveying patents in some area so that their methods can be avoided in code you develop, that is a reasonable private stance to take. But don't do that on the lists. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/18/13 11:04 AM, Robert Haas wrote: On a system where fsync is sometimes very very slow, that might result in the checkpoint overrunning its time budget - but SO WHAT? Checkpoints provide a boundary on recovery time. That is their only purpose. You can always do better by postponing them, but you've now changed the agreement with the user about how long recovery might take. And if you don't respect the checkpoint boundary, what you can't do is then claim better execution performance than something that did. It's always possible to improvement throughput by postponing I/O. SO WHAT? If that's OK, you don't need complicated logic to do that. Increase checkpoint_timeout. The system with checkpoint_timeout at 6 minutes will always outperform one where it's 5. You don't need to introduce a feedback loop--something that has significant schedule stability implications if it gets out of control--just to spread I/O out further. I'd like to wander down the road of load-sensitive feedback for database operations, especially for maintenance work. But if I build something that works mainly because it shifts the right edge of the I/O deadline forward, I am not fooled into thinking I did something awesome. That's cheating, getting better performance mainly by throwing out the implied contract with the user--the one over their expected recovery time after a crash. And I'm not excited about complicating the PostgreSQL code to add a new way to do that, not when checkpoint_timeout is already there with a direct, simple control on the exact same trade-off. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/18/13 12:00 PM, Alvaro Herrera wrote: I think the idea is to have a system in which most of the time the recovery time will be that for checkpoint_timeout=5, but in those (hopefully rare) cases where checkpoints take a bit longer, the recovery time will be that for checkpoint_timeout=6. I understand the implementation. My point is that if you do that, the fair comparison is to benchmark it against a current system where checkpoint_timeout=6 minutes. That is a) simpler, b) requires no code change, and c) makes the looser standards the server is now settling for transparent to the administrator. Also, my expectation is that it would perform better all of the time, not just during the periods this new behavior kicks in. Right now we have checkpoint_completion_target as a GUC for controlling what's called the spread of a checkpoint over time. That sometimes goes over, but that's happening against the best attempts of the server to do better. The first word that comes to mind for for just disregarding the end time is that it's a sloppy checkpoint. There is all sorts of sloppy behavior you might do here, but I've worked under the assumption that ignoring the contract with the administrator was frowned on by this project. If people want this sort of behavior in the server, I'm satisfied my distaste for the idea and the reasoning behind it is clear now. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 7/9/13 12:09 AM, Amit Kapila wrote: I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person. The testcase used is posted at below link: http://www.postgresql.org/message-id/51366323.8070...@vmware.com That seems easy enough to do here, Heikki's test script is excellent. The latest patch Hari posted on July 2 has one hunk that doesn't apply anymore now. Inside src/backend/utils/adt/pg_lzcompress.c the patch tries to change this code: - if (hent) + if (hentno != INVALID_ENTRY) But that line looks like this now: if (hent != INVALID_ENTRY_PTR) Definitions of those: #define INVALID_ENTRY 0 #define INVALID_ENTRY_PTR (hist_entries[INVALID_ENTRY]) I'm not sure if different error handling may be needed here now due the commit that changed this, or if the patch wasn't referring to the right type of error originally. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hardware donation
On 7/10/13 12:53 PM, Benedikt Grundmann wrote: The server will probably be most interesting for the disks in it. That is where we spend the largest amount of time optimizing (for sequential scan speed in particular): 22x600GB disks in a Raid6+0 (Raid0 of 2x 10disk raid 6 arrays) + 2 spare disks. Overall size 8.7 TB in that configuration. What is the RAID controller used in the server? That doesn't impact the donation, I'm just trying to fit this one into my goals for finding useful community performance testing equipment. There are a good number of systems floating around the community with HP controllers--I have even one myself now--but we could use more LSI Logic and Adaptec based systems. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/18/13 6:45 PM, Tatsuo Ishii wrote: I'm not a native English speaker either... Greg, could you please review the document? Yes, I already took at look at it briefly. The updates move in the right direction, but I can edit them usefully before commit. I'll have that done by tomorrow and send out a new version. I'm hopeful that v18 will finally be the one that everyone likes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])
On 7/18/13 4:02 PM, Alvaro Herrera wrote: I think we should just put the config directives of ALTER SYSTEM into a file, not dir, alongside postgresql.conf; I would suggest postgresql.auto.conf. This file is parsed automatically after postgresql.conf, without requiring an include directive in postgresql.conf. It is possible to build ALTER SYSTEM SET this way. One of the goals of the implementation style used here wasn't just to accomplish that narrow feature though. We keep running into situations where a standard, directory based configuration system would make things easier. Each time that happens, someone comes up with another way to avoid doing it. I think that approach contributes to stagnation in the terrible status quo of PostgreSQL configuration management. I thought this was a good spot to try and re-draw this line because I don't want just one program that is able to create new configuration entries easily. I want to see a whole universe of them. ALTER SYSTEM SET, tuning helpers, replication helpers, logging helpers, vacuum schedulers. All of them *could* just dump a simple file into a config directory with code anyone can write. And having ALTER SYSTEM SET do that provides a strong precedent for how it can be done. (I'd like to see initdb do that instead of hacking the system postgresql.conf as if sed-style edits were still the new hotness, but that's a future change) My claim, and that's one informed by writing pgtune, is that by far the hardest part of writing a configuration addition tool is parsing a postgresql.conf file. The expectation right now is that all changes must happen there. Creating a new snippet of configuration settings is easy, but no tools know where to put one right now. Instead we just keep coming up with single, hard-coded file names that people have to know in order to manage their installations. This seems the simplest way; config tools such as Puppet know they always need to consider the postgresql.auto.conf file. Like this idea. What administrators really want, whether they realize it or not, is to point Puppet at a configuration directory. Then the problem of what are the config files in the new version of Postgres happens once more and then it's over. Exactly what the files in there are named should be completely under control of the administrator. We're never going to reach that unless we lead by example though. The database's configuration pushes people toward using a small number of files with magic names--postgresql.conf, recovery.conf, and now postgresql.auto.conf in your proposal. Meanwhile, all sensible UNIX-style projects with complicated configurations in text files has moved toward a configuration directory full of them. Here are some directories on the last RHEL6 system I was editing configuration on this week: /etc/httpd/conf.d/ /etc/munin/conf.d/ /etc/ld.so.conf.d/ /etc/munin/conf.d/ /etc/dovecot/conf.d/ /etc/yum/pluginconf.d/ Some of them didn't get the memo that the right standard name is conf.d now, but they're the minority. It's fine to have a postgresql.conf file that you *can* make all your changes to, for people who want to stay in the old monolithic approach. But if there were also a conf.d directory under there, many classes of administrator would breath a sign of relief. With all the precedents people may have already ran into, with the right naming can be made discoverable and familiar to a lot of administrators. Telling them instead that there's this new postgresql.auto.conf file that suddenly they have to worry about--I'd say don't even bother if that's how you want to do it. That's making the problem I've been trying to simplify for five years now even worse. I think the whole business of parsing the file, and then verifying whether the auto file has been parsed, is nonsensical and should be removed from the patch. That was just trying to keep people from screwing up their configuration while they get used to things. That and some of the other sanity checking is not necessary, it was just trying to make the transition to the new configuration approach less error-prone. I don't think anyone would disagree that the current patch is doing enough of that error checking work that the error checking itself is the most likely thing to break. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 7/18/13 11:03 PM, Stephen Frost wrote: Wasn't there a wiki page about this feature which might also help? Bigger question- is it correct for the latest version of the patch? https://wiki.postgresql.org/wiki/RLS has accumulated a lot of older debris that may or may not be useful here. This useful piece was just presented at PGCon: http://www.pgcon.org/2013/schedule/attachments/273_PGcon2013-kaigai-row-level-security.pdf That is very up to date intro to the big picture issues. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/17/13 2:31 AM, Tatsuo Ishii wrote: ./pgbench -p 5433 -S -T 10 -R 1 test average rate limit lag: 862.534 ms (max 2960.913 ms) tps = 7133.745911 (including connections establishing) tps = 7135.130810 (excluding connections establishing) What does average rate limit lag mean? The whole concept of lag with the rate limit is complicated. At one point I thought this should be a debugging detail, rather than exposing the user to it. The problem is that if you do that, you might not notice that your limit failed to work as expected. Maybe it's good enough in a case like this that the user will see they tried to limit at 1, but they only got 7135, so something must not have worked as expected. Tatsuo: most of my tests were on Mac OS and Linux, I actually tested the Mac version a lot more than any other here. I didn't do any testing on Windows. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/16/13 11:36 PM, Ants Aasma wrote: As you know running a full suite of write benchmarks takes a very long time, with results often being inconclusive (noise is greater than effect we are trying to measure). I didn't say that. What I said is that over a full suite of write benchmarks, the effect of changes like this has always averaged out to zero. You should try it sometime. Then we can have a useful discussion of non-trivial results instead of you continuing to tell me I don't understand things. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/17/13 9:16 PM, Tatsuo Ishii wrote: Now suppose we have 3 transactions and each has following values: d(0) = 10 d(1) = 20 d(2) = 30 t(0) = 100 t(1) = 110 t(2) = 120 That says pgbench expects the duration 10 for each transaction. Actually, the first transaction runs slowly for some reason and the lag = 100 - 10 = 90. However, tx(1) and tx(2) are finished on schedule because they spend only 10 (110-10 = 10, 120-110 = 10). So the expected average lag would be 90/3 = 30. The clients are not serialized here in any significant way, even when they shared a single process/thread. I did many rounds of tracing through this code with timestamps on each line, and the sequence of events here will look like this: client 0: send SELECT... to server. yield to next client. client 1: send SELECT... to server. yield to next client. client 2: send SELECT... to server. yield to next client. select(): wait for the first response from any client. client 0: receive response. complete transaction, compute lag. client 1: receive response. complete transaction, compute lag. client 2: receive response. complete transaction, compute lag. There is nothing here that is queuing the clients one after the other. If (0) takes 100ms before its reply comes back, (1) and (2) can receive their reply back and continue forward at any time. They are not waiting for (0); it has yielded control while waiting for a response. All three times are independent once you reach the select() point where all are active. In this situation, if the server gets stuck doing something such that it takes 100ms before any client receives a response, it is correct to penalize every client for that latency. All three clients could have received the information earlier if the server had any to send them. If they did not, they all were suffering from some sort of lag. I'm not even sure why you spaced the start times out at intervals of 10. If I were constructing an example like this, I'd have them start at times of 0, 1, 2--as fast as the CPU can fire off statements basically--and then start waiting from that point. If client 1 takes 10 units of time to send its query out before client 2 runs, and the rate goal requires 10 units of time, the rate you're asking for is impossible. For sorting out what's going on with your two systems, I would recommend turning on debugging output with -d and looking at the new per-transaction latency numbers that the feature reports. If your theory that the lag is going up as the test proceeds is true, that should show up in the individual latency numbers too. Based on what I saw during weeks of testing here, I would be more suspicious that there's a system level difference between your two servers than to blame the latency calculation. I saw a *lot* of weird system issues myself when I started looking that carefully at sustained throughput. The latency reports from the perspective of Fabien's code were always reasonable though. When something delays every client, it counts that against every active client's lag, and that's the right thing to do. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/17/13 11:34 PM, Tatsuo Ishii wrote: My example is for 1 client case. So concurrent clients are not the issue here. Sorry about that, with your clarification I see what you were trying to explain now. The code initializes the target time like this: thread-throttle_trigger = INSTR_TIME_GET_MICROSEC(start); And then each time a transaction fires, it advances the reference time forward based on the expected rate: thread-throttle_trigger += wait; It does *not* reset thread-throttle_trigger based on when the previous transaction ended / when the next transaction started. If the goal is 10us transaction times, it beats a steady drum saying the transactions should come at 10us, 20us, 30us (on average--there's some randomness in the goals). It does not pay any attention to when the previous transactions finished. That means that if an early transaction takes an extra 1000us, every transaction after that will also show as 1000us late--even if all of them take 10us. You expect that those later transactions will show 0 lag, since they took the right duration. For that to happen, thread-throttle_trigger would need to be re-initialized with the current time at the end of each completed transaction. The lag computation was not the interesting part of this feature to me. As I said before, I considered it more of a debugging level thing than a number people would analyze as much as you did. I understand why you don't like it though. If the reference time was moved forward to match the transaction end each time, I think that would give the lag definition you're looking for. That's fine to me too, if Fabien doesn't have a good reason to reject the idea. We would need to make sure that doesn't break some part of the design too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/16/13 12:46 PM, Ants Aasma wrote: Spread checkpoints sprinkles the writes out over a long period and the general tuning advice is to heavily bound the amount of memory the OS willing to keep dirty. That's arguing that you can make this feature be useful if you tune in a particular way. That's interesting, but the goal here isn't to prove the existence of some workload that a change is useful for. You can usually find a test case that validates any performance patch as helpful if you search for one. Everyone who has submitted a sorted checkpoint patch for example has found some setup where it shows significant gains. We're trying to keep performance stable across a much wider set of possibilities though. Let's talk about default parameters instead, which quickly demonstrates where your assumptions fail. The server I happen to be running pgbench tests on today has 72GB of RAM running SL6 with RedHat derived kernel 2.6.32-358.11.1. This is a very popular middle grade server configuration nowadays. There dirty_background_ratio and dirty_background_ratio are 10 (percent). That means that roughly 7GB of RAM can be used for write caching. Note that this is a fairly low write cache tuning compared to a survey of systems in the field--lots of people have servers with earlier kernels where these numbers can be as high as 20 or even 40% instead. The current feasible tuning for shared_buffers suggests a value of 8GB is near the upper limit, beyond which cache related overhead makes increases counterproductive. Your examples are showing 53% of shared_buffers dirty at checkpoint time; that's typical. The checkpointer is then writing out just over 4GB of data. With that background what process here has more data to make decisions with? -The operating system has 7GB of writes it's trying to optimize. That potentially includes backend, background writer, checkpoint, temp table, statistics, log, and WAL data. The scheduler is also considering read operations. -The checkpointer process has 4GB of writes from rarely written shared memory it's trying to optimize. This is why if you take the opposite approach of yours today--go searching for workloads where sorting is counterproductive--those are equally easy to find. Any test of write speed I do starts with about 50 different scale/client combinations. Why do I suggest pgbench-tools as a way to do performance tests? It's because an automated sweep of client setups like it does is the minimum necessary to create enough variation in workload for changing the database's write path. It's really amazing how often doing that shows a proposed change is just shuffling the good and bad cases around. That's been the case for every sorting and fsync delay change submitted so far. I'm not even interested in testing today's submission because I tried that particular approach for a few months, twice so far, and it fell apart on just as many workloads as it helped. The checkpointer has the best long term overview of the situation here, OS scheduling only has the short term view of outstanding read and write requests. True only if shared_buffers is large compared to the OS write cache, which was not the case on the example I generated with all of a minute's work. I regularly see servers where Linux's Dirty area becomes a multiple of the dirty buffers written by a checkpoint. I can usually make that happen at will with CLUSTER and VACUUM on big tables. The idea that the checkpointer has a long-term view while the OS has a short one, that presumes a setup that I would say is possible but not common. kernel settings: dirty_background_bytes = 32M, dirty_bytes = 128M. You disclaimed this as a best case scenario. It is a low throughput / low latency tuning. That's fine, but if Postgres optimizes itself toward those cases it runs the risk of high throughput servers with large caches being detuned. I've posted examples before showing very low write caches like this leading to VACUUM running at 1/2 its normal speed or worse, as a simple example of where a positive change in one area can backfire badly on another workload. That particular problem was so common I updated pgbench-tools recently to track table maintenance time between tests, because that demonstrated an issue even when the TPS numbers all looked fine. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
To clarify what state this is all in: Fabien's latest pgbench-throttle-v15.patch is the ready for a committer version. The last two revisions are just tweaking the comments at this point, and his version is more correct than my last one. My little pgbench-delay-finish-v1.patch is a brand new bug fix of sorts that, while trivial, isn't ready for commit yet. I'll start a separate e-mail thread and CF entry for that later. Fabien has jumped into initial review comments of that already below, but the throttle feature isn't dependent on that. The finish delay just proves that the latency spikes I was getting here aren't directly tied to the throttle feature. On 7/14/13 5:42 PM, Fabien COELHO wrote: * ISTM that the impact of the chosen 1000 should appear somewhere. I don't have a problem with that, but I didn't see that the little table you included was enough to do that. I think if someone knows how this type of random generation works, they don't need the comment to analyze the impact. And if they don't know, that comment alone wasn't enough to help them figure it out. That's why I added some terms that might help point the right way for someone who wanted to search for more information instead. The text of pgbench is not really the right place to put a lecture about how to generate numbers with a target probability distribution function. Normally the code comments tries to recommend references for this sort of thing instead. I didn't find a really good one in a quick search though. About 123456 12345 vs 123456.012345: My data parser is usually gnuplot or my eyes, and in both cases the later option is better:-) pgbench-tools uses gnuplot too. If I were doing this again today from scratch, I would recommend using the epoch time format compatible with it you suggested. I need to look into this whole topic a little more before we get into that though. This patch just wasn't the right place to get into that change. About the little patch: Parameter ok should be renamed to something meaningful (say do_finish?). It's saying if the connection finished ok or not. I think exactly what is done with that information is an implementation detail that doesn't need to be exposed to the function interface. We might change how this is tied to PQfinish later. Also, it seems that when timer is exceeded in doCustom it is called with true, but maybe you intended that it should be called with false?? The way timeouts are handled right now is a known messy thing. Exactly what you should do with a client that has hit one isn't obvious. Try again? Close the connection and abort? The code has a way it handles that now, and I didn't want to change it any. it is important to remove the connection because it serves as a marker to know whether a client must run or not. This little hack moved around how clients finished enough to get rid of the weird issue with your patch I was bothered by. You may be right that the change isn't really correct because of how the connection is compared to null as a way to see if it's active. I initially added a more complicated finished state to the whole mess that tracked this more carefully. I may need to return to that idea now. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/13/13 12:13 PM, Fabien COELHO wrote: My 0.02€: if it means adding complexity to the pgbench code, I think that it is not worth it. The point of pgbench is to look at a steady state, not to end in the most graceful possible way as far as measures are concerned. That's how some people use pgbench. I'm just as likely to use it to characterize system latency. If there's a source of latency that's specific to the pgbench code, I want that out of there even if it's hard. But we don't have to argue about that because it isn't. The attached new patch seems to fix the latency spikes at the end, with -2 lines of new code! With that resolved I did a final pass across the rate limit code too, attached as a v14 and ready for a committer. I don't really care what order these two changes are committed, there's no hard dependency, but I would like to see them both go in eventually. No functional code was changed from your v13 except for tweaking the output. The main thing I did was expand/edit comments and rename a few variables to try and make this easier to read. If you have any objections to my cosmetic changes feel free to post an update. I've put a good bit of time into trying to simplify this further, thinking it can't really be this hard. But this seems to be the minimum complexity that still works given the mess of the pgbench state machine. Every change I try now breaks something. To wrap up the test results motivating my little pgbench-delay-finish patch, the throttled cases that were always giving 100ms of latency clustered at the end here now look like this: average rate limit lag: 0.181 ms (max 53.108 ms) tps = 10088.727398 (including connections establishing) tps = 10105.775864 (excluding connections establishing) There are still some of these cases where latency spikes, but they're not as big and they're randomly distributed throughout the run. The problem I had with the ones at the end is how they tended to happen a few times in a row. I kept seeing multiple of these ~50ms lulls adding up to a huge one, because the source of the lag kept triggering at every connection close. pgbench was already cleaning up all of its connections at the end, after all the transactions were finished. It looks safe to me to just rely on that for calling PQfinish in the normal case. And calls to client_done already label themselves ok or abort, the code just didn't do anything with that state before. I tried adding some more complicated state tracking, but that adds complexity while doing the exact same thing as the simple implementation I did. The only part of your code change I reverted was altering the latency log transaction timestamps to read like 1373821907.65702 instead of 1373821907 65702. Both formats were considered when I added that feature, and I completely understand why you'd like to change it. One problem is that doing so introduces a new class of float parsing and rounding issues to consumers of that data. I'd rather not revisit that without a better reason to break the output format. Parsing tools like my pgbench-tools already struggle trying to support multiple versions of pgbench, and I don't think there's enough benefit to the float format to bother breaking them today. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c new file mode 100644 index 08095a9..8dc81e5 *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** preparedStatementName(char *buffer, int *** 862,875 static bool clientDone(CState *st, bool ok) { ! /* !* When the connection finishes normally, don't call PQfinish yet. !* PQfinish can cause significant delays in other clients that are !* still running. Rather than finishing all of them here, in the !* normal case clients are instead closed in bulk by disconnect_all, !* after they have all stopped. !*/ ! if ((st-con != NULL) ok) { PQfinish(st-con); st-con = NULL; --- 862,870 static bool clientDone(CState *st, bool ok) { ! (void) ok; /* unused */ ! ! if (st-con != NULL) { PQfinish(st-con); st-con = NULL; diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 80203d6..da88bd7 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -137,6 +137,12 @@ intunlogged_tables = 0; double sample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ +int64 throttle_delay = 0; + +/* * tablespace selection */ char *tablespace = NULL
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 6/16/13 10:27 AM, Heikki Linnakangas wrote: Yeah, the checkpoint scheduling logic doesn't take into account the heavy WAL activity caused by full page images... Rationalizing a bit, I could even argue to myself that it's a *good* thing. At the beginning of a checkpoint, the OS write cache should be relatively empty, as the checkpointer hasn't done any writes yet. So it might make sense to write a burst of pages at the beginning, to partially fill the write cache first, before starting to throttle. But this is just handwaving - I have no idea what the effect is in real life. That's exactly right. When a checkpoint finishes the OS write cache is clean. That means all of the full-page writes aren't even hitting disk in many cases. They just pile up in the OS dirty memory, often sitting there all the way until when the next checkpoint fsyncs start. That's why I never wandered down the road of changing FPW behavior. I have never seen a benchmark workload hit a write bottleneck until long after the big burst of FPW pages is over. I could easily believe that there are low-memory systems where the FPW write pressure becomes a problem earlier. And slim VMs make sense as the place this behavior is being seen at. I'm a big fan of instrumenting the code around a performance change before touching anything, as a companion patch that might make sense to commit on its own. In the case of a change to FPW spacing, I'd want to see some diagnostic output in something like pg_stat_bgwriter that tracks how many FPW pages are being modified. A pgstat_bgwriter.full_page_writes counter would be perfect here, and then graph that data over time as the benchmark runs. Another thought is that rather than trying to compensate for that effect in the checkpoint scheduler, could we avoid the sudden rush of full-page images in the first place? The current rule for when to write a full page image is conservative: you don't actually need to write a full page image when you modify a buffer that's sitting in the buffer cache, if that buffer hasn't been flushed to disk by the checkpointer yet, because the checkpointer will write and fsync it later. I'm not sure how much it would smoothen WAL write I/O, but it would be interesting to try. There I also think the right way to proceed is instrumenting that area first. A long time ago, Itagaki wrote a patch to sort the checkpoint writes: www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp. He posted very promising performance numbers, but it was dropped because Tom couldn't reproduce the numbers, and because sorting requires allocating a large array, which has the risk of running out of memory, which would be bad when you're trying to checkpoint. I updated and re-reviewed that in 2011: http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com and commented on why I think the improvement was difficult to reproduce back then. The improvement didn't follow for me either. It would take a really amazing bit of data to get me to believe write sorting code is worthwhile after that. On large systems capable of dirtying enough blocks to cause a problem, the operating system and RAID controllers are already sorting block. And *that* sorting is also considering concurrent read requests, which are a lot more important to an efficient schedule than anything the checkpoint process knows about. The database doesn't have nearly enough information yet to compete against OS level sorting. Bad point of my patch is longer checkpoint. Checkpoint time was increased about 10% - 20%. But it can work correctry on schedule-time in checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6). For a fair comparison, you should increase the checkpoint_completion_target of the unpatched test, so that the checkpoints run for roughly the same amount of time with and without the patch. Otherwise the benefit you're seeing could be just because of a more lazy checkpoint. Heikki has nailed the problem with the submitted dbt-2 results here. If you spread checkpoints out more, you cannot fairly compare the resulting TPS or latency numbers anymore. Simple example: 20 minute long test. Server A does a checkpoint every 5 minutes. Server B has modified parameters or server code such that checkpoints happen every 6 minutes. If you run both to completion, A will have hit 4 checkpoints that flush the buffer cache, B only 3. Of course B will seem faster. It didn't do as much work. pgbench_tools measures the number of checkpoints during the test, as well as the buffer count statistics. If those numbers are very different between two tests, I have to throw them out as unfair. A lot of things that seem promising turn out to have this sort of problem. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 6/27/13 11:08 AM, Robert Haas wrote: I'm pretty sure Greg Smith tried it the fixed-sleep thing before and it didn't work that well. That's correct, I spent about a year whipping that particular horse and submitted improvements on it to the community. http://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.com and its updates downthread are good ones to compare this current work against. The important thing to realize about just delaying fsync calls is that it *cannot* increase TPS throughput. Not possible in theory, obviously doesn't happen in practice. The most efficient way to write things out is to delay those writes as long as possible. The longer you postpone a write, the more elevator sorting and write combining you get out of the OS. This is why operating systems like Linux come tuned for such delayed writes in the first place. Throughput and latency are linked; any patch that aims to decrease latency will probably slow throughput. Accordingly, the current behavior--no delay--is already the best possible throughput. If you apply a write timing change and it seems to increase TPS, that's almost certainly because it executed less checkpoint writes. It's not a fair comparison. You have to adjust any delaying to still hit the same end point on the checkpoint schedule. That's what my later submissions did, and under that sort of controlled condition most of the improvements went away. Now, I still do really believe that better spacing of fsync calls helps latency in the real world. Far as I know the server that I developed that patch for originally in 2010 is still running with that change. The result is not a throughput change though; there is a throughput drop with a latency improvement. That is the unbreakable trade-off in this area if all you touch is scheduling. The reason why I was ignoring this discussion and working on pgbench throttling until now is that you need to measure latency at a constant throughput to advance here on this topic, and that's exactly what the new pgbench feature enables. If we can take the current checkpoint scheduler and an altered one, run both at exactly the same rate, and one gives lower latency, now we're onto something. It's possible to do that with DBT-2 as well, but I wanted something really simple that people could replicate results with in pgbench. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/14/13 2:48 PM, Fabien COELHO wrote: You attached my v13. Could you send your v14? Correct patch (and the little one from me again) attached this time. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c new file mode 100644 index 08095a9..6564057 *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** int unlogged_tables = 0; *** 137,142 --- 137,148 doublesample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ + int64 throttle_delay = 0; + + /* * tablespace selection */ char *tablespace = NULL; *** typedef struct *** 200,210 --- 206,218 int listen; /* 0 indicates that an async query has been * sent */ int sleeping; /* 1 indicates that the client is napping */ + boolthrottling; /* whether nap is for throttling */ int64 until; /* napping until (usec) */ Variable *variables; /* array of variable definitions */ int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + boolis_throttled; /* whether transaction should be throttled */ int use_file; /* index in sql_files for this client */ boolprepared[MAX_FILES]; } CState; *** typedef struct *** 222,227 --- 230,238 instr_time *exec_elapsed; /* time spent executing cmds (per Command) */ int*exec_count; /* number of cmd executions (per Command) */ unsigned short random_state[3]; /* separate randomness for each thread */ + int64 throttle_trigger; /* previous/next throttling (us) */ + int64 throttle_lag; /* total transaction lag behind throttling */ + int64 throttle_lag_max; /* max transaction lag */ } TState; #define INVALID_THREAD((pthread_t) 0) *** typedef struct *** 230,235 --- 241,248 { instr_time conn_time; int xacts; + int64 throttle_lag; + int64 throttle_lag_max; } TResult; /* *** usage(void) *** 353,358 --- 366,372 -n, --no-vacuum do not run VACUUM before tests\n -N, --skip-some-updates skip updates of pgbench_tellers and pgbench_branches\n -r, --report-latencies report average latency per command\n +-R, --rate SPEC target rate in transactions per second\n -s, --scale=NUM report this scale factor in output\n -S, --select-onlyperform SELECT-only transactions\n -t, --transactions number of transactions each client runs *** doCustom(TState *thread, CState *st, ins *** 900,916 { PGresult *res; Command **commands; top: commands = sql_files[st-use_file]; if (st-sleeping) { /* are we sleeping? */ instr_time now; INSTR_TIME_SET_CURRENT(now); ! if (st-until = INSTR_TIME_GET_MICROSEC(now)) st-sleeping = 0; /* Done sleeping, go ahead with next command */ else return true;/* Still sleeping, nothing to do here */ } --- 914,973 { PGresult *res; Command **commands; + booltrans_needs_throttle = false; top: commands = sql_files[st-use_file]; + /* +* Handle throttling once per transaction by sleeping. It is simpler +* to do this here rather than at the end, because so much complicated +* logic happens below when statements finish. +*/ + if (throttle_delay ! st-is_throttled) + { + /* +* Use inverse transform sampling to randomly generate a delay, such +* that the series of delays will approximate a Poisson distribution +* centered on the throttle_delay time. +* +* If transactions are too slow or a given wait is shorter than
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/3/13 9:39 AM, Andres Freund wrote: I wonder how much of this could be gained by doing a sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing the original checkpoint-pass through the buffers or when fsyncing the files. The fsync calls decomposing into the queued set of block writes. If they all need to go out eventually to finish a checkpoint, the most efficient way from a throughput perspective is to dump them all at once. I'm not sure sync_file_range targeting checkpoint writes will turn out any differently than block sorting. Let's say the database tries to get involved in forcing a particular write order that way. Right now it's going to be making that ordering decision without the benefit of also knowing what blocks are being read. That makes it hard to do better than the OS, which knows a different--and potentially more useful in a ready-heavy environment--set of information about all the pending I/O. And it would be very expensive to made all the backends start sharing information about what they read to ever pull that logic into the database. It's really easy to wander down the path where you assume you must know more than the OS does, which leads to things like direct I/O. I am skeptical of that path in general. I really don't want Postgres to be competing with the innovation rate in Linux kernel I/O if we can ride it instead. One idea I was thinking about that overlaps with a sync_file_range refactoring is simply tracking how many blocks have been written to each relation. If there was a rule like fsync any relation that's gotten more than 100 8K writes, we'd never build up the sort of backlog that causes the worst latency issues. You really need to start tracking the file range there, just to fairly account for multiple writes to the same block. One of the reasons I don't mind all the work I'm planning to put into block write statistics is that I think that will make it easier to build this sort of facility too. The original page write and the fsync call that eventually flushes it out are very disconnected right now, and file range data seems the right missing piece to connect them well. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/11/13 8:29 AM, KONDO Mitsumasa wrote: I use linear combination method for considering about total checkpoint schedule which are write phase and fsync phase. V3 patch was considered about only fsync phase, V4 patch was considered about write phase and fsync phase, and v5 patch was considered about only fsync phase. Your v5 now looks like my Self-tuning checkpoint sync spread series: https://commitfest.postgresql.org/action/patch_view?id=514 which I did after deciding write phase delays didn't help. It looks to me like some, maybe all, of your gain is coming from how any added delays spread out the checkpoints. The self-tuning part I aimed at was trying to stay on exactly the same checkpoint end time even with the delays in place. I got that part to work, but the performance gain went away once the schedule was a fair comparison. You are trying to solve a very hard problem. How long are you running your dbt-2 tests for? I didn't see that listed anywhere. ** Average checkpoint duration (sec) (Not include during loading time) | write_duration | sync_duration | total fsync v3-0.7 | 296.6 | 251.8898 | 548.48 | OK fsync v3-0.9 | 292.086| 276.4525 | 568.53 | OK fsync v3-0.7_disabled| 303.5706 | 155.6116 | 459.18 | OK fsync v4-0.7 | 273.8338 | 355.6224 | 629.45 | OK fsync v4-0.9 | 329.0522 | 231.77| 560.82 | OK I graphed the total times against the resulting NOTPM values and attached that. I expect transaction rate to increase along with time time between checkpoints, and that's what I see here. The fsync v4-0.7 result is worse than the rest for some reason, but all the rest line up nicely. Notice how fsync v3-0.7_disabled has the lowest total time between checkpoints, at 459.18. That is why it has the most I/O and therefore runs more slowly than the rest. If you take your fsync v3-0.7_disabled and increase checkpoint_segments and/or checkpoint_timeout until that test is averaging about 550 seconds between checkpoints, NOTPM should also increase. That's interesting to know, but you don't need any change to Postgres for that. That's what always happens when you have less checkpoints per run. If you get a checkpoint time table like this where the total duration is very close--within +/-20 seconds is the sort of noise I would expect there--at that point I would say you have all your patches on the same checkpoint schedule. And then you can compare the NOTPM numbers usefully. When the checkpoint times are in a large range like 459.18 to 629.45 in this table, as my graph shows the associated NOTPM numbers are going to be based on that time. I would recommend taking a snapshot of pg_stat_bgwriter before and after the test runs, and then showing the difference between all of those numbers too. If the test runs for a while--say 30 minutes--the total number of checkpoints should be very close too. * Test Server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB (Add) Set off energy efficient function in BIOS and OS. Excellent, here I have a DL160 G6 with 2 processors, 72GB of RAM, and that same P410 controller + 4 disks. I've been meaning to get DBT-2 running on there usefully, your research gives me a reason to do that. You seem to be in a rush due to the commitfest schedule. I have some bad news for you there. You're not going to see a change here committed in this CF based on where it's at, so you might as well think about the best longer term plan. I would be shocked if anything came out of this in less than 3 months really. That's the shortest amount of time I've ever done something useful in this area. Each useful benchmarking run takes me about 3 days of computer time, it's not a very fast development cycle. Even if all of your results were great, we'd need to get someone to duplicate them on another server, and we'd need to make sure they didn't make other workloads worse. DBT-2 is very useful, but no one is going to get a major change to the write logic in the database committed based on one benchmark. Past changes like this have used both DBT-2 and a large number of pgbench tests to get enough evidence of improvement to commit. I can help with that part when you get to something I haven't tried already. I am very interesting in improving this area, it just takes a lot of work to do it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com attachment: NOTPM-Checkpoints.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/14/13 5:28 PM, james wrote: Some random seeks during sync can't be helped, but if they are done when we aren't waiting for sync completion then they are in effect free. That happens sometimes, but if you measure you'll find this doesn't actually occur usefully in the situation everyone dislikes. In a write heavy environment where the database doesn't fit in RAM, backends and/or the background writer are constantly writing data out to the OS. WAL is going out constantly as well, and in many cases that's competing for the disks too. The most popular blocks in the database get high usage counts and they never leave shared_buffers except at checkpoint time. That's easy to prove to yourself with pg_buffercache. And once the write cache fills, every I/O operation is now competing. There is nothing happening for free. You're stealing I/O from something else any time you force a write out. The optimal throughput path for checkpoints turns out to be delaying every single bit of I/O as long as possible, in favor of the [backend|bgwriter] writes and WAL. Whenever you delay a buffer write, you have increased the possibility that someone else will write the same block again. And the buffers being written by the checkpointer are, on average, the most popular ones in the database. Writing any of them to disk pre-emptively has high odds of writing the same block more than once per checkpoint. And that easy to measure waste--it shows as more writes/transaction in pg_stat_bgwriter--it hurts throughput more than every reduction in seek overhead you might otherwise get from early writes. The big gain isn't chasing after cheap seeks. The best path is the one that decreases the total volume of writes. We played this game with the background writer work for 8.3. The main reason the one committed improved on the original design is that it completely eliminated doing work on popular buffers in advance. Everything happens at the last possible time, which is the optimal throughput situation. The 8.1/8.2 BGW used to try and write things out before they were strictly necessary, in hopes that that I/O would be free. But it rarely was, while there was always a cost to forcing them to disk early. And that cost is highest when you're talking about the higher usage blocks the checkpointer tends to write. When in doubt, always delay the write in hopes it will be written to again and you'll save work. So it occurs to me that perhaps we can watch for patterns where we have groups of adjacent writes that might stream, and when they form we might schedule them... Stop here. I mentioned something upthread that is worth repeating. The checkpointer doesn't know what concurrent reads are happening. We can't even easily make it know, not without adding a whole new source of IPC and locking contention among clients. Whatever scheduling decision the checkpointer might make with its limited knowledge of system I/O is going to be poor. You might find a 100% write benchmark that it helps, but those are not representative of the real world. In any mixed read/write case, the operating system is likely to do better. That's why things like sorting blocks sometimes seem to help someone, somewhere, with one workload, but then aren't repeatable. We can decide to trade throughput for latency by nudging the OS to deal with its queued writes more regularly. That will result in more total writes, which is the reason throughput drops. But the idea that PostgreSQL is going to do a better global job of I/O scheduling, that road is a hard one to walk. It's only going to happen if we pull all of the I/O into the database *and* do a better job on the entire process than the existing OS kernel does. That sort of dream, of outperforming the filesystem, it is very difficult to realize. There's a good reason that companies like Oracle stopped pushing so hard on recommending raw partitions. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changing recovery.conf parameters into GUCs
On 7/10/13 9:39 AM, Heikki Linnakangas wrote: On 10.07.2013 02:54, Josh Berkus wrote: One bit of complexity I'd like to see go away is that we have two trigger files: one to put a server into replication, and one to promote it. The promotion trigger file is a legacy of warm standby, I believe. Maybe, now that we have pg_ctl promote available, we can eliminate the promotion trigger? No, see http://www.postgresql.org/message-id/5112a54b.8090...@vmware.com. Right, you had some good points in there. STONITH is so hard already, we need to be careful about eliminating options there. All the summaries added here have actually managed to revive this one usefully early in the release cycle! Well done. I just tried to apply Michael's 20130325_recovery_guc_v3.patch and the bit rot isn't that bad either. 5 rejection files, nothing big in them. The only overlap between the recovery.conf GUC work and refactoring the trigger file is that the trigger file is referenced in there, and we really need to point it somewhere to be most useful. Personally, I don't care if we support the old recovery.conf-trigger behavior as long as I'm not forced to use it. If you accept Heikki's argument for why the file can't go away altogether, and it's pretty reasonable, I think two changes reach a point where everyone can live with this: -We need a useful default filename for trigger_file to point at. -pg_ctl failover creates that file. As for the location to default to, the pg_standby docs suggest /tmp/pgsql.trigger.5432 while the Binary Replication Tutorial on the wiki uses a RedHat directory layout $PGDATA/failover The main reason I've preferred something in the data directory is that triggering a standby is too catastrophic for me to be comfortable putting it in /tmp. Any random hooligan with a shell account can trigger a standby and break its replication. Putting the unix socket into /tmp only works because the server creates the file as part of startup. Here that's not possible, because creating the trigger is the signalling mechanism. I don't think there needs to be a CLI interface for putting the alternate possible text into the trigger--that you can ask for 'fast' startup. It's nice to have available as an expert, but it's fine for that to be harder to do. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/30/13 2:04 AM, Fabien COELHO wrote: My guess is the OS. PQfinish or select do/are systems calls that present opportunities to switch context. I think that the OS is passing time with other processes on the same host, expecially postgres backends, when it is not with the client. I went looking for other instances of this issue in pgbench results, that's what I got lost in the last two weeks. It's subtle because the clients normally all end in one very short burst of time, but I have found evidence of PQfinish issues elsewhere. Evidence still seems to match the theory that throttling highlights this only because it spreads out the ending a bit more. Also, it happens to be a lot worse on the Mac I did initial testing with, and I don't have nearly as many Mac pgbench results. There's a refactoring possible here that seems to make this whole class of problem go away. If I change pgbench so that PQfinish isn't called for any client until *all* of the clients are actually finished with transactions, the whole issue goes away. I'm going to package that hack the right way into its own little change, revisit the throttling code, and then this all should wrap up nicely. I'd like to get this one out of the commitfest so I can move onto looking at something else. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 7/5/13 2:50 AM, Jeff Davis wrote: So, my simple conclusion is that glibc emulation should be about the same as what we're doing now, so there's no reason to avoid it. That means, if posix_fallocate() is present, we should use it, because it's either the same (if emulated in glibc) or significantly faster (if implemented in the kernel). That's what I'm seeing everywhere too. I'm happy that we've spent enough time chasing after potential issues without finding anything now. Pull out the GUC that was added for default and this is ready to commit. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block write statistics WIP
On 7/1/13 3:10 AM, Satoshi Nagayasu wrote: Or should we add some pointer, which is accociated with the Relation, into BufferDesc? Maybe OID? That is the other option here, I looked at it but didn't like it. The problem is that at the point when a new page is created, it's not always clear yet what relation it's going to hold. That means that if the buffer page itself is where you look up the relation OID, every code path that manipulates relation IDs will need to worry about maintaining this information. It's possible to do it that way, but I don't know that it's any less work than making all the write paths keep track of it. It would need some extra locking around updating the relation tag in the buffer pages too, and I'd prefer not to The other thing that I like about the writing side is that I can guarantee the code is correct pretty easily. Yes, it's going to take days worth of modifying writing code. But the compile will force you to fix all of them, and once they're all updated I'd be surprised if something unexpected happened. If instead the data moves onto the buffer page header instead, we could be chasing bugs similar to the relcache ones forever. Also, as a tie breaker, buffer pages will get bigger and require more locking this way. Making writers tell us the relation doesn't need any of that. I'm thinking of FlushBuffer() too. How can we count physical write for each relation in FlushBuffer()? Or any other appropriate place? SMgrRelation is available there, so tagging the relation should be easy in this case. BTW, Greg's first patch could not be applied to the latest HEAD. I guess it need to have some fix, I couldn't clafiry it though. Since this is a WIP patch, what I was looking for was general design approach feedback, with my bit as a simple example. I didn't expect anyone to spend much time doing a formal review of that quick hack. You're welcome to try and play with that code to add things, I'm not very attached to it yet. I've basically gotten what I wanted to here from this submission; I'm marking it returned with feedback. If anyone else has comments, I'm still open to discussion here too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 7/1/13 3:44 PM, Robert Haas wrote: Yeah. If the patch isn't going to be a win on RHEL 5, I'd consider that a good reason to scrap it for now and revisit it in 3 years. There are still a LOT of people running RHEL 5, and the win isn't big enough to engineer a more complex solution. I'm still testing, expect to have this wrapped up on my side by tomorrow. So much of the runtime here is the file setup/close that having a 2:1 difference in number of writes, what happens on the old platforms, it is hard to get excited about. I don't think the complexity to lock out RHEL5 here is that bad even if it turns out to be a good idea. Just add another configure check for fallocate, and on Linux if it's not there don't use posix_fallocate either. Maybe 5 lines of macro code? RHEL5 sure isn't going anyway anytime soon, but at the same time there won't be that many 9.4 deployments on that version. I've been digging into the main situation where this feature helps, and it won't be easy to duplicate in a benchmark situation. Using Linux's fallocate works as a hint that the whole 16MB should be allocated at once, and therefore together on disk if feasible. The resulting WAL files should be less prone to fragmentation. That's actually the biggest win of this approach, but I can't easily duplicate the sort of real-world fragmentation I see on live servers here.Given that, I'm leaning toward saying that unless there's a clear regression on older platforms, above the noise floor, this is still the right thing to do. I fully agree that this needs to fully automatic--no GUC--before it's worth committing. If we can't figure out the right thing to do now, there's little hope anyone else will in a later tuning expedition. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 6/30/13 2:01 PM, Jeff Davis wrote: Simple test program attached, which creates two files and fills them: one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then, I just cmp the resulting files (and also ls them, to make sure they are 16MB). This makes platform level testing a lot easier, thanks. Attached is an updated copy of that program with some error checking. If the files it creates already existed, the code didn't notice, and a series of write errors happened. If you set the test up right it's not a problem, but it's better if a bad setup is caught. I wrapped the whole test with a shell script, also attached, which insures the right test sequence and checks. Your C test program compiles and passes on RHEL5/6 here, doesn't on OS X Darwin. No surprises there, there's a long list of platforms that don't support this call at https://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.html and the Mac is on it. Many other platforms I was worried about don't support it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--so that cuts down on testing quite a bit. If it runs faster on Linux, that's the main target here, just like the existing effective_io_concurrency fadvise code. The specific thing I was worried about is that this interface might have a stub that doesn't work perfectly in older Linux kernels. After being surprised to find this interface worked on RHEL5 with your test program, I dug into this more. It works there, but it may actually be slower. posix_fallocate is actually implemented by glibc on Linux. Been there since 2.1.94 according to the Linux man pages. But Linux itself didn't add the feature until kernel 2.6.20: http://lwn.net/Articles/226436/ The biggest thing I was worried about--the call might be there in early kernels but with a non-functional implementation--that's not the case. Looking at the diff, before that patch there's no fallocate at all. So what happened in earlier kernels, where there was no kernel level fallocate available? According to https://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.html what glibc does is check for kernel fallocate(), and if it's not there it writes a bunch of zeros to create the file instead. What is actually happening on a RHEL5 system (with kernel 2.6.18) is that calling posix_fallocate does this fallback behavior, where it basically does the same thing the existing WAL clearing code does. I can even prove that's the case. On RHEL5, if you run strace -o out ./fallocate the main write loop looks like this: write(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 8192) = 8192 But when you call posix_fallocate, you still get a bunch of writes, but 4 bytes at a time: pwrite(4, \0, 1, 16769023)= 1 pwrite(4, \0, 1, 16773119)= 1 pwrite(4, \0, 1, 16777215)= 1 That's glibc helpfully converting your call to posix_fallocate into small writes, because the OS doesn't provide a better way in that kernel. It's not hard to imagine this being slower than what the WAL code is doing right now. I'm not worried about correctness issues anymore, but my gut paranoia about this not working as expected on older systems was justified. Everyone who thought I was just whining owes me a cookie. This is what I plan to benchmark specifically next. If the posix_fallocate approach is actually slower than what's done now when it's not getting kernel acceleration, which is the case on RHEL5 era kernels, we might need to make the configure time test more complicated. Whether posix_fallocate is defined isn't sensitive enough; on Linux it may be the case that this only is usable when fallocate() is also there. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com #!/bin/sh rm -f fallocate /tmp/afile /tmp/bfile gcc fallocate.c -o fallocate if [ ! -x fallocate ] ; then echo Test program did not compile, posix_fallocate may not be supported exit fi ./fallocate if [ -f /tmp/afile ] ; then sizea=`du /tmp/afile | cut -f 1` sizeb=`du /tmp/bfile | cut -f 1` if [ $sizea -eq $sizeb ] ; then cmp /tmp/afile /tmp/bfile if [ $? -ne 0 ] ; then echo Test failed, files do not match else echo Test passed fi else echo Test failed, sizes do not match fi fi #include fcntl.h #include stdio.h char buf[8192] = {0}; int main() { int i; int written; int fda = open(/tmp/afile, O_CREAT | O_EXCL | O_WRONLY, 0600); int fdb = open(/tmp/bfile, O_CREAT | O_EXCL | O_WRONLY, 0600); if (fda 0 || fdb 0) { printf(Opening files failed\n); return(1); } for(i = 0; i 2048; i++) { written=write(fda, buf, 8192); if (written 8192) { printf(Write to file failed); return(2); } } posix_fallocate(fdb, 0, 16*1024*1024); close(fda); close(fdb); return 0
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/28/13 10:00 PM, Jon Nelson wrote: A note: The attached test program uses *fsync* instead of *fdatasync* after calling fallocate (or writing out 16MB of zeroes), per an earlier suggestion. I tried this out on the RHEL5 platform I'm worried about now. There's something weird about the test program there. If I run it once it shows posix_fallocate running much faster: without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s The problem is that I'm seeing the gap between the two get smaller the more iterations I run, which makes me wonder if the test is completely fair: without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s You didn't show any output from the latest program on your system, so I'm not sure how it behaved for you here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 6/30/13 9:28 PM, Jon Nelson wrote: The performance of the latter (new) test sometimes seems to perform worse and sometimes seems to perform better (usually worse) than either of the other two. In all cases, posix_fallocate performs better, but I don't have a sufficiently old kernel to test with. This updated test program looks reliable now. The numbers are very tight when I'd expect them to be, and there's nowhere with the huge differences I saw in the earlier test program. Here's results from a few sets of popular older platforms: RHEL5, ext3 method: classic. 10 open/close iterations, 10 rewrite in 22.6949s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 23.0113s method: glibc emulation. 10 open/close iterations, 10 rewrite in 22.4921s method: classic. 10 open/close iterations, 10 rewrite in 23.2808s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.4736s method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.9871s method: classic. 10 open/close iterations, 10 rewrite in 22.4812s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.2393s method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.6940s RHEL6, ext4 method: classic. 10 open/close iterations, 10 rewrite in 56.0483s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 61.5092s method: glibc emulation. 10 open/close iterations, 10 rewrite in 53.8569s method: classic. 10 open/close iterations, 10 rewrite in 57.0361s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 55.9840s method: glibc emulation. 10 open/close iterations, 10 rewrite in 64.9437sb RHEL6, ext3 method: classic. 10 open/close iterations, 10 rewrite in 14.4080s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.1395s method: glibc emulation. 10 open/close iterations, 10 rewrite in 16.9657s method: classic. 10 open/close iterations, 10 rewrite in 15.2825s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.5315s method: glibc emulation. 10 open/close iterations, 10 rewrite in 14.8115s The win for posix_fallocate is there in most cases, but it's pretty hard to see in these older systems. That could be OK. As long as the difference is no more than noise, and that is the case, this could be good enough to commit. If there are significantly better results on the new platforms, the old ones need to just not get worse. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/22/13 12:54 PM, Fabien COELHO wrote: After some poking around, and pursuing various red herrings, I resorted to measure the delay for calling PQfinish(), which is really the only special thing going around at the end of pgbench run... This wasn't what I was seeing, but it's related. I've proved to myself the throttle change isn't reponsible for the weird stuff I'm seeing now. I'd like to rearrange when PQfinish happens now based on what I'm seeing, but that's not related to this review. I duplicated the PQfinish problem you found too. On my Linux system, calls to PQfinish are normally about 36 us long. They will sometimes get lost for 15ms before they return. That's a different problem though, because the ones I'm seeing on my Mac are sometimes 150ms. PQfinish never takes quite that long. PQfinish doesn't pause for a long time on this platform. But it does *something* that causes socket select() polling to stutter. I have instrumented everything interesting in this part of the pgbench code, and here is the problem event. 1372531862.062236 select with no timeout sleeping=0 1372531862.109111 select returned 6 sockets latency 46875 us Here select() is called with 0 sleeping processes, 11 that are done, and 14 that are running. The running ones have all sent SELECT statements to the server, and they are waiting for a response. Some of them received some data from the server, but they haven't gotten the entire response back. (The PQfinish calls could be involved in how that happened) With that setup, select runs for 47 *ms* before it gets the next byte to a client. During that time 6 clients get responses back to it, but it stays stuck in there for a long time anyway. Why? I don't know exactly why, but I am sure that pgbench isn't doing anything weird. It's either libpq acting funny, or the OS. When pgbench is waiting on a set of sockets, and none of them are returning anything, that's interesting. But there's nothing pgbench can do about it. The cause/effect here is that the randomness to the throttling code spreads out when all the connections end a bit. There are more times during which you might have 20 connections finished while 5 still run. I need to catch up with revisions done to this feature since I started instrumenting my copy more heavily. I hope I can get this ready for commit by Monday. I've certainly beaten on the feature for long enough now. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move unused buffers to freelist
On 6/28/13 8:50 AM, Robert Haas wrote: On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com wrote: 4. Separate processes for writing dirty buffers and moving buffers to freelist I think this part might be best pushed to a separate patch, although I agree we probably need it. This might be necessary eventually, but it's going to make thing more complicated. And I don't think it's a blocker for creating something useful. The two most common workloads are: 1) Lots of low usage count data, typically data that is updated sparsely across a larger database. These are helped by a process that writes dirty buffers in the background. These benefit from the current background writer. Kevin's system he was just mentioning again is the best example of this type that there's public data on. 2) Lots of high usage count data, because there are large hotspots in things like index blocks. Most writes happen at checkpoint time, because the background writer won't touch them. Because there are only a small number of re-usable pages, the clock sweep goes around very fast looking for them. This is the type of workload that should benefit from putting buffers into the free list. pgbench provides a simple example of this type, which is why Amit's tests using it have been useful. If you had a process that tried to handle both background writes and freelist management, I suspect one path would be hot and the other almost idle in each type of system. I don't expect that splitting those into two separate process would buy a lot of value, that can easily be pushed to a later patch. The background writer would just have a high and a low watermark. When the number of buffers on the freelist drops below the low watermark, the allocating backend sets the latch and bgwriter wakes up and begins adding buffers to the freelist. When the number of buffers on the free list reaches the high watermark, the background writer goes back to sleep. This will work fine for all of the common workloads. The main challenge is keeping the buffer allocation counting from turning into a hotspot. Busy systems now can easily hit 100K buffer allocations/second. I'm not too worried about it because those allocations are making the free list lock a hotspot right now. One of the consistently controversial parts of the current background writer is how it tries to loop over the buffer cache every 2 minutes, regardless of activity level. The idea there was that on bursty workloads, buffers would be cleaned during idle periods with that mechanism. Part of why that's in there is to deal with the relatively long pause between background writer runs. This refactoring idea will make that hard to keep around. I think this is OK though. Switching to a latch based design should eliminate the bgwriter_delay, which means you won't have this worst case of a 200ms stall while heavy activity is incoming. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Kudos for Reviewers -- straw poll
On 6/25/13 2:44 PM, Josh Berkus wrote: On the other hand, I will point out that we currently have a shortage of reviewers, and we do NOT have a shortage of patch submitters. That's because reviewing is harder than initial development. The only people who think otherwise are developers who don't do enough review. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
I'm still getting the same sort of pauses waiting for input with your v11. This is a pretty frustrating problem; I've spent about two days so far trying to narrow down how it happens. I've attached the test program I'm using. It seems related to my picking a throttled rate that's close to (but below) the maximum possible on your system. I'm using 10,000 on a system that can do about 16,000 TPS when running pgbench in debug mode. This problem is 100% reproducible here; happens every time. This is a laptop running Mac OS X. It's possible the problem is specific to that platform. I'm doing all my tests with the database itself setup for development, with debug and assertions on. The lag spikes seem smaller without assertions on, but they are still there. Here's a sample: transaction type: SELECT only scaling factor: 10 query mode: simple number of clients: 25 number of threads: 1 duration: 30 s number of transactions actually processed: 301921 average transaction lag: 1.133 ms (max 137.683 ms) tps = 10011.527543 (including connections establishing) tps = 10027.834189 (excluding connections establishing) And those slow ones are all at the end of the latency log file, as shown in column 3 here: 22 11953 3369 0 1371578126 954881 23 11926 3370 0 1371578126 954918 3 12238 30310 0 1371578126 984634 7 12205 30350 0 1371578126 984742 8 12207 30359 0 1371578126 984792 11 12176 30325 0 1371578126 984837 13 12074 30292 0 1371578126 984882 0 12288 175452 0 1371578127 126340 9 12194 171948 0 1371578127 126421 12 12139 171915 0 1371578127 126466 24 11876 175657 0 1371578127 126507 Note that there are two long pauses here, which happens maybe half of the time. There's a 27 ms pause and then a 145 ms one. The exact sequence for when the pauses show up is: -All remaining clients have sent their SELECT statement and are waiting for a response. They are not sleeping, they're waiting for the server to return a query result. -A client receives part of the data it wants, but there is still data left. This is the path through pgbench.c where the if (PQisBusy(st-con)) test is true after receiving some information. I hacked up some logging that distinguishes those as a client %d partial receive to make this visible. -A select() call is made with no timeout, so it can only be satisfied by more data being returned. -Around ~100ms (can be less, can be more) goes by before that select() returns more data to the client, where normally it would happen in ~2ms. You were concerned about a possible bug in the timeout code. If you hack up the select statement to show some state information, the setup for the pauses at the end always looks like this: calling select min_usec=9223372036854775807 sleeping=0 When no one is sleeping, the timeout becomes infinite, so only returning data will break it. This is intended behavior though. This exact same sequence and select() call parameters happen in pgbench code every time at the end of a run. But clients without the throttling patch applied never seem to get into the state where they pause for so long, waiting for one of the active clients to receive the next bit of result. I don't think the st-listen related code has anything to do with this either. That flag is only used to track when clients have completed sending their first query over to the server. Once reaching that point once, afterward they should be listening for results each time they exit the doCustom() code. st-listen goes to 1 very soon after startup and then it stays there, and that logic seems fine too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com #!/bin/bash make make install rm pgbench_log* nohup.out #nohup pgbench -c 25 -T 30 -l -d -f select.sql -s 10 pgbench nohup pgbench -c 25 -T 30 -l -d -S -R 1 pgbench tail -n 50 pgbench_log* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/12/13 3:19 AM, Fabien COELHO wrote: If you are still worried: if you run the very same command without throttling and measure the same latency, does the same thing happens at the end? My guess is that it should be yes. If it is no, I'll try out pgbench-tools. It looks like it happens rarely for one client without the rate limit, but that increases to every time for multiple client with limiting in place. pgbench-tools just graphs the output from the latency log. Here's a setup that runs the test I'm doing: $ createdb pgbench $ pgbench -i -s 10 pgbench $ pgbench -S -c 25 -T 30 -l pgbench tail -n 40 pgbench_log* Sometimes there's no slow entries. but I've seen this once so far: 0 21822 1801 0 1371217462 945264 1 21483 1796 0 1371217462 945300 8 20891 1931 0 1371217462 945335 14 20520 2084 0 1371217462 945374 15 20517 1991 0 1371217462 945410 16 20393 1928 0 1371217462 945444 17 20183 2000 0 1371217462 945479 18 20277 2209 0 1371217462 945514 23 20316 2114 0 1371217462 945549 22 20267 250128 0 1371217463 193656 The third column is the latency for that transaction. Notice how it's a steady ~2000 us except for the very last transaction, which takes 250,128 us. That's the weird thing; these short SELECT statements should never take that long. It suggests there's something weird happening with how the client exits, probably that its latency number is being computed after more work than it should. Here's what a rate limited run looks like for me. Note that I'm still using the version I re-submitted since that's where I ran into this issue, I haven't merged your changes to split the rate among each client here--which means this is 400 TPS per client == 1 TPS total: $ pgbench -S -c 25 -T 30 -R 400 -l pgbench tail -n 40 pgbench_log 7 12049 2070 0 1371217859 195994 22 12064 2228 0 1371217859 196115 18 11957 1570 0 1371217859 196243 23 12130 989 0 1371217859 196374 8 11922 1598 0 1371217859 196646 11 12229 4833 0 1371217859 196702 21 11981 1943 0 1371217859 196754 20 11930 1026 0 1371217859 196799 14 11990 13119 0 1371217859 208014 ^^^ fast section vvv delayed section 1 11982 91926 0 1371217859 287862 2 12033 116601 0 1371217859 308644 6 12195 115957 0 1371217859 308735 17 12130 114375 0 1371217859 308776 0 12026 115507 0 1371217859 308822 3 11948 118228 0 1371217859 308859 4 12061 113484 0 1371217859 308897 5 12110 113586 0 1371217859 308933 9 12032 117744 0 1371217859 308969 10 12045 114626 0 1371217859 308989 12 11953 113372 0 1371217859 309030 13 11883 114405 0 1371217859 309066 15 12018 116069 0 1371217859 309101 16 11890 115727 0 1371217859 309137 19 12140 114006 0 1371217859 309177 24 11884 115782 0 1371217859 309212 There's almost 90,000 usec of latency showing up between epoch time 1371217859.208014 and 1371217859.287862 here. What's weird about it is that the longer the test runs, the larger the gap is. If collecting the latency data itself caused the problem, that would make sense, so maybe this is related to flushing that out to disk. If you want to look just at the latency numbers without the other columns in the way you can use: cat pgbench_log.* | awk {'print $3'} That is how I was evaluating the smoothness of the rate limit, by graphing those latency values. pgbench-tools takes those and a derived TPS/s number and plots them, which made it easier for me to spot this weirdness. But I've already moved onto analyzing the raw latency data instead, I can see the issue without the graph once I've duplicated the conditions. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 6/14/13 1:06 PM, Jeff Davis wrote: Why have a GUC here at all? Perhaps this was already discussed, and I missed it? Is it just for testing purposes, or did you intend for it to be in the final version? You have guessed correctly! I suggested it stay in there only to make review benchmarking easier. I started looking at this patch and it looks like we are getting a consensus that it's the right approach. Microbenchmarks appear to show a benefit, and (thanks to Noah's comment) it seems like the change is safe. Are there any remaining questions or objections? I'm planning to duplicate Jon's test program on a few machines here, and then see if that turns into a useful latency improvement for clients. I'm trying to get this pgbench rate limit stuff working first though, because one of the tests I had in mind for WAL creation overhead would benefit from it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
I don't have this resolved yet, but I think I've identified the cause. Updating here mainly so Fabien doesn't duplicate my work trying to track this down. I'm going to keep banging at this until it's resolved now that I got this far. Here's a slow transaction: 1371226017.568515 client 1 executing \set naccounts 10 * :scale 1371226017.568537 client 1 throttling 6191 us 1371226017.747858 client 1 executing \setrandom aid 1 :naccounts 1371226017.747872 client 1 sending SELECT abalance FROM pgbench_accounts WHERE aid = 268721; 1371226017.789816 client 1 receiving That confirms it is getting stuck at the throttling step. Looks like the code pauses there because it's trying to overload the sleeping state that was already in pgbench, but handle it in a special way inside of doCustom(), and that doesn't always work. The problem is that pgbench doesn't always stay inside doCustom when a client sleeps. It exits there to poll for incoming messages from the other clients, via select() on a shared socket. It's not safe to assume doCustom will be running regularly; that's only true if clients keep returning messages. So as long as other clients keep banging on the shared socket, doCustom is called regularly, and everything works as expected. But at the end of the test run that happens less often, and that's when the problem shows up. pgbench already has a \sleep command, and the way that delay is handled happens inside threadRun() instead. The pausing of the rate limit throttle needs to operate in the same place. I have to redo a few things to confirm this actually fixes the issue, as well as look at Fabien's later updates to this since I wandered off debugging. I'm sure it's in the area of code I'm poking at now though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/14/13 3:50 PM, Fabien COELHO wrote: I think that the weirdness really comes from the way transactions times are measured, their interactions with throttling, and latent bugs in the code. measurement times, no; interactions with throttling, no. If it was either of those I'd have finished this off days ago. Latent bugs, possibly. We may discover there's nothing wrong with your code at the end here, that it just makes hitting this bug more likely. Unfortunately today is the day *some* bug is popping up, and I want to get it squashed before I'll be happy. The lag is actually happening during a kernel call that isn't working as expected. I'm not sure whether this bug was there all along if \sleep was used, or if it's specific to the throttle sleep. Also, flag st-listen is set to 1 but *never* set back to 0... I noticed that st-listen was weird too, and that's on my short list of suspicious things I haven't figured out yet. I added a bunch more logging as pgbench steps through its work to track down where it's stuck at. Until the end all transactions look like this: 1371238832.084783 client 10 throttle lag 2 us 1371238832.084783 client 10 executing \setrandom aid 1 :naccounts 1371238832.084803 client 10 sending SELECT abalance FROM pgbench_accounts WHERE aid = 753099; 1371238832.084840 calling select 1371238832.086539 client 10 receiving 1371238832.086539 client 10 finished All clients who hit lag spikes at the end are going through this sequence instead: 1371238832.085912 client 13 throttle lag 790 us 1371238832.085912 client 13 executing \setrandom aid 1 :naccounts 1371238832.085931 client 13 sending SELECT abalance FROM pgbench_accounts WHERE aid = 564894; 1371238832.086592 client 13 receiving 1371238832.086662 calling select 1371238832.235543 client 13 receiving 1371238832.235543 client 13 finished Note the calling select here that wasn't in the normal length transaction before it. The client is receiving something here, but rather than it finishing the transaction it falls through and ends up at the select() system call outside of doCustom. All of the clients that are sleeping when the system slips into one of these long select() calls are getting stuck behind it. I'm not 100% sure, but I think this only happens when all remaining clients are sleeping. Here's another one, it hits the receive that doesn't finish the transaction earlier (1371238832.086587) but then falls into the same select() call at 1371238832.086662: 1371238832.085884 client 12 throttle lag 799 us 1371238832.085884 client 12 executing \setrandom aid 1 :naccounts 1371238832.085903 client 12 sending SELECT abalance FROM pgbench_accounts WHERE aid = 299080; 1371238832.086587 client 12 receiving 1371238832.086662 calling select 1371238832.231032 client 12 receiving 1371238832.231032 client 12 finished Investigation is still going here... -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/10/13 6:02 PM, Fabien COELHO wrote: - the tps is global, with a mutex to share the global stochastic process - there is an adaptation for the fork emulation - I do not know wheter this works with Win32 pthread stuff. Instead of this complexity, can we just split the TPS input per client? That's all I was thinking of here, not adding a new set of threading issues. If 1 TPS is requested and there's 10 clients, just set the delay so that each of them targets 1000 TPS. I'm guessing it's more accurate to have them all communicate as you've done here, but it seems like a whole class of new bugs and potential bottlenecks could come out of that. Whenever someone touches the threading model for pgbench it usually gives a stack of build farm headaches. Better to avoid those unless there's really a compelling reason to go through that. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 6/11/13 12:22 PM, Merlin Moncure wrote: Personally I think this patch should go in regardless -- the concerns made IMNSHO are specious. That's nice, but we have this process for validating whether features go in or not that relies on review instead of opinions. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/11/13 4:11 PM, Fabien COELHO wrote: - ISTM that there thread start time should be initialized at the beginning of threadRun instead of in the loop *before* thread creation, otherwise the thread creation delays are incorporated in the performance measure, but ISTM that the point of pgbench is not to measure thread creation performance... I noticed that, but it seemed a pretty minor issue. Did you look at the giant latency spikes at the end of the test run I submitted the graph for? I wanted to nail down what was causing those before worrying about the startup timing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/1/13 5:00 AM, Fabien COELHO wrote: Question 1: should it report the maximum lang encountered? I haven't found the lag measurement to be very useful yet, outside of debugging the feature itself. Accordingly I don't see a reason to add even more statistics about the number outside of testing the code. I'm seeing some weird lag problems that this will be useful for though right now, more on that a few places below. Question 2: the next step would be to have the current lag shown under option --progress, but that would mean having a combined --throttle --progress patch submission, or maybe dependencies between patches. This is getting too far ahead. Let's get the throttle part nailed down before introducing even more moving parts into this. I've attached an updated patch that changes a few things around already. I'm not done with this yet and it needs some more review before commit, but it's not too far away from being ready. This feature works quite well. On a system that will run at 25K TPS without any limit, I did a run with 25 clients and a rate of 400/second, aiming at 10,000 TPS, and that's what I got: number of clients: 25 number of threads: 1 duration: 60 s number of transactions actually processed: 599620 average transaction lag: 0.307 ms tps = 9954.779317 (including connections establishing) tps = 9964.947522 (excluding connections establishing) I never thought of implementing the throttle like this before, but it seems to work out well so far. Check out tps.png to see the smoothness of the TPS curve (the graphs came out of pgbench-tools. There's a little more play outside of the target than ideal for this case. Maybe it's worth tightening the Poisson curve a bit around its center? The main implementation issue I haven't looked into yet is why things can get weird at the end of the run. See the latency.png graph attached and you can see what I mean. I didn't like the naming on this option or all of the ways you could specify the delay. None of those really added anything, since you can get every other behavior by specifying a non-integer TPS. And using the word throttle inside the code is fine, but I didn't like exposing that implementation detail more than it had to be. What I did instead was think of this as a transaction rate target, which makes the help a whole lot simpler: -R SPEC, --rate SPEC target rate per client in transactions per second Made the documentation easier to write too. I'm not quite done with that yet, the docs wording in this updated patch could still be better. I personally would like this better if --rate specified a *total* rate across all clients. However, there are examples of both types of settings in the program already, so there's no one precedent for which is right here. -t is per-client and now -R is too; I'd prefer it to be like -T instead. It's not that important though, and the code is cleaner as it's written right now. Maybe this is better; I'm not sure. I did some basic error handling checks on this and they seemed good, the program rejects target rates of =0. On the topic of this weird latency spike issue, I did see that show up in some of the results too. Here's one where I tried to specify a rate higher than the system can actually handle, 8 TPS total on a SELECT-only test $ pgbench -S -T 30 -c 8 -j 4 -R1tps pgbench starting vacuum...end. transaction type: SELECT only scaling factor: 100 query mode: simple number of clients: 8 number of threads: 4 duration: 30 s number of transactions actually processed: 761779 average transaction lag: 10298.380 ms tps = 25392.312544 (including connections establishing) tps = 25397.294583 (excluding connections establishing) It was actually limited by the capabilities of the hardware, 25K TPS. 10298 ms of lag per transaction can't be right though. Some general patch submission suggestions for you as a new contributor: -When re-submitting something with improvements, it's a good idea to add a version number to the patch so reviewers can tell them apart easily. But there is no reason to change the subject line of the e-mail each time. I followed that standard here. If you updated this again I would name the file pgbench-throttle-v9.patch but keep the same e-mail subject. -There were some extra carriage return characters in your last submission. Wasn't a problem this time, but if you can get rid of those that makes for a better patch. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com attachment: tps.pngattachment: latency.pngdiff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 8c202bf..799dfcd 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -137,6 +137,12 @@ intunlogged_tables = 0; double sample_rate = 0.0; /* + * When clients
Re: [HACKERS] Redesigning checkpoint_segments
On 6/7/13 2:43 PM, Robert Haas wrote: name. What I would like to see is a single number here in memory units that replaces both checkpoint_segments and wal_keep_segments. This isn't really making sense to me. I don't think we should assume that someone who wants to keep WAL around for replication also wants to wait longer between checkpoints. Those are two quite different things. It's been years since I saw anyone actually using checkpoint_segments as that sort of limit. I see a lot of sites pushing the segments limit up and then using checkpoint_timeout carefully. It's pretty natural to say I don't want to go more than X minutes between checkpoints. The case for wanting to say I don't want to go more than X MB between checkpoints instead, motivated by not wanting too much activity to queue between them, I'm just not seeing demand for that now. The main reason I do see people paying attention to checkpoint_segments still is to try and put a useful bound on WAL disk space usage. That's the use case I think overlaps with wal_keep_segments such that you might replace both of them. I think we really only need one control that limits how much WAL space is expected inside of pg_xlog, and it should be easy and obvious how to set it. The more I look at this checkpoint_segments patch, the more I wonder why it's worth even bothering with anything but a disk space control here. checkpoint_segments is turning into an internal implementation detail most sites I see wouldn't miss at all. Rather than put work into autotuning it, I'd be happy to eliminate checkpoint_segments altogther, in favor of a WAL disk space limit. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 6/8/13 4:43 PM, Jeff Janes wrote: Also, in all the anecdotes I've been hearing about autovacuum causing problems from too much IO, in which people can identify the specific problem, it has always been the write pressure, not the read, that caused the problem. Should the default be to have the read limit be inactive and rely on the dirty-limit to do the throttling? That would be bad, I have to carefully constrain both of them on systems that are short on I/O throughput. There all sorts of cases where cleanup of a large and badly cached relation will hit the read limit right now. I suspect the reason we don't see as many complaints is that a lot more systems can handle 7.8MB/s of random reads then there are ones that can do 3.9MB/s of random writes. If we removed that read limit, a lot more complaints would start rolling in about the read side. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 6/8/13 5:17 PM, Jeff Janes wrote: But my gut feeling is that if autovacuum is trying to read faster than the hardware will support, it will just automatically get throttled, by inherent IO waits, at a level which can be comfortably supported. And this will cause minimal interference with other processes. If this were true all the time autovacuum tuning would be a lot easier. You can easily make a whole server unresponsive by letting loose one rogue process doing a lot of reads. Right now this isn't a problem for autovacuum because any one process running at 7.8MB/s is usually not a big deal. It doesn't take too much in the way of read-ahead logic and throughput to satisfy that. But I've seen people try and push the read rate upwards who didn't get very far beyond that before it was way too obtrusive. I could collect some data from troubled servers to see how high I can push the read rate before they suffer. Maybe there's a case there for increasing the default read rate because the write one is a good enough secondary limiter. I'd be surprised if we could get away with more than a 2 or 3X increase though, and the idea of going unlimited is really scary. It took me a year of before/after data collection before I was confident that it's OK to run unrestricted in all cache hit situations. Why is there so much random IO? Do your systems have autovacuum_vacuum_scale_factor set far below the default? Unless they do, most of the IO (both read and write) should be sequential. Insert one process doing sequential reads into a stream of other activity and you can easily get random I/O against the disks out of the mix. You don't necessarily need the other activity to be random to get that. N sequential readers eventually acts like N random readers for high enough values of N. On busy servers, autovacuum is normally competing against multiple random I/O processes though. Also, the database's theoretical model that block number correlates directly with location on disk can break down. I haven't put a hard number to measuring it directly, but systems with vacuum problems seem more likely to have noticeable filesystem level fragmentation. I've been thinking about collecting data from a few systems with filefrag to see if I'm right about that. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 6/8/13 5:20 PM, Kevin Grittner wrote: I'll believe that all of that is true, but I think there's another reason. With multiple layers of write cache (PostgreSQL shared_buffers, OS cache, controller or SSD cache) I think there's a tendency for an avalanche effect. Dirty pages stick to cache at each level like snow on the side of a mountain, accumulating over time. When it finally breaks loose at the top, it causes more from lower levels to break free as it passes. I explained this once as being like a tower of leaky buckets where each one drips into the one below. Buckets draining out of the bottom at one rate, and new water comes in at another. You can add water much faster than it drains, for a while. But once one of the buckets fills you've got a serious mess. I think the sudden onset of problems from write saturation contributes to the complaints. It's also important to realize that vacuum itself doesn't even do the writes in many cases. If you have a large shared_buffers value, it wanders off making things dirty without any concern for what's going to disk. When the next checkpoint shows up is when pressure increases at the top. The way this discussion has wandered off has nicely confirmed I was right to try and avoid going into this area again :( -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 6/7/13 10:14 AM, Robert Haas wrote: If the page hit limit goes away, the user with a single core server who used to having autovacuum only pillage shared_buffers at 78MB/s might complain that if it became unbounded. Except that it shouldn't become unbounded, because of the ring-buffer stuff. Vacuum can pillage the OS cache, but the degree to which a scan of a single relation can pillage shared_buffers should be sharply limited. I wasn't talking about disruption of the data that's in the buffer cache. The only time the scenario I was describing plays out is when the data is already in shared_buffers. The concern is damage done to the CPU's data cache by this activity. Right now you can't even reach 100MB/s of damage to your CPU caches in an autovacuum process. Ripping out the page hit cost will eliminate that cap. Autovacuum could introduce gigabytes per second of memory - L1 cache transfers. That's what all my details about memory bandwidth were trying to put into context. I don't think it really matter much because the new bottleneck will be the processing speed of a single core, and that's still a decent cap to most people now. I think you're missing my point here, which is is that we shouldn't have any such things as a cost limit. We should limit reads and writes *completely separately*. IMHO, there should be a limit on reading, and a limit on dirtying data, and those two limits should not be tied to any common underlying cost limit. If they are, they will not actually enforce precisely the set limit, but some other composite limit which will just be weird. I see the distinction you're making now, don't need a mock up to follow you. The main challenge of moving this way is that read and write rates never end up being completely disconnected from one another. A read will only cost some fraction of what a write does, but they shouldn't be completely independent. Just because I'm comfortable doing 10MB/s of reads and 5MB/s of writes, I may not be happy with the server doing 9MB/s read + 5MB/s write=14MB/s of I/O in an implementation where they float independently. It's certainly possible to disconnect the two like that, and people will be able to work something out anyway. I personally would prefer not to lose some ability to specify how expensive read and write operations should be considered in relation to one another. Related aside: shared_buffers is becoming a decreasing fraction of total RAM each release, because it's stuck with this rough 8GB limit right now. As the OS cache becomes a larger multiple of the shared_buffers size, the expense of the average read is dropping. Reads are getting more likely to be in the OS cache but not shared_buffers, which makes the average cost of any one read shrink. But writes are as expensive as ever. Real-world tunings I'm doing now reflecting that, typically in servers with 128GB of RAM, have gone this far in that direction: vacuum_cost_page_hit = 0 vacuum_cost_page_hit = 2 vacuum_cost_page_hit = 20 That's 4MB/s of writes, 40MB/s of reads, or some blended mix that considers writes 10X as expensive as reads. The blend is a feature. The logic here is starting to remind me of how the random_page_cost default has been justified. Read-world random reads are actually close to 50X as expensive as sequential ones. But the average read from the executor's perspective is effectively discounted by OS cache hits, so 4.0 is still working OK. In large memory servers, random reads keep getting cheaper via better OS cache hit odds, and it's increasingly becoming something important to tune for. Some of this mess would go away if we could crack the shared_buffers scaling issues for 9.4. There's finally enough dedicated hardware around to see the issue and work on it, but I haven't gotten a clear picture of any reproducible test workload that gets slower with large buffer cache sizes. If anyone has a public test case that gets slower when shared_buffers goes from 8GB to 16GB, please let me know; I've got two systems setup I could chase that down on now. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 6/7/13 12:42 PM, Robert Haas wrote: GUCs in terms of units that are meaningful to the user. One could have something like io_rate_limit (measured in MB/s), io_read_multiplier = 1.0, io_dirty_multiplier = 1.0, and I think that would be reasonably clear. There's one other way to frame this: io_read_limit = 7.8MB/s # Maximum read rate io_dirty_multiplier = 2.0 # How expensive writes are considered relative to reads That still gives all of the behavior I'd like to preserve, as well as not changing the default I/O pattern. I don't think it's too complicated to ask people to grapple with that pair. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Redesigning checkpoint_segments
On 6/6/13 4:42 AM, Joshua D. Drake wrote: On 6/6/2013 1:11 AM, Heikki Linnakangas wrote: (I'm sure you know this, but:) If you perform a checkpoint as fast and short as possible, the sudden burst of writes and fsyncs will overwhelm the I/O subsystem, and slow down queries. That's what we saw before spread checkpoints: when a checkpoint happens, the response times of queries jumped up. That isn't quite right. Previously we had lock issues as well and checkpoints would take considerable time to complete. What I am talking about is that the background writer (and wal writer where applicable) have done all the work before a checkpoint is even called. That is not possible, and if you look deeper at a lot of workloads you'll eventually see why. I'd recommend grabbing snapshots of pg_buffercache output from a lot of different types of servers and see what the usage count distribution looks like. That's what did in order to create all of the behaviors the current background writer code caters to. Attached is a small spreadsheet that shows the main two extremes here, from one of my old talks. Effective buffer cache system is full of usage count 5 pages, while the Minimally effective buffer cache one is all usage count 1 or 0. We don't have redundant systems here; we have two that aim at distinctly different workloads. That's one reason why splitting them apart ended up being necessary to move forward, they really don't overlap very much on some servers. Sampling a few servers that way was where the controversial idea of scanning the whole buffer pool every few minutes even without activity came from too. I found a bursty real world workload where that was necessary to keep buffers clean usefully, and that heuristic helped them a lot. I too would like to visit the exact logic used, but I could cook up a test case where it's useful again if people really doubt it has any value. There's one in the 2007 archives somewhere. The reason the checkpointer code has to do this work, and it has to spread the writes out, is that on some systems the hot data set hits a high usage count. If shared_buffers is 8GB and at any moment 6GB of it has a usage count of 5, which absolutely happens on many busy servers, the background writer will do almost nothing useful. It won't and shouldn't touch buffers unless their usage count is low. Those heavily referenced blocks will only be written to disk once per checkpoint cycle. Without the spreading, in this example you will drop 6GB into Dirty Memory on a Linux server, call fdatasync, and the server might stop doing any work at all for *minutes* of time. Easiest way to see it happen is to set checkpoint_completion_target to 0, put the filesystem on ext3, and have a server with lots of RAM. I have a monitoring tool that graphs Dirty Memory over time because this problem is so nasty even with the spreading code in place. There is this idea that pops up sometimes that a background writer write is better than a checkpoint one. This is backwards. A dirty block must be written at least once per checkpoint. If you only write it once per checkpoint, inside of the checkpoint process, that is the ideal. It's what you want for best performance when it's possible. At the same time, some workloads churn through a lot of low usage count data, rather than building up a large block of high usage count stuff. On those your best hope for low latency is to crank up the background writer and let it try to stay ahead of backends with the writes. The checkpointer won't have nearly as much work to do in that situation. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com bgwriter-snapshot.xls Description: MS-Excel spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Redesigning checkpoint_segments
On 6/6/13 4:41 AM, Heikki Linnakangas wrote: I was thinking of letting the estimate decrease like a moving average, but react to any increases immediately. Same thing we do in bgwriter to track buffer allocations: Combine what your submitted patch does and this idea, and you'll have something I prototyped a few years ago. I took the logic and tested it out in user space by parsing the output from log_checkpoints to see how many segments were being used. That approach coughed out a value about as good for checkpoint_segments as I picked by hand. The main problem was it liked to over-tune the segments based on a small bursts of activity, leaving a value higher than you might want to use the rest of the time. The background writer didn't worry about this very much because the cost of making a mistake for one 200ms cycle was pretty low. Setting checkpoint_segments high is a more expensive issue. When I set these by hand, I'll aim more to cover a 99th percentile of the maximum segments number rather than every worst case seen. I don't think that improvement is worth spending very much effort on though. The moving average approach is more than good enough in most cases. I've wanted checkpoint_segments to go away in exactly this fashion for a while. The general complaint the last time I suggested a change in this area, to make checkpoint_segments larger for the average user, was that some people had seen workloads where that was counterproductive. Pretty sure Kevin Grittner said he'd seen that happen. That's how I remember this general idea dying the last time, and I still don't have enough data to refute that doesn't happen. As far as the UI, if it's a soft limit I'd suggest wal_size_target for the name. What I would like to see is a single number here in memory units that replaces both checkpoint_segments and wal_keep_segments. If you're willing to use a large chunk of disk space to handle either one of activity spikes or the class of replication issues wal_keep_segments targets, I don't see why you'd want to ban using that space for the other one too. To put some perspective on how far we've been able to push this in the field with minimal gripes, the repmgr tool requires wal_keep_segments be =5000, which works out to 78GB. I still see some people use 73GB SAS drives in production servers for their WAL files, but that's the only time I've seen that number become scary when deploying repmgr. Meanwhile, the highest value for checkpoint_segments I've set based on real activity levels was 1024, on a server where checkpoint_timeout is 15 minutes (and can be no shorter without checkpoint spikes). At no point during that fairly difficult but of tuning work did checkpoint_segments do anything but get in the way. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 6:49 AM, Robert Haas wrote: On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote: So we don't even know whether we can read. I think that means we need to zero the file anyway... Surely this is undue pessimism. There have been many occasions where I've found the Linux kernel defining support for POSIX behavior with a NOP stub that basically says we should make this work one day. I don't know whether the fallocate code is one of those or a fully implemented call. Based on that history, until I see a reader that validates the resulting files are good I have to assume they're not. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 7:17 AM, Andres Freund wrote: That argument in contrast I find not very convincing though. What was the last incidence of such a system call that did not just error out with ENOTSUPP or such? http://linux.die.net/man/2/posix_fadvise talks about POSIX_FADV_NOREUSE and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, depending on your version. I know there were more examples than just that one that popped up during the testing of effective_io_concurrency. My starting position has to assume that posix_fallocate can have the same sort of surprising behavior that showed up repeatedly when we were trying to use posix_fadvise more aggressively. The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing) was a similar issue, and that's the first one that left me forever skeptical of Linux kernel claims in this area until they are explicitly validated: http://lwn.net/Articles/350225/ -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 7:52 AM, Andres Freund wrote: fadvise is a hint which is pretty different from a fallocate where ignoring would have way much more severe consequences. Yes, it will. That's why I want to see it tested. There is more than enough past examples of bad behavior here to be skeptical that this sort of API may not work exactly as specified. If you're willing to believe the spec, that's fine, but I think that's dangerously optimistic. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 8:50 AM, Stephen Frost wrote: I don't think this solves the locking issue around the relation extention lock, but it might help some and it may make it easier to tweak that logic to allocate larger chunks in the future. It might make it a bit faster, but it doesn't make it any easier to implement. The messy part of extending relations in larger chunks is how to communicate that back into the buffer manager usefully. The extension path causing trouble is RelationGetBufferForTuple calling ReadBufferBI. All of that is passing a single buffer around. There's no simple way I can see to rewrite it to handle more than one at a time. I have a test case for relation extension that I'm going to package up soon. That makes it easy to see where the problem is at. Far as I can tell the reason it hasn't been fixed before now is that it's a pain to write the code. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/30/13 11:21 AM, Alvaro Herrera wrote: Greg Smith escribió: The messy part of extending relations in larger chunks is how to communicate that back into the buffer manager usefully. The extension path causing trouble is RelationGetBufferForTuple calling ReadBufferBI. All of that is passing a single buffer around. There's no simple way I can see to rewrite it to handle more than one at a time. No, but we can have it create several pages and insert them into the FSM. So they aren't returned to the original caller but are available to future users. There's actually a code comment wondering about this topic for the pages that are already created, in src/backend/access/heap/hio.c : Remember the new page as our target for future insertions. XXX should we enter the new page into the free space map immediately, or just keep it for this backend's exclusive use in the short run (until VACUUM sees it)? Seems to depend on whether you expect the current backend to make more insertions or not, which is probably a good bet most of the time. So for now, don't add it to FSM yet. We have to be careful about touching too much at that particular point, because it's holding a relation extension lock at the obvious spot to make a change. There's an interesting overlap with these questions about how files are extended too, with this comment in that file too, just before the above: XXX This does an lseek - rather expensive - but at the moment it is the only way to accurately determine how many blocks are in a relation. Is it worth keeping an accurate file length in shared memory someplace, rather than relying on the kernel to do it for us? That whole sequence of code took the easy way forward when it was written, but it's obvious the harder one (also touching the FSM) was considered even then. The whole sequence needs to be revisited to pull off multiple page extension. I wouldn't say it's hard, but it's enough work that I haven't been able to find a block of time to go through the whole thing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/28/13 11:12 AM, Jon Nelson wrote: It opens a new file, fallocates 16MB, calls fdatasync. Outside of the run for performance testing, I think it would be good at this point to validate that there is really a 16MB file full of zeroes resulting from these operations. I am not really concerned that posix_fallocate might be slower in some cases; that seems unlikely. I am concerned that it might result in a file that isn't structurally the same as the 16MB of zero writes implementation used now. The timing program you're writing has some aspects that are similar to the contrib/pg_test_fsync program. You might borrow some code from there usefully. To clarify the suggestion I was making before about including performance test results: that doesn't necessarily mean the testing code must run using only the database. That's better if possible, but as Robert says it may not be for some optimizations. The important thing is to have something measuring the improvement that a reviewer can duplicate, and if that's a standalone benchmark problem that's still very useful. The main thing I'm wary of is any this should be faster claims that don't come with any repeatable measurements at all. Very often theories about the fastest way to do something don't match what's actually seen in testing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 5/28/13 10:00 PM, Jon Nelson wrote: On Tue, May 28, 2013 at 10:36 AM, Greg Smith g...@2ndquadrant.com wrote: On 5/28/13 11:12 AM, Jon Nelson wrote: It opens a new file, fallocates 16MB, calls fdatasync. Outside of the run for performance testing, I think it would be good at this point to validate that there is really a 16MB file full of zeroes resulting from these operations. I am not really concerned that posix_fallocate might be slower in some cases; that seems unlikely. I am concerned that it might result in a file that isn't structurally the same as the 16MB of zero writes implementation used now. util-linux comes with fallocate which (might?) suffice for testing in that respect, no? If that is a real concern, it could be made part of the autoconf testing, perhaps. I was just thinking of something to run in your test program, not another build time check. Just run the new allocation sequence, and then check the resulting WAL file for a) correct length, and b) 16K of zero bytes. I would like to build some confidence that posix_fallocate is operating correctly in this context on at least one platform. My experience with Linux handling this class of functions correctly has left me skeptical of them working until that's proven to be the case. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 5/24/13 10:36 AM, Jim Nasby wrote: Instead of KB/s, could we look at how much time one process is spending waiting on IO vs the rest of the cluster? Is it reasonable for us to measure IO wait time for every request, at least on the most popular OSes? It's not just an OS specific issue. The overhead of collecting timing data varies massively based on your hardware, which is why there's the pg_test_timing tool now to help quantify that. I have a design I'm working on that exposes the system load to the database usefully. That's what I think people really want if the goal is to be adaptive based on what else is going on. My idea is to use what uptime collects as a starting useful set of numbers to quantify what's going on. If you have both a short term load measurement and a longer term one like uptime provides, you can quantify both the overall load and whether it's rising or falling. I want to swipe some ideas on how moving averages are used to determine trend in stock trading systems: http://www.onlinetradingconcepts.com/TechnicalAnalysis/MASimple2.html Dynamic load-sensitive statement limits and autovacuum are completely feasible on UNIX-like systems. The work to insert a cost delay point needs to get done before building more complicated logic on top of it though, so I'm starting with this part. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 5/24/13 9:21 AM, Robert Haas wrote: But I wonder if we wouldn't be better off coming up with a little more user-friendly API. Instead of exposing a cost delay, a cost limit, and various charges, perhaps we should just provide limits measured in KB/s, like dirty_rate_limit = amount of data you can dirty per second, in kB and read_rate_limit = amount of data you can read into shared buffers per second, in kB. I already made and lost the argument for doing vacuum in KB/s units, so I wasn't planning on putting that in the way of this one. I still think it's possible to switch to real world units and simplify all of those parameters. Maybe I'll get the energy to fight this battle again for 9.4. I do have a lot more tuning data from production deployments to use as evidence now. I don't think the UI end changes the bulk of the implementation work though. The time consuming part of this development is inserting all of the cost delay hooks and validating they work. Exactly what parameters and logic fires when they are called can easily be refactored later. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block write statistics WIP
On 5/20/13 7:51 AM, Heikki Linnakangas wrote: The way that MarkDirty requires this specific underlying storage manager behavior to work properly strikes me as as a bit of a layering violation too. I'd like the read and write paths to have a similar API, but here they don't even operate on the same type of inputs. Addressing that is probably harder than just throwing a hack on the existing code though. To be honest, I don't understand what you mean by that. ? Let's say you were designing a storage layer API from scratch for what Postgres does. That might take a relation as its input, like ReadBuffer does. Hiding the details of how that turns into a physical file operation would be a useful goal of such a layer. I'd then consider it a problem if that exposed things like the actual mapping of relations into files to callers. What we actually have right now is this MarkDirty function that operates on BufferTag data, which points directly to the underlying file. I see that as cutting the storage API in half and calling a function in the middle of the implementation. It strikes me as kind of weird that the read side and write side are not more symmetrical. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cost limited statements RFC
I'm working on a new project here that I wanted to announce, just to keep from duplicating effort in this area. I've started to add a cost limit delay for regular statements. The idea is that you set a new statement_cost_delay setting before running something, and it will restrict total resources the same way autovacuum does. I'll be happy with it when it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY. Modifying the buffer manager to account for statement-based cost accumulation isn't difficult. The tricky part here is finding the right spot to put the delay at. In the vacuum case, it's easy to insert a call to check for a delay after every block of I/O. It should be possible to find a single or small number of spots to put a delay check in the executor. But I expect that every utility command may need to be modified individually to find a useful delay point. This is starting to remind me of the SEPostgres refactoring, because all of the per-command uniqueness ends up requiring a lot of work to modify in a unified way. The main unintended consequences issue I've found so far is when a cost delayed statement holds a heavy lock. Autovacuum has some protection against letting processes with an exclusive lock on a table go to sleep. It won't be easy to do that with arbitrary statements. There's a certain amount of allowing the user to shoot themselves in the foot here that will be time consuming (if not impossible) to eliminate. The person who runs an exclusive CLUSTER that's limited by statement_cost_delay may suffer from holding the lock too long. But that might be their intention with setting the value. Hard to idiot proof this without eliminating useful options too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 5/23/13 7:34 PM, Claudio Freire wrote: Why not make the delay conditional on the amount of concurrency, kinda like the commit_delay? Although in this case, it should only count unwaiting connections. The test run by commit_delay is way too heavy to run after every block is processed. That code is only hit when there's a commit, which already assumes a lot of overhead is going on--the disk flush to WAL--so burning some processing/lock acquisition time isn't a big deal. The spot where statement delay is going is so performance sensitive that everything it touches needs to be local to the backend. For finding cost delayed statements that are causing trouble because they are holding locks, the only place I've thought of that runs infrequently and is poking at the right data is the deadlock detector. Turning that into a more general mechanism for finding priority inversion issues is an interesting idea. It's a bit down the road from what I'm staring at now though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cost limited statements RFC
On 5/23/13 7:56 PM, Claudio Freire wrote: Besides of the obvious option of making a lighter check (doesn't have to be 100% precise), wouldn't this check be done when it would otherwise sleep? Is it so heavy still in that context? A commit to typical 7200RPM disk is about 10ms, while autovacuum_vacuum_cost_delay is 20ms. If the statement cost limit logic were no more complicated than commit_delay, it would be feasible to do something similar each time a statement was being put to sleep. I suspect that the cheapest useful thing will be more expensive than commit_delay's test. That's a guess though. I'll have to think about this more when I circle back toward usability. Thanks for the implementation idea. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Block write statistics WIP
I have some time now for working on the mystery of why there are no block write statistics in the database. I hacked out the statistics collection and reporting side already, rough patch attached if you want to see the boring parts. I guessed that there had to be a gotcha behind why this wasn't done before now, and I've found one so far. All of the read statistics are collected with a macro that expects to know a Relation number. Callers to ReadBuffer pass one. On the write side, the two places that increment the existing write counters (pg_stat_statements, vacuum) are MarkBufferDirty and MarkBufferDirtyHint. Neither of those is given a Relation though. Inspecting the Buffer passed can only find the buffer tag's RelFileNode. I've thought of two paths to get a block write count out of that so far: -Provide a function to find the Relation from the RelFileNode. There is a warning about the perils of assuming you can map that way from a buftag value in buf_internals.h though: Note: the BufferTag data must be sufficient to determine where to write the block, without reference to pg_class or pg_tablespace entries. It's possible that the backend flushing the buffer doesn't even believe the relation is visible yet (its xact may have started before the xact that created the rel). The storage manager must be able to cope anyway. -Modify every caller of MarkDirty* to include a relation when that information is available. There are about 200 callers of those functions around, so that won't be fun. I noted that many of them are in the XLog replay functions, which won't have the relation--but those aren't important to count anyway. Neither of these options feels very good to me, so selecting between the two feels like picking the lesser of two evils. I'm fine with chugging through all of the callers to modify MarkDirty*, but I didn't want to do that only to have the whole approach rejected as wrong. That's why I stopped here to get some feedback. The way that MarkDirty requires this specific underlying storage manager behavior to work properly strikes me as as a bit of a layering violation too. I'd like the read and write paths to have a similar API, but here they don't even operate on the same type of inputs. Addressing that is probably harder than just throwing a hack on the existing code though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index a03bfa6..1773d59 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -467,15 +467,19 @@ CREATE VIEW pg_statio_all_tables AS pg_stat_get_blocks_fetched(C.oid) - pg_stat_get_blocks_hit(C.oid) AS heap_blks_read, pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit, +pg_stat_get_blocks_dirtied(C.oid) AS heap_blks_dirtied, sum(pg_stat_get_blocks_fetched(I.indexrelid) - pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read, sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit, +sum(pg_stat_get_blocks_dirtied(I.indexrelid))::bigint AS idx_blks_dirtied, pg_stat_get_blocks_fetched(T.oid) - pg_stat_get_blocks_hit(T.oid) AS toast_blks_read, pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit, +pg_stat_get_blocks_dirtied(T.oid) AS toast_blks_dirtied, pg_stat_get_blocks_fetched(X.oid) - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read, pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit +pg_stat_get_blocks_dirted(X.oid) AS tidx_blks_dirtied FROM pg_class C LEFT JOIN pg_index I ON C.oid = I.indrelid LEFT JOIN pg_class T ON C.reltoastrelid = T.oid LEFT JOIN @@ -530,6 +534,7 @@ CREATE VIEW pg_statio_all_indexes AS pg_stat_get_blocks_fetched(I.oid) - pg_stat_get_blocks_hit(I.oid) AS idx_blks_read, pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit +pg_stat_get_blocks_dirtied(I.oid) AS idx_blks_dirtied FROM pg_class C JOIN pg_index X ON C.oid = X.indrelid JOIN pg_class I ON I.oid = X.indexrelid @@ -554,6 +559,7 @@ CREATE VIEW pg_statio_all_sequences AS pg_stat_get_blocks_fetched(C.oid) - pg_stat_get_blocks_hit(C.oid) AS blks_read, pg_stat_get_blocks_hit(C.oid) AS blks_hit +pg_stat_get_blocks_dirtied(C.oid) AS blks_dirtied FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE C.relkind = 'S'; @@ -622,6 +628,7 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_blocks_fetched(D.oid) - pg_stat_get_db_blocks_hit(D.oid