Re: [HACKERS] Enabling Checksums

2013-09-12 Thread Greg Smith

On 3/18/13 10:52 AM, Bruce Momjian wrote:

With a potential 10-20% overhead, I am unclear who would enable this at
initdb time.


If you survey people who are running PostgreSQL on cloud hardware, be 
it Amazon's EC2 or similar options from other vendors, you will find a 
high percentage of them would pay quite a bit of performance to make 
their storage more reliable.  To pick one common measurement for 
popularity, a Google search on ebs corruption returns 17 million hits. 
 To quote one of those, Baron Schwartz of Percona talking about MySQL 
on EC2:


BTW, I have seen data corruption on EBS volumes. It’s not clear whether 
it was InnoDB’s fault (extremely unlikely IMO), the operating system’s 
fault, EBS’s fault, or something else.


http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

*That* uncertainty is where a lot of the demand for this feature is 
coming from.  People deploy into the cloud, their data gets corrupted, 
and no one call tell them what/why/how it happened.  And that means they 
don't even know what to change to make it better.  The only people I see 
really doing something about this problem all seem years off, and I'm 
not sure they are going to help--especially since some of them are 
targeting enterprise storage rather than the cloud-style installations.



I assume a user would wait until they suspected corruption to turn it
on, and because it is only initdb-enabled, they would have to
dump/reload their cluster.  The open question is whether this is a
usable feature as written, or whether we should wait until 9.4.


The reliability issues of both physical and virtual hardware are so 
widely known that many people will deploy with this on as their default 
configuration.


If you don't trust your existing data, you can't retroactively check it. 
 A checksum of an already corrupt block is useless.  Accordingly, there 
is no use case for converting an installation with real or even 
suspected problems to a checksummed one.  If you wait until you suspect 
corruption to care about checksums, it's really too late.  There is only 
one available next step:  you must do a dump to figure out what's 
readable.  That is the spot that all of the incoming data recovery 
customers we see at 2ndQuadrant are already in when we're called.  The 
cluster is suspicious, sometimes they can get data out of it with a 
dump, and if we hack up their install we can usually recover a bit more 
than they could.


After the data from a partially corrupted database is dumped, someone 
who has just been through that pain might decide they should turn 
checksums on when they restore the dump.  When it's on, they can access 
future damage easily at the block level when it happens, and possibly 
repair it without doing a full dump/reload.  What's implemented in the 
feature we're talking about has a good enough UI to handle this entire 
cycle I see damaged installations go through.



In fact, this feature is going to need
pg_upgrade changes to detect from pg_controldata that the old/new
clusters have the same checksum setting.


I think that's done already, but it's certainly something to test out too.

Good questions, Bruce, I don't think the reasons behind this feature's 
demand have been highlighted very well before.  I try not to spook the 
world by talking regularly about how many corrupt PostgreSQL databases 
I've seen, but they do happen.  Most of my regular ranting on crappy 
SSDs that lie about writes comes from a TB scale PostgreSQL install that 
got corrupted due to the write-cache flaws of the early Intel 
SSDs--twice.  The would have happily lost even the worst-case 20% of 
regular performance to avoid going down for two days each time they saw 
corruption, where we had to dump/reload to get them going again.  If the 
install had checksums, I could have figured out which blocks were 
damaged and manually fixed them.  Without checksums, there's no way to 
even tell for sure what is broken.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-09-12 Thread Greg Smith

On 3/18/13 10:52 AM, Bruce Momjian wrote:

With a potential 10-20% overhead, I am unclear who would enable this at
initdb time.


If you survey people who are running PostgreSQL on cloud hardware, be 
it Amazon's EC2 or similar options from other vendors, you will find a 
high percentage of them would pay quite a bit of performance to make 
their storage more reliable.  To pick one common measurement for 
popularity, a Google search on ebs corruption returns 17 million hits. 
 To quote one of those, Baron Schwartz of Percona talking about MySQL 
on EC2:


BTW, I have seen data corruption on EBS volumes. It’s not clear whether 
it was InnoDB’s fault (extremely unlikely IMO), the operating system’s 
fault, EBS’s fault, or something else.


http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

*That* uncertainty is where a lot of the demand for this feature is 
coming from.  People deploy into the cloud, their data gets corrupted, 
and no one call tell them what/why/how it happened.  And that means they 
don't even know what to change to make it better.  The only people I see 
really doing something about this problem all seem years off, and I'm 
not sure they are going to help--especially since some of them are 
targeting enterprise storage rather than the cloud-style installations.



I assume a user would wait until they suspected corruption to turn it
on, and because it is only initdb-enabled, they would have to
dump/reload their cluster.  The open question is whether this is a
usable feature as written, or whether we should wait until 9.4.


The reliability issues of both physical and virtual hardware are so 
widely known that many people will deploy with this on as their default 
configuration.


If you don't trust your existing data, you can't retroactively check it. 
 A checksum of an already corrupt block is useless.  Accordingly, there 
is no use case for converting an installation with real or even 
suspected problems to a checksummed one.  If you wait until you suspect 
corruption to care about checksums, it's really too late.  There is only 
one available next step:  you must do a dump to figure out what's 
readable.  That is the spot that all of the incoming data recovery 
customers we see at 2ndQuadrant are already in when we're called.  The 
cluster is suspicious, sometimes they can get data out of it with a 
dump, and if we hack up their install we can usually recover a bit more 
than they could.


After the data from a partially corrupted database is dumped, someone 
who has just been through that pain might decide they should turn 
checksums on when they restore the dump.  When it's on, they can access 
future damage easily at the block level when it happens, and possibly 
repair it without doing a full dump/reload.  What's implemented in the 
feature we're talking about has a good enough UI to handle this entire 
cycle I see damaged installations go through.


Good questions, Bruce, I don't think the reasons behind this feature's 
demand have been highlighted very well before.  I try not to spook the 
world by talking regularly about how many corrupt PostgreSQL databases 
I've seen, but they do happen.  Most of my regular ranting on crappy 
SSDs that lie about writes comes from a TB scale PostgreSQL install that 
got corrupted due to the write-cache flaws of the early Intel 
SSDs--twice.  The would have happily lost even the worst-case 20% of 
regular performance to avoid going down for two days each time they saw 
corruption, where we had to dump/reload to get them going again.  If the 
install had checksums, I could have figured out which blocks were 
damaged and manually fixed them.  Without checksums, there really was 
nowhere to go except dump/reload.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-09-01 Thread Greg Smith

On 9/1/13 9:38 AM, Heikki Linnakangas wrote:

To phrase it differently: We already have RLS. It's shipped as an
extension called Veil. Now please explain what's wrong with that
statement, if anything.


Veil was last updated for 9.1 to work against that version, so the first 
thing is that it's two versions back from being current.


The main improvement for a few now core features, compared to their 
external/extension predecessors, is that they go through a real review 
process.  I suspect a lot of the criticisms being lobbied against the 
core RLS feature would also hit Veil if it were evaluated to the same 
standard.


Regardless, I'm seeing a few review themes pop up from this thread:

-Comparison against the Veil feature set.
-Competitive review against industry expectations, AKA checkbox 
compliance.
-Confirm feature set is useful to government security clearance 
applications and multi-tenant applications.  There's also a secured web 
application use case that's popped up a few times too; KaiGai has used 
secured Apache installs for example.

-Summary of known covert channels, with documentation coverage.
-Assess odds of this implementation's future issues turning into 
security bugs.  My personal hotspot here is that I'd like minimal code 
exposure to people who don't use this feature at all.  Are there parts 
here that should be compile time enabled?


Of course those are all on top of the usual code quality review.  Did I 
miss any big themes on that list?


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-09-01 Thread Greg Smith

On 9/1/13 5:54 PM, Greg Stark wrote:

So I think up above Tom is wrong about why it's ok that INSERT leaks
keys when it reports a unique key violation. The reason why it's ok
that there's a covert channel there is that the DBA can avoid that
covert channel by being careful when creating unique constraints. He
or she should be aware that creating a unique constraint implicitly
provides a kind of limited access to data to users who have INSERT
privilege even if they lack the real SELECT privilege.


And if someone can INSERT values that they can't actually see once 
they're committed, that's a similarly bad we should describe.  People 
should be dumping their trash in their neighbor's yard.  I think 
eventually this needs to be wrestled to the ground in a robust way.  I 
want to see if all unique violations might be changed to give less 
information in this sort of RLS context.


One rough early idea is to create a new error condition that means you 
hit something protected by RLS, but doesn't leak any more information 
than that.  Just a generic Security restriction operation that comes 
out of fishing for keys, inserting outside your area, etc.  I want to 
think through some use cases and review the code to see whether that 
concept helps or not.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-08-27 Thread Greg Smith

On 7/29/13 2:04 AM, KONDO Mitsumasa wrote:

I think that it is almost same as small dirty_background_ratio or
dirty_background_bytes.


The main difference here is that all writes pushed out this way will be 
to a single 1GB relation chunk.  The odds are better that multiple 
writes will combine, and that the I/O will involve a lower than average 
amount of random seeking.  Whereas shrinking the size of the write cache 
always results in more random seeking.



The essential improvement is not dirty page size in fsync() but
scheduling of fsync phase.
I can't understand why postgres does not consider scheduling of fsync
phase.


Because it cannot get the sort of latency improvements I think people 
want.  I proved to myself it's impossible during the last 9.2 CF when I 
submitted several fsync scheduling change submissions.


By the time you get to the fsync sync phase, on a system that's always 
writing heavily there is way too much backlog to possibly cope with by 
then.  There just isn't enough time left before the checkpoint should 
end to write everything out.  You have to force writes to actual disk to 
start happening earlier to keep a predictable schedule.  Basically, the 
longer you go without issuing a fsync, the more uncertainty there is 
around how long it might take to fire.  My proposal lets someone keep 
all I/O from ever reaching the point where the uncertainty is that high.


In the simplest to explain case, imagine that a checkpoint includes a 
1GB relation segment that is completely dirty in shared_buffers.  When a 
checkpoint hits this, it will have 1GB of I/O to push out.


If you have waited this long to fsync the segment, the problem is now 
too big to fix by checkpoint time.  Even if the 1GB of writes are 
themselves nicely ordered and grouped on disk, the concurrent background 
ability is going to chop the combination up into more random I/O than 
the ideal.


Regular consumer disks have a worst case random I/O throughput of less 
than 2MB/s.  My observed progress rates for such systems show you're 
lucky to get 10MB/s of writes out of them.  So how long will the dirty 
1GB in the segment take to write?  1GB @ 10MB/s = 102.4 *seconds*.  And 
that's exactly what I saw whenever I tried to play with checkpoint sync 
scheduling.  No matter what you do there, periodically you'll hit a 
segment that has over a minute of dirty data accumulated, and 60 second 
latency pauses result.  By the point you've reached checkpoint, you're 
dead when you call fsync on that relation.  You *must* hit that segment 
with fsync more often than once per checkpoint to achieve reasonable 
latency.


With this linear slider idea, I might tune such that no segment will 
ever get more than 256MB of writes before hitting a fsync instead.  I 
can't guarantee that will work usefully, but the shape of the idea seems 
to match the problem.



Taken together my checkpoint proposal method,
* write phase
   - Almost same, but considering fsync phase schedule.
   - Considering case of background-write in OS, sort buffer before
starting checkpoint write.


This cannot work for the reasons I've outlined here.  I guarantee you I 
will easily find a test workload where it performs worse than what's 
happening right now.  If you want to play with this to learn more about 
the trade-offs involved, that's fine, but expect me to vote against 
accepting any change of this form.  I would prefer you to not submit 
them because it will waste a large amount of reviewer time to reach that 
conclusion yet again.  And I'm not going to be that reviewer.



* fsync phase
   - Considering checkpoint schedule and write-phase schedule
   - Executing separated sync_file_range() and sleep, in final fsync().


If you can figure out how to use sync_file_range() to fine tune how much 
fsync is happening at any time, that would be useful on all the 
platforms that support it.  I haven't tried it just because that looked 
to me like a large job refactoring the entire fsync absorb mechanism, 
and I've never had enough funding to take it on.  That approach has a 
lot of good properties, if it could be made to work without a lot of 
code changes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backup throttling

2013-08-27 Thread Greg Smith

On 8/27/13 7:58 AM, Robert Haas wrote:

 We have a *general* need to be able to throttle server-side resource
utilization, particularly I/O.  This is a problem not only for
pg_basebackup, but for COPY, CLUSTER, VACUUM, and even things like
UPDATE.  Of all of those, the only one for which we currently have any
kind of a solution is VACUUM.


I didn't mention it specifically, but I always presumed that the Cost 
limited statements RFC proposal I floated: 
http://www.postgresql.org/message-id/519ea5ff.5040...@2ndquadrant.com 
(and am still working on) would handle the base backup case too. 
pg_basebackup is just another client.  Some sort of purpose made 
solution for pg_basebackup alone may be useful, but I'd be shocked if 
the sort of general mechanism I described there wasn't good enough to 
handle many of the backup limiting cases too.


Also, once that and the block write counters I also sent an RFC out for 
are in place, I have a plan for adding a server-wide throttling 
mechanism.  I want to extract a cluster wide read and write rate and put 
a cluster wide limit on that whole thing.  It seemed too hard to jump 
into without these other two pieces of plumbing in place first.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does larger i/o size make sense?

2013-08-27 Thread Greg Smith

On 8/27/13 3:54 PM, Josh Berkus wrote:

I believe that Greenplum currently uses 128K.  There's a definite
benefit for the DW use-case.


Since Linux read-ahead can easily give big gains on fast storage, I 
normally set that to at least 4096 sectors = 2048KB.  That's a lot 
bigger than even this, and definitely necessary for reaching maximum 
storage speed.


I don't think that the block size change alone will necessarily 
duplicate the gains on seq scans that Greenplum gets though.  They've 
done a lot more performance optimization on that part of the read path 
than just the larger block size.


As far as quantifying whether this is worth chasing, the most useful 
thing to do here is find some fast storage and profile the code with 
different block sizes at a large read-ahead.  I wouldn't spend a minute 
on trying to come up with a more complicated management scheme until the 
potential gain is measured.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-08-27 Thread Greg Smith

On 7/20/13 10:08 AM, Kohei KaiGai wrote:

Hmm. I didn't have this idea. It seems to me fair enough and kills
necessity to enhance RangeTblEntry and getrelid() indeed.
I try to fix up this implementation according to your suggestion.


How is that going?  I'm going to do a serious review of this myself over 
the next few weeks.  I have a good chunk of time set aside for it as 
part of a larger project.  I'm hoping to get more people here involved 
in that effort too, starting in the November CF if that works out.


I've been trying to catch up with your larger plan for this feature for 
9.4.  You made this comment earlier:


 Also, I'd like to have discussion for this feature in earlier half of
 v9.4 to keep time for the remaining features, such as check on
 writer-side, integration with selinux, and so on

Is any of that code around yet?  I see that you have split your 
submissions so that a smaller program can be reviewed today.  I'd like 
to start taking a look at the next step too though.  For the project I'm 
starting to work on here, getting the integration with labeling also 
done is a very important thing to target for 9.4.  It would be nice to 
see how that fits together today, even if the code for it isn't being 
reviewed heavily yet.


I don't quite understand yet what's missing on the writer side.  If you 
could help explain what's missing there, I would like to read about that.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unsafe GUCs and ALTER SYSTEM WAS: Re: ALTER SYSTEM SET

2013-08-06 Thread Greg Smith

On 8/5/13 2:36 PM, Josh Berkus wrote:

Most of our users not on Heroku are running with superuser as the app
user now.  Like, 95% of them based on my personal experience (because
our object permissions management sucks).


My percentage wouldn't be nearly that high.  95% of database installs 
done by developers?  That I could believe.  But setups done by 
ex-{Oracle,DB2,Sybase...} DBAs avoid superuser rights for the 
application, and that's around 1/3 of the systems I see.  In general I 
agree that there certainly are a lot of superuser only installs out 
there, I just don't think it's quite as bad as you're painting it.


The important thing to realize here is that ALTER SYSTEM SET is a 
convenience function.  It’s useful to the extent that it makes people 
feel more comfortable with changing things in the database.  We will 
never stop people from shooting themselves in the foot.  And if the 
barriers added for that purpose are too high, like disabling changes to 
shared_buffers altogether, all you’ll do is make this so restricted that 
it doesn’t satisfy anyone.


The original spec I outlined for how to implement this feature allowed 
disabling the whole thing.  Then it was just commenting out the 
includedir directive from the postgresql.conf.  Allowing people to 
disable use of this feature in a managed configuration environment is 
important.  Beyond that, I don’t expect that we’ll ever make this foolproof.


After arguing out this topic in person with Stephen last night, I’ve 
lumped ideas here into three major approaches that could be followed:


1) Don’t allow changing unsafe GUCs.

2) Provide a way for the server to start with bad setting and force the 
administrator to fix the problem they introduced.  Some sort of 
maintenance mode that only allows superuser connections would force 
someone to clean up this class of problem at next restart, while still 
giving them enough access to run a new ALTER SYSTEM SET command.


3) Require extra syntax to modify an unsafe GUC.

As far as fixing the bad settings goes, there’s already code in initdb 
to detect how high shared_buffers and max_connections can go.  Any bogus 
shared_buffers/max_connections value above the kernel limits could be 
worked around with that approach.


Here’s how I would try and communicate that a change is unsafe, only 
allowing it after proving user is paying some attention to the danger:


# ALTER SYSTEM SET shared_buffers = ‘8GB’;
NOTICE:  Changing shared_buffers only takes effect after a server restart.
ERROR:  Changing shared_buffers incorrectly can prevent the server from 
starting normally.  Use the FORCE option to modify the value anyway.


# ALTER SYSTEM SET shared_buffers = ‘8GB’ FORCE;
NOTICE:  Changing shared_buffers only takes effect after a server restart.
ALTER SYSTEM

Will bad examples pop up in the Internet that just use FORCE all the 
time?  Sure they will, and people will cut and paste them without paying 
attention.  I don't see why that possibility has to block this feature 
from being adopted though.  That line of thinking leads toward removing 
trust authentication, because that's similarly abused with cut and paste 
tutorials.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-08-03 Thread Greg Smith

On 8/2/13 4:48 PM, Stephen Frost wrote:

* Josh Berkus (j...@agliodbs.com) wrote:

I really think this is the wrong approach.  If we start removing
unsafe parameters from ALTER SYSTEM SET, we basically hobble the
feature to the point of uselessness.  Out of the 15 or so parameters 80%
of our users touch, half of them are on your unsafe list.


Reflecting on this a bit more, I'm curious what your list-of-15 is.


You can get a top 15-ish list from 
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server


The list of things I see changed regularly that are on your unsafe list are:

listen_addresses, port, shared_buffers, log_directory, log_filename

Obvious missing thing from your unsafe list that is also changed 
regularly is max_connection.  I count 6 parameters that are both unsafe 
and changed regularly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-08-02 Thread Greg Smith

On 7/29/13 3:46 PM, Josh Berkus wrote:

Based on the ongoing discussion of this patch, I have moved it to 9.4CF2
(9-2013).

Mind you, it would be good to commit some version of it before September.


Quite a bit of the patch adds some refactored GUC parameter validation 
code that seems necessary for this feature, and that's now where the 
controversial parts are either.  That's the part I thought should be 
looked at for commit now.  I'd like to get the change size Amit is 
carrying around and (and other people are reviewing) reduced here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-08-02 Thread Greg Smith

On 8/1/13 10:47 AM, David Johnston wrote:

Minor request: could someone enlighten me as to why making the directory
location a compile-time option is undesirable.


The ongoing argument here is whether to allow moving the directory at 
all, or to make it fixed to $PGDATA the way recovery.conf is.  If you 
accept that it should float, then it actually needs to be a start time 
option.  Software like Debian moves around the postgresql.conf like this:


pg_ctl -c config_file=/etc/postgresql/9.3/main/postgresql.conf ...

The way this argument is going the last few days, I'm starting to think 
that it's worth breaking this style of config directory setup out into 
its own feature now.


Whether or not ALTER SYSTEM SET takes advantage of the config directory 
or not seems a still raging question.  I've been coupling the two 
together because I think the design of ALTER SYSTEM SET should consider 
a config directory based approach.  But from the perspective of what can 
get committed first, the config directory really should go first.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-30 Thread Greg Smith

On 7/28/13 10:49 PM, Tom Lane wrote:

Josh Berkus j...@agliodbs.com writes:

I think if we can design conf.d separately for config files of management 
tools, then
it is better to have postgresql.auto.conf to be in $PGDATA rather than in
$PGDATA/conf.d



One of the biggest current complaints about recovery.conf from
Debian/Ubuntu users is the fact that it lives in $PGDATA.  Won't we just
get the same thing here?


I don't think that's the same case, but ... why exactly don't they like
recovery.conf, and can you show that the location of the file has
anything to do with the underlying complaint?


Here is a quick survey of config files:

RHEL6:
Cluster configuration file at /etc/sysconfig/pgsql, writeable by root. 
postgresql.conf is in $HOME/8.4/data


Debian Squeeze:
All configuration files in /etc/postgresql/8.4/main and are owned by the 
postgres user.


That last one shows why there's a complaint, and why it's 75% due to 
location rather than split role.  Debian users expect all of the config 
files to be relocated to /etc/postgresql or /etc/postgresql-common. 
Those files are also writeable by the postgres user.  recovery.conf 
doesn't fit that model, and people don't like that.  The admin who has 
to manage servers with puppet would be much happier if recovery.conf was 
in the /etc/postgresql tree they are already managing.  I get this 
complaint every single time I introduce an experienced Debian admin to 
PostgreSQL replication setup.  They aren't sure whether it's the 
packagers or PostgreSQL that is to blame for recovery.conf being in 
$PGDATA instead of /etc, but it's obvious to them someone has screwed up.


Given that Debian is the major example of a relocated postgresql.conf in 
the field, and that work has been done such that a) all config files are 
in /etc and b) they are all writeable by the postgres user, I don't 
understand why people are suggesting the auto.conf file or conf.d 
directory will go anywhere else on that platform anymore.  I think the 
only claims that are suggesting otherwise are thinking about an older, 
now unsupported version of the Debian packaging.  Debian packaging 
absolutely will want to relocate any new files added here into their 
/etc directory tree as well, where they will be writeable.  They will 
not go into the $PGDATA directory.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/25/13 6:02 PM, didier wrote:

It was surely already discussed but why isn't postresql  writing
sequentially its cache in a temporary file?


If you do that, reads of the data will have to traverse that temporary 
file to assemble their data.  You'll make every later reader pay the 
random I/O penalty that's being avoided right now.  Checkpoints are 
already postponing these random writes as long as possible. You have to 
take care of them eventually though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 5:59 AM, Hannu Krosing wrote:

Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
random
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.


If your solution to avoiding random writes now is to do sequential ones 
into a buffer, you'll pay for it by having more expensive random reads 
later.  In the SSD write buffer case, that works only because those 
random reads are very cheap.  Do the same thing on a regular drive, and 
you'll be paying a painful penalty *every* time you read in return for 
saving work *once* when you write.  That only makes sense when your 
workload is near write-only.


It's possible to improve on this situation by having some sort of 
background process that goes back and cleans up the random data, 
converting it back into sequentially ordered writes again.  SSD 
controllers also have firmware that does this sort of work, and Postgres 
might do it as part of vacuum cleanup.  But note that such work faces 
exactly the same problems as writing the data out in the first place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 9:14 AM, didier wrote:

During recovery you have to load the log in cache first before applying WAL.


Checkpoints exist to bound recovery time after a crash.  That is their 
only purpose.  What you're suggesting moves a lot of work into the 
recovery path, which will slow down how long it takes to process.


More work at recovery time means someone who uses the default of 
checkpoint_timeout='5 minutes', expecting that crash recovery won't take 
very long, will discover it does take a longer time now.  They'll be 
forced to shrink the value to get the same recovery time as they do 
currently.  You might need to make checkpoint_timeout 3 minutes instead, 
if crash recovery now has all this extra work to deal with.  And when 
the time between checkpoints drops, it will slow the fundamental 
efficiency of checkpoint processing down.  You will end up writing out 
more data in the end.


The interval between checkpoints and recovery time are all related.  If 
you let any one side of the current requirements slip, it makes the rest 
easier to deal with.  Those are all trade-offs though, not improvements. 
 And this particular one is already an option.


If you want less checkpoint I/O per capita and don't care about recovery 
time, you don't need a code change to get it.  Just make 
checkpoint_timeout huge.  A lot of checkpoint I/O issues go away if you 
only do a checkpoint per hour, because instead of random writes you're 
getting sequential ones to the WAL.  But when you crash, expect to be 
down for a significant chunk of an hour, as you go back to sort out all 
of the work postponed before.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 8:32 AM, Tom Lane wrote:

What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes.  But sooner or
later you have to put the data where it belongs.


Hannu was observing that SSD often doesn't do that at all.  They can 
maintain logical - physical translation tables that decode where each 
block was written to forever.  When read seeks are really inexpensive, 
the only pressure to reorder block is wear leveling.


That doesn't really help with regular drives though, where the low seek 
time assumption doesn't play out so well.  The whole idea of writing 
things sequentially and then sorting them out later was the rage in 2001 
for ext3 on Linux, as part of the data=journal mount option.  You can 
go back and see that people are confused but excited about the 
performance at 
http://www.ibm.com/developerworks/linux/library/l-fs8/index.html


Spoiler:  if you use a workload that has checkpoint issues, it doesn't 
help PostgreSQL latency.  Just like using a large write cache, you gain 
some burst performance, but eventually you pay for it with extra latency 
somewhere.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

2013-07-23 Thread Greg Smith
On 7/23/13 2:16 AM, Satoshi Nagayasu wrote:
 I've been working on new pgstattuple function to allow
 block sampling [1] in order to reduce block reads while
 scanning a table. A PoC patch is attached.

Take a look at all of the messages linked in
https://commitfest.postgresql.org/action/patch_view?id=778

Jaime and I tried to do what you're working on then, including a random
block sampling mechanism modeled on the stats_target mechanism.  We
didn't do that as part of pgstattuple though, which was a mistake.

Noah created some test cases as part of his thorough review that were
not computing the correct results.  Getting the results correct for all
of the various types of PostgreSQL tables and indexes ended up being
much harder than the sampling part.  See
http://www.postgresql.org/message-id/20120222052747.ge8...@tornado.leadboat.com
in particular for that.

 This new function, pgstattuple2(), samples only 3,000 blocks
 (which accounts 24MB) from the table randomly, and estimates
 several parameters of the entire table.

There should be an input parameter to the function for how much sampling
to do, and if it's possible to make the scale for it to look like
ANALYZE that's helpful too.

I have a project for this summer that includes reviving this topic and
making sure it works on some real-world systems.  If you want to work on
this too, I can easily combine that project into what you're doing.

-- 
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] And then there were 5

2013-07-23 Thread Greg Smith

On 7/23/13 9:06 AM, Robert Haas wrote:

But just BTW, you kind of messed up that CommitFest entry.  The link you added 
as a new version of the
patch was actually to a different patch from the same flock.


Fixed now, since I find myself looking back through history on 
submissions often enough to care about it being right.  My eyes were 
starting to glaze over by the end of staring at this flock.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Greg Smith

On 7/23/13 10:56 AM, Robert Haas wrote:

On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote:

We know that a 1GB relation segment can take a really long time to write
out.  That could include up to 128 changed 8K pages, and we allow all of
them to get dirty before any are forced to disk with fsync.


By my count, it can include up to 131,072 changed 8K pages.


Even better!  I can pinpoint exactly what time last night I got tired 
enough to start making trivial mistakes.  Everywhere I said 128 it's 
actually 131,072, which just changes the range of the GUC I proposed.


Getting the number right really highlights just how bad the current 
situation is.  Would you expect the database to dump up to 128K writes 
into a file and then have low latency when it's flushed to disk with 
fsync?  Of course not.  But that's the job the checkpointer process is 
trying to do right now.  And it's doing it blind--it has no idea how 
many dirty pages might have accumulated before it started.


I'm not exactly sure how best to use the information collected.  fsync 
every N writes is one approach.  Another is to use accumulated writes to 
predict how long fsync on that relation should take.  Whenever I tried 
to spread fsync calls out before, the scale of the piled up writes from 
backends was the input I really wanted available.  The segment write 
count gives an alternate way to sort the blocks too, you might start 
with the heaviest hit ones.


In all these cases, the fundamental I keep coming back to is wanting to 
cue off past write statistics.  If you want to predict relative I/O 
delay times with any hope of accuracy, you have to start the checkpoint 
knowing something about the backend and background writer activity since 
the last one.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-23 Thread Greg Smith

On 7/23/13 12:10 PM, Josh Berkus wrote:

Apparently it's a little much for experienced reviewers to chew on, too,
since I've been trying to get someone to review it since the beginning
of the Commitfest.


It's more than the available experienced reviewers are willing to chew 
on fully as volunteers.  The reward for spending review time is pretty 
low right now.



While I understand the call for resources, this is a bit unfair to
KaiGai, who has put in his time reviewing other people's patches.


If you read Dean Rasheed's comments, it's obvious he put a useful amount 
of work into his review suggestions.  It is not the case here that 
KaiGai worked on a review and got nothing in return.  Unfortunately that 
has happened to this particular patch before, but the community did a 
little better this time.


The goal of the CF is usually to reach one of these outcomes for every 
submission:


-The submission is ready for commit
-The submission needs improvement in X

Review here went far enough to identify an X to be improved.  It was a 
big enough issue that a rewrite is needed, commit at this time isn't 
possible, and now KaiGai has something we hope is productive he can 
continue working on.  That's all we can really promise here--that if we 
say something isn't ready for commit yet, that there's some feedback as 
to why.


I would have preferred to see multiple X issues identified here, given 
that we know there has to be more than just the one in a submission of 
this size.  The rough fairness promises of the CommitFest seem satisfied 
to me though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-23 Thread Greg Smith

On 7/23/13 2:30 PM, Josh Berkus wrote:


You know as well as me that, as consultants, we can get clients to pay for 10% 
extra time
for review in the course of developing a feature


Before this number gets quoted anywhere, I think it's on the low side. 
I've spent a good bit of time measuring how much time it takes to do a 
fair offsetting review--one where you put as much time in as it takes to 
review your submission--and I keep getting numbers more in the 20 to 25% 
range.  The work involved to do a high quality review takes a while.


I happen to think the review structure is one of the most important 
components to PostgreSQL release quality.  It used to be a single level 
review with just the committers, now it's a two level structure.  The 
reason the Postgres code is so good isn't that the submitted development 
is inherently any better than other projects.  There's plenty of bogus 
material submitted here.  It's the high standards for review and commit 
that are the key filter.  The importance of the process to the result 
isn't weighed as heavily as I think it should be.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-22 Thread Greg Smith
Attached is an update that I think sorts out all of the documentation 
concerns.  I broke this section into paragraphs now that it's getting so 
long too.


The only code change is that this now labels the controversial lag here 
average rate limit schedule lag.  That way if someone wants to 
introduce other measures of rate limit lag, like a more transaction 
oriented one, you might call that average rate limit transaction lag 
and tell the two apart.


The rewritten documentation here tries to communicate that there is a 
schedule that acts like it was pre-computed at the start of each client 
too.  It's not ever adjusted based on what individual transactions do. 
I also noted the way this can cause schedule lag for some time after a 
slow transaction finishes, since that's the main issue observed so far.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
new file mode 100644
index 2ad8f0b..4e6b608
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
*** int unlogged_tables = 0;
*** 137,142 
--- 137,148 
  doublesample_rate = 0.0;
  
  /*
+  * When threads are throttled to a given rate limit, this is the target delay
+  * to reach that rate in usec.  0 is the default and means no throttling.
+  */
+ int64 throttle_delay = 0;
+ 
+ /*
   * tablespace selection
   */
  char *tablespace = NULL;
*** typedef struct
*** 202,212 
--- 208,220 
int listen; /* 0 indicates that an 
async query has been
 * sent */
int sleeping;   /* 1 indicates that the 
client is napping */
+   boolthrottling; /* whether nap is for throttling */
int64   until;  /* napping until (usec) */
Variable   *variables;  /* array of variable definitions */
int nvariables;
instr_time  txn_begin;  /* used for measuring 
transaction latencies */
instr_time  stmt_begin; /* used for measuring statement 
latencies */
+   boolis_throttled;   /* whether transaction throttling is 
done */
int use_file;   /* index in sql_files 
for this client */
boolprepared[MAX_FILES];
  } CState;
*** typedef struct
*** 224,229 
--- 232,240 
instr_time *exec_elapsed;   /* time spent executing cmds (per 
Command) */
int*exec_count; /* number of cmd executions 
(per Command) */
unsigned short random_state[3]; /* separate randomness for each 
thread */
+   int64   throttle_trigger;   /* previous/next throttling (us) */
+   int64   throttle_lag;   /* total transaction lag behind 
throttling */
+   int64   throttle_lag_max;   /* max transaction lag */
  } TState;
  
  #define INVALID_THREAD((pthread_t) 0)
*** typedef struct
*** 232,237 
--- 243,250 
  {
instr_time  conn_time;
int xacts;
+   int64   throttle_lag;
+   int64   throttle_lag_max;
  } TResult;
  
  /*
*** usage(void)
*** 356,361 
--- 369,375 
 -N, --skip-some-updates  skip updates of pgbench_tellers 
and pgbench_branches\n
 -P, --progress=NUM   show thread progress report 
every NUM seconds\n
 -r, --report-latencies   report average latency per 
command\n
+-R, --rate SPEC  target rate in transactions per 
second\n
 -s, --scale=NUM  report this scale factor in 
output\n
 -S, --select-onlyperform SELECT-only 
transactions\n
 -t, --transactions   number of transactions each 
client runs 
*** doCustom(TState *thread, CState *st, ins
*** 898,914 
  {
PGresult   *res;
Command   **commands;
  
  top:
commands = sql_files[st-use_file];
  
if (st-sleeping)
{   /* are we 
sleeping? */
instr_time  now;
  
INSTR_TIME_SET_CURRENT(now);
!   if (st-until = INSTR_TIME_GET_MICROSEC(now))
st-sleeping = 0;   /* Done sleeping, go ahead with 
next command */
else
return true;/* Still sleeping, nothing to 
do here */
}
--- 912,973 
  {
PGresult   *res;
Command   **commands;
+   booltrans_needs_throttle = false;
  
  top:
commands = sql_files[st-use_file

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-22 Thread Greg Smith
Very minor update with V19 here, to reflect Alvaro's comments.  The 
tricky part now reads like this:


High rate limit schedule lag values, that is lag values that are large 
compared to the actual transaction latency, indicate that something is 
amiss in the throttling process.  High schedule lag can highlight a 
subtle problem there even if the target rate limit is met in the end.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
new file mode 100644
index 2ad8f0b..4e6b608
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
*** int unlogged_tables = 0;
*** 137,142 
--- 137,148 
  doublesample_rate = 0.0;
  
  /*
+  * When threads are throttled to a given rate limit, this is the target delay
+  * to reach that rate in usec.  0 is the default and means no throttling.
+  */
+ int64 throttle_delay = 0;
+ 
+ /*
   * tablespace selection
   */
  char *tablespace = NULL;
*** typedef struct
*** 202,212 
--- 208,220 
int listen; /* 0 indicates that an 
async query has been
 * sent */
int sleeping;   /* 1 indicates that the 
client is napping */
+   boolthrottling; /* whether nap is for throttling */
int64   until;  /* napping until (usec) */
Variable   *variables;  /* array of variable definitions */
int nvariables;
instr_time  txn_begin;  /* used for measuring 
transaction latencies */
instr_time  stmt_begin; /* used for measuring statement 
latencies */
+   boolis_throttled;   /* whether transaction throttling is 
done */
int use_file;   /* index in sql_files 
for this client */
boolprepared[MAX_FILES];
  } CState;
*** typedef struct
*** 224,229 
--- 232,240 
instr_time *exec_elapsed;   /* time spent executing cmds (per 
Command) */
int*exec_count; /* number of cmd executions 
(per Command) */
unsigned short random_state[3]; /* separate randomness for each 
thread */
+   int64   throttle_trigger;   /* previous/next throttling (us) */
+   int64   throttle_lag;   /* total transaction lag behind 
throttling */
+   int64   throttle_lag_max;   /* max transaction lag */
  } TState;
  
  #define INVALID_THREAD((pthread_t) 0)
*** typedef struct
*** 232,237 
--- 243,250 
  {
instr_time  conn_time;
int xacts;
+   int64   throttle_lag;
+   int64   throttle_lag_max;
  } TResult;
  
  /*
*** usage(void)
*** 356,361 
--- 369,375 
 -N, --skip-some-updates  skip updates of pgbench_tellers 
and pgbench_branches\n
 -P, --progress=NUM   show thread progress report 
every NUM seconds\n
 -r, --report-latencies   report average latency per 
command\n
+-R, --rate SPEC  target rate in transactions per 
second\n
 -s, --scale=NUM  report this scale factor in 
output\n
 -S, --select-onlyperform SELECT-only 
transactions\n
 -t, --transactions   number of transactions each 
client runs 
*** doCustom(TState *thread, CState *st, ins
*** 898,914 
  {
PGresult   *res;
Command   **commands;
  
  top:
commands = sql_files[st-use_file];
  
if (st-sleeping)
{   /* are we 
sleeping? */
instr_time  now;
  
INSTR_TIME_SET_CURRENT(now);
!   if (st-until = INSTR_TIME_GET_MICROSEC(now))
st-sleeping = 0;   /* Done sleeping, go ahead with 
next command */
else
return true;/* Still sleeping, nothing to 
do here */
}
--- 912,973 
  {
PGresult   *res;
Command   **commands;
+   booltrans_needs_throttle = false;
  
  top:
commands = sql_files[st-use_file];
  
+   /*
+* Handle throttling once per transaction by sleeping.  It is simpler
+* to do this here rather than at the end, because so much complicated
+* logic happens below when statements finish.
+*/
+   if (throttle_delay  ! st-is_throttled)
+   {
+   /*
+* Use inverse transform sampling to randomly generate a delay

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-07-22 Thread Greg Smith
The v3 patch applies perfectly here now.  Attached is a spreadsheet with 
test results from two platforms, a Mac laptop and a Linux server.  I 
used systems with high disk speed because that seemed like a worst case 
for this improvement.  The actual improvement for shrinking WAL should 
be even better on a system with slower disks.


There are enough problems with the hundred tiny fields results that I 
think this not quite ready for commit yet.  More comments on that below. 
 This seems close though, close enough that I am planning to follow up 
to see if the slow disk results are better.


Reviewing the wal-update-testsuite.sh test program, I think the only 
case missing that would be useful to add is ten tiny fields, one 
changed.  I think that one is interesting to highlight because it's 
what benchmark programs like pgbench do very often.


The GUC added by the program looks like this:

postgres=# show wal_update_compression_ratio ;
 wal_update_compression_ratio
--
 25

Is possible to add a setting here that disables the feature altogether? 
 That always makes it easier to consider a commit, knowing people can 
roll back the change if it makes performance worse.  That would make 
performance testing easier too.  wal-update-testsuit.sh takes as long as 
13 minutes, it's long enough that I'd like the easier to automate 
comparison GUC disabling adds.  If that's not practical to do given the 
intrusiveness of the code, it's not really necessary.  I haven't looked 
at the change enough to be sure how hard this is.


On the Mac, the only case that seems to have a slowdown now is hundred 
tiny fields, half nulled.  It would be nice to understand just what is 
going on with that one.  I got some ugly results in two short fields, 
no change too, along with a couple of other weird results, but I think 
those were testing procedure issues that can be ignored.  The pgbench 
throttle work I did recently highlights that I can't really make a Mac 
quiet/consistent for benchmarking very well.  Note that I ran all of the 
Mac tests with assertions on, to try and catch platform specific bugs. 
The Linux ones used the default build parameters.


On Linux hundred tiny fields, half nulled was also by far the worst 
performing one, with a 30% increase in duration despite the 14% drop in 
WAL.  Exactly what's going on there really needs to be investigated 
before this seems safe to commit.  All of the hundred tiny fields 
cases seem pretty bad on Linux, with the rest of them running about a 
11% duration increase.


This doesn't seem ready to commit for this CF, but the number of problem 
cases is getting pretty small now.  Now that I've gotten more familiar 
with the test programs and the feature, I can run more performance tests 
on this at any time really.  If updates addressing the trouble cases are 
ready from Amit or Hari before the next CF, send them out and I can look 
at them without waiting until that one starts.  This is a very promising 
looking performance feature.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


WAL-lz-v3.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-22 Thread Greg Smith

On 7/22/13 12:58 PM, Amit kapila wrote:

As per discussion, updated patch contains following changes:
1. Regression tests for Alter System are removed
2. Parsed the auto file automatically after parsing postgresql.conf
3. Removed addition of include directive in postgresql.conf
4. Removed error handling for parsing errors


These changes have shrunk the diff down to 1411 lines of code.

I'd like to identify which committer might take this on at this point. 
In a few respects this is Ready For Committer now, because the 
committer who takes this on is going to get a strong vote on how to 
resolve most of the remaining fundamental issues:


-Is this the point to finally commit to the config directory approach?

-If this does set the config directory usage precedent, is the name used 
for that appropriate?  Whoever suggested the change from conf.d to 
config made an error IMHO.  Every example I've found of other projects 
doing this style of config refactoring picked either conf.d or a 
unique, no two are alike name.  I'd rather not see Postgres add yet 
another unique one.  (I think the 'postgresql' part of 
postgresql.auto.conf as the name of the file is redundant too--what else 
would be in the Postgres config directory but postgresql files?--but 
that's not a major issue)


This could definitely use a round of committer level review of how the 
GUC handling is being done now too.  That part of the code seems to have 
settled, and things like using the new validate_conf_option could be 
committed even with other parts still being discussed.  Exactly how to 
best break this out into useful commits is another decision that really 
needs some input from the potential committer though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-07-22 Thread Greg Smith

On 7/22/13 2:57 PM, Andres Freund wrote:

* I'd be very surprised if this doesn't make WAL replay of update heavy
   workloads slower by at least factor of 2.


I was thinking about what a benchmark of WAL replay would look like last 
year.  I don't think that data is captured very well yet, and it should be.


My idea was to break the benchmark into two pieces.  One would take a 
base backup, then run a series of tests and archive the resulting the 
WAL.  I doubt you can make a useful benchmark here without a usefully 
populated database, that's why the base backup step is needed.


The first useful result then is to measure how long commit/archiving 
took and the WAL volume, which is what's done by the test harness for 
this program.  Then the resulting backup would be setup for replay. 
tarring up the backup and WAL archive could even give you a repeatable 
test set for ones where it's only replay changes happening.  Then the 
main number that's useful, total replay time, would be measured.


The main thing I wanted this for wasn't for code changes; it was to 
benchmark configuration changes.  I'd like to be able to answer 
questions like which I/O scheduler is best for a standby in a way that 
has real test data behind it.  The same approach should useful for 
answering your concerns about the replay performance impact of this 
change too though.



* It makes changeset extraction either more expensive or it would have
   to be disabled there.


That argues that if committed at all, the ability to turn this off I was 
asking about would be necessary.  It sounds like this *could* work like 
how minimal WAL archiving levels allow optimizations that are disabled 
at higher ones--like the COPY into a truncated/new table cheat.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-22 Thread Greg Smith

On 7/20/13 10:08 AM, Kohei KaiGai wrote:


With that change to expand_targetlist(), the change to getrelid() may
be unnecessary, and then also the new rowsec_relid field in
RangeTblEntry may not be needed.


Hmm. I didn't have this idea. It seems to me fair enough and kills
necessity to enhance RangeTblEntry and getrelid() indeed.
I try to fix up this implementation according to your suggestion.


Great, there's one useful bit of feedback for you then, and that seems 
to address Tom's getrelid concern too.


For the active CommitFest, I don't see any place we can go with this 
right now except for Returned with Feedback.  We really need more 
reviewers willing to put a significant amount of time into going through 
this code.


Anyone who would like to see RLS committed should consider how you can 
get resources to support a skilled PostgreSQL reviewer spending time on 
it.  (This is a bit much to expect new reviewers to chew on usefully) 
I've been working on that here, but I don't have anything I can publicly 
commit to yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List

2013-07-22 Thread Greg Smith

On 6/24/13 12:57 PM, Josh Berkus wrote:

Maciej is correct that this policy also belongs on the how to submit a
patch wiki page.  I will remedy that.


I just reviewed and heavily updated the new section you added to 
https://wiki.postgresql.org/wiki/Submitting_a_Patch  That included the 
idea that the reviewed patch should be similar in size/scope to the 
submitted one, as well a slightly deeper discussion of how to fit this 
work into a feature review quote.


I find myself needing to explain this whole subject to potential feature 
sponsors enough that I've tried a few ways of describing it.  The 
closest analog I've found so far is the way carbon offset work is 
accounted for.  I sometimes refer to the mutual review as an offsetting 
review in conversation, and I threw that term into the guidelines as well.


As far as motivating new reviewers goes, let's talk about positive 
feedback.  Anything that complicates the release notes is a non-starter 
because that resource is tightly controlled by a small number of people, 
and it's trying to satisfy a lot of purposes.  What I would like to see 
is an official but simple Review Leaderboard for each release instead.


Each time someone writes a review and adds it to CF app with a Review 
entry, the account that entered it gets a point.  Sum the points at the 
end and there's your weighted list for T-shirt prizes.  It should be 
possible to get that count with a trivial SELECT query out of the CF 
database, and then produce a simple HTML table at the end of each CF or 
release.  Anything that takes more work than that, and anything that 
takes *any* time that must come from a committer, is too much accounting.


This idea would be a bit unfair to people who review big patches instead 
of little ones.  But an approach that disproportionately rewards new 
contributors working on small things isn't that terrible.  As long as 
the review tests whether the code compiles and passes the regression 
tests, that's good enough to deserve a point.  I'd be happy if we 
suddenly had a problem where people were doing only that to try game 
their leaderboard ranking.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List

2013-07-22 Thread Greg Smith

On 7/3/13 7:25 PM, Bruce Momjian wrote:

The extrapolation of Josh's approach is that committers
have to do work that the community wants to maintain their commit
rights, but their commit rights are helping the community, so why would
people care if you take them away --- you only hurt the community
further by doing so.


The main problem with having inactive committers (which I don't intend 
to include the important subject matter committers, who I'll get into at 
the end here) is that they skew the public information about who commits 
in a counterproductive way.  People visit 
https://wiki.postgresql.org/wiki/Committers , sees that list of names, 
and then make conclusions based on its content.  And some of those 
conclusions are wrong because the data is inconsistent.  The standards 
listed are applied when new committers are promoted, but they are not 
applied symmetrically to remove ones who don't anymore.


The #1 obstacle to my getting more time to work on core PostgreSQL is 
that companies presume regular submitters who are also non-committers 
don't do a very good job.  If you are competent and have a consistent 
track record of contributions to an open source project, the project 
would make you a committer, right?  Conversely, if you've been 
contributing for a while but aren't a committer, the most likely 
explanation is that your work quality is poor.  That is a completely 
reasonable viewpoint based on how most open source projects work.  The 
really terrible part is that it means the longer you've been submitting 
patches, the *less* competent you're assumed to be.  When I tell people 
I've been submitting things since 2007 but am not a committer, the only 
logical explanation is that my submissions must suck very hard, right?


From that perspective, people who are listed as committers but don't 
actively do work for the project are causing me a serious problem.  When 
someone who rarely commits can obviously qualify, that *proves* the bar 
for PostgreSQL committers is actually very low to casual observers. 
That's the message the project is inadvertently sending by leaving 
committers on there if they stop working actively.


The main thing I'd like to see is having the committer list, and its 
associated guidelines, updated to reflect that there are subject matter 
experts committing too.  That would pull them out of any what have you 
done for me lately? computations, and possibly open up a way to get 
more of them usefully.  Here are the first two obvious labels like that:


Michael Meskes (meskes) - embedded SQL
Teodor Sigaev (teodor) - inverted indexes

When even Josh Berkus doesn't even know all of this information, it's 
clearly way too obscure to expect the rest of the world to figure it out.


It also boggles my mind that there isn't already an entry like this on 
there too:


Thom Browne - documentation

Each time Thom passes through yet another correction patch that is 
committed with no change, I find it downright bizarre that a community 
with such limited committer resources wastes their time with that 
gatekeeping.  The standards for nominating committers seem based on 
whether they can commit just about anything.  I think it's more 
important to consider whether people are trusted to keep commits within 
their known area(s) of expertise.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Revive line type

2013-07-22 Thread Greg Smith

On 6/26/13 9:34 PM, Peter Eisentraut wrote:

Still wondering whether to use a A,B,C-based output
format per Tom's comment.


Wouldn't it also be helpful to remove The points used in the output are 
not necessarily the points used on input by making that not true?


There are three obvious ways you might output a line:

-Math class expectations of slope-intercept form:  y = mx + b. 
Intercept forms don't handle both horizontal and vertical lines though, 
so that's easily out.


-Pair of points.  A casual observer might get lucky and observe putting 
a pair of points in and getting the same pair back again, then 
incorrectly assume that's normally the case.  Seems better to never make 
that possible in the infinite line case.  I'm reminded of how row output 
usually is in order gives a bad impression about ordering guarantees in SQL.


-Ax+By+C=0

Outputting that third one, when it's also the internal form, handles any 
time of line; will show any assumptions about individual points being 
preserved are wrong; and avoids rounding errors too.  The only downside 
seems to be that bounded lines are easier to show with a pair of points. 
 I think that suggests an alternate, secondary output format would be 
useful though, rather than that a different default one is a good idea.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] And then there were 5

2013-07-22 Thread Greg Smith
After pushes from a few people, the remaining submissions are now 
waiting for commit.  I updated each of those to have the latest info in 
the CF app, and tried to identify what committers have already looked at 
them.


Access to calls stack from GET DIAGNOSTICS statement:  ?

Add more regression tests mostly for ALTER OPERATOR FAMILY .. ADD/DROP: 
 ?  Robert Haas gave a quick review last week that's generated an 
update since.  But that looks it was just a quick scan for issues.


Remove unused targets from plan:  Alvaro?  (He reviewed it already)

SQL Command to edit postgresql.conf (ALTER SYSTEM patch):  Alvaro 
volunteered to look at this if no one else gets to it first.


UNNEST() (and other functions) WITH ORDINALITY:  Greg Stark

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-22 Thread Greg Smith

On 7/22/13 4:52 AM, KONDO Mitsumasa wrote:

The writeback source code which I indicated part of writeback is almost
same as community kernel (2.6.32.61). I also read linux kernel 3.9.7,
but it is almost same this part.


The main source code difference comes from going back to the RedHat 5 
kernel, which means 2.6.18.  For many of these versions, you are right 
that it is only the tuning parameters that were changed in newer versions.


Optimizing performance for the old RHEL5 kernel isn't the most important 
thing, but it's helpful to know the things it does very badly.



My fsync patch is only sleep returned succece of fsync and maximum sleep
time is set to 10 seconds. It does not cause bad for this problem.


It's easy to have hundreds of relations that are getting fsync calls 
during a checkpoint.  If you have 100 relations getting a 10 second 
sleep each, you could potentially delay checkpoints by 17 minutes this 
way.  I regularly see systems where shared_buffers=8GB and there are 200 
to 400 relation segments that need a sync during a checkpoint.


This is the biggest problem with your submission.  Once you give up 
following the checkpoint schedule carefully, it is very easy to end up 
with large checkpoint deadline misses on production servers.  If someone 
thinks they are doing a checkpoint every 5 minutes, but your patch makes 
them take 20 minutes instead, that is bad.  They will not expect that a 
crash might have to replay that much activity before the server is 
useful again.



You also don't seem afraid of how exceeding the
checkpoint timeout is a very bad thing yet.

I think it is important that why this problem was caused. We should try
to find the cause of which program has bug or problem.


The checkpointer process is the problem.  There's no filesystem bug or 
complicated issues involved in many of the bad cases.  Here is a simple 
example that shows how the toughest problem cases happen:


-64GB of RAM
-10% dirty_background_ratio = 6GB of dirty writes = 6144MB
-2MB/s random I/O when concurrent reads are heavy
-3027 seconds to clear the cache = 51 minutes

That's how you get to an example like the one in my slides:

LOG: checkpoint complete: wrote 33282 buers (3.2%); 0 transaction log 
file(s) added, 60 removed, 129 recycled; write=228.848 s, sync=4628.879 
s, total=4858.859 s


It's very hard to do better on these, and I don't expect any change to 
help this a lot.  But I don't want to see a change committed that makes 
this sort of checkpoint 17 minutes longer if there's 100 relations 
involved either.



My patch not only improvement of throughput but also
realize stable response time at fsync phase in checkpoint.


The main reason your patch improves latency and throughput is that it 
makes checkpoints farther apart.  That's why I drew you a graph showing 
how the time between checkpoints lined up perfectly with TPS.  If it was 
only a small problem it would be worth considering, but I think it's 
likely to end up with these 15 minute I've outlined here instead.



And I servey about ext3 file system.


I wouldn't worry too much about the problems ext3 has.  Like the old 
RHEL5 kernel I was commenting about above, there are a lot of ext3 
systems out there.  But we can't do a lot about getting good performance 
from them.  It's only important to test that you're not making them a 
lot worse with a change.



My system block size is 4096, but
8192 or more seems to better. It will decrease number of inode and get
more large sequential disk fields.


I normally increase read-ahead on Linux systems to get faster speed on 
sequential disk throughput.  Changing the block size might work better 
in some cases, but not many people are willing to do that.  Read-ahead 
is very easy to change at any time.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Design proposal: fsync absorb linear slider

2013-07-22 Thread Greg Smith
Recently I've been dismissing a lot of suggested changes to checkpoint 
fsync timing without suggesting an alternative.  I have a simple one in 
mind that captures the biggest problem I see:  that the number of 
backend and checkpoint writes to a file are not connected at all.


We know that a 1GB relation segment can take a really long time to write 
out.  That could include up to 128 changed 8K pages, and we allow all of 
them to get dirty before any are forced to disk with fsync.


Rather than second guess the I/O scheduling, I'd like to take this on 
directly by recognizing that the size of the problem is proportional to 
the number of writes to a segment.  If you turned off fsync absorption 
altogether, you'd be at an extreme that allows only 1 write before 
fsync.  That's low latency for each write, but terrible throughput.  The 
maximum throughput case of 128 writes has the terrible latency we get 
reports about.  But what if that trade-off was just a straight, linear 
slider going from 1 to 128?  Just move it to the latency vs. throughput 
position you want, and see how that works out.


The implementation I had in mind was this one:

-Add an absorption_count to the fsync queue.

-Add a new latency vs. throughput GUC I'll call .  Its default value is 
-1 (or 0), which corresponds to ignoring this new behavior.


-Whenever the background write absorbs a fsync call for a relation 
that's already in the queue, increment the absorption count.


-max_segment_absorb  0, have the background writer scan for relations 
where absorption_count  max_segment_absorb.  When it finds one, call 
fsync on that segment.


Note that it's possible for this simple scheme to be fooled when writes 
are actually touching a small number of pages.  A process that 
constantly overwrites the same page is the worst case here.  Overwrite 
it 128 times, and this method would assume you've dirtied every page, 
while only 1 will actually go to disk when you call fsync.  It's 
possible to track this better.  The count mechanism could be replaced 
with a bitmap of the 128 blocks, so that absorbs set a bit instead of 
incrementing a count.  My gut feel is that this is more complexity than 
is really necessary here.  If in fact the fsync is slimmer than 
expected, paying for it too much isn't the worst problem to have here.


I'd like to build this myself, but if someone else wants to take a shot 
at it I won't mind.  Just be aware the review is the big part here.  I 
should be honest about one thing: I have zero incentive to actually work 
on this.  The moderate amount of sponsorship money I've raised for 9.4 
so far isn't getting anywhere near this work.  The checkpoint patch 
review I have been doing recently is coming out of my weekend volunteer 
time.


And I can't get too excited about making this as my volunteer effort 
when I consider what the resulting credit will look like.  Coding is by 
far the smallest part of work like this, first behind coming up with the 
design in the first place.  And both of those are way, way behind how 
long review benchmarking takes on something like this.  The way credit 
is distributed for this sort of feature puts coding first, design not 
credited at all, and maybe you'll see some small review credit for 
benchmarks.  That's completely backwards from the actual work ratio.  If 
all I'm getting out of something is credit, I'd at least like it to be 
an appropriate amount of it.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] And then there were 5

2013-07-22 Thread Greg Smith

On 7/22/13 10:48 PM, Tom Lane wrote:

Greg Smith g...@2ndquadrant.com writes:

Remove unused targets from plan:  Alvaro?  (He reviewed it already)


Really I should do that one, but it seems like all my available cycles
have been going into bug fixing lately :-(


I just put you down as the committer on it, given Alvaro waved that way too.

A further look at recently ALTER OPERATOR FAMILY changes shows that 
Robert has been working on those quite a bit.  I'm putting him down as 
the committer on this last one.


If the CF drags on a bit due to committer time being spent on bugs, 
given the date here I think that's OK.  I would put next week as a loose 
target for finishing these off.  I worry a lot less about committers 
keeping to their schedules than when there are 50 regular contributors 
still moving around.


That leaves only 1 patch where I have no idea who will commit it:

  access to calls stack from GET DIAGNOSTICS statement

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-20 Thread Greg Smith

On 7/20/13 4:48 AM, didier wrote:

With your tests did you try to write the hot buffers first? ie buffers
with a high  refcount, either by sorting them on refcount or at least
sweeping the buffer list in reverse?


I never tried that version.  After a few rounds of seeing that all 
changes I tried were just rearranging the good and bad cases, I got 
pretty bored with trying new changes in that same style.



by writing to the OS the less likely to be recycle buffers first it may
have less work to do at fsync time, hopefully they have been written by
the OS background task during the spread and are not re-dirtied by other
backends.


That is the theory.  In practice write caches are so large now, there is 
almost no pressure forcing writes to happen until the fsync calls show 
up.  It's easily possible to enter the checkpoint fsync phase only to 
discover there are 4GB of dirty writes ahead of you, ones that have 
nothing to do with the checkpoint's I/O.


Backends are constantly pounding the write cache with new writes in 
situations with checkpoint spikes.  The writes and fsync calls made by 
the checkpoint process are only a fraction of the real I/O going on. 
The volume of data being squeezed out by each fsync call is based on 
total writes to that relation since the checkpoint.  That's connected to 
the writes to that relation happening during the checkpoint, but the 
checkpoint writes can easily be the minority there.


It is not a coincidence that the next feature I'm working on attempts to 
quantify the total writes to each 1GB relation chunk.  That's the most 
promising path forward on the checkpoint problem I've found.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-19 Thread Greg Smith
 was measuring my theories as 
directly as possible.



I would like to make a itemizing list which can be proof of my patch
from you. Because DBT-2 benchmark spent lot of time about 1 setting test
per 3 - 4 hours.


That's great, but to add some perspective here I have spent over 1 year 
of my life running tests like this.  The development cycle to do 
something useful in this area is normally measured in months of machine 
time running benchmarks, not hours or days.  You're doing well so far, 
but you're just getting started.


My itemized list is simple:  throw out all results where the checkpoint 
end goes more than 5% beyond its targets.  When that happens, no matter 
what you think is causing your gain, I will assume it's actually less 
total writes that are improving things.


I'm willing to consider an optional, sloppy checkpoint approach that 
uses heavy load to adjust how often checkpoints happen.  But if we're 
going to do that, it has to be extremely clear that the reason for the 
gain is the checkpoint spacing--and there is going to be a crash 
recovery time penalty paid for it.  And this patch is not how I would do 
that.


It's not really clear yet where the gains you're seeing are really 
coming from.  If you re-ran all your tests with pg_stat_bgwriter 
before/after snapshots, logged every fsync call, and then build some 
tools to analyze the fsync call latency, then you'll have enough data to 
talk about this usefully.  That's what I consider the bare minimum 
evidence to consider changing something here.  I have all of those 
features in pgbench-tools with checkpoint logging turned way up, but 
they're not all in the dbt2 toolset yet as far as I know.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Greg Smith
Please stop all this discussion of patents in this area.  Bringing up a 
US patents here makes US list members more likely to be treated as 
willful infringers of that patent: 
http://www.ipwatchdog.com/patent/advanced-patent/willful-patent-infringement/ 
if the PostgreSQL code duplicates that method one day.


The idea of surveying patents in some area so that their methods can be 
avoided in code you develop, that is a reasonable private stance to 
take.  But don't do that on the lists.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Greg Smith

On 7/18/13 11:04 AM, Robert Haas wrote:

On a system where fsync is sometimes very very slow, that
might result in the checkpoint overrunning its time budget - but SO
WHAT?


Checkpoints provide a boundary on recovery time.  That is their only 
purpose.  You can always do better by postponing them, but you've now 
changed the agreement with the user about how long recovery might take.


And if you don't respect the checkpoint boundary, what you can't do is 
then claim better execution performance than something that did.  It's 
always possible to improvement throughput by postponing I/O.  SO WHAT? 
If that's OK, you don't need complicated logic to do that.  Increase 
checkpoint_timeout.  The system with checkpoint_timeout at 6 minutes 
will always outperform one where it's 5.


You don't need to introduce a feedback loop--something that has 
significant schedule stability implications if it gets out of 
control--just to spread I/O out further.  I'd like to wander down the 
road of load-sensitive feedback for database operations, especially for 
maintenance work.  But if I build something that works mainly because it 
shifts the right edge of the I/O deadline forward, I am not fooled into 
thinking I did something awesome.  That's cheating, getting better 
performance mainly by throwing out the implied contract with the 
user--the one over their expected recovery time after a crash.  And I'm 
not excited about complicating the PostgreSQL code to add a new way to 
do that, not when checkpoint_timeout is already there with a direct, 
simple control on the exact same trade-off.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Greg Smith

On 7/18/13 12:00 PM, Alvaro Herrera wrote:

I think the idea is to have a system in which most of the time the
recovery time will be that for checkpoint_timeout=5, but in those
(hopefully rare) cases where checkpoints take a bit longer, the recovery
time will be that for checkpoint_timeout=6.


I understand the implementation.  My point is that if you do that, the 
fair comparison is to benchmark it against a current system where 
checkpoint_timeout=6 minutes.  That is a) simpler, b) requires no code 
change, and c) makes the looser standards the server is now settling for 
transparent to the administrator.  Also, my expectation is that it would 
perform better all of the time, not just during the periods this new 
behavior kicks in.


Right now we have checkpoint_completion_target as a GUC for controlling 
what's called the spread of a checkpoint over time.  That sometimes goes 
over, but that's happening against the best attempts of the server to do 
better.


The first word that comes to mind for for just disregarding the end time 
is that it's a sloppy checkpoint.  There is all sorts of sloppy behavior 
you might do here, but I've worked under the assumption that ignoring 
the contract with the administrator was frowned on by this project.  If 
people want this sort of behavior in the server, I'm satisfied my 
distaste for the idea and the reasoning behind it is clear now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-07-18 Thread Greg Smith

On 7/9/13 12:09 AM, Amit Kapila wrote:

   I think the first thing to verify is whether the results posted can be 
validated in some other environment setup by another person.
   The testcase used is posted at below link:
   http://www.postgresql.org/message-id/51366323.8070...@vmware.com


That seems easy enough to do here, Heikki's test script is excellent. 
The latest patch Hari posted on July 2 has one hunk that doesn't apply 
anymore now.  Inside src/backend/utils/adt/pg_lzcompress.c the patch 
tries to change this code:


-   if (hent)
+   if (hentno != INVALID_ENTRY)

But that line looks like this now:

if (hent != INVALID_ENTRY_PTR)

Definitions of those:

#define INVALID_ENTRY   0
#define INVALID_ENTRY_PTR   (hist_entries[INVALID_ENTRY])

I'm not sure if different error handling may be needed here now due the 
commit that changed this, or if the patch wasn't referring to the right 
type of error originally.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hardware donation

2013-07-18 Thread Greg Smith

On 7/10/13 12:53 PM, Benedikt Grundmann wrote:

The server will probably be most interesting for the disks in it.  That
is where we spend the largest amount of time optimizing (for sequential
scan speed in particular):
22x600GB disks in a Raid6+0 (Raid0 of 2x 10disk raid 6 arrays) + 2 spare
disks. Overall size 8.7 TB in that configuration.


What is the RAID controller used in the server?  That doesn't impact the 
donation, I'm just trying to fit this one into my goals for finding 
useful community performance testing equipment.


There are a good number of systems floating around the community with HP 
controllers--I have even one myself now--but we could use more LSI Logic 
and Adaptec based systems.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-18 Thread Greg Smith

On 7/18/13 6:45 PM, Tatsuo Ishii wrote:

I'm not a native English speaker either... Greg, could you please
review the document?


Yes, I already took at look at it briefly.  The updates move in the 
right direction, but I can edit them usefully before commit.  I'll have 
that done by tomorrow and send out a new version.  I'm hopeful that v18 
will finally be the one that everyone likes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-18 Thread Greg Smith

On 7/18/13 4:02 PM, Alvaro Herrera wrote:


I think we should just put the config directives of ALTER SYSTEM into a
file, not dir, alongside postgresql.conf; I would suggest
postgresql.auto.conf.  This file is parsed automatically after
postgresql.conf, without requiring an include directive in
postgresql.conf.


It is possible to build ALTER SYSTEM SET this way.  One of the goals of 
the implementation style used here wasn't just to accomplish that narrow 
feature though.  We keep running into situations where a standard, 
directory based configuration system would make things easier.  Each 
time that happens, someone comes up with another way to avoid doing it. 
 I think that approach contributes to stagnation in the terrible status 
quo of PostgreSQL configuration management.


I thought this was a good spot to try and re-draw this line because I 
don't want just one program that is able to create new configuration 
entries easily.  I want to see a whole universe of them.  ALTER SYSTEM 
SET, tuning helpers, replication helpers, logging helpers, vacuum 
schedulers.  All of them *could* just dump a simple file into a config 
directory with code anyone can write.  And having ALTER SYSTEM SET do 
that provides a strong precedent for how it can be done.  (I'd like to 
see initdb do that instead of hacking the system postgresql.conf as if 
sed-style edits were still the new hotness, but that's a future change)


My claim, and that's one informed by writing pgtune, is that by far the 
hardest part of writing a configuration addition tool is parsing a 
postgresql.conf file.  The expectation right now is that all changes 
must happen there.  Creating a new snippet of configuration settings is 
easy, but no tools know where to put one right now.  Instead we just 
keep coming up with single, hard-coded file names that people have to 
know in order to manage their installations.



This seems the simplest way; config tools such as Puppet know they
always need to consider the postgresql.auto.conf file.


Like this idea.  What administrators really want, whether they realize 
it or not, is to point Puppet at a configuration directory.  Then the 
problem of what are the config files in the new version of Postgres 
happens once more and then it's over.  Exactly what the files in there 
are named should be completely under control of the administrator.


We're never going to reach that unless we lead by example though.  The 
database's configuration pushes people toward using a small number of 
files with magic names--postgresql.conf, recovery.conf, and now 
postgresql.auto.conf in your proposal.  Meanwhile, all sensible 
UNIX-style projects with complicated configurations in text files has 
moved toward a configuration directory full of them.  Here are some 
directories on the last RHEL6 system I was editing configuration on this 
week:


/etc/httpd/conf.d/
/etc/munin/conf.d/
/etc/ld.so.conf.d/
/etc/munin/conf.d/
/etc/dovecot/conf.d/
/etc/yum/pluginconf.d/

Some of them didn't get the memo that the right standard name is conf.d 
now, but they're the minority.


It's fine to have a postgresql.conf file that you *can* make all your 
changes to, for people who want to stay in the old monolithic approach. 
 But if there were also a conf.d directory under there, many classes of 
administrator would breath a sign of relief.  With all the precedents 
people may have already ran into, with the right naming can be made 
discoverable and familiar to a lot of administrators.


Telling them instead that there's this new postgresql.auto.conf file 
that suddenly they have to worry about--I'd say don't even bother if 
that's how you want to do it.  That's making the problem I've been 
trying to simplify for five years now even worse.



I think the whole business of parsing the file, and then verifying
whether the auto file has been parsed, is nonsensical and should be
removed from the patch.


That was just trying to keep people from screwing up their configuration 
while they get used to things.  That and some of the other sanity 
checking is not necessary, it was just trying to make the transition to 
the new configuration approach less error-prone.  I don't think anyone 
would disagree that the current patch is doing enough of that error 
checking work that the error checking itself is the most likely thing to 
break.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-18 Thread Greg Smith

On 7/18/13 11:03 PM, Stephen Frost wrote:

Wasn't there a wiki page about this
feature which might also help?  Bigger question- is it correct for the
latest version of the patch?


https://wiki.postgresql.org/wiki/RLS has accumulated a lot of older 
debris that may or may not be useful here.


This useful piece was just presented at PGCon: 
http://www.pgcon.org/2013/schedule/attachments/273_PGcon2013-kaigai-row-level-security.pdf

That is very up to date intro to the big picture issues.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-17 Thread Greg Smith

On 7/17/13 2:31 AM, Tatsuo Ishii wrote:

./pgbench -p 5433 -S -T 10 -R 1 test
average rate limit lag: 862.534 ms (max 2960.913 ms)
tps = 7133.745911 (including connections establishing)
tps = 7135.130810 (excluding connections establishing)

What does average rate limit lag mean?


The whole concept of lag with the rate limit is complicated.  At one 
point I thought this should be a debugging detail, rather than exposing 
the user to it.


The problem is that if you do that, you might not notice that your limit 
failed to work as expected.  Maybe it's good enough in a case like this 
that the user will see they tried to limit at 1, but they only got 
7135, so something must not have worked as expected.


Tatsuo:  most of my tests were on Mac OS and Linux, I actually tested 
the Mac version a lot more than any other here.  I didn't do any testing 
on Windows.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-17 Thread Greg Smith

On 7/16/13 11:36 PM, Ants Aasma wrote:

As you know running a full suite of write benchmarks takes a very long
time, with results often being inconclusive (noise is greater than
effect we are trying to measure).


I didn't say that.  What I said is that over a full suite of write 
benchmarks, the effect of changes like this has always averaged out to 
zero.  You should try it sometime.  Then we can have a useful discussion 
of non-trivial results instead of you continuing to tell me I don't 
understand things.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-17 Thread Greg Smith

On 7/17/13 9:16 PM, Tatsuo Ishii wrote:

Now suppose we have 3 transactions and each has following values:

d(0) = 10
d(1) = 20
d(2) = 30

t(0) = 100
t(1) = 110
t(2) = 120

That says pgbench expects the duration 10 for each
transaction. Actually, the first transaction runs slowly for some
reason and the lag = 100 - 10 = 90. However, tx(1) and tx(2) are
finished on schedule because they spend only 10 (110-10 = 10, 120-110
= 10). So the expected average lag would be 90/3 = 30.


The clients are not serialized here in any significant way, even when 
they shared a single process/thread.  I did many rounds of tracing 
through this code with timestamps on each line, and the sequence of 
events here will look like this:


client 0:  send SELECT... to server.  yield to next client.
client 1:  send SELECT... to server.  yield to next client.
client 2:  send SELECT... to server.  yield to next client.
select():  wait for the first response from any client.
client 0:  receive response.  complete transaction, compute lag.
client 1:  receive response.  complete transaction, compute lag.
client 2:  receive response.  complete transaction, compute lag.

There is nothing here that is queuing the clients one after the other. 
If (0) takes 100ms before its reply comes back, (1) and (2) can receive 
their reply back and continue forward at any time.  They are not waiting 
for (0); it has yielded control while waiting for a response.  All three 
times are independent once you reach the select() point where all are 
active.


In this situation, if the server gets stuck doing something such that it 
takes 100ms before any client receives a response, it is correct to 
penalize every client for that latency.  All three clients could have 
received the information earlier if the server had any to send them.  If 
they did not, they all were suffering from some sort of lag.


I'm not even sure why you spaced the start times out at intervals of 10. 
 If I were constructing an example like this, I'd have them start at 
times of 0, 1, 2--as fast as the CPU can fire off statements 
basically--and then start waiting from that point.  If client 1 takes 10 
units of time to send its query out before client 2 runs, and the rate 
goal requires 10 units of time, the rate you're asking for is impossible.


For sorting out what's going on with your two systems, I would recommend 
turning on debugging output with -d and looking at the new 
per-transaction latency numbers that the feature reports.  If your 
theory that the lag is going up as the test proceeds is true, that 
should show up in the individual latency numbers too.


Based on what I saw during weeks of testing here, I would be more 
suspicious that there's a system level difference between your two 
servers than to blame the latency calculation.  I saw a *lot* of weird 
system issues myself when I started looking that carefully at sustained 
throughput.  The latency reports from the perspective of Fabien's code 
were always reasonable though.  When something delays every client, it 
counts that against every active client's lag, and that's the right 
thing to do.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-17 Thread Greg Smith

On 7/17/13 11:34 PM, Tatsuo Ishii wrote:

My example is for 1 client case. So concurrent clients are not the
issue here.


Sorry about that, with your clarification I see what you were trying to 
explain now.  The code initializes the target time like this:


thread-throttle_trigger = INSTR_TIME_GET_MICROSEC(start);

And then each time a transaction fires, it advances the reference time 
forward based on the expected rate:


thread-throttle_trigger += wait;

It does *not* reset thread-throttle_trigger based on when the previous 
transaction ended / when the next transaction started.  If the goal is 
10us transaction times, it beats a steady drum saying the transactions 
should come at 10us, 20us, 30us (on average--there's some randomness in 
the goals).  It does not pay any attention to when the previous 
transactions finished.


That means that if an early transaction takes an extra 1000us, every 
transaction after that will also show as 1000us late--even if all of 
them take 10us.  You expect that those later transactions will show 0 
lag, since they took the right duration.  For that to happen, 
thread-throttle_trigger would need to be re-initialized with the 
current time at the end of each completed transaction.


The lag computation was not the interesting part of this feature to me. 
 As I said before, I considered it more of a debugging level thing than 
a number people would analyze as much as you did.  I understand why you 
don't like it though.  If the reference time was moved forward to match 
the transaction end each time, I think that would give the lag 
definition you're looking for.  That's fine to me too, if Fabien doesn't 
have a good reason to reject the idea.  We would need to make sure that 
doesn't break some part of the design too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-16 Thread Greg Smith

On 7/16/13 12:46 PM, Ants Aasma wrote:


Spread checkpoints sprinkles the writes out over a long
period and the general tuning advice is to heavily bound the amount of
memory the OS willing to keep dirty.


That's arguing that you can make this feature be useful if you tune in a 
particular way.  That's interesting, but the goal here isn't to prove 
the existence of some workload that a change is useful for.  You can 
usually find a test case that validates any performance patch as helpful 
if you search for one.  Everyone who has submitted a sorted checkpoint 
patch for example has found some setup where it shows significant gains. 
 We're trying to keep performance stable across a much wider set of 
possibilities though.


Let's talk about default parameters instead, which quickly demonstrates 
where your assumptions fail.  The server I happen to be running pgbench 
tests on today has 72GB of RAM running SL6 with RedHat derived kernel 
2.6.32-358.11.1.  This is a very popular middle grade server 
configuration nowadays.  There dirty_background_ratio and 
dirty_background_ratio are 10 (percent).  That means that roughly 7GB of 
RAM can be used for write caching.  Note that this is a fairly low write 
cache tuning compared to a survey of systems in the field--lots of 
people have servers with earlier kernels where these numbers can be as 
high as 20 or even 40% instead.


The current feasible tuning for shared_buffers suggests a value of 8GB 
is near the upper limit, beyond which cache related overhead makes 
increases counterproductive.  Your examples are showing 53% of 
shared_buffers dirty at checkpoint time; that's typical.  The 
checkpointer is then writing out just over 4GB of data.


With that background what process here has more data to make decisions with?

-The operating system has 7GB of writes it's trying to optimize.  That 
potentially includes backend, background writer, checkpoint, temp table, 
statistics, log, and WAL data.  The scheduler is also considering read 
operations.


-The checkpointer process has 4GB of writes from rarely written shared 
memory it's trying to optimize.


This is why if you take the opposite approach of yours today--go 
searching for workloads where sorting is counterproductive--those are 
equally easy to find.  Any test of write speed I do starts with about 50 
different scale/client combinations.  Why do I suggest pgbench-tools as 
a way to do performance tests?  It's because an automated sweep of 
client setups like it does is the minimum necessary to create enough 
variation in workload for changing the database's write path.  It's 
really amazing how often doing that shows a proposed change is just 
shuffling the good and bad cases around.  That's been the case for every 
sorting and fsync delay change submitted so far.  I'm not even 
interested in testing today's submission because I tried that particular 
approach for a few months, twice so far, and it fell apart on just as 
many workloads as it helped.



The checkpointer has the best long term overview of the situation here, OS
scheduling only has the short term view of outstanding read and write
requests.


True only if shared_buffers is large compared to the OS write cache, 
which was not the case on the example I generated with all of a minute's 
work.  I regularly see servers where Linux's Dirty area becomes a 
multiple of the dirty buffers written by a checkpoint.  I can usually 
make that happen at will with CLUSTER and VACUUM on big tables.  The 
idea that the checkpointer has a long-term view while the OS has a short 
one, that presumes a setup that I would say is possible but not common.



kernel settings: dirty_background_bytes = 32M,
dirty_bytes = 128M.


You disclaimed this as a best case scenario.  It is a low throughput / 
low latency tuning.  That's fine, but if Postgres optimizes itself 
toward those cases it runs the risk of high throughput servers with 
large caches being detuned.  I've posted examples before showing very 
low write caches like this leading to VACUUM running at 1/2 its normal 
speed or worse, as a simple example of where a positive change in one 
area can backfire badly on another workload.  That particular problem 
was so common I updated pgbench-tools recently to track table 
maintenance time between tests, because that demonstrated an issue even 
when the TPS numbers all looked fine.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-15 Thread Greg Smith
To clarify what state this is all in:  Fabien's latest 
pgbench-throttle-v15.patch is the ready for a committer version.  The 
last two revisions are just tweaking the comments at this point, and his 
version is more correct than my last one.


My little pgbench-delay-finish-v1.patch is a brand new bug fix of sorts 
that, while trivial, isn't ready for commit yet.  I'll start a separate 
e-mail thread and CF entry for that later.  Fabien has jumped into 
initial review comments of that already below, but the throttle feature 
isn't dependent on that.  The finish delay just proves that the latency 
spikes I was getting here aren't directly tied to the throttle feature.


On 7/14/13 5:42 PM, Fabien COELHO wrote:

* ISTM that the impact of the chosen 1000 should appear somewhere.


I don't have a problem with that, but I didn't see that the little table 
you included was enough to do that.  I think if someone knows how this 
type of random generation works, they don't need the comment to analyze 
the impact.  And if they don't know, that comment alone wasn't enough to 
help them figure it out.  That's why I added some terms that might help 
point the right way for someone who wanted to search for more 
information instead.


The text of pgbench is not really the right place to put a lecture about 
how to generate numbers with a target probability distribution function. 
 Normally the code comments tries to recommend references for this sort 
of thing instead.  I didn't find a really good one in a quick search though.



About 123456 12345 vs 123456.012345: My data parser is usually gnuplot
or my eyes, and in both cases the later option is better:-)


pgbench-tools uses gnuplot too.  If I were doing this again today from 
scratch, I would recommend using the epoch time format compatible with 
it you suggested.  I need to look into this whole topic a little more 
before we get into that though.  This patch just wasn't the right place 
to get into that change.



About the little patch: Parameter ok should be renamed to something
meaningful (say do_finish?).


It's saying if the connection finished ok or not.  I think exactly 
what is done with that information is an implementation detail that 
doesn't need to be exposed to the function interface.  We might change 
how this is tied to PQfinish later.



Also, it seems that when timer is
exceeded in doCustom it is called with true, but maybe you intended that
it should be called with false??


The way timeouts are handled right now is a known messy thing.  Exactly 
what you should do with a client that has hit one isn't obvious.  Try 
again?  Close the connection and abort?  The code has a way it handles 
that now, and I didn't want to change it any.



it is important to remove the connection because it serves as a marker
to know whether a client must run or not.


This little hack moved around how clients finished enough to get rid of 
the weird issue with your patch I was bothered by.  You may be right 
that the change isn't really correct because of how the connection is 
compared to null as a way to see if it's active.  I initially added a 
more complicated finished state to the whole mess that tracked this 
more carefully.  I may need to return to that idea now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-14 Thread Greg Smith

On 7/13/13 12:13 PM, Fabien COELHO wrote:

My 0.02€: if it means adding complexity to the pgbench code, I think
that it is not worth it. The point of pgbench is to look at a steady
state, not to end in the most graceful possible way as far as measures
are concerned.


That's how some people use pgbench.  I'm just as likely to use it to 
characterize system latency.  If there's a source of latency that's 
specific to the pgbench code, I want that out of there even if it's hard.


But we don't have to argue about that because it isn't.  The attached 
new patch seems to fix the latency spikes at the end, with -2 lines of 
new code!  With that resolved I did a final pass across the rate limit 
code too, attached as a v14 and ready for a committer.  I don't really 
care what order these two changes are committed, there's no hard 
dependency, but I would like to see them both go in eventually.


No functional code was changed from your v13 except for tweaking the 
output.  The main thing I did was expand/edit comments and rename a few 
variables to try and make this easier to read.  If you have any 
objections to my cosmetic changes feel free to post an update.  I've put 
a good bit of time into trying to simplify this further, thinking it 
can't really be this hard.  But this seems to be the minimum complexity 
that still works given the mess of the pgbench state machine.  Every 
change I try now breaks something.


To wrap up the test results motivating my little pgbench-delay-finish 
patch, the throttled cases that were always giving 100ms of latency 
clustered at the end here now look like this:


average rate limit lag: 0.181 ms (max 53.108 ms)
tps = 10088.727398 (including connections establishing)
tps = 10105.775864 (excluding connections establishing)

There are still some of these cases where latency spikes, but they're 
not as big and they're randomly distributed throughout the run.  The 
problem I had with the ones at the end is how they tended to happen a 
few times in a row.  I kept seeing multiple of these ~50ms lulls adding 
up to a huge one, because the source of the lag kept triggering at every 
connection close.


pgbench was already cleaning up all of its connections at the end, after 
all the transactions were finished.  It looks safe to me to just rely on 
that for calling PQfinish in the normal case.  And calls to client_done 
already label themselves ok or abort, the code just didn't do anything 
with that state before.  I tried adding some more complicated state 
tracking, but that adds complexity while doing the exact same thing as 
the simple implementation I did.


The only part of your code change I reverted was altering the latency 
log transaction timestamps to read like 1373821907.65702 instead of 
1373821907 65702.  Both formats were considered when I added that 
feature, and I completely understand why you'd like to change it.  One 
problem is that doing so introduces a new class of float parsing and 
rounding issues to consumers of that data.  I'd rather not revisit that 
without a better reason to break the output format.  Parsing tools like 
my pgbench-tools already struggle trying to support multiple versions of 
pgbench, and I don't think there's enough benefit to the float format to 
bother breaking them today.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
new file mode 100644
index 08095a9..8dc81e5
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
*** preparedStatementName(char *buffer, int 
*** 862,875 
  static bool
  clientDone(CState *st, bool ok)
  {
!   /*
!* When the connection finishes normally, don't call PQfinish yet.
!* PQfinish can cause significant delays in other clients that are
!* still running.  Rather than finishing all of them here, in the
!* normal case clients are instead closed in bulk by disconnect_all,
!* after they have all stopped.
!*/
!   if ((st-con != NULL)  ok)
{
PQfinish(st-con);
st-con = NULL;
--- 862,870 
  static bool
  clientDone(CState *st, bool ok)
  {
!   (void) ok;  /* unused */
! 
!   if (st-con != NULL)
{
PQfinish(st-con);
st-con = NULL;
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 80203d6..da88bd7 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -137,6 +137,12 @@ intunlogged_tables = 0;
 double sample_rate = 0.0;
 
 /*
+ * When threads are throttled to a given rate limit, this is the target delay
+ * to reach that rate in usec.  0 is the default and means no throttling.
+ */
+int64  throttle_delay = 0;
+
+/*
  * tablespace selection
  */
 char  *tablespace = NULL

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 6/16/13 10:27 AM, Heikki Linnakangas wrote:

Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images...
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.


That's exactly right.  When a checkpoint finishes the OS write cache is 
clean.  That means all of the full-page writes aren't even hitting disk 
in many cases.  They just pile up in the OS dirty memory, often sitting 
there all the way until when the next checkpoint fsyncs start.  That's 
why I never wandered down the road of changing FPW behavior.  I have 
never seen a benchmark workload hit a write bottleneck until long after 
the big burst of FPW pages is over.


I could easily believe that there are low-memory systems where the FPW 
write pressure becomes a problem earlier.  And slim VMs make sense as 
the place this behavior is being seen at.


I'm a big fan of instrumenting the code around a performance change 
before touching anything, as a companion patch that might make sense to 
commit on its own.  In the case of a change to FPW spacing, I'd want to 
see some diagnostic output in something like pg_stat_bgwriter that 
tracks how many FPW pages are being modified.  A 
pgstat_bgwriter.full_page_writes counter would be perfect here, and then 
graph that data over time as the benchmark runs.



Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.


There I also think the right way to proceed is instrumenting that area 
first.



A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.


I updated and re-reviewed that in 2011: 
http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com 
and commented on why I think the improvement was difficult to reproduce 
back then.  The improvement didn't follow for me either.  It would take 
a really amazing bit of data to get me to believe write sorting code is 
worthwhile after that.  On large systems capable of dirtying enough 
blocks to cause a problem, the operating system and RAID controllers are 
already sorting block.  And *that* sorting is also considering 
concurrent read requests, which are a lot more important to an efficient 
schedule than anything the checkpoint process knows about.  The database 
doesn't have nearly enough information yet to compete against OS level 
sorting.




Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).


For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.


Heikki has nailed the problem with the submitted dbt-2 results here.  If 
you spread checkpoints out more, you cannot fairly compare the resulting 
TPS or latency numbers anymore.


Simple example:  20 minute long test.  Server A does a checkpoint every 
5 minutes.  Server B has modified parameters or server code such that 
checkpoints happen every 6 minutes.  If you run both to completion, A 
will have hit 4 checkpoints that flush the buffer cache, B only 3.  Of 
course B will seem faster.  It didn't do as much work.


pgbench_tools measures the number of checkpoints during the test, as 
well as the buffer count statistics.  If those numbers are very 
different between two tests, I have to throw them out as unfair.  A lot 
of things that seem promising turn out to have this sort of problem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 6/27/13 11:08 AM, Robert Haas wrote:

I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well.


That's correct, I spent about a year whipping that particular horse and 
submitted improvements on it to the community. 
http://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.com 
and its updates downthread are good ones to compare this current work 
against.


The important thing to realize about just delaying fsync calls is that 
it *cannot* increase TPS throughput.  Not possible in theory, obviously 
doesn't happen in practice.  The most efficient way to write things out 
is to delay those writes as long as possible.  The longer you postpone a 
write, the more elevator sorting and write combining you get out of the 
OS.  This is why operating systems like Linux come tuned for such 
delayed writes in the first place.  Throughput and latency are linked; 
any patch that aims to decrease latency will probably slow throughput.


Accordingly, the current behavior--no delay--is already the best 
possible throughput.  If you apply a write timing change and it seems to 
increase TPS, that's almost certainly because it executed less 
checkpoint writes.  It's not a fair comparison.  You have to adjust any 
delaying to still hit the same end point on the checkpoint schedule. 
That's what my later submissions did, and under that sort of controlled 
condition most of the improvements went away.


Now, I still do really believe that better spacing of fsync calls helps 
latency in the real world.  Far as I know the server that I developed 
that patch for originally in 2010 is still running with that change. 
The result is not a throughput change though; there is a throughput drop 
with a latency improvement.  That is the unbreakable trade-off in this 
area if all you touch is scheduling.


The reason why I was ignoring this discussion and working on pgbench 
throttling until now is that you need to measure latency at a constant 
throughput to advance here on this topic, and that's exactly what the 
new pgbench feature enables.  If we can take the current checkpoint 
scheduler and an altered one, run both at exactly the same rate, and one 
gives lower latency, now we're onto something.  It's possible to do that 
with DBT-2 as well, but I wanted something really simple that people 
could replicate results with in pgbench.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-14 Thread Greg Smith

On 7/14/13 2:48 PM, Fabien COELHO wrote:

You attached my v13. Could you send your v14?


Correct patch (and the little one from me again) attached this time.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
new file mode 100644
index 08095a9..6564057
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
*** int unlogged_tables = 0;
*** 137,142 
--- 137,148 
  doublesample_rate = 0.0;
  
  /*
+  * When threads are throttled to a given rate limit, this is the target delay
+  * to reach that rate in usec.  0 is the default and means no throttling.
+  */
+ int64 throttle_delay = 0;
+ 
+ /*
   * tablespace selection
   */
  char *tablespace = NULL;
*** typedef struct
*** 200,210 
--- 206,218 
int listen; /* 0 indicates that an 
async query has been
 * sent */
int sleeping;   /* 1 indicates that the 
client is napping */
+   boolthrottling; /* whether nap is for throttling */
int64   until;  /* napping until (usec) */
Variable   *variables;  /* array of variable definitions */
int nvariables;
instr_time  txn_begin;  /* used for measuring 
transaction latencies */
instr_time  stmt_begin; /* used for measuring statement 
latencies */
+   boolis_throttled;   /* whether transaction should be 
throttled */
int use_file;   /* index in sql_files 
for this client */
boolprepared[MAX_FILES];
  } CState;
*** typedef struct
*** 222,227 
--- 230,238 
instr_time *exec_elapsed;   /* time spent executing cmds (per 
Command) */
int*exec_count; /* number of cmd executions 
(per Command) */
unsigned short random_state[3]; /* separate randomness for each 
thread */
+   int64   throttle_trigger;   /* previous/next throttling (us) */
+   int64   throttle_lag;   /* total transaction lag behind 
throttling */
+   int64   throttle_lag_max;   /* max transaction lag */
  } TState;
  
  #define INVALID_THREAD((pthread_t) 0)
*** typedef struct
*** 230,235 
--- 241,248 
  {
instr_time  conn_time;
int xacts;
+   int64   throttle_lag;
+   int64   throttle_lag_max;
  } TResult;
  
  /*
*** usage(void)
*** 353,358 
--- 366,372 
 -n, --no-vacuum  do not run VACUUM before tests\n
 -N, --skip-some-updates  skip updates of pgbench_tellers 
and pgbench_branches\n
 -r, --report-latencies   report average latency per 
command\n
+-R, --rate SPEC  target rate in transactions per 
second\n
 -s, --scale=NUM  report this scale factor in 
output\n
 -S, --select-onlyperform SELECT-only 
transactions\n
 -t, --transactions   number of transactions each 
client runs 
*** doCustom(TState *thread, CState *st, ins
*** 900,916 
  {
PGresult   *res;
Command   **commands;
  
  top:
commands = sql_files[st-use_file];
  
if (st-sleeping)
{   /* are we 
sleeping? */
instr_time  now;
  
INSTR_TIME_SET_CURRENT(now);
!   if (st-until = INSTR_TIME_GET_MICROSEC(now))
st-sleeping = 0;   /* Done sleeping, go ahead with 
next command */
else
return true;/* Still sleeping, nothing to 
do here */
}
--- 914,973 
  {
PGresult   *res;
Command   **commands;
+   booltrans_needs_throttle = false;
  
  top:
commands = sql_files[st-use_file];
  
+   /*
+* Handle throttling once per transaction by sleeping.  It is simpler
+* to do this here rather than at the end, because so much complicated
+* logic happens below when statements finish.
+*/
+   if (throttle_delay  ! st-is_throttled)
+   {
+   /*
+* Use inverse transform sampling to randomly generate a delay, 
such
+* that the series of delays will approximate a Poisson 
distribution
+* centered on the throttle_delay time.
+*
+* If transactions are too slow or a given wait is shorter than

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 7/3/13 9:39 AM, Andres Freund wrote:

I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files.


The fsync calls decomposing into the queued set of block writes.  If 
they all need to go out eventually to finish a checkpoint, the most 
efficient way from a throughput perspective is to dump them all at once.


I'm not sure sync_file_range targeting checkpoint writes will turn out 
any differently than block sorting.  Let's say the database tries to get 
involved in forcing a particular write order that way.  Right now it's 
going to be making that ordering decision without the benefit of also 
knowing what blocks are being read.  That makes it hard to do better 
than the OS, which knows a different--and potentially more useful in a 
ready-heavy environment--set of information about all the pending I/O. 
And it would be very expensive to made all the backends start sharing 
information about what they read to ever pull that logic into the 
database.  It's really easy to wander down the path where you assume you 
must know more than the OS does, which leads to things like direct I/O. 
 I am skeptical of that path in general.  I really don't want Postgres 
to be competing with the innovation rate in Linux kernel I/O if we can 
ride it instead.


One idea I was thinking about that overlaps with a sync_file_range 
refactoring is simply tracking how many blocks have been written to each 
relation.  If there was a rule like fsync any relation that's gotten 
more than 100 8K writes, we'd never build up the sort of backlog that 
causes the worst latency issues.  You really need to start tracking the 
file range there, just to fairly account for multiple writes to the same 
block.  One of the reasons I don't mind all the work I'm planning to put 
into block write statistics is that I think that will make it easier to 
build this sort of facility too.  The original page write and the fsync 
call that eventually flushes it out are very disconnected right now, and 
file range data seems the right missing piece to connect them well.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith
On 7/11/13 8:29 AM, KONDO Mitsumasa wrote:
 I use linear combination method for considering about total checkpoint 
 schedule
 which are write phase and fsync phase. V3 patch was considered about only 
 fsync
 phase, V4 patch was considered about write phase and fsync phase, and v5 patch
 was considered about only fsync phase.

Your v5 now looks like my Self-tuning checkpoint sync spread series:
https://commitfest.postgresql.org/action/patch_view?id=514 which I did
after deciding write phase delays didn't help.  It looks to me like
some, maybe all, of your gain is coming from how any added delays spread
out the checkpoints.  The self-tuning part I aimed at was trying to
stay on exactly the same checkpoint end time even with the delays in
place.  I got that part to work, but the performance gain went away once
the schedule was a fair comparison.  You are trying to solve a very hard
problem.

How long are you running your dbt-2 tests for?  I didn't see that listed
anywhere.

 ** Average checkpoint duration (sec) (Not include during loading time)
   | write_duration | sync_duration | total
 fsync v3-0.7 | 296.6  | 251.8898  | 548.48 | OK
 fsync v3-0.9 | 292.086| 276.4525  | 568.53 | OK
 fsync v3-0.7_disabled| 303.5706   | 155.6116  | 459.18 | OK
 fsync v4-0.7 | 273.8338   | 355.6224  | 629.45 | OK
 fsync v4-0.9 | 329.0522   | 231.77| 560.82 | OK

I graphed the total times against the resulting NOTPM values and
attached that.  I expect transaction rate to increase along with time
time between checkpoints, and that's what I see here.  The fsync v4-0.7
result is worse than the rest for some reason, but all the rest line up
nicely.

Notice how fsync v3-0.7_disabled has the lowest total time between
checkpoints, at 459.18.  That is why it has the most I/O and therefore
runs more slowly than the rest.  If you take your fsync v3-0.7_disabled
and increase checkpoint_segments and/or checkpoint_timeout until that
test is averaging about 550 seconds between checkpoints, NOTPM should
also increase.  That's interesting to know, but you don't need any
change to Postgres for that.  That's what always happens when you have
less checkpoints per run.

If you get a checkpoint time table like this where the total duration is
very close--within +/-20 seconds is the sort of noise I would expect
there--at that point I would say you have all your patches on the same
checkpoint schedule.  And then you can compare the NOTPM numbers
usefully.  When the checkpoint times are in a large range like 459.18 to
629.45 in this table, as my graph shows the associated NOTPM numbers are
going to be based on that time.

I would recommend taking a snapshot of pg_stat_bgwriter before and after
the test runs, and then showing the difference between all of those
numbers too.  If the test runs for a while--say 30 minutes--the total
number of checkpoints should be very close too.

 * Test Server
Server: HP Proliant DL360 G7
CPU:Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk:   146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
(Add) Set off energy efficient function in BIOS and OS.

Excellent, here I have a DL160 G6 with 2 processors, 72GB of RAM, and
that same P410 controller + 4 disks.  I've been meaning to get DBT-2
running on there usefully, your research gives me a reason to do that.

You seem to be in a rush due to the commitfest schedule.  I have some
bad news for you there.  You're not going to see a change here committed
in this CF based on where it's at, so you might as well think about the
best longer term plan.  I would be shocked if anything came out of this
in less than 3 months really.  That's the shortest amount of time I've
ever done something useful in this area.  Each useful benchmarking run
takes me about 3 days of computer time, it's not a very fast development
cycle.

Even if all of your results were great, we'd need to get someone to
duplicate them on another server, and we'd need to make sure they didn't
make other workloads worse.  DBT-2 is very useful, but no one is going
to get a major change to the write logic in the database committed based
on one benchmark.  Past changes like this have used both DBT-2 and a
large number of pgbench tests to get enough evidence of improvement to
commit.  I can help with that part when you get to something I haven't
tried already.  I am very interesting in improving this area, it just
takes a lot of work to do it.

-- 
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
attachment: NOTPM-Checkpoints.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 7/14/13 5:28 PM, james wrote:

Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free.


That happens sometimes, but if you measure you'll find this doesn't 
actually occur usefully in the situation everyone dislikes.  In a write 
heavy environment where the database doesn't fit in RAM, backends and/or 
the background writer are constantly writing data out to the OS.  WAL is 
going out constantly as well, and in many cases that's competing for the 
disks too.  The most popular blocks in the database get high usage 
counts and they never leave shared_buffers except at checkpoint time. 
That's easy to prove to yourself with pg_buffercache.


And once the write cache fills, every I/O operation is now competing. 
There is nothing happening for free.  You're stealing I/O from something 
else any time you force a write out.  The optimal throughput path for 
checkpoints turns out to be delaying every single bit of I/O as long as 
possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever 
you delay a buffer write, you have increased the possibility that 
someone else will write the same block again.  And the buffers being 
written by the checkpointer are, on average, the most popular ones in 
the database.  Writing any of them to disk pre-emptively has high odds 
of writing the same block more than once per checkpoint.  And that easy 
to measure waste--it shows as more writes/transaction in 
pg_stat_bgwriter--it hurts throughput more than every reduction in seek 
overhead you might otherwise get from early writes.  The big gain isn't 
chasing after cheap seeks.  The best path is the one that decreases the 
total volume of writes.


We played this game with the background writer work for 8.3.  The main 
reason the one committed improved on the original design is that it 
completely eliminated doing work on popular buffers in advance. 
Everything happens at the last possible time, which is the optimal 
throughput situation.  The 8.1/8.2 BGW used to try and write things out 
before they were strictly necessary, in hopes that that I/O would be 
free.  But it rarely was, while there was always a cost to forcing them 
to disk early.  And that cost is highest when you're talking about the 
higher usage blocks the checkpointer tends to write.  When in doubt, 
always delay the write in hopes it will be written to again and you'll 
save work.



So it occurs to me that perhaps we can watch for patterns where we have
groups of adjacent writes that might stream, and when they form we might
schedule them...


Stop here.  I mentioned something upthread that is worth repeating.

The checkpointer doesn't know what concurrent reads are happening.  We 
can't even easily make it know, not without adding a whole new source of 
IPC and locking contention among clients.


Whatever scheduling decision the checkpointer might make with its 
limited knowledge of system I/O is going to be poor.  You might find a 
100% write benchmark that it helps, but those are not representative of 
the real world.  In any mixed read/write case, the operating system is 
likely to do better.  That's why things like sorting blocks sometimes 
seem to help someone, somewhere, with one workload, but then aren't 
repeatable.


We can decide to trade throughput for latency by nudging the OS to deal 
with its queued writes more regularly.  That will result in more total 
writes, which is the reason throughput drops.


But the idea that PostgreSQL is going to do a better global job of I/O 
scheduling, that road is a hard one to walk.  It's only going to happen 
if we pull all of the I/O into the database *and* do a better job on the 
entire process than the existing OS kernel does.  That sort of dream, of 
outperforming the filesystem, it is very difficult to realize.  There's 
a good reason that companies like Oracle stopped pushing so hard on 
recommending raw partitions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changing recovery.conf parameters into GUCs

2013-07-14 Thread Greg Smith

On 7/10/13 9:39 AM, Heikki Linnakangas wrote:

On 10.07.2013 02:54, Josh Berkus wrote:

One bit of complexity I'd like to see go away is that we have two
trigger files: one to put a server into replication, and one to promote
it.  The promotion trigger file is a legacy of warm standby, I believe.
  Maybe, now that we have pg_ctl promote available, we can eliminate the
promotion trigger?


No, see http://www.postgresql.org/message-id/5112a54b.8090...@vmware.com.


Right, you had some good points in there.  STONITH is so hard already, 
we need to be careful about eliminating options there.


All the summaries added here have actually managed to revive this one 
usefully early in the release cycle!  Well done.  I just tried to apply 
Michael's 20130325_recovery_guc_v3.patch and the bit rot isn't that bad 
either.  5 rejection files, nothing big in them.


The only overlap between the recovery.conf GUC work and refactoring the 
trigger file is that the trigger file is referenced in there, and we 
really need to point it somewhere to be most useful.



Personally, I don't care if we support the old recovery.conf-trigger
behavior as long as I'm not forced to use it.


If you accept Heikki's argument for why the file can't go away 
altogether, and it's pretty reasonable, I think two changes reach a 
point where everyone can live with this:


-We need a useful default filename for trigger_file to point at.
-pg_ctl failover creates that file.

As for the location to default to, the pg_standby docs suggest 
/tmp/pgsql.trigger.5432 while the Binary Replication Tutorial on the 
wiki uses a RedHat directory layout $PGDATA/failover


The main reason I've preferred something in the data directory is that 
triggering a standby is too catastrophic for me to be comfortable 
putting it in /tmp.  Any random hooligan with a shell account can 
trigger a standby and break its replication.  Putting the unix socket 
into /tmp only works because the server creates the file as part of 
startup.  Here that's not possible, because creating the trigger is the 
signalling mechanism.


I don't think there needs to be a CLI interface for putting the 
alternate possible text into the trigger--that you can ask for 'fast' 
startup.  It's nice to have available as an expert, but it's fine for 
that to be harder to do.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-13 Thread Greg Smith

On 6/30/13 2:04 AM, Fabien COELHO wrote:

My guess is the OS. PQfinish or select do/are systems calls that
present opportunities to switch context. I think that the OS is passing
time with other processes on the same host, expecially postgres
backends, when it is not with the client.


I went looking for other instances of this issue in pgbench results, 
that's what I got lost in the last two weeks.  It's subtle because the 
clients normally all end in one very short burst of time, but I have 
found evidence of PQfinish issues elsewhere.  Evidence still seems to 
match the theory that throttling highlights this only because it spreads 
out the ending a bit more.  Also, it happens to be a lot worse on the 
Mac I did initial testing with, and I don't have nearly as many Mac 
pgbench results.


There's a refactoring possible here that seems to make this whole class 
of problem go away.  If I change pgbench so that PQfinish isn't called 
for any client until *all* of the clients are actually finished with 
transactions, the whole issue goes away.  I'm going to package that hack 
the right way into its own little change, revisit the throttling code, 
and then this all should wrap up nicely.  I'd like to get this one out 
of the commitfest so I can move onto looking at something else.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-05 Thread Greg Smith

On 7/5/13 2:50 AM, Jeff Davis wrote:

So, my simple conclusion is that glibc emulation should be about the
same as what we're doing now, so there's no reason to avoid it. That
means, if posix_fallocate() is present, we should use it, because it's
either the same (if emulated in glibc) or significantly faster (if
implemented in the kernel).


That's what I'm seeing everywhere too.  I'm happy that we've spent 
enough time chasing after potential issues without finding anything now. 
 Pull out the GUC that was added for default and this is ready to commit.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block write statistics WIP

2013-07-05 Thread Greg Smith

On 7/1/13 3:10 AM, Satoshi Nagayasu wrote:

Or should we add some pointer, which is accociated with the Relation,
into BufferDesc? Maybe OID?


That is the other option here, I looked at it but didn't like it.  The 
problem is that at the point when a new page is created, it's not always 
clear yet what relation it's going to hold.  That means that if the 
buffer page itself is where you look up the relation OID, every code 
path that manipulates relation IDs will need to worry about maintaining 
this information.  It's possible to do it that way, but I don't know 
that it's any less work than making all the write paths keep track of 
it.  It would need some extra locking around updating the relation tag 
in the buffer pages too, and I'd prefer not to


The other thing that I like about the writing side is that I can 
guarantee the code is correct pretty easily.  Yes, it's going to take 
days worth of modifying writing code.  But the compile will force you to 
fix all of them, and once they're all updated I'd be surprised if 
something unexpected happened.


If instead the data moves onto the buffer page header instead, we could 
be chasing bugs similar to the relcache ones forever.  Also, as a tie 
breaker, buffer pages will get bigger and require more locking this way. 
 Making writers tell us the relation doesn't need any of that.



I'm thinking of FlushBuffer() too. How can we count physical write
for each relation in FlushBuffer()? Or any other appropriate place?


SMgrRelation is available there, so tagging the relation should be easy 
in this case.



BTW, Greg's first patch could not be applied to the latest HEAD.
I guess it need to have some fix, I couldn't clafiry it though.


Since this is a WIP patch, what I was looking for was general design 
approach feedback, with my bit as a simple example.  I didn't expect 
anyone to spend much time doing a formal review of that quick hack. 
You're welcome to try and play with that code to add things, I'm not 
very attached to it yet.  I've basically gotten what I wanted to here 
from this submission; I'm marking it returned with feedback.  If anyone 
else has comments, I'm still open to discussion here too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Greg Smith

On 7/1/13 3:44 PM, Robert Haas wrote:

Yeah.  If the patch isn't going to be a win on RHEL 5, I'd consider
that a good reason to scrap it for now and revisit it in 3 years.
There are still a LOT of people running RHEL 5, and the win isn't big
enough to engineer a more complex solution.


I'm still testing, expect to have this wrapped up on my side by 
tomorrow.  So much of the runtime here is the file setup/close that 
having a 2:1 difference in number of writes, what happens on the old 
platforms, it is hard to get excited about.


I don't think the complexity to lock out RHEL5 here is that bad even if 
it turns out to be a good idea.  Just add another configure check for 
fallocate, and on Linux if it's not there don't use posix_fallocate 
either.  Maybe 5 lines of macro code?  RHEL5 sure isn't going anyway 
anytime soon, but at the same time there won't be that many 9.4 
deployments on that version.


I've been digging into the main situation where this feature helps, and 
it won't be easy to duplicate in a benchmark situation.  Using Linux's 
fallocate works as a hint that the whole 16MB should be allocated at 
once, and therefore together on disk if feasible.  The resulting WAL 
files should be less prone to fragmentation.  That's actually the 
biggest win of this approach, but I can't easily duplicate the sort of 
real-world fragmentation I see on live servers here.Given that, I'm 
leaning toward saying that unless there's a clear regression on older 
platforms, above the noise floor, this is still the right thing to do.


I fully agree that this needs to fully automatic--no GUC--before it's 
worth committing.  If we can't figure out the right thing to do now, 
there's little hope anyone else will in a later tuning expedition.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Greg Smith

On 6/30/13 2:01 PM, Jeff Davis wrote:

Simple test program attached, which creates two files and fills them:
one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
I just cmp the resulting files (and also ls them, to make sure they
are 16MB).


This makes platform level testing a lot easier, thanks.  Attached is an 
updated copy of that program with some error checking.  If the files it 
creates already existed, the code didn't notice, and a series of write 
errors happened.  If you set the test up right it's not a problem, but 
it's better if a bad setup is caught.  I wrapped the whole test with a 
shell script, also attached, which insures the right test sequence and 
checks.


Your C test program compiles and passes on RHEL5/6 here, doesn't on OS X 
Darwin.  No surprises there, there's a long list of platforms that don't 
support this call at 
https://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.html 
and the Mac is on it.  Many other platforms I was worried about don't 
support it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--so 
that cuts down on testing quite a bit.  If it runs faster on Linux, 
that's the main target here, just like the existing 
effective_io_concurrency fadvise code.


The specific thing I was worried about is that this interface might have 
a stub that doesn't work perfectly in older Linux kernels.  After being 
surprised to find this interface worked on RHEL5 with your test program, 
I dug into this more.  It works there, but it may actually be slower.


posix_fallocate is actually implemented by glibc on Linux.  Been there 
since 2.1.94 according to the Linux man pages.  But Linux itself didn't 
add the feature until kernel 2.6.20:  http://lwn.net/Articles/226436/ 
The biggest thing I was worried about--the call might be there in early 
kernels but with a non-functional implementation--that's not the case. 
Looking at the diff, before that patch there's no fallocate at all.


So what happened in earlier kernels, where there was no kernel level 
fallocate available?  According to 
https://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.html 
what glibc does is check for kernel fallocate(), and if it's not there 
it writes a bunch of zeros to create the file instead.  What is actually 
happening on a RHEL5 system (with kernel 2.6.18) is that calling 
posix_fallocate does this fallback behavior, where it basically does the 
same thing the existing WAL clearing code does.


I can even prove that's the case.  On RHEL5, if you run strace -o out 
./fallocate the main write loop looks like this:


write(3, 
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
8192) = 8192


But when you call posix_fallocate, you still get a bunch of writes, but 
4 bytes at a time:


pwrite(4, \0, 1, 16769023)= 1
pwrite(4, \0, 1, 16773119)= 1
pwrite(4, \0, 1, 16777215)= 1

That's glibc helpfully converting your call to posix_fallocate into 
small writes, because the OS doesn't provide a better way in that 
kernel.  It's not hard to imagine this being slower than what the WAL 
code is doing right now.  I'm not worried about correctness issues 
anymore, but my gut paranoia about this not working as expected on older 
systems was justified.  Everyone who thought I was just whining owes me 
a cookie.


This is what I plan to benchmark specifically next.  If the 
posix_fallocate approach is actually slower than what's done now when 
it's not getting kernel acceleration, which is the case on RHEL5 era 
kernels, we might need to make the configure time test more complicated. 
 Whether posix_fallocate is defined isn't sensitive enough; on Linux it 
may be the case that this only is usable when fallocate() is also there.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
#!/bin/sh
rm -f fallocate /tmp/afile /tmp/bfile
gcc fallocate.c -o fallocate
if [ ! -x fallocate ] ; then
  echo Test program did not compile, posix_fallocate may not be supported
  exit
fi

./fallocate
if [ -f /tmp/afile ] ; then
  sizea=`du /tmp/afile | cut -f 1`
  sizeb=`du /tmp/bfile | cut -f 1`
  if [ $sizea -eq $sizeb ] ; then
cmp /tmp/afile /tmp/bfile
if [ $? -ne 0 ] ; then
  echo Test failed, files do not match
else
  echo Test passed
fi
  else
echo Test failed, sizes do not match
  fi
fi
#include fcntl.h
#include stdio.h

char buf[8192] = {0};

int main()
{
	int i;
	int written;
	int fda = open(/tmp/afile, O_CREAT | O_EXCL | O_WRONLY, 0600);
	int fdb = open(/tmp/bfile, O_CREAT | O_EXCL | O_WRONLY, 0600);
	if (fda  0 || fdb  0)
	{
		printf(Opening files failed\n);
		return(1);
	}
	for(i = 0; i  2048; i++)
		{
			written=write(fda, buf, 8192);
			if (written  8192)
			{
printf(Write to file failed);
return(2);
			}
		}

	posix_fallocate(fdb, 0, 16*1024*1024);

	close(fda);
	close(fdb);

	return 0

Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Greg Smith

On 5/28/13 10:00 PM, Jon Nelson wrote:


A note: The attached test program uses *fsync* instead of *fdatasync*
after calling fallocate (or writing out 16MB of zeroes), per an
earlier suggestion.


I tried this out on the RHEL5 platform I'm worried about now.  There's 
something weird about the test program there.  If I run it once it shows 
posix_fallocate running much faster:


without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s
with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s

The problem is that I'm seeing the gap between the two get smaller the 
more iterations I run, which makes me wonder if the test is completely fair:


without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s
with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s

without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s
with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s

without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s
with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s

You didn't show any output from the latest program on your system, so 
I'm not sure how it behaved for you here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Greg Smith

On 6/30/13 9:28 PM, Jon Nelson wrote:

The performance of the latter (new) test sometimes seems to perform
worse and sometimes seems to perform better (usually worse) than
either of the other two. In all cases, posix_fallocate performs
better, but I don't have a sufficiently old kernel to test with.


This updated test program looks reliable now.  The numbers are very 
tight when I'd expect them to be, and there's nowhere with the huge 
differences I saw in the earlier test program.


Here's results from a few sets of popular older platforms:

RHEL5, ext3

method: classic. 10 open/close iterations, 10 rewrite in 22.6949s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 23.0113s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 22.4921s

method: classic. 10 open/close iterations, 10 rewrite in 23.2808s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.4736s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.9871s

method: classic. 10 open/close iterations, 10 rewrite in 22.4812s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.2393s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.6940s

RHEL6, ext4

method: classic. 10 open/close iterations, 10 rewrite in 56.0483s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 61.5092s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 53.8569s

method: classic. 10 open/close iterations, 10 rewrite in 57.0361s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 55.9840s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 64.9437sb


RHEL6, ext3

method: classic. 10 open/close iterations, 10 rewrite in 14.4080s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.1395s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 16.9657s

method: classic. 10 open/close iterations, 10 rewrite in 15.2825s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.5315s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 14.8115s

The win for posix_fallocate is there in most cases, but it's pretty hard 
to see in these older systems.  That could be OK.  As long as the 
difference is no more than noise, and that is the case, this could be 
good enough to commit.  If there are significantly better results on the 
new platforms, the old ones need to just not get worse.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-29 Thread Greg Smith

On 6/22/13 12:54 PM, Fabien COELHO wrote:

After some poking around, and pursuing various red herrings, I resorted
to measure the delay for calling PQfinish(), which is really the only
special thing going around at the end of pgbench run...


This wasn't what I was seeing, but it's related.  I've proved to myself 
the throttle change isn't reponsible for the weird stuff I'm seeing now. 
 I'd like to rearrange when PQfinish happens now based on what I'm 
seeing, but that's not related to this review.


I duplicated the PQfinish problem you found too.  On my Linux system, 
calls to PQfinish are normally about 36 us long.  They will sometimes 
get lost for 15ms before they return.  That's a different problem 
though, because the ones I'm seeing on my Mac are sometimes 150ms. 
PQfinish never takes quite that long.


PQfinish doesn't pause for a long time on this platform.  But it does 
*something* that causes socket select() polling to stutter.  I have 
instrumented everything interesting in this part of the pgbench code, 
and here is the problem event.


1372531862.062236 select with no timeout sleeping=0
1372531862.109111 select returned 6 sockets latency 46875 us

Here select() is called with 0 sleeping processes, 11 that are done, and 
14 that are running.  The running ones have all sent SELECT statements 
to the server, and they are waiting for a response.  Some of them 
received some data from the server, but they haven't gotten the entire 
response back.  (The PQfinish calls could be involved in how that happened)


With that setup, select runs for 47 *ms* before it gets the next byte to 
a client.  During that time 6 clients get responses back to it, but it 
stays stuck in there for a long time anyway.  Why?  I don't know exactly 
why, but I am sure that pgbench isn't doing anything weird.  It's either 
libpq acting funny, or the OS.  When pgbench is waiting on a set of 
sockets, and none of them are returning anything, that's interesting. 
But there's nothing pgbench can do about it.


The cause/effect here is that the randomness to the throttling code 
spreads out when all the connections end a bit. There are more times 
during which you might have 20 connections finished while 5 still run.


I need to catch up with revisions done to this feature since I started 
instrumenting my copy more heavily.  I hope I can get this ready for 
commit by Monday.  I've certainly beaten on the feature for long enough now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-28 Thread Greg Smith

On 6/28/13 8:50 AM, Robert Haas wrote:

On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com wrote:

4. Separate processes for writing dirty buffers and moving buffers to
freelist


I think this part might be best pushed to a separate patch, although I
agree we probably need it.


This might be necessary eventually, but it's going to make thing more 
complicated.  And I don't think it's a blocker for creating something 
useful.  The two most common workloads are:


1) Lots of low usage count data, typically data that is updated sparsely 
across a larger database.  These are helped by a process that writes 
dirty buffers in the background.  These benefit from the current 
background writer.  Kevin's system he was just mentioning again is the 
best example of this type that there's public data on.


2) Lots of high usage count data, because there are large hotspots in 
things like index blocks.  Most writes happen at checkpoint time, 
because the background writer won't touch them.  Because there are only 
a small number of re-usable pages, the clock sweep goes around very fast 
looking for them.  This is the type of workload that should benefit from 
putting buffers into the free list.  pgbench provides a simple example 
of this type, which is why Amit's tests using it have been useful.


If you had a process that tried to handle both background writes and 
freelist management, I suspect one path would be hot and the other 
almost idle in each type of system.  I don't expect that splitting those 
into two separate process would buy a lot of value, that can easily be 
pushed to a later patch.



The background writer would just
have a high and a low watermark.  When the number of buffers on the
freelist drops below the low watermark, the allocating backend sets
the latch and bgwriter wakes up and begins adding buffers to the
freelist.  When the number of buffers on the free list reaches the
high watermark, the background writer goes back to sleep.


This will work fine for all of the common workloads.  The main challenge 
is keeping the buffer allocation counting from turning into a hotspot. 
Busy systems now can easily hit 100K buffer allocations/second.  I'm not 
too worried about it because those allocations are making the free list 
lock a hotspot right now.


One of the consistently controversial parts of the current background 
writer is how it tries to loop over the buffer cache every 2 minutes, 
regardless of activity level.  The idea there was that on bursty 
workloads, buffers would be cleaned during idle periods with that 
mechanism.  Part of why that's in there is to deal with the relatively 
long pause between background writer runs.


This refactoring idea will make that hard to keep around.  I think this 
is OK though.  Switching to a latch based design should eliminate the 
bgwriter_delay, which means you won't have this worst case of a 200ms 
stall while heavy activity is incoming.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-27 Thread Greg Smith

On 6/25/13 2:44 PM, Josh Berkus wrote:

On the other hand, I will point out that we currently have a shortage of
reviewers, and we do NOT have a shortage of patch submitters.


That's because reviewing is harder than initial development.  The only 
people who think otherwise are developers who don't do enough review.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-18 Thread Greg Smith
I'm still getting the same sort of pauses waiting for input with your 
v11.  This is a pretty frustrating problem; I've spent about two days so 
far trying to narrow down how it happens.  I've attached the test 
program I'm using.  It seems related to my picking a throttled rate 
that's close to (but below) the maximum possible on your system.  I'm 
using 10,000 on a system that can do about 16,000 TPS when running 
pgbench in debug mode.


This problem is 100% reproducible here; happens every time.  This is a 
laptop running Mac OS X.  It's possible the problem is specific to that 
platform.  I'm doing all my tests with the database itself setup for 
development, with debug and assertions on.  The lag spikes seem smaller 
without assertions on, but they are still there.


Here's a sample:

transaction type: SELECT only
scaling factor: 10
query mode: simple
number of clients: 25
number of threads: 1
duration: 30 s
number of transactions actually processed: 301921
average transaction lag: 1.133 ms (max 137.683 ms)
tps = 10011.527543 (including connections establishing)
tps = 10027.834189 (excluding connections establishing)

And those slow ones are all at the end of the latency log file, as shown 
in column 3 here:


22 11953 3369 0 1371578126 954881
23 11926 3370 0 1371578126 954918
3 12238 30310 0 1371578126 984634
7 12205 30350 0 1371578126 984742
8 12207 30359 0 1371578126 984792
11 12176 30325 0 1371578126 984837
13 12074 30292 0 1371578126 984882
0 12288 175452 0 1371578127 126340
9 12194 171948 0 1371578127 126421
12 12139 171915 0 1371578127 126466
24 11876 175657 0 1371578127 126507

Note that there are two long pauses here, which happens maybe half of 
the time.  There's a 27 ms pause and then a 145 ms one.


The exact sequence for when the pauses show up is:

-All remaining clients have sent their SELECT statement and are waiting 
for a response.  They are not sleeping, they're waiting for the server 
to return a query result.
-A client receives part of the data it wants, but there is still data 
left.  This is the path through pgbench.c where the if 
(PQisBusy(st-con)) test is true after receiving some information.  I 
hacked up some logging that distinguishes those as a client %d partial 
receive to make this visible.
-A select() call is made with no timeout, so it can only be satisfied by 
more data being returned.
-Around ~100ms (can be less, can be more) goes by before that select() 
returns more data to the client, where normally it would happen in ~2ms.


You were concerned about a possible bug in the timeout code.  If you 
hack up the select statement to show some state information, the setup 
for the pauses at the end always looks like this:


calling select min_usec=9223372036854775807 sleeping=0

When no one is sleeping, the timeout becomes infinite, so only returning 
data will break it.  This is intended behavior though.  This exact same 
sequence and select() call parameters happen in pgbench code every time 
at the end of a run.  But clients without the throttling patch applied 
never seem to get into the state where they pause for so long, waiting 
for one of the active clients to receive the next bit of result.


I don't think the st-listen related code has anything to do with this 
either.  That flag is only used to track when clients have completed 
sending their first query over to the server.  Once reaching that point 
once, afterward they should be listening for results each time they 
exit the doCustom() code.  st-listen goes to 1 very soon after startup 
and then it stays there, and that logic seems fine too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
#!/bin/bash
make
make install
rm pgbench_log* nohup.out
#nohup pgbench -c 25 -T 30 -l -d -f select.sql -s 10 pgbench
nohup pgbench -c 25 -T 30 -l -d -S -R 1 pgbench
tail -n 50 pgbench_log*


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Greg Smith

On 6/12/13 3:19 AM, Fabien COELHO wrote:

If you are still worried: if you run the very same command without
throttling and measure the same latency, does the same thing happens at
the end? My guess is that it should be yes. If it is no, I'll try out
pgbench-tools.


It looks like it happens rarely for one client without the rate limit, 
but that increases to every time for multiple client with limiting in 
place.  pgbench-tools just graphs the output from the latency log. 
Here's a setup that runs the test I'm doing:


$ createdb pgbench
$ pgbench -i -s 10 pgbench
$ pgbench -S -c 25 -T 30 -l pgbench  tail -n 40 pgbench_log*

Sometimes there's no slow entries. but I've seen this once so far:

0 21822 1801 0 1371217462 945264
1 21483 1796 0 1371217462 945300
8 20891 1931 0 1371217462 945335
14 20520 2084 0 1371217462 945374
15 20517 1991 0 1371217462 945410
16 20393 1928 0 1371217462 945444
17 20183 2000 0 1371217462 945479
18 20277 2209 0 1371217462 945514
23 20316 2114 0 1371217462 945549
22 20267 250128 0 1371217463 193656

The third column is the latency for that transaction.  Notice how it's a 
steady ~2000 us except for the very last transaction, which takes 
250,128 us.  That's the weird thing; these short SELECT statements 
should never take that long.  It suggests there's something weird 
happening with how the client exits, probably that its latency number is 
being computed after more work than it should.


Here's what a rate limited run looks like for me.  Note that I'm still 
using the version I re-submitted since that's where I ran into this 
issue, I haven't merged your changes to split the rate among each client 
here--which means this is 400 TPS per client == 1 TPS total:


$ pgbench -S -c 25 -T 30 -R 400 -l pgbench  tail -n 40 pgbench_log

7 12049 2070 0 1371217859 195994
22 12064 2228 0 1371217859 196115
18 11957 1570 0 1371217859 196243
23 12130 989 0 1371217859 196374
8 11922 1598 0 1371217859 196646
11 12229 4833 0 1371217859 196702
21 11981 1943 0 1371217859 196754
20 11930 1026 0 1371217859 196799
14 11990 13119 0 1371217859 208014
^^^ fast section
vvv delayed section
1 11982 91926 0 1371217859 287862
2 12033 116601 0 1371217859 308644
6 12195 115957 0 1371217859 308735
17 12130 114375 0 1371217859 308776
0 12026 115507 0 1371217859 308822
3 11948 118228 0 1371217859 308859
4 12061 113484 0 1371217859 308897
5 12110 113586 0 1371217859 308933
9 12032 117744 0 1371217859 308969
10 12045 114626 0 1371217859 308989
12 11953 113372 0 1371217859 309030
13 11883 114405 0 1371217859 309066
15 12018 116069 0 1371217859 309101
16 11890 115727 0 1371217859 309137
19 12140 114006 0 1371217859 309177
24 11884 115782 0 1371217859 309212

There's almost 90,000 usec of latency showing up between epoch time 
1371217859.208014 and 1371217859.287862 here.  What's weird about it is 
that the longer the test runs, the larger the gap is.  If collecting the 
latency data itself caused the problem, that would make sense, so maybe 
this is related to flushing that out to disk.


If you want to look just at the latency numbers without the other 
columns in the way you can use:


cat pgbench_log.* | awk {'print $3'}

That is how I was evaluating the smoothness of the rate limit, by 
graphing those latency values.  pgbench-tools takes those and a derived 
TPS/s number and plots them, which made it easier for me to spot this 
weirdness.  But I've already moved onto analyzing the raw latency data 
instead, I can see the issue without the graph once I've duplicated the 
conditions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Greg Smith

On 6/14/13 1:06 PM, Jeff Davis wrote:

Why have a GUC here at all? Perhaps this was already discussed, and I
missed it? Is it just for testing purposes, or did you intend for it to
be in the final version?


You have guessed correctly!  I suggested it stay in there only to make 
review benchmarking easier.



I started looking at this patch and it looks like we are getting a
consensus that it's the right approach. Microbenchmarks appear to show a
benefit, and (thanks to Noah's comment) it seems like the change is
safe. Are there any remaining questions or objections?


I'm planning to duplicate Jon's test program on a few machines here, and 
then see if that turns into a useful latency improvement for clients. 
I'm trying to get this pgbench rate limit stuff working first though, 
because one of the tests I had in mind for WAL creation overhead would 
benefit from it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Greg Smith
I don't have this resolved yet, but I think I've identified the cause. 
Updating here mainly so Fabien doesn't duplicate my work trying to track 
this down.  I'm going to keep banging at this until it's resolved now 
that I got this far.


Here's a slow transaction:

1371226017.568515 client 1 executing \set naccounts 10 * :scale
1371226017.568537 client 1 throttling 6191 us
1371226017.747858 client 1 executing \setrandom aid 1 :naccounts
1371226017.747872 client 1 sending SELECT abalance FROM pgbench_accounts 
WHERE aid = 268721;

1371226017.789816 client 1 receiving

That confirms it is getting stuck at the throttling step.  Looks like 
the code pauses there because it's trying to overload the sleeping 
state that was already in pgbench, but handle it in a special way inside 
of doCustom(), and that doesn't always work.


The problem is that pgbench doesn't always stay inside doCustom when a 
client sleeps.  It exits there to poll for incoming messages from the 
other clients, via select() on a shared socket.  It's not safe to assume 
doCustom will be running regularly; that's only true if clients keep 
returning messages.


So as long as other clients keep banging on the shared socket, doCustom 
is called regularly, and everything works as expected.  But at the end 
of the test run that happens less often, and that's when the problem 
shows up.


pgbench already has a \sleep command, and the way that delay is 
handled happens inside threadRun() instead.  The pausing of the rate 
limit throttle needs to operate in the same place.  I have to redo a few 
things to confirm this actually fixes the issue, as well as look at 
Fabien's later updates to this since I wandered off debugging.  I'm sure 
it's in the area of code I'm poking at now though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Greg Smith

On 6/14/13 3:50 PM, Fabien COELHO wrote:

I think that the weirdness really comes from the way transactions times
are measured, their interactions with throttling, and latent bugs in the
code.


measurement times, no; interactions with throttling, no.  If it was 
either of those I'd have finished this off days ago.  Latent bugs, 
possibly.  We may discover there's nothing wrong with your code at the 
end here, that it just makes hitting this bug more likely. 
Unfortunately today is the day *some* bug is popping up, and I want to 
get it squashed before I'll be happy.


The lag is actually happening during a kernel call that isn't working as 
expected.  I'm not sure whether this bug was there all along if \sleep 
was used, or if it's specific to the throttle sleep.


 Also, flag st-listen is set to 1 but *never* set back to 0...

I noticed that st-listen was weird too, and that's on my short list of 
suspicious things I haven't figured out yet.


I added a bunch more logging as pgbench steps through its work to track 
down where it's stuck at.  Until the end all transactions look like this:


1371238832.084783 client 10 throttle lag 2 us
1371238832.084783 client 10 executing \setrandom aid 1 :naccounts
1371238832.084803 client 10 sending SELECT abalance FROM 
pgbench_accounts WHERE aid = 753099;

1371238832.084840 calling select
1371238832.086539 client 10 receiving
1371238832.086539 client 10 finished

All clients who hit lag spikes at the end are going through this 
sequence instead:


1371238832.085912 client 13 throttle lag 790 us
1371238832.085912 client 13 executing \setrandom aid 1 :naccounts
1371238832.085931 client 13 sending SELECT abalance FROM 
pgbench_accounts WHERE aid = 564894;

1371238832.086592 client 13 receiving
1371238832.086662 calling select
1371238832.235543 client 13 receiving
1371238832.235543 client 13 finished

Note the calling select here that wasn't in the normal length 
transaction before it.  The client is receiving something here, but 
rather than it finishing the transaction it falls through and ends up at 
the select() system call outside of doCustom.  All of the clients that 
are sleeping when the system slips into one of these long select() calls 
are getting stuck behind it.  I'm not 100% sure, but I think this only 
happens when all remaining clients are sleeping.


Here's another one, it hits the receive that doesn't finish the 
transaction earlier (1371238832.086587) but then falls into the same 
select() call at 1371238832.086662:


1371238832.085884 client 12 throttle lag 799 us
1371238832.085884 client 12 executing \setrandom aid 1 :naccounts
1371238832.085903 client 12 sending SELECT abalance FROM 
pgbench_accounts WHERE aid = 299080;

1371238832.086587 client 12 receiving
1371238832.086662 calling select
1371238832.231032 client 12 receiving
1371238832.231032 client 12 finished

Investigation is still going here...

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-11 Thread Greg Smith

On 6/10/13 6:02 PM, Fabien COELHO wrote:

  - the tps is global, with a mutex to share the global stochastic process
  - there is an adaptation for the fork emulation
  - I do not know wheter this works with Win32 pthread stuff.


Instead of this complexity, can we just split the TPS input per client? 
 That's all I was thinking of here, not adding a new set of threading 
issues.  If 1 TPS is requested and there's 10 clients, just set the 
delay so that each of them targets 1000 TPS.


I'm guessing it's more accurate to have them all communicate as you've 
done here, but it seems like a whole class of new bugs and potential 
bottlenecks could come out of that.  Whenever someone touches the 
threading model for pgbench it usually gives a stack of build farm 
headaches.  Better to avoid those unless there's really a compelling 
reason to go through that.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Greg Smith

On 6/11/13 12:22 PM, Merlin Moncure wrote:


Personally I think this patch should go in regardless -- the concerns
made IMNSHO are specious.


That's nice, but we have this process for validating whether features go 
in or not that relies on review instead of opinions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-11 Thread Greg Smith

On 6/11/13 4:11 PM, Fabien COELHO wrote:

  - ISTM that there thread start time should be initialized at the
beginning of threadRun instead of in the loop *before* thread creation,
otherwise the thread creation delays are incorporated in the
performance measure, but ISTM that the point of pgbench is not to
measure thread creation performance...


I noticed that, but it seemed a pretty minor issue.  Did you look at the 
giant latency spikes at the end of the test run I submitted the graph 
for?  I wanted to nail down what was causing those before worrying about 
the startup timing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-08 Thread Greg Smith

On 6/1/13 5:00 AM, Fabien COELHO wrote:

Question 1: should it report the maximum lang encountered?


I haven't found the lag measurement to be very useful yet, outside of 
debugging the feature itself.  Accordingly I don't see a reason to add 
even more statistics about the number outside of testing the code.  I'm 
seeing some weird lag problems that this will be useful for though right 
now, more on that a few places below.



Question 2: the next step would be to have the current lag shown under
option --progress, but that would mean having a combined --throttle
--progress patch submission, or maybe dependencies between patches.


This is getting too far ahead.  Let's get the throttle part nailed down 
before introducing even more moving parts into this.  I've attached an 
updated patch that changes a few things around already.  I'm not done 
with this yet and it needs some more review before commit, but it's not 
too far away from being ready.


This feature works quite well.  On a system that will run at 25K TPS 
without any limit, I did a run with 25 clients and a rate of 400/second, 
aiming at 10,000 TPS, and that's what I got:


number of clients: 25
number of threads: 1
duration: 60 s
number of transactions actually processed: 599620
average transaction lag: 0.307 ms
tps = 9954.779317 (including connections establishing)
tps = 9964.947522 (excluding connections establishing)

I never thought of implementing the throttle like this before, but it 
seems to work out well so far.  Check out tps.png to see the smoothness 
of the TPS curve (the graphs came out of pgbench-tools.  There's a 
little more play outside of the target than ideal for this case.  Maybe 
it's worth tightening the Poisson curve a bit around its center?


The main implementation issue I haven't looked into yet is why things 
can get weird at the end of the run.  See the latency.png graph attached 
and you can see what I mean.


I didn't like the naming on this option or all of the ways you could 
specify the delay.  None of those really added anything, since you can 
get every other behavior by specifying a non-integer TPS.  And using the 
word throttle inside the code is fine, but I didn't like exposing that 
implementation detail more than it had to be.


What I did instead was think of this as a transaction rate target, which 
makes the help a whole lot simpler:


  -R SPEC, --rate SPEC
   target rate per client in transactions per second

Made the documentation easier to write too.  I'm not quite done with 
that yet, the docs wording in this updated patch could still be better.


I personally would like this better if --rate specified a *total* rate 
across all clients.  However, there are examples of both types of 
settings in the program already, so there's no one precedent for which 
is right here.  -t is per-client and now -R is too; I'd prefer it to be 
like -T instead.  It's not that important though, and the code is 
cleaner as it's written right now.  Maybe this is better; I'm not sure.


I did some basic error handling checks on this and they seemed good, the 
program rejects target rates of =0.


On the topic of this weird latency spike issue, I did see that show up 
in some of the results too.  Here's one where I tried to specify a rate 
higher than the system can actually handle, 8 TPS total on a 
SELECT-only test


$ pgbench -S -T 30 -c 8 -j 4 -R1tps pgbench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 4
duration: 30 s
number of transactions actually processed: 761779
average transaction lag: 10298.380 ms
tps = 25392.312544 (including connections establishing)
tps = 25397.294583 (excluding connections establishing)

It was actually limited by the capabilities of the hardware, 25K TPS. 
10298 ms of lag per transaction can't be right though.


Some general patch submission suggestions for you as a new contributor:

-When re-submitting something with improvements, it's a good idea to add 
a version number to the patch so reviewers can tell them apart easily. 
But there is no reason to change the subject line of the e-mail each 
time.  I followed that standard here.  If you updated this again I would 
name the file pgbench-throttle-v9.patch but keep the same e-mail subject.


-There were some extra carriage return characters in your last 
submission.  Wasn't a problem this time, but if you can get rid of those 
that makes for a better patch.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
attachment: tps.pngattachment: latency.pngdiff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 8c202bf..799dfcd 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -137,6 +137,12 @@ intunlogged_tables = 0;
 double sample_rate = 0.0;
 
 /*
+ * When clients

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-08 Thread Greg Smith

On 6/7/13 2:43 PM, Robert Haas wrote:

name.  What I would like to see is a single number here in memory units that
replaces both checkpoint_segments and wal_keep_segments.


This isn't really making sense to me.  I don't think we should assume
that someone who wants to keep WAL around for replication also wants
to wait longer between checkpoints.  Those are two quite different
things.


It's been years since I saw anyone actually using checkpoint_segments as 
that sort of limit.  I see a lot of sites pushing the segments limit up 
and then using checkpoint_timeout carefully.  It's pretty natural to say 
I don't want to go more than X minutes between checkpoints.  The case 
for wanting to say I don't want to go more than X MB between 
checkpoints instead, motivated by not wanting too much activity to 
queue between them, I'm just not seeing demand for that now.


The main reason I do see people paying attention to checkpoint_segments 
still is to try and put a useful bound on WAL disk space usage.  That's 
the use case I think overlaps with wal_keep_segments such that you might 
replace both of them.  I think we really only need one control that 
limits how much WAL space is expected inside of pg_xlog, and it should 
be easy and obvious how to set it.


The more I look at this checkpoint_segments patch, the more I wonder why 
it's worth even bothering with anything but a disk space control here. 
checkpoint_segments is turning into an internal implementation detail 
most sites I see wouldn't miss at all.  Rather than put work into 
autotuning it, I'd be happy to eliminate checkpoint_segments altogther, 
in favor of a WAL disk space limit.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-06-08 Thread Greg Smith

On 6/8/13 4:43 PM, Jeff Janes wrote:


Also, in all the anecdotes I've been hearing about autovacuum causing
problems from too much IO, in which people can identify the specific
problem, it has always been the write pressure, not the read, that
caused the problem.  Should the default be to have the read limit be
inactive and rely on the dirty-limit to do the throttling?


That would be bad, I have to carefully constrain both of them on systems 
that are short on I/O throughput.  There all sorts of cases where 
cleanup of a large and badly cached relation will hit the read limit 
right now.


I suspect the reason we don't see as many complaints is that a lot more 
systems can handle 7.8MB/s of random reads then there are ones that can 
do 3.9MB/s of random writes.  If we removed that read limit, a lot more 
complaints would start rolling in about the read side.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-06-08 Thread Greg Smith

On 6/8/13 5:17 PM, Jeff Janes wrote:


But my gut feeling is that if autovacuum is trying to read faster than
the hardware will support, it will just automatically get throttled, by
inherent IO waits, at a level which can be comfortably supported.  And
this will cause minimal interference with other processes.


If this were true all the time autovacuum tuning would be a lot easier. 
 You can easily make a whole server unresponsive by letting loose one 
rogue process doing a lot of reads.  Right now this isn't a problem for 
autovacuum because any one process running at 7.8MB/s is usually not a 
big deal.  It doesn't take too much in the way of read-ahead logic and 
throughput to satisfy that.  But I've seen people try and push the read 
rate upwards who didn't get very far beyond that before it was way too 
obtrusive.


I could collect some data from troubled servers to see how high I can 
push the read rate before they suffer.  Maybe there's a case there for 
increasing the default read rate because the write one is a good enough 
secondary limiter.  I'd be surprised if we could get away with more than 
a 2 or 3X increase though, and the idea of going unlimited is really 
scary.  It took me a year of before/after data collection before I was 
confident that it's OK to run unrestricted in all cache hit situations.



Why is there so much random IO?  Do your systems have
autovacuum_vacuum_scale_factor set far below the default?  Unless they
do, most of the IO (both read and write) should be sequential.


Insert one process doing sequential reads into a stream of other 
activity and you can easily get random I/O against the disks out of the 
mix.  You don't necessarily need the other activity to be random to get 
that.  N sequential readers eventually acts like N random readers for 
high enough values of N.  On busy servers, autovacuum is normally 
competing against multiple random I/O processes though.


Also, the database's theoretical model that block number correlates 
directly with location on disk can break down.  I haven't put a hard 
number to measuring it directly, but systems with vacuum problems seem 
more likely to have noticeable filesystem level fragmentation.  I've 
been thinking about collecting data from a few systems with filefrag to 
see if I'm right about that.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-06-08 Thread Greg Smith

On 6/8/13 5:20 PM, Kevin Grittner wrote:

I'll believe that all of that is true, but I think there's another
reason.  With multiple layers of write cache (PostgreSQL
shared_buffers, OS cache, controller or SSD cache) I think there's
a tendency for an avalanche effect.  Dirty pages stick to cache
at each level like snow on the side of a mountain, accumulating
over time.  When it finally breaks loose at the top, it causes more
from lower levels to break free as it passes.


I explained this once as being like a tower of leaky buckets where each 
one drips into the one below.  Buckets draining out of the bottom at one 
rate, and new water comes in at another.  You can add water much faster 
than it drains, for a while.  But once one of the buckets fills you've 
got a serious mess.



I think the sudden onset of problems from write
saturation contributes to the complaints.


It's also important to realize that vacuum itself doesn't even do the 
writes in many cases.  If you have a large shared_buffers value, it 
wanders off making things dirty without any concern for what's going to 
disk.  When the next checkpoint shows up is when pressure increases at 
the top.


The way this discussion has wandered off has nicely confirmed I was 
right to try and avoid going into this area again :(


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-06-07 Thread Greg Smith

On 6/7/13 10:14 AM, Robert Haas wrote:

If the page hit limit goes away, the user with a single core server who used
to having autovacuum only pillage shared_buffers at 78MB/s might complain
that if it became unbounded.


Except that it shouldn't become unbounded, because of the ring-buffer
stuff.  Vacuum can pillage the OS cache, but the degree to which a
scan of a single relation can pillage shared_buffers should be sharply
limited.


I wasn't talking about disruption of the data that's in the buffer 
cache.  The only time the scenario I was describing plays out is when 
the data is already in shared_buffers.  The concern is damage done to 
the CPU's data cache by this activity.  Right now you can't even reach 
100MB/s of damage to your CPU caches in an autovacuum process.  Ripping 
out the page hit cost will eliminate that cap.  Autovacuum could 
introduce gigabytes per second of memory - L1 cache transfers.  That's 
what all my details about memory bandwidth were trying to put into 
context.  I don't think it really matter much because the new bottleneck 
will be the processing speed of a single core, and that's still a decent 
cap to most people now.



I think you're missing my point here, which is is that we shouldn't
have any such things as a cost limit.  We should limit reads and
writes *completely separately*.  IMHO, there should be a limit on
reading, and a limit on dirtying data, and those two limits should not
be tied to any common underlying cost limit.  If they are, they will
not actually enforce precisely the set limit, but some other composite
limit which will just be weird.


I see the distinction you're making now, don't need a mock up to follow 
you.  The main challenge of moving this way is that read and write rates 
never end up being completely disconnected from one another.  A read 
will only cost some fraction of what a write does, but they shouldn't be 
completely independent.


Just because I'm comfortable doing 10MB/s of reads and 5MB/s of writes, 
I may not be happy with the server doing 9MB/s read + 5MB/s write=14MB/s 
of I/O in an implementation where they float independently.  It's 
certainly possible to disconnect the two like that, and people will be 
able to work something out anyway.  I personally would prefer not to 
lose some ability to specify how expensive read and write operations 
should be considered in relation to one another.


Related aside:  shared_buffers is becoming a decreasing fraction of 
total RAM each release, because it's stuck with this rough 8GB limit 
right now.  As the OS cache becomes a larger multiple of the 
shared_buffers size, the expense of the average read is dropping.  Reads 
are getting more likely to be in the OS cache but not shared_buffers, 
which makes the average cost of any one read shrink.  But writes are as 
expensive as ever.


Real-world tunings I'm doing now reflecting that, typically in servers 
with 128GB of RAM, have gone this far in that direction:


vacuum_cost_page_hit = 0
vacuum_cost_page_hit = 2
vacuum_cost_page_hit = 20

That's 4MB/s of writes, 40MB/s of reads, or some blended mix that 
considers writes 10X as expensive as reads.  The blend is a feature.


The logic here is starting to remind me of how the random_page_cost 
default has been justified.  Read-world random reads are actually close 
to 50X as expensive as sequential ones.  But the average read from the 
executor's perspective is effectively discounted by OS cache hits, so 
4.0 is still working OK.  In large memory servers, random reads keep 
getting cheaper via better OS cache hit odds, and it's increasingly 
becoming something important to tune for.


Some of this mess would go away if we could crack the shared_buffers 
scaling issues for 9.4.  There's finally enough dedicated hardware 
around to see the issue and work on it, but I haven't gotten a clear 
picture of any reproducible test workload that gets slower with large 
buffer cache sizes.  If anyone has a public test case that gets slower 
when shared_buffers goes from 8GB to 16GB, please let me know; I've got 
two systems setup I could chase that down on now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-06-07 Thread Greg Smith

On 6/7/13 12:42 PM, Robert Haas wrote:

GUCs in terms of units that are meaningful to the user.  One could
have something like io_rate_limit (measured in MB/s),
io_read_multiplier = 1.0, io_dirty_multiplier = 1.0, and I think that
would be reasonably clear.


There's one other way to frame this:

io_read_limit = 7.8MB/s # Maximum read rate
io_dirty_multiplier = 2.0  # How expensive writes are considered 
relative to reads


That still gives all of the behavior I'd like to preserve, as well as 
not changing the default I/O pattern.  I don't think it's too 
complicated to ask people to grapple with that pair.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Greg Smith

On 6/6/13 4:42 AM, Joshua D. Drake wrote:


On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:


(I'm sure you know this, but:) If you perform a checkpoint as fast and
short as possible, the sudden burst of writes and fsyncs will
overwhelm the I/O subsystem, and slow down queries. That's what we saw
before spread checkpoints: when a checkpoint happens, the response
times of queries jumped up.


That isn't quite right. Previously we had lock issues as well and
checkpoints would take considerable time to complete. What I am talking
about is that the background writer (and wal writer where applicable)
have done all the work before a checkpoint is even called.


That is not possible, and if you look deeper at a lot of workloads 
you'll eventually see why.  I'd recommend grabbing snapshots of 
pg_buffercache output from a lot of different types of servers and see 
what the usage count distribution looks like.  That's what did in order 
to create all of the behaviors the current background writer code caters 
to.  Attached is a small spreadsheet that shows the main two extremes 
here, from one of my old talks.  Effective buffer cache system is full 
of usage count 5 pages, while the Minimally effective buffer cache one 
is all usage count 1 or 0.  We don't have redundant systems here; we 
have two that aim at distinctly different workloads.  That's one reason 
why splitting them apart ended up being necessary to move forward, they 
really don't overlap very much on some servers.


Sampling a few servers that way was where the controversial idea of 
scanning the whole buffer pool every few minutes even without activity 
came from too.  I found a bursty real world workload where that was 
necessary to keep buffers clean usefully, and that heuristic helped them 
a lot.  I too would like to visit the exact logic used, but I could cook 
up a test case where it's useful again if people really doubt it has any 
value.  There's one in the 2007 archives somewhere.


The reason the checkpointer code has to do this work, and it has to 
spread the writes out, is that on some systems the hot data set hits a 
high usage count.  If shared_buffers is 8GB and at any moment 6GB of it 
has a usage count of 5, which absolutely happens on many busy servers, 
the background writer will do almost nothing useful.  It won't and 
shouldn't touch buffers unless their usage count is low.  Those heavily 
referenced blocks will only be written to disk once per checkpoint cycle.


Without the spreading, in this example you will drop 6GB into Dirty 
Memory on a Linux server, call fdatasync, and the server might stop 
doing any work at all for *minutes* of time.  Easiest way to see it 
happen is to set checkpoint_completion_target to 0, put the filesystem 
on ext3, and have a server with lots of RAM.  I have a monitoring tool 
that graphs Dirty Memory over time because this problem is so nasty even 
with the spreading code in place.


There is this idea that pops up sometimes that a background writer write 
is better than a checkpoint one.  This is backwards.  A dirty block must 
be written at least once per checkpoint.  If you only write it once per 
checkpoint, inside of the checkpoint process, that is the ideal.  It's 
what you want for best performance when it's possible.


At the same time, some workloads churn through a lot of low usage count 
data, rather than building up a large block of high usage count stuff. 
On those your best hope for low latency is to crank up the background 
writer and let it try to stay ahead of backends with the writes.  The 
checkpointer won't have nearly as much work to do in that situation.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


bgwriter-snapshot.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Greg Smith

On 6/6/13 4:41 AM, Heikki Linnakangas wrote:


I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:


Combine what your submitted patch does and this idea, and you'll have 
something I prototyped a few years ago.  I took the logic and tested it 
out in user space by parsing the output from log_checkpoints to see how 
many segments were being used.  That approach coughed out a value about 
as good for checkpoint_segments as I picked by hand.


The main problem was it liked to over-tune the segments based on a small 
bursts of activity, leaving a value higher than you might want to use 
the rest of the time.  The background writer didn't worry about this 
very much because the cost of making a mistake for one 200ms cycle was 
pretty low.  Setting checkpoint_segments high is a more expensive issue. 
 When I set these by hand, I'll aim more to cover a 99th percentile of 
the maximum segments number rather than every worst case seen.


I don't think that improvement is worth spending very much effort on 
though.  The moving average approach is more than good enough in most 
cases.  I've wanted checkpoint_segments to go away in exactly this 
fashion for a while.


The general complaint the last time I suggested a change in this area, 
to make checkpoint_segments larger for the average user, was that some 
people had seen workloads where that was counterproductive.  Pretty sure 
Kevin Grittner said he'd seen that happen.  That's how I remember this 
general idea dying the last time, and I still don't have enough data to 
refute that doesn't happen.


As far as the UI, if it's a soft limit I'd suggest wal_size_target for 
the name.  What I would like to see is a single number here in memory 
units that replaces both checkpoint_segments and wal_keep_segments.  If 
you're willing to use a large chunk of disk space to handle either one 
of activity spikes or the class of replication issues wal_keep_segments 
targets, I don't see why you'd want to ban using that space for the 
other one too.


To put some perspective on how far we've been able to push this in the 
field with minimal gripes, the repmgr tool requires wal_keep_segments be 
=5000, which works out to 78GB.  I still see some people use 73GB SAS 
drives in production servers for their WAL files, but that's the only 
time I've seen that number become scary when deploying repmgr. 
Meanwhile, the highest value for checkpoint_segments I've set based on 
real activity levels was 1024, on a server where checkpoint_timeout is 
15 minutes (and can be no shorter without checkpoint spikes).  At no 
point during that fairly difficult but of tuning work did 
checkpoint_segments do anything but get in the way.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 6:49 AM, Robert Haas wrote:

On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote:

So we don't even know whether we can read. I think that means we need to
zero the file anyway...


Surely this is undue pessimism.


There have been many occasions where I've found the Linux kernel 
defining support for POSIX behavior with a NOP stub that basically says 
we should make this work one day.  I don't know whether the fallocate 
code is one of those or a fully implemented call.  Based on that 
history, until I see a reader that validates the resulting files are 
good I have to assume they're not.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 7:17 AM, Andres Freund wrote:

That argument in contrast I find not very convincing though. What was
the last incidence of such a system call that did not just error out
with ENOTSUPP or such?


http://linux.die.net/man/2/posix_fadvise talks about  POSIX_FADV_NOREUSE 
and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, 
depending on your version.  I know there were more examples than just 
that one that popped up during the testing of effective_io_concurrency. 
 My starting position has to assume that posix_fallocate can have the 
same sort of surprising behavior that showed up repeatedly when we were 
trying to use posix_fadvise more aggressively.


The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same 
thing) was a similar issue, and that's the first one that left me 
forever skeptical of Linux kernel claims in this area until they are 
explicitly validated:  http://lwn.net/Articles/350225/


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 7:52 AM, Andres Freund wrote:

fadvise is a hint which is pretty
different from a fallocate where ignoring would have way much more
severe consequences.


Yes, it will.  That's why I want to see it tested.  There is more than 
enough past examples of bad behavior here to be skeptical that this sort 
of API may not work exactly as specified.  If you're willing to believe 
the spec, that's fine, but I think that's dangerously optimistic.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 8:50 AM, Stephen Frost wrote:

I don't think this solves the locking issue around the
relation extention lock, but it might help some and it may make it
easier to tweak that logic to allocate larger chunks in the future.


It might make it a bit faster, but it doesn't make it any easier to 
implement.  The messy part of extending relations in larger chunks is 
how to communicate that back into the buffer manager usefully.  The 
extension path causing trouble is RelationGetBufferForTuple calling 
ReadBufferBI.  All of that is passing a single buffer around.  There's 
no simple way I can see to rewrite it to handle more than one at a time.


I have a test case for relation extension that I'm going to package up 
soon.  That makes it easy to see where the problem is at.  Far as I can 
tell the reason it hasn't been fixed before now is that it's a pain to 
write the code.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 11:21 AM, Alvaro Herrera wrote:

Greg Smith escribió:


The messy part of extending relations in larger chunks
is how to communicate that back into the buffer manager usefully.
The extension path causing trouble is RelationGetBufferForTuple
calling ReadBufferBI.  All of that is passing a single buffer
around.  There's no simple way I can see to rewrite it to handle
more than one at a time.


No, but we can have it create several pages and insert them into the
FSM.  So they aren't returned to the original caller but are available
to future users.


There's actually a code comment wondering about this topic for the pages 
that are already created, in src/backend/access/heap/hio.c :


Remember the new page as our target for future insertions.
XXX should we enter the new page into the free space map immediately, or 
just keep it for this backend's exclusive use in the short run (until 
VACUUM sees it)?  Seems to depend on whether you expect the current 
backend to make more insertions or not, which is probably a good bet 
most of the time.  So for now, don't add it to FSM yet.


We have to be careful about touching too much at that particular point, 
because it's holding a relation extension lock at the obvious spot to 
make a change.


There's an interesting overlap with these questions about how files are 
extended too, with this comment in that file too, just before the above:


XXX This does an lseek - rather expensive - but at the moment it is the 
only way to accurately determine how many blocks are in a relation.  Is 
it worth keeping an accurate file length in shared memory someplace, 
rather than relying on the kernel to do it for us?


That whole sequence of code took the easy way forward when it was 
written, but it's obvious the harder one (also touching the FSM) was 
considered even then.  The whole sequence needs to be revisited to pull 
off multiple page extension.  I wouldn't say it's hard, but it's enough 
work that I haven't been able to find a block of time to go through the 
whole thing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Greg Smith

On 5/28/13 11:12 AM, Jon Nelson wrote:

It opens a new file, fallocates 16MB, calls fdatasync.


Outside of the run for performance testing, I think it would be good at 
this point to validate that there is really a 16MB file full of zeroes 
resulting from these operations.  I am not really concerned that 
posix_fallocate might be slower in some cases; that seems unlikely.  I 
am concerned that it might result in a file that isn't structurally the 
same as the 16MB of zero writes implementation used now.


The timing program you're writing has some aspects that are similar to 
the contrib/pg_test_fsync program.  You might borrow some code from 
there usefully.


To clarify the suggestion I was making before about including 
performance test results:  that doesn't necessarily mean the testing 
code must run using only the database.  That's better if possible, but 
as Robert says it may not be for some optimizations.  The important 
thing is to have something measuring the improvement that a reviewer can 
duplicate, and if that's a standalone benchmark problem that's still 
very useful.  The main thing I'm wary of is any this should be faster 
claims that don't come with any repeatable measurements at all.  Very 
often theories about the fastest way to do something don't match what's 
actually seen in testing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Greg Smith

On 5/28/13 10:00 PM, Jon Nelson wrote:

On Tue, May 28, 2013 at 10:36 AM, Greg Smith g...@2ndquadrant.com wrote:

On 5/28/13 11:12 AM, Jon Nelson wrote:


It opens a new file, fallocates 16MB, calls fdatasync.



Outside of the run for performance testing, I think it would be good at this
point to validate that there is really a 16MB file full of zeroes resulting
from these operations.  I am not really concerned that posix_fallocate might
be slower in some cases; that seems unlikely.  I am concerned that it might
result in a file that isn't structurally the same as the 16MB of zero writes
implementation used now.


util-linux comes with fallocate which (might?) suffice for testing in
that respect, no?
If that is a real concern, it could be made part of the autoconf
testing, perhaps.


I was just thinking of something to run in your test program, not 
another build time check.  Just run the new allocation sequence, and 
then check the resulting WAL file for a) correct length, and b) 16K of 
zero bytes.  I would like to build some confidence that posix_fallocate 
is operating correctly in this context on at least one platform.  My 
experience with Linux handling this class of functions correctly has 
left me skeptical of them working until that's proven to be the case.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Greg Smith

On 5/24/13 10:36 AM, Jim Nasby wrote:

Instead of KB/s, could we look at how much time one process is spending
waiting on IO vs the rest of the cluster? Is it reasonable for us to
measure IO wait time for every request, at least on the most popular OSes?


It's not just an OS specific issue.  The overhead of collecting timing 
data varies massively based on your hardware, which is why there's the 
pg_test_timing tool now to help quantify that.


I have a design I'm working on that exposes the system load to the 
database usefully.  That's what I think people really want if the goal 
is to be adaptive based on what else is going on.  My idea is to use 
what uptime collects as a starting useful set of numbers to quantify 
what's going on.  If you have both a short term load measurement and a 
longer term one like uptime provides, you can quantify both the overall 
load and whether it's rising or falling.  I want to swipe some ideas on 
how moving averages are used to determine trend in stock trading 
systems: 
http://www.onlinetradingconcepts.com/TechnicalAnalysis/MASimple2.html


Dynamic load-sensitive statement limits and autovacuum are completely 
feasible on UNIX-like systems.  The work to insert a cost delay point 
needs to get done before building more complicated logic on top of it 
though, so I'm starting with this part.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Greg Smith

On 5/24/13 9:21 AM, Robert Haas wrote:


But I wonder if we wouldn't be better off coming up with a little more
user-friendly API.  Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = amount of data you can dirty per
second, in kB and read_rate_limit = amount of data you can read into
shared buffers per second, in kB.


I already made and lost the argument for doing vacuum in KB/s units, so 
I wasn't planning on putting that in the way of this one.  I still think 
it's possible to switch to real world units and simplify all of those 
parameters.  Maybe I'll get the energy to fight this battle again for 
9.4.  I do have a lot more tuning data from production deployments to 
use as evidence now.


I don't think the UI end changes the bulk of the implementation work 
though.  The time consuming part of this development is inserting all of 
the cost delay hooks and validating they work.  Exactly what parameters 
and logic fires when they are called can easily be refactored later.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block write statistics WIP

2013-05-23 Thread Greg Smith

On 5/20/13 7:51 AM, Heikki Linnakangas wrote:

The way that MarkDirty requires this specific underlying storage manager
behavior to work properly strikes me as as a bit of a layering violation
too. I'd like the read and write paths to have a similar API, but here
they don't even operate on the same type of inputs. Addressing that is
probably harder than just throwing a hack on the existing code though.


To be honest, I don't understand what you mean by that. ?


Let's say you were designing a storage layer API from scratch for what 
Postgres does.  That might take a relation as its input, like ReadBuffer 
does.  Hiding the details of how that turns into a physical file 
operation would be a useful goal of such a layer.  I'd then consider it 
a problem if that exposed things like the actual mapping of relations 
into files to callers.


What we actually have right now is this MarkDirty function that operates 
on BufferTag data, which points directly to the underlying file.  I see 
that as cutting the storage API in half and calling a function in the 
middle of the implementation.  It strikes me as kind of weird that the 
read side and write side are not more symmetrical.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cost limited statements RFC

2013-05-23 Thread Greg Smith
I'm working on a new project here that I wanted to announce, just to 
keep from duplicating effort in this area.  I've started to add a cost 
limit delay for regular statements.  The idea is that you set a new 
statement_cost_delay setting before running something, and it will 
restrict total resources the same way autovacuum does.  I'll be happy 
with it when it's good enough to throttle I/O on SELECT and CREATE INDEX 
CONCURRENTLY.


Modifying the buffer manager to account for statement-based cost 
accumulation isn't difficult.  The tricky part here is finding the right 
spot to put the delay at.  In the vacuum case, it's easy to insert a 
call to check for a delay after every block of I/O.  It should be 
possible to find a single or small number of spots to put a delay check 
in the executor.  But I expect that every utility command may need to be 
modified individually to find a useful delay point.  This is starting to 
remind me of the SEPostgres refactoring, because all of the per-command 
uniqueness ends up requiring a lot of work to modify in a unified way.


The main unintended consequences issue I've found so far is when a cost 
delayed statement holds a heavy lock.  Autovacuum has some protection 
against letting processes with an exclusive lock on a table go to sleep. 
 It won't be easy to do that with arbitrary statements.  There's a 
certain amount of allowing the user to shoot themselves in the foot here 
that will be time consuming (if not impossible) to eliminate.  The 
person who runs an exclusive CLUSTER that's limited by 
statement_cost_delay may suffer from holding the lock too long.  But 
that might be their intention with setting the value.  Hard to idiot 
proof this without eliminating useful options too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-23 Thread Greg Smith

On 5/23/13 7:34 PM, Claudio Freire wrote:

Why not make the delay conditional on the amount of concurrency, kinda
like the commit_delay? Although in this case, it should only count
unwaiting connections.


The test run by commit_delay is way too heavy to run after every block 
is processed.  That code is only hit when there's a commit, which 
already assumes a lot of overhead is going on--the disk flush to WAL--so 
burning some processing/lock acquisition time isn't a big deal.  The 
spot where statement delay is going is so performance sensitive that 
everything it touches needs to be local to the backend.


For finding cost delayed statements that are causing trouble because 
they are holding locks, the only place I've thought of that runs 
infrequently and is poking at the right data is the deadlock detector. 
Turning that into a more general mechanism for finding priority 
inversion issues is an interesting idea.  It's a bit down the road from 
what I'm staring at now though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-23 Thread Greg Smith

On 5/23/13 7:56 PM, Claudio Freire wrote:

Besides of the obvious option of making a lighter check (doesn't have
to be 100% precise), wouldn't this check be done when it would
otherwise sleep? Is it so heavy still in that context?


A commit to typical 7200RPM disk is about 10ms, while 
autovacuum_vacuum_cost_delay is 20ms.  If the statement cost limit logic 
were no more complicated than commit_delay, it would be feasible to do 
something similar each time a statement was being put to sleep.


I suspect that the cheapest useful thing will be more expensive than 
commit_delay's test.  That's a guess though.  I'll have to think about 
this more when I circle back toward usability.  Thanks for the 
implementation idea.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Block write statistics WIP

2013-05-19 Thread Greg Smith
I have some time now for working on the mystery of why there are no 
block write statistics in the database.  I hacked out the statistics 
collection and reporting side already, rough patch attached if you want 
to see the boring parts.


I guessed that there had to be a gotcha behind why this wasn't done 
before now, and I've found one so far.  All of the read statistics are 
collected with a macro that expects to know a Relation number.  Callers 
to ReadBuffer pass one.  On the write side, the two places that 
increment the existing write counters (pg_stat_statements, vacuum) are 
MarkBufferDirty and MarkBufferDirtyHint.  Neither of those is given a 
Relation though.  Inspecting the Buffer passed can only find the buffer 
tag's RelFileNode.


I've thought of two paths to get a block write count out of that so far:

-Provide a function to find the Relation from the RelFileNode.  There is 
a warning about the perils of assuming you can map that way from a 
buftag value in buf_internals.h though:


Note: the BufferTag data must be sufficient to determine where to write 
the block, without reference to pg_class or pg_tablespace entries.  It's 
possible that the backend flushing the buffer doesn't even believe the 
relation is visible yet (its xact may have started before the xact that 
created the rel).  The storage manager must be able to cope anyway.


-Modify every caller of MarkDirty* to include a relation when that 
information is available.  There are about 200 callers of those 
functions around, so that won't be fun.  I noted that many of them are 
in the XLog replay functions, which won't have the relation--but those 
aren't important to count anyway.


Neither of these options feels very good to me, so selecting between the 
two feels like picking the lesser of two evils.  I'm fine with chugging 
through all of the callers to modify MarkDirty*, but I didn't want to do 
that only to have the whole approach rejected as wrong.  That's why I 
stopped here to get some feedback.


The way that MarkDirty requires this specific underlying storage manager 
behavior to work properly strikes me as as a bit of a layering violation 
too.  I'd like the read and write paths to have a similar API, but here 
they don't even operate on the same type of inputs.  Addressing that is 
probably harder than just throwing a hack on the existing code though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index a03bfa6..1773d59 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -467,15 +467,19 @@ CREATE VIEW pg_statio_all_tables AS
 pg_stat_get_blocks_fetched(C.oid) -
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
+pg_stat_get_blocks_dirtied(C.oid) AS heap_blks_dirtied,
 sum(pg_stat_get_blocks_fetched(I.indexrelid) -
 pg_stat_get_blocks_hit(I.indexrelid))::bigint AS 
idx_blks_read,
 sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+sum(pg_stat_get_blocks_dirtied(I.indexrelid))::bigint AS 
idx_blks_dirtied,
 pg_stat_get_blocks_fetched(T.oid) -
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
+pg_stat_get_blocks_dirtied(T.oid) AS toast_blks_dirtied,
 pg_stat_get_blocks_fetched(X.oid) -
 pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
 pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+pg_stat_get_blocks_dirted(X.oid) AS tidx_blks_dirtied
 FROM pg_class C LEFT JOIN
 pg_index I ON C.oid = I.indrelid LEFT JOIN
 pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
@@ -530,6 +534,7 @@ CREATE VIEW pg_statio_all_indexes AS
 pg_stat_get_blocks_fetched(I.oid) -
 pg_stat_get_blocks_hit(I.oid) AS idx_blks_read,
 pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit
+pg_stat_get_blocks_dirtied(I.oid) AS idx_blks_dirtied
 FROM pg_class C JOIN
 pg_index X ON C.oid = X.indrelid JOIN
 pg_class I ON I.oid = X.indexrelid
@@ -554,6 +559,7 @@ CREATE VIEW pg_statio_all_sequences AS
 pg_stat_get_blocks_fetched(C.oid) -
 pg_stat_get_blocks_hit(C.oid) AS blks_read,
 pg_stat_get_blocks_hit(C.oid) AS blks_hit
+pg_stat_get_blocks_dirtied(C.oid) AS blks_dirtied
 FROM pg_class C
 LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
 WHERE C.relkind = 'S';
@@ -622,6 +628,7 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_blocks_fetched(D.oid) -
 pg_stat_get_db_blocks_hit(D.oid

  1   2   3   4   5   6   7   8   9   10   >