Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2013-02-26 Thread Robert Haas
On Mon, Feb 25, 2013 at 10:22 PM, Greg Stark st...@mit.edu wrote:
 On Mon, Feb 25, 2013 at 8:26 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby j...@nasby.net wrote:
 We actually do that in our application and have discovered that random
 sampling can end up significantly skewing your data.

 /me blinks.

 How so?

 Sampling is a pretty big area of statistics. There are dozens of
 sampling methods to deal with various problems that occur with
 different types of data distributions.

 One problem is if you have some very rare events then random sampling
 can produce odd results since those rare events will drop out entirely
 unless your sample is very large whereas less rare events are
 represented proportionally. There are sampling methods that ensure
 that x% of the rare events are included even if those rare events are
 less than x% of your total data set. One of those might be appropriate
 to use for profiling data when you're looking for rare slow queries
 amongst many faster queries.

I'll grant all that, but it still seems to me like x% of all queries
plus all queries running longer than x milliseconds would cover most
of the interesting cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2013-02-26 Thread Jim Nasby

On 2/26/13 11:19 AM, Robert Haas wrote:

On Mon, Feb 25, 2013 at 10:22 PM, Greg Stark st...@mit.edu wrote:

On Mon, Feb 25, 2013 at 8:26 PM, Robert Haas robertmh...@gmail.com wrote:

On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby j...@nasby.net wrote:

We actually do that in our application and have discovered that random
sampling can end up significantly skewing your data.


/me blinks.

How so?


Sampling is a pretty big area of statistics. There are dozens of
sampling methods to deal with various problems that occur with
different types of data distributions.

One problem is if you have some very rare events then random sampling
can produce odd results since those rare events will drop out entirely
unless your sample is very large whereas less rare events are
represented proportionally. There are sampling methods that ensure
that x% of the rare events are included even if those rare events are
less than x% of your total data set. One of those might be appropriate
to use for profiling data when you're looking for rare slow queries
amongst many faster queries.


I'll grant all that, but it still seems to me like x% of all queries
plus all queries running longer than x milliseconds would cover most
of the interesting cases.


In our specific case, we were capturing statistics about webpage hits; when we took 
random samples and multiplied back out there were some inconsistencies that 
we couldn't explain. We just turned the sampling off and never really investigated. So 
it's possible that something in our implementation was flawed.

However, randomness can also work against you in strange ways. You could easily get a 
glut of samples that are skewed in one direction or another. And the problem can 
potentially be far worse if your randomness is actually impacted by some 
other aspect of the system.

For this case it might be good enough. I just wanted to caution about it.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2013-02-25 Thread Robert Haas
On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby j...@nasby.net wrote:
 We actually do that in our application and have discovered that random
 sampling can end up significantly skewing your data.

/me blinks.

How so?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2013-02-25 Thread Greg Stark
On Mon, Feb 25, 2013 at 8:26 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby j...@nasby.net wrote:
 We actually do that in our application and have discovered that random
 sampling can end up significantly skewing your data.

 /me blinks.

 How so?

Sampling is a pretty big area of statistics. There are dozens of
sampling methods to deal with various problems that occur with
different types of data distributions.

One problem is if you have some very rare events then random sampling
can produce odd results since those rare events will drop out entirely
unless your sample is very large whereas less rare events are
represented proportionally. There are sampling methods that ensure
that x% of the rare events are included even if those rare events are
less than x% of your total data set. One of those might be appropriate
to use for profiling data when you're looking for rare slow queries
amongst many faster queries.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2013-02-24 Thread Jim Nasby

Sorry for the late reply, but I think I can add some ideas here...

On 11/21/12 5:33 PM, Gavin Flower wrote:

On 22/11/12 12:15, Greg Smith wrote:

On 11/8/12 2:16 PM, Josh Berkus wrote:


Also, logging only the long-running queries is less useful than people
on this list seem to think.  When I'm doing real performance analysis, I
need to see *everything* which was run, not just the slow stuff.  Often
the real problem is a query which used to take 1.1ms, now takes 1.8ms,
and gets run 400 times/second. Looking just at the slow queries won't
tell you that.


No argument here.  I've tried to be clear that some of these high-speed but 
lossy things I'm hacking on are not going to be useful for everyone.  A 
solution that found most of these 1.8ms queries, but dropped some percentage 
under the highest peak load conditions, would still be very useful to me.

An anecdote on this topic seems relevant.  I have a troublesome production 
server that has moved log_min_duration_statement from 100ms to 500ms to 2000ms 
as the system grew.  Even the original setting wasn't short enough to catch 
everything we would like to watch *now*, but seeing sub-second data is a dream 
at this point. The increases have been forced by logging contention becoming 
unmanagable when every client has to fight for the log to write to disk.  I can 
see the buggers stack up as waiting for locks if I try to log shorter 
statements, stalling enough that it drags the whole server down under peak load.

If I could just turn off logging just during those periods--basically, throwing 
them away only when some output rate throttled component hit its limit--I could 
still find them in the data collected the rest of the time.  There are some 
types of problems that also only occur under peak load that this idea would 
miss, but you'd still be likely to get *some* of them, statistically.

There's a really hard trade-off here:

-Sometimes you must save data on every query to capture fine details
-Making every query wait for disk I/O is impractical

The sort of ideas you threw out for making things like auto-explain logging 
per-session can work.  The main limitation there though is that it presumes you 
even know what area the problem is in the first place. I am not optimistic that 
covers a whole lot of ground either.

Upthread I talked about a background process that persists shared memory queues 
as a future consumer of timing events--one that might consume slow query data 
too.  That is effectively acting as the ideal component I described above, one 
that only loses things when it exceeds the system's write capacity for saving 
them.  I wouldn't want to try and retrofit the existing PostgreSQL logging 
facility for such a thing though.  Log parsing as the way to collect data is 
filled with headaches anyway.

I don't see any other good way to resolve this trade-off.  To help with the 
real-world situation you describe, ideally you dump all the query data 
somewhere, fast, and have the client move on.  You can't make queries wait for 
storage, something else (with a buffer!) needs to take over that job.

I can't find the URL right now, but at PG.EU someone was showing me a module 
that grabbed the new 9.2 logging hook and shipped the result to another server. 
 Clients burn a little CPU and network time and they move on, and magically 
disk I/O goes out of their concerns.  How much overhead persisting the data 
takes isn't the database server's job at all then.  That's the sort of 
flexibility people need to have with logging eventually.  Something so 
efficient that every client can afford to do it; it is capable of saving all 
events under ideal conditions; but under adverse ones, you have to keep going 
and accept the loss.


I definitely think we need a means to log stuff really, really fast and then 
deal with it later. And the log ideally would not be text either, because it's a PITA to 
deal with and often far more data to handle (which makes all our problems worse).

I like that Greg is proposing a method of logging information in a more useful 
format than text. I suggest that it would be best if the non-text structure be 
implemented outside of any particular logging methodology if possible. Perhaps 
as something *equivalent to* a set of data types in PG. Of course I doubt real 
PG types could be used for the actual logging due to performance reasons, but 
the end interface should be able to use PG capabilities (inheritance is another 
interesting possibility).

Josh also brings up an interesting use case, one where I really like Tom Kyte's (the Oracle 
Ask Tom guy) methodology. When asked about how to do logging in PL/SQL he proposed a 
system where you always associate some kind of context with log data so that you can 
enable logging *only for that context*. In this case, (and in plpgsql syntax), you'd do something 
like:

PERFORM debug( 'my function name', 'blah blah just happened' );

That way you're not stuck trying to mess 

Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2012-11-22 Thread Alvaro Herrera
Greg Smith escribió:

 If I could just turn off logging just during those
 periods--basically, throwing them away only when some output rate
 throttled component hit its limit--I could still find them in the
 data collected the rest of the time.  There are some types of
 problems that also only occur under peak load that this idea would
 miss, but you'd still be likely to get *some* of them,
 statistically.

Uhm, what if you had a simple script that changed postgresql.conf and
SIGHUP'ed postmaster?  You could have it turn logging on and off
depending on predefined system load watermarks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2012-11-21 Thread Greg Smith

On 11/8/12 2:16 PM, Josh Berkus wrote:


Also, logging only the long-running queries is less useful than people
on this list seem to think.  When I'm doing real performance analysis, I
need to see *everything* which was run, not just the slow stuff.  Often
the real problem is a query which used to take 1.1ms, now takes 1.8ms,
and gets run 400 times/second. Looking just at the slow queries won't
tell you that.


No argument here.  I've tried to be clear that some of these high-speed 
but lossy things I'm hacking on are not going to be useful for everyone. 
 A solution that found most of these 1.8ms queries, but dropped some 
percentage under the highest peak load conditions, would still be very 
useful to me.


An anecdote on this topic seems relevant.  I have a troublesome 
production server that has moved log_min_duration_statement from 100ms 
to 500ms to 2000ms as the system grew.  Even the original setting wasn't 
short enough to catch everything we would like to watch *now*, but 
seeing sub-second data is a dream at this point.  The increases have 
been forced by logging contention becoming unmanagable when every client 
has to fight for the log to write to disk.  I can see the buggers stack 
up as waiting for locks if I try to log shorter statements, stalling 
enough that it drags the whole server down under peak load.


If I could just turn off logging just during those periods--basically, 
throwing them away only when some output rate throttled component hit 
its limit--I could still find them in the data collected the rest of the 
time.  There are some types of problems that also only occur under peak 
load that this idea would miss, but you'd still be likely to get *some* 
of them, statistically.


There's a really hard trade-off here:

-Sometimes you must save data on every query to capture fine details
-Making every query wait for disk I/O is impractical

The sort of ideas you threw out for making things like auto-explain 
logging per-session can work.  The main limitation there though is that 
it presumes you even know what area the problem is in the first place. 
I am not optimistic that covers a whole lot of ground either.


Upthread I talked about a background process that persists shared memory 
queues as a future consumer of timing events--one that might consume 
slow query data too.  That is effectively acting as the ideal component 
I described above, one that only loses things when it exceeds the 
system's write capacity for saving them.  I wouldn't want to try and 
retrofit the existing PostgreSQL logging facility for such a thing 
though.  Log parsing as the way to collect data is filled with headaches 
anyway.


I don't see any other good way to resolve this trade-off.  To help with 
the real-world situation you describe, ideally you dump all the query 
data somewhere, fast, and have the client move on.  You can't make 
queries wait for storage, something else (with a buffer!) needs to take 
over that job.


I can't find the URL right now, but at PG.EU someone was showing me a 
module that grabbed the new 9.2 logging hook and shipped the result to 
another server.  Clients burn a little CPU and network time and they 
move on, and magically disk I/O goes out of their concerns.  How much 
overhead persisting the data takes isn't the database server's job at 
all then.  That's the sort of flexibility people need to have with 
logging eventually.  Something so efficient that every client can afford 
to do it; it is capable of saving all events under ideal conditions; but 
under adverse ones, you have to keep going and accept the loss.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2012-11-21 Thread Gavin Flower

On 22/11/12 12:15, Greg Smith wrote:

On 11/8/12 2:16 PM, Josh Berkus wrote:


Also, logging only the long-running queries is less useful than people
on this list seem to think.  When I'm doing real performance analysis, I
need to see *everything* which was run, not just the slow stuff.  Often
the real problem is a query which used to take 1.1ms, now takes 1.8ms,
and gets run 400 times/second. Looking just at the slow queries won't
tell you that.


No argument here.  I've tried to be clear that some of these 
high-speed but lossy things I'm hacking on are not going to be useful 
for everyone.  A solution that found most of these 1.8ms queries, but 
dropped some percentage under the highest peak load conditions, would 
still be very useful to me.


An anecdote on this topic seems relevant.  I have a troublesome 
production server that has moved log_min_duration_statement from 100ms 
to 500ms to 2000ms as the system grew.  Even the original setting 
wasn't short enough to catch everything we would like to watch *now*, 
but seeing sub-second data is a dream at this point. The increases 
have been forced by logging contention becoming unmanagable when every 
client has to fight for the log to write to disk.  I can see the 
buggers stack up as waiting for locks if I try to log shorter 
statements, stalling enough that it drags the whole server down under 
peak load.


If I could just turn off logging just during those periods--basically, 
throwing them away only when some output rate throttled component hit 
its limit--I could still find them in the data collected the rest of 
the time.  There are some types of problems that also only occur under 
peak load that this idea would miss, but you'd still be likely to get 
*some* of them, statistically.


There's a really hard trade-off here:

-Sometimes you must save data on every query to capture fine details
-Making every query wait for disk I/O is impractical

The sort of ideas you threw out for making things like auto-explain 
logging per-session can work.  The main limitation there though is 
that it presumes you even know what area the problem is in the first 
place. I am not optimistic that covers a whole lot of ground either.


Upthread I talked about a background process that persists shared 
memory queues as a future consumer of timing events--one that might 
consume slow query data too.  That is effectively acting as the ideal 
component I described above, one that only loses things when it 
exceeds the system's write capacity for saving them.  I wouldn't want 
to try and retrofit the existing PostgreSQL logging facility for such 
a thing though.  Log parsing as the way to collect data is filled with 
headaches anyway.


I don't see any other good way to resolve this trade-off.  To help 
with the real-world situation you describe, ideally you dump all the 
query data somewhere, fast, and have the client move on.  You can't 
make queries wait for storage, something else (with a buffer!) needs 
to take over that job.


I can't find the URL right now, but at PG.EU someone was showing me a 
module that grabbed the new 9.2 logging hook and shipped the result to 
another server.  Clients burn a little CPU and network time and they 
move on, and magically disk I/O goes out of their concerns.  How much 
overhead persisting the data takes isn't the database server's job at 
all then.  That's the sort of flexibility people need to have with 
logging eventually.  Something so efficient that every client can 
afford to do it; it is capable of saving all events under ideal 
conditions; but under adverse ones, you have to keep going and accept 
the loss.


Would it be useful to be able to specify a fixed sampling rate, say 1 in 
100.  That way you could get a well defined statistical sample, yet not 
cause excessive I/O?



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2012-11-08 Thread Josh Berkus

 For the log volume, would it help if there was some unexpectedness
 threshold?  That is, if a statement exceeds the duration threshold, it
 gets explained,  But then it only gets logged if the actual duration
 divided by the cost estimate exceeds some threshold.

Thing is, pg_stat_plans makes this kind of use of auto_explain obsolete.
 With a much more useful interface.

Where autoexplain would be useful would be to get all of the plans,
regardless of execution time, for a specific application session or a
specific stored procedure.  However, that requires the ability of the
application session to toggle auto-explain logging settings.  This was
part of the functionality which Itagaki demonstrated for auto-explain
when he first proposed it, but was later disabled for security reasons
before commit IIRC.  Writing a SECURITY DEFINER function to get around
inability to toggle as a regular user has been a nonstarter when I've
proposed it to clients.

Also, logging only the long-running queries is less useful than people
on this list seem to think.  When I'm doing real performance analysis, I
need to see *everything* which was run, not just the slow stuff.  Often
the real problem is a query which used to take 1.1ms, now takes 1.8ms,
and gets run 400 times/second. Looking just at the slow queries won't
tell you that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2012-11-07 Thread Josh Berkus

 What prevents you from doing so?  The performance impact?  The volume
 of logs generated?

Yes and yes.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain WAS: RFC: Timing Events

2012-11-07 Thread Jeff Janes
On Wed, Nov 7, 2012 at 4:55 PM, Josh Berkus j...@agliodbs.com wrote:

 What prevents you from doing so?  The performance impact?  The volume
 of logs generated?

 Yes and yes.

I've never noticed a performance impact, unless you are having it
analyze, or having it log every query.  How do you measure the impact?

For the log volume, would it help if there was some unexpectedness
threshold?  That is, if a statement exceeds the duration threshold, it
gets explained,  But then it only gets logged if the actual duration
divided by the cost estimate exceeds some threshold.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers