Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer


Andrew Dunstan schrieb am 2015-01-21:

> On 01/21/2015 11:21 AM, Arne Scheffer wrote:



> >Why is it a bad thing to call the column "stddev_samp" analog to the
> >aggregate function or make a note in the documentation, that the
> >sample stddev is used to compute the solution?


> I think you are making a mountain out of a molehill, frankly. These
> stats are not intended as anything other than a pretty indication of
> the
> shape, to see if they are significantly influenced by outliers. For
> any
> significantly large sample size the difference will be negligible.

You're right, I maybe exaggerated the statistics part a bit.
I wanted to help, because the patch is of interest for us.
I will try to keep focus in the future.


> But I will add a note to the documentation, that seems reasonable.


*happy*

Thx

Arne



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer



On Wed, 21 Jan 2015, Andrew Dunstan wrote:



On 01/21/2015 09:27 AM, Arne Scheffer wrote:

Sorry, corrected second try because of copy&paste mistakes:
VlG-Arne


Comments appreciated.
Definition var_samp = Sum of squared differences /n-1
Definition stddev_samp = sqrt(var_samp)
Example N=4
1.) Sum of squared differences
   1_4Sum(Xi-XM4)²
=
2.) adding nothing
   1_4Sum(Xi-XM4)²
  +0
  +0
  +0
=
3.) nothing changed
  1_4Sum(Xi-XM4)²
  +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²)
  +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM2)²)
  +(-1_1Sum(Xi-XM1)²+1_1Sum(Xi-XM1)²)
=
4.) parts reordered
   (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²)
  +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²)
  +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM1)²)
  +1_1Sum(X1-XM1)²
=
5.)
   (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-XM1)
+ (X1-XM1)²
=
6.) XM1=X1 => There it is - The iteration part of Welfords Algorithm
(in
reverse order)
   (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-X1)
+ 0
The missing piece is 4.) to 5.)
it's algebra, look at e.g.:
http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/







I have no idea what you are saying here.


I'm sorry for that statistics stuff, 
my attempt was only to visualize in detail 
the mathematical reason for 
the iterating part of Welfords algorithm

being computing the current sum of squared differences in every step

- therefore it's in my opinion better to call the variable sum_of_squared_diffs
  (every statistician will be confused bei "sum_of_variances",
   because:  sample variance = sum_of_squared_diffs / n-1,
   have a look at Mr. Cooks explanation)

- therefore deviding by n-1 is the unbiased estimator by definition.
  (have a look at Mr. Cooks explanation)

- therefore I suggested (as a minor nomenclature issue) to call the 
column/description
  stdev_samp (PostgreSQL-nomenclature) / sample_ to indicate that 
information.
  (have a look at the PostgreSQL aggregate functions, it's doing that the same 
way)



Here are comments in email to me from the author of 
<http://www.johndcook.com/blog/standard_deviation> regarding the divisor 
used:


  My code is using the unbiased form of the sample variance, dividing
  by n-1.



I am relieved, now we are at least two persons saying that. :-)
Insert into the commonly known definition


Definition stddev_samp = sqrt(var_samp)


from above, and it's exactly my point.

Maybe I should add that in the code comments. Otherwise, I don't think we 
need a change.


Huh?

Why is it a bad thing to call the column "stddev_samp" analog to the
aggregate function or make a note in the documentation, 
that the sample stddev is used to compute the solution?


I really think it not a good strategy having the user to make a test or dive
into the source code to determine the divisor used.

E.g. David expected stdev_pop, so there is a need for documentation for cases 
with a small sample.

VlG-Arne

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer
Sorry, corrected second try because of copy&paste mistakes:
VlG-Arne

> Comments appreciated.

> Definition var_samp = Sum of squared differences /n-1
> Definition stddev_samp = sqrt(var_samp)

> Example N=4

> 1.) Sum of squared differences
>   1_4Sum(Xi-XM4)²
> =
> 2.) adding nothing
>   1_4Sum(Xi-XM4)²
>  +0
>  +0
>  +0
> =
> 3.) nothing changed
>  1_4Sum(Xi-XM4)²
>  +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²)
>  +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM2)²)
>  +(-1_1Sum(Xi-XM1)²+1_1Sum(Xi-XM1)²)

> =
> 4.) parts reordered
>   (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²)
>  +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²)
>  +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM1)²)
>  +1_1Sum(X1-XM1)²
> =
> 5.)
>   (X4-XM4)(X4-XM3)
> + (X3-XM3)(X3-XM2)
> + (X2-XM2)(X2-XM1)
> + (X1-XM1)²
> =
> 6.) XM1=X1 => There it is - The iteration part of Welfords Algorithm
> (in
> reverse order)
>   (X4-XM4)(X4-XM3)
> + (X3-XM3)(X3-XM2)
> + (X2-XM2)(X2-X1)
> + 0

> The missing piece is 4.) to 5.)
> it's algebra, look at e.g.:
> http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer
> >>I don't understand. I'm following pretty exactly the calculations
> >>stated
> >>at ;

> >>I'm not a statistician. Perhaps others who are more literate in

Maybe I'm mistaken here,
but I think, the algorithm is not that complicated.
I try to explain it further:

Comments appreciated.

Definition var_samp = Sum of squared differences /n-1
Definition stddev_samp = sqrt(var_samp)

Example N=4

1.) Sum of squared differences
  1_4Sum(Xi-XM4)²
=
2.) adding nothing
  1_4Sum(Xi-XM4)²
 +0
 +0
 +0
=
3.) nothing changed
 1_4Sum(Xi-XM4)²
 +(-1_3Sum(Xi-XM3)²+1_3Sum(Xi-XM3)²)
 +(-1_2Sum(Xi-XM2)²+1_2Sum(Xi-XM3)²)
 +(-1_1Sum(Xi-XM2)²+1_1Sum(Xi-XM3)²)

=
4.) parts reordered
  (1_4Sum(Xi-XM4)²-1_3Sum(Xi-XM3)²)
 +(1_3Sum(Xi-XM3)²-1_2Sum(Xi-XM2)²)
 +(1_2Sum(Xi-XM2)²-1_1Sum(Xi-XM2)²)
 +1_1Sum(X1-XM1)²
=
5.)
  (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-XM1)
+ (X1-XM1)²
=
6.) XM1=X1 => There it is - The iteration part of Welfords Algorithm (in
reverse order)
  (X4-XM4)(X4-XM3)
+ (X3-XM3)(X3-XM2)
+ (X2-XM2)(X2-X1)
+ 0

The missing piece is 4.) to 5.)
it's algebra, look at e.g.:
http://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/

> Thanks. Still not quite sure what to do, though :-) I guess in the
> end we want the answer to come up with similar results to the builtin
> stddev SQL function. I'll try to set up a test program, to see if we do.

If you want to go this way:
Maybe this is one of the very few times, you have to use a small sample
;-)

VlG-Arne


> cheers

> andrew


> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-21 Thread Arne Scheffer


David G Johnston schrieb am 2015-01-21:
> Andrew Dunstan wrote
> > On 01/20/2015 01:26 PM, Arne Scheffer wrote:

> >> And a very minor aspect:
> >> The term "standard deviation" in your code stands for
> >> (corrected) sample standard deviation, I think,
> >> because you devide by n-1 instead of n to keep the
> >> estimator unbiased.
> >> How about mentioning the prefix "sample"
> >> to indicate this beiing the estimator?


> > I don't understand. I'm following pretty exactly the calculations
> > stated
> > at <http://www.johndcook.com/blog/standard_deviation/>;


> > I'm not a statistician. Perhaps others who are more literate in
> > statistics can comment on this paragraph.

> I'm largely in the same boat as Andrew but...

> I take it that Arne is referring to:

> http://en.wikipedia.org/wiki/Bessel's_correction

Yes, it is.

> but the mere presence of an (n-1) divisor does not mean that is what
> is
> happening.  In this particular situation I believe the (n-1) simply
> is a
> necessary part of the recurrence formula and not any attempt to
> correct for
> sampling bias when estimating a population's variance.

That's wrong, it's applied in the end to the sum of squared differences
and therefore per definition the corrected sample standard deviation
estimator.

> In fact, as
> far as
> the database knows, the values provided to this function do represent
> an
> entire population and such a correction would be unnecessary.  I

That would probably be an exotic assumption in a working database
and it is not, what is computed here!

> guess it
> boils down to whether "future" queries are considered part of the
> population
> or whether the population changes upon each query being run and thus
> we are
> calculating the ever-changing population variance.

Yes, indeed correct.
And exactly to avoid that misunderstanding, I suggested to
use the "sample" term.
To speak in Postgresql terms; applied in Andrews/Welfords algorithm
is stddev_samp(le), not stddev_pop(ulation).
Therefore stddev in Postgres is only kept for historical reasons, look at
http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
Table 9-43.

VlG-Arne

> Note point 3 in
> the
> linked Wikipedia article.





> David J.



> --
> View this message in context:
> http://postgresql.nabble.com/Add-min-and-max-execute-statement-time-in-pg-stat-statement-tp5774989p5834805.html
> Sent from the PostgreSQL - hackers mailing list archive at
> Nabble.com.


> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-20 Thread Arne Scheffer


Andrew Dunstan schrieb am 2015-01-20:

> On 01/20/2015 01:26 PM, Arne Scheffer wrote:
> >Interesting patch.
> >I did a quick review looking only into the patch file.

> >The "sum of variances" variable contains
> >the "sum of squared differences" instead, I think.

> Umm, no. It's not.

Umm, yes, i think, it is ;-)

>   e->counters.sum_var_time +=
>   (total_time - old_mean) * (total_time - e->counters.mean_time);
> This is not a square that's being added.

That's correct.
Nevertheless it's the difference between the computed sum of squared
differences
and the preceeding one, added in every step.

> old_mean is not the same as
> e->counters.mean_time.

> Since the variance is this value divided by (n - 1), AIUI, I think
> "sum
> of variances" isn't a bad description. I'm open to alternative
> suggestions.

> >And a very minor aspect:
> >The term "standard deviation" in your code stands for
> >(corrected) sample standard deviation, I think,
> >because you devide by n-1 instead of n to keep the
> >estimator unbiased.
> >How about mentioning the prefix "sample"
> >to indicate this beiing the estimator?

> I don't understand. I'm following pretty exactly the calculations
> stated
> at <http://www.johndcook.com/blog/standard_deviation/>

(There is nothing bad about that calculations, Welford's algorithm
 is simply sequently adding the differences mentioned above.)

VlG-Arne


> I'm not a statistician. Perhaps others who are more literate in
> statistics can comment on this paragraph.

> >And I'm sure I'm missing C specifics (again)
> >(or it's the reduced patch file scope),
> >but you introduce sqrtd, but sqrt is called?


> Good catch. Will fix.

> cheers

> andrew


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2015-01-20 Thread Arne Scheffer
Interesting patch.
I did a quick review looking only into the patch file.

The "sum of variances" variable contains
the "sum of squared differences" instead, I think.

And a very minor aspect:
The term "standard deviation" in your code stands for
(corrected) sample standard deviation, I think,
because you devide by n-1 instead of n to keep the
estimator unbiased.
How about mentioning the prefix "sample"
to indicate this beiing the estimator?

And I'm sure I'm missing C specifics (again)
(or it's the reduced patch file scope),
but you introduce sqrtd, but sqrt is called?

VlG

Arne












-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] explain sortorder

2015-01-14 Thread Arne Scheffer

Hi,

we will also remove the following is lc_collate hint in the next version, 
showing only mandatory info as suggested.

/* for those who use COLLATE although their default is already 
the wanted */
if (strcmp(collname, localeptr) == 0)
{
appendStringInfo(sortorderInformation, " (%s is 
LC_COLLATE)", collname);
}

Anybody insisting on that?

Arne

Note: I see, at the moment we use the wrong default for DESC. We'll fix that.

On Wed, 14 Jan 2015, Heikki Linnakangas wrote:


On 01/14/2015 05:26 PM, Timmer, Marius wrote:

Hello Heikki,

abbreviated version:
Sorry, the problem is only the unhandy patch text format, not different 
opinions how to proceed.


Long version:
The v7 patch file already addressed your suggestions,
but the file contained serveral (old) local commits,
the new ones at the end of the patch text/file.


Ah, missed that. I stopped reading when I saw the old stuff there :-).


v7.1 is attached and addresses this issue providing a clean patch file.


Ok, thanks, will take a look.


V8 will - as mentioned - add missing docs and regression tests,


Great!

- Heikki





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] explain sortorder

2014-12-26 Thread Arne Scheffer
>Heikki Linnakangas  writes:
>> I would suggest just adding the information to the Sort Key line. As
>> long as you don't print the modifiers when they are defaults (ASC and
>> NULLS LAST), we could print the information even in non-VERBOSE mode.

>+1.  I had assumed without looking that that was what it did already,
>else I'd have complained too.

>   regards, tom lane

We will change the patch according to Heikkis suggestions.

A nice Christmas & all the best in the New Year

Arne Scheffer

http://www.uni-muenster.de/ZIV/Mitarbeiter/ArneScheffer.shtml


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers