stddev is mapped to stdddev_samp. That is the general use case or rather
common use of standard deviation.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 July 2016 at 09:39, Sean Owen <so...@cloudera.com> wrote:

> I don't think that's relevant here. The question is why would samp
> give a different result to pop, not the result of "stddev". Neither
> one is a 'correct' definition of standard deviation in the abstract;
> one or the other is correct depending on what standard deviation you
> are trying to measure.
>
> On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh
> <mich.talebza...@gmail.com> wrote:
> > The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is
> the
> > correct one.
> >
> > You can actually work that one out yourself
> >
> >
> > BTW Hive also gives a wrong value. This is what I reported back in April
> > about Hive giving incorrect value
> >
> > Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
> > did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
> > STDDEV_SAMP.
> >
> > The following query was used
> >
> > SELECT
> >
> >
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> > AS MYSTDDEV,
> >         STDDEV(amount_sold) AS STDDEV,
> >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
> >         STDDEV_POP(amount_sold) AS STDDEV_POP
> > from    sales;
> >
> > The following is from running the above query on Hive where STDDEV -->
> > STDDEV_POP which is incorrect
> >
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > |      mystddev      |       stddev        |    stddev_samp     |
> > stddev_pop      |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
> > 260.72704617040444  |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
> > correct
> >
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > |      mystddev      |       stddev        |    stddev_samp     |
> > stddev_pop      |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> > | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
> > 260.72704617042166  |
> >
> +--------------------+---------------------+--------------------+---------------------+--+
> >
> > HTH
> >
> >
> >
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
> > On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> No, because these are different values defined differently. If you
> >> have 1 data point, the sample stdev is undefined while population
> >> stdev is defined. Refer to their definition.
> >>
> >> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mungeol....@gmail.com>
> wrote:
> >> > Hello,
> >> >
> >> > As I mentioned at the title, stddev_samp function gives a NaN while
> >> > stddev_pop gives a numeric value on the same data.
> >> > The stddev_samp function will give a numeric value, if I cast it to
> >> > decimal.
> >> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
> >> > Is it a bug?
> >> >
> >> > Thanks
> >> >
> >> > - mungeol
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >
>

Reply via email to