I don't think that's relevant here. The question is why would samp give a different result to pop, not the result of "stddev". Neither one is a 'correct' definition of standard deviation in the abstract; one or the other is correct depending on what standard deviation you are trying to measure.
On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP. That is the > correct one. > > You can actually work that one out yourself > > > BTW Hive also gives a wrong value. This is what I reported back in April > about Hive giving incorrect value > > Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I > did tests with Spark 1.6 as well. Spark correctly points STTDEV to > STDDEV_SAMP. > > The following query was used > > SELECT > > SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) > AS MYSTDDEV, > STDDEV(amount_sold) AS STDDEV, > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, > STDDEV_POP(amount_sold) AS STDDEV_POP > from sales; > > The following is from running the above query on Hive where STDDEV --> > STDDEV_POP which is incorrect > > > +--------------------+---------------------+--------------------+---------------------+--+ > | mystddev | stddev | stddev_samp | > stddev_pop | > +--------------------+---------------------+--------------------+---------------------+--+ > | 260.7270919450411 | 260.72704617040444 | 260.7270722861465 | > 260.72704617040444 | > +--------------------+---------------------+--------------------+---------------------+--+ > > The following is from Spark-sql where STDDEV --> STDDEV_SAMP which is > correct > > +--------------------+---------------------+--------------------+---------------------+--+ > | mystddev | stddev | stddev_samp | > stddev_pop | > +--------------------+---------------------+--------------------+---------------------+--+ > | 260.7270919450411 | 260.7270722861637 | 260.7270722861637 | > 260.72704617042166 | > +--------------------+---------------------+--------------------+---------------------+--+ > > HTH > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > Disclaimer: Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > > On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote: >> >> No, because these are different values defined differently. If you >> have 1 data point, the sample stdev is undefined while population >> stdev is defined. Refer to their definition. >> >> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mungeol....@gmail.com> wrote: >> > Hello, >> > >> > As I mentioned at the title, stddev_samp function gives a NaN while >> > stddev_pop gives a numeric value on the same data. >> > The stddev_samp function will give a numeric value, if I cast it to >> > decimal. >> > E.g. cast(stddev_samp(column_name) as decimal(16,3)) >> > Is it a bug? >> > >> > Thanks >> > >> > - mungeol >> > >> > --------------------------------------------------------------------- >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org