I don't think that's relevant here. The question is why would samp
give a different result to pop, not the result of "stddev". Neither
one is a 'correct' definition of standard deviation in the abstract;
one or the other is correct depending on what standard deviation you
are trying to measure.

On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh
<mich.talebza...@gmail.com> wrote:
> The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is the
> correct one.
>
> You can actually work that one out yourself
>
>
> BTW Hive also gives a wrong value. This is what I reported back in April
> about Hive giving incorrect value
>
> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
> did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
> STDDEV_SAMP.
>
> The following query was used
>
> SELECT
>
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
>         STDDEV(amount_sold) AS STDDEV,
>         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>         STDDEV_POP(amount_sold) AS STDDEV_POP
> from    sales;
>
> The following is from running the above query on Hive where STDDEV -->
> STDDEV_POP which is incorrect
>
>
> +--------------------+---------------------+--------------------+---------------------+--+
> |      mystddev      |       stddev        |    stddev_samp     |
> stddev_pop      |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
> 260.72704617040444  |
> +--------------------+---------------------+--------------------+---------------------+--+
>
> The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
> correct
>
> +--------------------+---------------------+--------------------+---------------------+--+
> |      mystddev      |       stddev        |    stddev_samp     |
> stddev_pop      |
> +--------------------+---------------------+--------------------+---------------------+--+
> | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
> 260.72704617042166  |
> +--------------------+---------------------+--------------------+---------------------+--+
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:
>>
>> No, because these are different values defined differently. If you
>> have 1 data point, the sample stdev is undefined while population
>> stdev is defined. Refer to their definition.
>>
>> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mungeol....@gmail.com> wrote:
>> > Hello,
>> >
>> > As I mentioned at the title, stddev_samp function gives a NaN while
>> > stddev_pop gives a numeric value on the same data.
>> > The stddev_samp function will give a numeric value, if I cast it to
>> > decimal.
>> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
>> > Is it a bug?
>> >
>> > Thanks
>> >
>> > - mungeol
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to