Re: Discriptency sample standard deviation pyspark and Excel

Mich Talebzadeh Wed, 20 Sep 2023 06:56:24 -0700

Spark uses the sample standard deviation stddev_samp by default, whereas
*Hive* uses population standard deviation stddev_pop as default.


My understanding is that spark uses sample standard deviation by default
because

   - It is more commonly used.
   - It is more efficient to calculate.
   - It is less sensitive to outliers. (data points that differ
   significantly from other observations in a dataset. They can be caused by a
   variety of factors, such as measurement errors or edge events.)

The sample standard deviation is less sensitive to outliers because it
divides by N-1 instead of N. This means that a single outlier will have a
smaller impact on the sample standard deviation than it would on the
population standard deviation.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 19 Sept 2023 at 21:50, Sean Owen <[email protected]> wrote:

> Pyspark follows SQL databases here. stddev is stddev_samp, and sample
> standard deviation is the calculation with the Bessel correction, n-1 in
> the denominator. stddev_pop is simply standard deviation, with n in the
> denominator.
>
> On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe <[email protected]>
> wrote:
>
>> Hi!
>>
>>
>>
>> I am applying the stddev function (so actually stddev_samp), however when
>> comparing with the sample standard deviation in Excel the resuls do not
>> match.
>>
>> I cannot find in your documentation any more specifics on how the sample
>> standard deviation is calculated, so I cannot compare the difference toward
>> excel, which uses
>>
>> .
>>
>> I am trying to avoid using Excel at all costs, but if the stddev_samp
>> function is not calculating the standard deviation correctly I have a
>> problem.
>>
>> I hope you can help me resolve this issue.
>>
>>
>>
>> Kindest regards,
>>
>>
>>
>> *Helene Bøe*
>> *Graduate Project Engineer*
>> Recycling Process & Support
>>
>> M: +47 980 00 887
>> [email protected]
>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>
>> Norsk Hydro ASA
>> Drammensveien 264
>> NO-0283 Oslo, Norway
>> www.hydro.com
>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>
>>
>> NOTICE: This e-mail transmission, and any documents, files or previous
>> e-mail messages attached to it, may contain confidential or privileged
>> information. If you are not the intended recipient, or a person responsible
>> for delivering it to the intended recipient, you are hereby notified that
>> any disclosure, copying, distribution or use of any of the information
>> contained in or attached to this message is STRICTLY PROHIBITED. If you
>> have received this transmission in error, please immediately notify the
>> sender and delete the e-mail and attached documents. Thank you.
>>
>

Re: Discriptency sample standard deviation pyspark and Excel

Reply via email to