Well the truth is that we had this discussion in 2016 :(. what Hive calls
Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is
incorrect and has not been rectified yet!


Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not
STDDEV_POP.  Run a test on *Hive*


SELECT

SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,
        STDDEV(amount_sold) AS STDDEV,
        STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
        STDDEV_POP(amount_sold) AS STDDEV_POP
from    sales;



It comes back with

+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |
 stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+
| 458.59339942758265  | *458.5931211084447*  | 458.5933504051772  |
*458.5931211084447*  |
+---------------------+--------------------+--------------------+--------------------+

*spark-sql *comes back with


spark-sql> SELECT

         >
 
SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,

         >         STDDEV(amount_sold) AS STDDEV,

         >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

         >         STDDEV_POP(amount_sold) AS STDDEV_POP

         > from    sales;


+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |
 stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+

458.59339942758265      *458.5933504051778*       *458.5933504051778*
 458.59312110844525

Just wanted to see what numpy would come back with

Thanks


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 23 Dec 2020 at 23:50, Sean Owen <sro...@gmail.com> wrote:

> Why do you want to use this function instead of the built-in stddev
> function?
>
> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> This is a shot in the dark so to speak.
>>
>>
>> I would like to use the standard deviation std offered by numpy in
>> PySpark. I am using SQL for now
>>
>>
>> The code as below
>>
>>
>>   sqltext = f"""
>>
>>   SELECT
>>
>>           rs.Customer_ID
>>
>>         , rs.Number_of_orders
>>
>>         , rs.Total_customer_amount
>>
>>         , rs.Average_order
>>
>>         , rs.Standard_deviation
>>
>>   FROM
>>
>>   (
>>
>>         SELECT cust_id AS Customer_ID,
>>
>>         COUNT(amount_sold) AS Number_of_orders,
>>
>>         SUM(amount_sold) AS Total_customer_amount,
>>
>>         AVG(amount_sold) AS Average_order,
>>
>>       *  STDDEV(amount_sold) AS Standard_deviation*
>>
>>         FROM {DB}.{table}
>>
>>         GROUP BY cust_id
>>
>>         HAVING SUM(amount_sold) > 94000
>>
>>         AND AVG(amount_sold) < STDDEV(amount_sold)
>>
>>   ) rs
>>
>>   ORDER BY
>>
>>           3 DESC
>>
>>   """
>>
>>   spark.sql(sqltext)
>>
>> Now if I wanted to use UDF based on numpy STD function, I can do
>>
>> import numpy as np
>> from pyspark.sql.functions import UserDefinedFunction
>> from pyspark.sql.types import DoubleType
>> udf = UserDefinedFunction(np.std, DoubleType())
>>
>> How can I use that udf with spark SQL? I gather this is only possible
>> through functional programming?
>>
>> Thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Reply via email to