Re: Using UDF based on Numpy functions in Spark SQL

2020-12-26 Thread Mich Talebzadeh
Well I gave up on using anything except the standard one offered by PySpark itself. The problem is that anything that is homemade (UDF), is never going to be as performant as the functions offered by Spark itself. What I don't understand is why a numpy STDDEV provided should be more performant than

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation. You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in. On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh wrote: > Thanks for the feedback. > > I h

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
Thanks for the feedback. I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
I don't know which one is 'correct' (it's not standard SQL?) or whether it's the sample stdev for a good reason or just historical now. But you can always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std with ddof=1, the Bessel-corrected standard deviation. On Thu, Dec 24, 2020

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet! Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP. Run a test on *Hive* SELECT S

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Sean Owen
Why do you want to use this function instead of the built-in stddev function? On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh wrote: > Hi, > > > This is a shot in the dark so to speak. > > > I would like to use the standard deviation std offered by numpy in > PySpark. I am using SQL for now > >

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
OK Thanks for the tip. I found this link useful for Python from Databricks User-defined functions - Python — Databricks Documentation LinkedIn * https://w

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Peyman Mohajerian
https://stackoverflow.com/questions/43484269/how-to-register-udf-to-use-in-sql-and-dataframe On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh wrote: > Hi, > > > This is a shot in the dark so to speak. > > > I would like to use the standard deviation std offered by numpy in > PySpark. I am using

Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
Hi, This is a shot in the dark so to speak. I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now The code as below sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount ,