Re: Jdbc Hook in Spark Batch Application

2020-12-24 Thread lec ssmi
Thanks.
But there is a problem that the classes referenced in the code need to be
modified. I want to try not to change the existing code.

Gabor Somogyi  于2020年12月25日周五 上午12:16写道:

> One can wrap the JDBC driver and such a way eveything can be sniffed.
>
> On Thu, 24 Dec 2020, 03:51 lec ssmi,  wrote:
>
>> Hi:
>>guys,  I have some spark programs that have database connection
>> operations.  I want to acquire the connection information, such as jdbc
>> connection properties ,  but not too intrusive to the code.
>>   Any good ideas ? Can java agent make it ?
>>
>>
>


unsubscribe

2020-12-24 Thread Richardson, Jeff



Re: Jdbc Hook in Spark Batch Application

2020-12-24 Thread Gabor Somogyi
One can wrap the JDBC driver and such a way eveything can be sniffed.

On Thu, 24 Dec 2020, 03:51 lec ssmi,  wrote:

> Hi:
>guys,  I have some spark programs that have database connection
> operations.  I want to acquire the connection information, such as jdbc
> connection properties ,  but not too intrusive to the code.
>   Any good ideas ? Can java agent make it ?
>
>


Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
Why not just use STDDEV_SAMP? it's probably more accurate than the
differences-of-squares calculation.
You can write an aggregate UDF that calls numpy and register it for SQL,
but, it is already a built-in.

On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh 
wrote:

> Thanks for the feedback.
>
> I have a question here. I want to use numpy STD as well but just using sql
> in pyspark. Like below
>
>   sqltext = f"""
>   SELECT
>   rs.Customer_ID
> , rs.Number_of_orders
> , rs.Total_customer_amount
> , rs.Average_order
> , rs.Standard_deviation
> , rs.mystddev
>   FROM
>   (
>SELECT cust_id AS Customer_ID
> ,  COUNT(amount_sold) AS Number_of_orders
> ,  SUM(amount_sold) AS Total_customer_amount
> ,  AVG(amount_sold) AS Average_order
> ,  STDDEV(amount_sold) AS Standard_deviation
> ,
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS mystddev
>FROM {DB}.{table}
>GROUP BY cust_id
>HAVING SUM(amount_sold) > 94000
>AND AVG(amount_sold) < STDDEV(amount_sold)
>   ) rs
>   ORDER BY
>   3 DESC
>   """
>   spark.sql(sqltext)
>
> The green one (mystddev) is the way I work standard deviation myself. I
> want to add numpy STD to this sql as well but I don't think it is possible
> here. Can a UDF help? . Or it is only possible with FP?
>
> thanks
>
>>
>



Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
Thanks for the feedback.

I have a question here. I want to use numpy STD as well but just using sql
in pyspark. Like below

  sqltext = f"""
  SELECT
  rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
, rs.mystddev
  FROM
  (
   SELECT cust_id AS Customer_ID
,  COUNT(amount_sold) AS Number_of_orders
,  SUM(amount_sold) AS Total_customer_amount
,  AVG(amount_sold) AS Average_order
,  STDDEV(amount_sold) AS Standard_deviation
,
SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS mystddev
   FROM {DB}.{table}
   GROUP BY cust_id
   HAVING SUM(amount_sold) > 94000
   AND AVG(amount_sold) < STDDEV(amount_sold)
  ) rs
  ORDER BY
  3 DESC
  """
  spark.sql(sqltext)

The green one (mystddev) is the way I work standard deviation myself. I
want to add numpy STD to this sql as well but I don't think it is possible
here. Can a UDF help? . Or it is only possible with FP?

thanks


On Thu, 24 Dec 2020 at 13:47, Sean Owen  wrote:

> I don't know which one is 'correct' (it's not standard SQL?) or whether
> it's the sample stdev for a good reason or just historical now. But you can
> always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std
> with ddof=1, the Bessel-corrected standard deviation.
>
> On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh 
> wrote:
>
>>
>> Well the truth is that we had this discussion in 2016 :(. what Hive
>> calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This
>> is incorrect and has not been rectified yet!
>>
>>
>> Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not
>> STDDEV_POP.  Run a test on *Hive*
>>
>>
>> SELECT
>>
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>> STDDEV(amount_sold) AS STDDEV,
>> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>> STDDEV_POP(amount_sold) AS STDDEV_POP
>> fromsales;
>>
>>
>>
>> It comes back with
>>
>>
>> +-++++
>> |  mystddev   |   stddev   |stddev_samp |
>>  stddev_pop |
>>
>> +-++++
>> | 458.59339942758265  | *458.5931211084447*  | 458.5933504051772  |
>> *458.5931211084447*  |
>>
>> +-++++
>>
>> *spark-sql *comes back with
>>
>>
>> spark-sql> SELECT
>>
>>  >
>>  
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>>
>>  > STDDEV(amount_sold) AS STDDEV,
>>
>>  > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>>
>>  > STDDEV_POP(amount_sold) AS STDDEV_POP
>>
>>  > fromsales;
>>
>>
>>
>> +-++++
>> |  mystddev   |   stddev   |stddev_samp |
>>  stddev_pop |
>>
>> +-++++
>>
>> 458.59339942758265  *458.5933504051778*   *458.5933504051778*
>>458.59312110844525
>>
>> Just wanted to see what numpy would come back with
>>
>> Thanks
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Dec 2020 at 23:50, Sean Owen  wrote:
>>
>>> Why do you want to use this function instead of the built-in stddev
>>> function?
>>>
>>> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,


 This is a shot in the dark so to speak.


 I would like to use the standard deviation std offered by numpy in
 PySpark. I am using SQL for now


 The code as below


   sqltext = f"""

   SELECT

   rs.Customer_ID

 , rs.Number_of_orders

 , rs.Total_customer_amount

 , rs.Average_order

 , rs.Standard_deviation

   FROM

   (

 SELECT cust_id AS Customer_ID,

 COUNT(amount_sold) AS Number_of_orders,

 SUM(amount_sold) AS Total_customer_amount,

 AVG(amount_sold) AS Average_order,

   *  STDDEV(amount_sold) AS Standard_deviation*

 FROM {DB}.{table}

 GROUP BY cust_id

 HAVING SUM(amount_sold) > 94000

 AND 

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
I don't know which one is 'correct' (it's not standard SQL?) or whether
it's the sample stdev for a good reason or just historical now. But you can
always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std
with ddof=1, the Bessel-corrected standard deviation.

On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh 
wrote:

>
> Well the truth is that we had this discussion in 2016 :(. what Hive calls
> Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is
> incorrect and has not been rectified yet!
>
>
> Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not
> STDDEV_POP.  Run a test on *Hive*
>
>
> SELECT
>
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
> STDDEV(amount_sold) AS STDDEV,
> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
> STDDEV_POP(amount_sold) AS STDDEV_POP
> fromsales;
>
>
>
> It comes back with
>
>
> +-++++
> |  mystddev   |   stddev   |stddev_samp |
>  stddev_pop |
>
> +-++++
> | 458.59339942758265  | *458.5931211084447*  | 458.5933504051772  |
> *458.5931211084447*  |
>
> +-++++
>
> *spark-sql *comes back with
>
>
> spark-sql> SELECT
>
>  >
>  
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
> AS MYSTDDEV,
>
>  > STDDEV(amount_sold) AS STDDEV,
>
>  > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>
>  > STDDEV_POP(amount_sold) AS STDDEV_POP
>
>  > fromsales;
>
>
>
> +-++++
> |  mystddev   |   stddev   |stddev_samp |
>  stddev_pop |
>
> +-++++
>
> 458.59339942758265  *458.5933504051778*   *458.5933504051778*
>458.59312110844525
>
> Just wanted to see what numpy would come back with
>
> Thanks
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Dec 2020 at 23:50, Sean Owen  wrote:
>
>> Why do you want to use this function instead of the built-in stddev
>> function?
>>
>> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> This is a shot in the dark so to speak.
>>>
>>>
>>> I would like to use the standard deviation std offered by numpy in
>>> PySpark. I am using SQL for now
>>>
>>>
>>> The code as below
>>>
>>>
>>>   sqltext = f"""
>>>
>>>   SELECT
>>>
>>>   rs.Customer_ID
>>>
>>> , rs.Number_of_orders
>>>
>>> , rs.Total_customer_amount
>>>
>>> , rs.Average_order
>>>
>>> , rs.Standard_deviation
>>>
>>>   FROM
>>>
>>>   (
>>>
>>> SELECT cust_id AS Customer_ID,
>>>
>>> COUNT(amount_sold) AS Number_of_orders,
>>>
>>> SUM(amount_sold) AS Total_customer_amount,
>>>
>>> AVG(amount_sold) AS Average_order,
>>>
>>>   *  STDDEV(amount_sold) AS Standard_deviation*
>>>
>>> FROM {DB}.{table}
>>>
>>> GROUP BY cust_id
>>>
>>> HAVING SUM(amount_sold) > 94000
>>>
>>> AND AVG(amount_sold) < STDDEV(amount_sold)
>>>
>>>   ) rs
>>>
>>>   ORDER BY
>>>
>>>   3 DESC
>>>
>>>   """
>>>
>>>   spark.sql(sqltext)
>>>
>>> Now if I wanted to use UDF based on numpy STD function, I can do
>>>
>>> import numpy as np
>>> from pyspark.sql.functions import UserDefinedFunction
>>> from pyspark.sql.types import DoubleType
>>> udf = UserDefinedFunction(np.std, DoubleType())
>>>
>>> How can I use that udf with spark SQL? I gather this is only possible
>>> through functional programming?
>>>
>>> Thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>


Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
Well the truth is that we had this discussion in 2016 :(. what Hive calls
Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is
incorrect and has not been rectified yet!


Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not
STDDEV_POP.  Run a test on *Hive*


SELECT

SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,
STDDEV(amount_sold) AS STDDEV,
STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
STDDEV_POP(amount_sold) AS STDDEV_POP
fromsales;



It comes back with

+-++++
|  mystddev   |   stddev   |stddev_samp |
 stddev_pop |
+-++++
| 458.59339942758265  | *458.5931211084447*  | 458.5933504051772  |
*458.5931211084447*  |
+-++++

*spark-sql *comes back with


spark-sql> SELECT

 >
 
SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,

 > STDDEV(amount_sold) AS STDDEV,

 > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

 > STDDEV_POP(amount_sold) AS STDDEV_POP

 > fromsales;


+-++++
|  mystddev   |   stddev   |stddev_samp |
 stddev_pop |
+-++++

458.59339942758265  *458.5933504051778*   *458.5933504051778*
 458.59312110844525

Just wanted to see what numpy would come back with

Thanks


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 23 Dec 2020 at 23:50, Sean Owen  wrote:

> Why do you want to use this function instead of the built-in stddev
> function?
>
> On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>>
>> This is a shot in the dark so to speak.
>>
>>
>> I would like to use the standard deviation std offered by numpy in
>> PySpark. I am using SQL for now
>>
>>
>> The code as below
>>
>>
>>   sqltext = f"""
>>
>>   SELECT
>>
>>   rs.Customer_ID
>>
>> , rs.Number_of_orders
>>
>> , rs.Total_customer_amount
>>
>> , rs.Average_order
>>
>> , rs.Standard_deviation
>>
>>   FROM
>>
>>   (
>>
>> SELECT cust_id AS Customer_ID,
>>
>> COUNT(amount_sold) AS Number_of_orders,
>>
>> SUM(amount_sold) AS Total_customer_amount,
>>
>> AVG(amount_sold) AS Average_order,
>>
>>   *  STDDEV(amount_sold) AS Standard_deviation*
>>
>> FROM {DB}.{table}
>>
>> GROUP BY cust_id
>>
>> HAVING SUM(amount_sold) > 94000
>>
>> AND AVG(amount_sold) < STDDEV(amount_sold)
>>
>>   ) rs
>>
>>   ORDER BY
>>
>>   3 DESC
>>
>>   """
>>
>>   spark.sql(sqltext)
>>
>> Now if I wanted to use UDF based on numpy STD function, I can do
>>
>> import numpy as np
>> from pyspark.sql.functions import UserDefinedFunction
>> from pyspark.sql.types import DoubleType
>> udf = UserDefinedFunction(np.std, DoubleType())
>>
>> How can I use that udf with spark SQL? I gather this is only possible
>> through functional programming?
>>
>> Thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>