Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi - interesting stuff. My stand always was to use spark native functions,
pandas and python native - in this order.

To OP - did you try the code? What kind of perf are you seeing? Just
curious, why do you think UDFs are bad?

On Sat, 10 Apr 2021 at 2:36 am, Sean Owen  wrote:

> Actually, good question, I'm not sure. I don't think that Spark would
> vectorize these operations over rows.
> Whereas in a pandas UDF, given a DataFrame, you can apply operations like
> sin to 1000s of values at once in native code via numpy. It's trivially
> 'vectorizable' and I've seen good wins over, at least, a single-row UDF.
>
> On Fri, Apr 9, 2021 at 9:14 AM ayan guha  wrote:
>
>> Hi Sean - absolutely open to suggestions.
>>
>> My impression was using spark native functions should provide similar
>> perf as scala ones because serialization penalty should not be there,
>> unlike native python udfs.
>>
>> Is it wrong understanding?
>>
>>
>>
>> On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru  wrote:
>>
>>> Hi All,
>>>
>>>
>>> yes ,i need to add the below scenario based code to the executing spark
>>> job,while executing this it took lot of time to complete,please suggest
>>> best way to get below requirement without using UDF
>>>
>>>
>>> Thanks,
>>>
>>> Ankamma Rao B
>>> ----------
>>> *From:* Sean Owen 
>>> *Sent:* Friday, April 9, 2021 6:11 PM
>>> *To:* ayan guha 
>>> *Cc:* Rao Bandaru ; User 
>>> *Subject:* Re: [Spark SQL]:to calculate distance between four
>>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
>>> dataframe
>>>
>>> This can be significantly faster with a pandas UDF, note, because you
>>> can vectorize the operations.
>>>
>>> On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:
>>>
>>> Hi
>>>
>>> We are using a haversine distance function for this, and wrapping it in
>>> udf.
>>>
>>> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
>>> from pyspark.sql.types import *
>>>
>>> def haversine_distance(long_x, lat_x, long_y, lat_y):
>>> return acos(
>>> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
>>> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
>>> cos(toRadians(long_x) - toRadians(long_y))
>>> ) * lit(6371.0)
>>>
>>> distudf = udf(haversine_distance, FloatType())
>>>
>>> in case you just want to use just Spark SQL, you can still utilize the
>>> functions shown above to implement in SQL.
>>>
>>> Any reason you do not want to use UDF?
>>>
>>> Credit
>>> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>>>
>>>
>>> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru 
>>> wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I have a requirement to calculate distance between four
>>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>>> dataframe *with the help of from *geopy* import *distance *without
>>> using *UDF* (user defined function)*,*Please help how to achieve this
>>> scenario and do the needful.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ankamma Rao B
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
> --
Best Regards,
Ayan Guha


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
Actually, good question, I'm not sure. I don't think that Spark would
vectorize these operations over rows.
Whereas in a pandas UDF, given a DataFrame, you can apply operations like
sin to 1000s of values at once in native code via numpy. It's trivially
'vectorizable' and I've seen good wins over, at least, a single-row UDF.

On Fri, Apr 9, 2021 at 9:14 AM ayan guha  wrote:

> Hi Sean - absolutely open to suggestions.
>
> My impression was using spark native functions should provide similar perf
> as scala ones because serialization penalty should not be there, unlike
> native python udfs.
>
> Is it wrong understanding?
>
>
>
> On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru  wrote:
>
>> Hi All,
>>
>>
>> yes ,i need to add the below scenario based code to the executing spark
>> job,while executing this it took lot of time to complete,please suggest
>> best way to get below requirement without using UDF
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>> --
>> *From:* Sean Owen 
>> *Sent:* Friday, April 9, 2021 6:11 PM
>> *To:* ayan guha 
>> *Cc:* Rao Bandaru ; User 
>> *Subject:* Re: [Spark SQL]:to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
>> dataframe
>>
>> This can be significantly faster with a pandas UDF, note, because you can
>> vectorize the operations.
>>
>> On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:
>>
>> Hi
>>
>> We are using a haversine distance function for this, and wrapping it in
>> udf.
>>
>> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
>> from pyspark.sql.types import *
>>
>> def haversine_distance(long_x, lat_x, long_y, lat_y):
>> return acos(
>> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
>> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
>> cos(toRadians(long_x) - toRadians(long_y))
>> ) * lit(6371.0)
>>
>> distudf = udf(haversine_distance, FloatType())
>>
>> in case you just want to use just Spark SQL, you can still utilize the
>> functions shown above to implement in SQL.
>>
>> Any reason you do not want to use UDF?
>>
>> Credit
>> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>>
>>
>> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:
>>
>> Hi All,
>>
>>
>>
>> I have a requirement to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>> dataframe *with the help of from *geopy* import *distance *without using
>> *UDF* (user defined function)*,*Please help how to achieve this scenario
>> and do the needful.
>>
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi Sean - absolutely open to suggestions.

My impression was using spark native functions should provide similar perf
as scala ones because serialization penalty should not be there, unlike
native python udfs.

Is it wrong understanding?



On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru  wrote:

> Hi All,
>
>
> yes ,i need to add the below scenario based code to the executing spark
> job,while executing this it took lot of time to complete,please suggest
> best way to get below requirement without using UDF
>
>
> Thanks,
>
> Ankamma Rao B
> --
> *From:* Sean Owen 
> *Sent:* Friday, April 9, 2021 6:11 PM
> *To:* ayan guha 
> *Cc:* Rao Bandaru ; User 
> *Subject:* Re: [Spark SQL]:to calculate distance between four
> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
> dataframe
>
> This can be significantly faster with a pandas UDF, note, because you can
> vectorize the operations.
>
> On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:
>
> Hi
>
> We are using a haversine distance function for this, and wrapping it in
> udf.
>
> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
> from pyspark.sql.types import *
>
> def haversine_distance(long_x, lat_x, long_y, lat_y):
> return acos(
> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
> cos(toRadians(long_x) - toRadians(long_y))
> ) * lit(6371.0)
>
> distudf = udf(haversine_distance, FloatType())
>
> in case you just want to use just Spark SQL, you can still utilize the
> functions shown above to implement in SQL.
>
> Any reason you do not want to use UDF?
>
> Credit
> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>
>
> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:
>
> Hi All,
>
>
>
> I have a requirement to calculate distance between four
> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
> dataframe *with the help of from *geopy* import *distance *without using
> *UDF* (user defined function)*,*Please help how to achieve this scenario
> and do the needful.
>
>
>
> Thanks,
>
> Ankamma Rao B
>
>
>
> --
> Best Regards,
> Ayan Guha
>
> --
Best Regards,
Ayan Guha


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Rao Bandaru
Hi All,

yes ,i need to add the below scenario based code to the executing spark 
job,while executing this it took lot of time to complete,please suggest best 
way to get below requirement without using UDF

Thanks,
Ankamma Rao B

From: Sean Owen 
Sent: Friday, April 9, 2021 6:11 PM
To: ayan guha 
Cc: Rao Bandaru ; User 
Subject: Re: [Spark SQL]:to calculate distance between four 
coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk 
dataframe

This can be significantly faster with a pandas UDF, note, because you can 
vectorize the operations.

On Fri, Apr 9, 2021, 7:32 AM ayan guha 
mailto:guha.a...@gmail.com>> wrote:
Hi

We are using a haversine distance function for this, and wrapping it in udf.

from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
from pyspark.sql.types import *

def haversine_distance(long_x, lat_x, long_y, lat_y):
return acos(
sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
cos(toRadians(long_x) - toRadians(long_y))
) * lit(6371.0)

distudf = udf(haversine_distance, FloatType())

in case you just want to use just Spark SQL, you can still utilize the 
functions shown above to implement in SQL.

Any reason you do not want to use UDF?

Credit<https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>

On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru 
mailto:rao.m...@outlook.com>> wrote:
Hi All,



I have a requirement to calculate distance between four coordinates(Latitude1, 
Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe with the help of 
from geopy import distance without using UDF (user defined function),Please 
help how to achieve this scenario and do the needful.



Thanks,

Ankamma Rao B


--
Best Regards,
Ayan Guha


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
This can be significantly faster with a pandas UDF, note, because you can
vectorize the operations.

On Fri, Apr 9, 2021, 7:32 AM ayan guha  wrote:

> Hi
>
> We are using a haversine distance function for this, and wrapping it in
> udf.
>
> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
> from pyspark.sql.types import *
>
> def haversine_distance(long_x, lat_x, long_y, lat_y):
> return acos(
> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
> cos(toRadians(long_x) - toRadians(long_y))
> ) * lit(6371.0)
>
> distudf = udf(haversine_distance, FloatType())
>
> in case you just want to use just Spark SQL, you can still utilize the
> functions shown above to implement in SQL.
>
> Any reason you do not want to use UDF?
>
> Credit
> 
>
>
> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:
>
>> Hi All,
>>
>>
>>
>> I have a requirement to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>> dataframe *with the help of from *geopy* import *distance *without using
>> *UDF* (user defined function)*,*Please help how to achieve this scenario
>> and do the needful.
>>
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi

We are using a haversine distance function for this, and wrapping it in
udf.

from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
from pyspark.sql.types import *

def haversine_distance(long_x, lat_x, long_y, lat_y):
return acos(
sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
cos(toRadians(long_x) - toRadians(long_y))
) * lit(6371.0)

distudf = udf(haversine_distance, FloatType())

in case you just want to use just Spark SQL, you can still utilize the
functions shown above to implement in SQL.

Any reason you do not want to use UDF?

Credit



On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru  wrote:

> Hi All,
>
>
>
> I have a requirement to calculate distance between four
> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
> dataframe *with the help of from *geopy* import *distance *without using
> *UDF* (user defined function)*,*Please help how to achieve this scenario
> and do the needful.
>
>
>
> Thanks,
>
> Ankamma Rao B
>


-- 
Best Regards,
Ayan Guha