Hi - interesting stuff. My stand always was to use spark native functions, pandas and python native - in this order.
To OP - did you try the code? What kind of perf are you seeing? Just curious, why do you think UDFs are bad? On Sat, 10 Apr 2021 at 2:36 am, Sean Owen <sro...@gmail.com> wrote: > Actually, good question, I'm not sure. I don't think that Spark would > vectorize these operations over rows. > Whereas in a pandas UDF, given a DataFrame, you can apply operations like > sin to 1000s of values at once in native code via numpy. It's trivially > 'vectorizable' and I've seen good wins over, at least, a single-row UDF. > > On Fri, Apr 9, 2021 at 9:14 AM ayan guha <guha.a...@gmail.com> wrote: > >> Hi Sean - absolutely open to suggestions. >> >> My impression was using spark native functions should provide similar >> perf as scala ones because serialization penalty should not be there, >> unlike native python udfs. >> >> Is it wrong understanding? >> >> >> >> On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru <rao.m...@outlook.com> wrote: >> >>> Hi All, >>> >>> >>> yes ,i need to add the below scenario based code to the executing spark >>> job,while executing this it took lot of time to complete,please suggest >>> best way to get below requirement without using UDF >>> >>> >>> Thanks, >>> >>> Ankamma Rao B >>> ------------------------------ >>> *From:* Sean Owen <sro...@gmail.com> >>> *Sent:* Friday, April 9, 2021 6:11 PM >>> *To:* ayan guha <guha.a...@gmail.com> >>> *Cc:* Rao Bandaru <rao.m...@outlook.com>; User <user@spark.apache.org> >>> *Subject:* Re: [Spark SQL]:to calculate distance between four >>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk >>> dataframe >>> >>> This can be significantly faster with a pandas UDF, note, because you >>> can vectorize the operations. >>> >>> On Fri, Apr 9, 2021, 7:32 AM ayan guha <guha.a...@gmail.com> wrote: >>> >>> Hi >>> >>> We are using a haversine distance function for this, and wrapping it in >>> udf. >>> >>> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf >>> from pyspark.sql.types import * >>> >>> def haversine_distance(long_x, lat_x, long_y, lat_y): >>> return acos( >>> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) + >>> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) * >>> cos(toRadians(long_x) - toRadians(long_y)) >>> ) * lit(6371.0) >>> >>> distudf = udf(haversine_distance, FloatType()) >>> >>> in case you just want to use just Spark SQL, you can still utilize the >>> functions shown above to implement in SQL. >>> >>> Any reason you do not want to use UDF? >>> >>> Credit >>> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark> >>> >>> >>> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <rao.m...@outlook.com> >>> wrote: >>> >>> Hi All, >>> >>> >>> >>> I have a requirement to calculate distance between four >>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk >>> dataframe *with the help of from *geopy* import *distance *without >>> using *UDF* (user defined function)*,*Please help how to achieve this >>> scenario and do the needful. >>> >>> >>> >>> Thanks, >>> >>> Ankamma Rao B >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >>> -- >> Best Regards, >> Ayan Guha >> > -- Best Regards, Ayan Guha