Currently, Python UDFs run in a Python instances, are MUCH slower than Scala ones (from 10 to 100x). There is JIRA to improve the performance: https://issues.apache.org/jira/browse/SPARK-8632, After that, they will be still much slower than Scala ones (because Python is lower and the overhead for calling Python).
On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander <alek.eskil...@cerner.com> wrote: > Hi there, > > I’m trying to get a feel for how User Defined Functions from SparkSQL (as > written in Python and registered using the udf function from > pyspark.sql.functions) are run behind the scenes. Trying to grok the source > it seems that the native Python function is serialized for distribution to > the clusters. In practice, it seems to be able to check for other variables > and functions defined elsewhere in the namepsace and include those in the > function’s serialization. > > Following all this though, when actually run, are Python interpreter > instances on each node brought up to actually run the function against the > RDDs, or can the serialized function somehow be run on just the JVM? If > bringing up Python instances is the execution model, what is the overhead of > PySpark UDFs like as compared to those registered in Scala? > > Thanks, > Alek > CONFIDENTIALITY NOTICE This message and any included attachments are from > Cerner Corporation and are intended only for the addressee. The information > contained in this message is confidential and may constitute inside or > non-public information under international, federal, or state securities > laws. Unauthorized forwarding, printing, copying, distribution, or use of > such information is strictly prohibited and may be unlawful. If you are not > the addressee, please promptly delete this message and notify the sender of > the delivery error by e-mail or you may call Cerner's corporate offices in > Kansas City, Missouri, U.S.A at (+1) (816)221-1024. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org