DataFrame python UDF performnce too slow

Bijay Pathak Thu, 24 Mar 2016 09:21:24 -0700

Hi,

I am running Spark 1.6.0 on EMR. The job fails with OOM.I have DataFrame
with 250 columns and I am applying UDF on more than 50 of the columns. I am
registering the DataFrame as temptable and applying the UDF in hive_context
sql statement. I am applying the UDF after sort merge join of two DataFrame
(each of around 4GB) and multiple broadcast joins of 22 Dim table.


Below is how I am applying the UDF.

data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select
python_udf(column_1),python_udf(column_2), ... , from temp_table")

There is Jira for the same issue (
https://issues.apache.org/jira/browse/SPARK-8632) which is resolved for
1.6.0 but I am running into the similar issue.

Thanks,

Bijay

DataFrame python UDF performnce too slow

Reply via email to