Ohad Raviv created SPARK-37752: ---------------------------------- Summary: Python UDF fails when it should not get evaluated Key: SPARK-37752 URL: https://issues.apache.org/jira/browse/SPARK-37752 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.4 Reporter: Ohad Raviv
Haven't checked on newer versions yet. If i define in Python: {code:java} def udf1(col1): print(col1[2]) return "blah" spark.udf.register("udf1", udf1) {code} and then use it in SQL: {code:java} select case when length(c)>2 then udf1(c) end from ( select explode(array("123","234","12")) as c ) {code} it fails on: {noformat} File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main process() File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 155, in <lambda> func = lambda _, it: map(mapper, it) File "<string>", line 1, in <lambda> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 76, in <lambda> return lambda *a: f(*a) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in udf1 IndexError: string index out of range{noformat} Although in the out-of-range row it should not get evaluated at all as the case-when filters for lengths of more than 2 letters. the same scenario works great when we define instead a Scala UDF. will check now if it happens also for newer versions. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org