I threw together a quick example that replicates what you see, then looked
at the physical plan:

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Row

df = spark.createDataFrame([Row(list_names=['a', 'b', 'c', 'd'],
name=None), Row(list_names=['a', 'b', 'c', 'd'], name='a')])
df2 = df.withColumn('match_flag', col('list_names').isNull() |
contains(col('name'), col('list_names')))

Running df2.show() returns the error you mentioned. However, if you look at
the query plan you see the following:

== Physical Plan ==
*(1) Project [list_names#27, name#28, (isnull(list_names#27) ||
pythonUDF0#47) AS match_flag#32]
+- BatchEvalPython [<lambda>(name#28, list_names#27)], [list_names#27,
name#28, pythonUDF0#47]
   +- Scan ExistingRDD[list_names#27,name#28]

Spark needs to evaluate the Python UDF in the case that it might be needed.
My guess is that the architecture of the PythonUDF pipeline requires the
values to be processed together in a batch. It appears that the result is
stored into a column reference that is then used the WholeStageCodegen
phase that follows the UDF evaluation:

[image: Screen Shot 2019-05-13 at 4.31.17 PM.png]
If you look at the code that is generated by the codegen, it seems like the
or condition might be optimized into a nested if..then..else statement but
I'm not experienced in digging into codegen output.

Hope this helps!

-Nick

Nicholas Szandor Hakobian, Ph.D.
Principal Data Scientist
Rally Health




On Mon, May 13, 2019 at 8:38 AM Rishi Shah <rishishah.s...@gmail.com> wrote:

> Hi All,
>
> I am using or operator "|" in withColumn clause on a DataFrame in pyspark.
> However it looks like it always evaluates all the conditions regardless of
> first condition being true. Please find a sample below:
>
> contains = udf(lambda s, arr : s in arr, BooleanType())
>
> df.withColumn('match_flag', (col('list_names').isNull()) |
> (contains(col('name'), col('list_names'))))
>
> Here where list_names is null, it starts to throw an error : NoneType is
> not iterable.
>
> Any idea?
>
> --
> Regards,
>
> Rishi Shah
>

Reply via email to