Maciej Szymkiewicz created SPARK-19160: ------------------------------------------
Summary: UDF creation Key: SPARK-19160 URL: https://issues.apache.org/jira/browse/SPARK-19160 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 2.1.0, 2.0.0, 1.6.0, 1.5.0 Reporter: Maciej Szymkiewicz Right now there are a few ways we can create UDF: - With standalone function: {code} def _add_one(x): """Adds one""" if x is not None: return x + 1 add_one = udf(_add_one, IntegerType()) {code} This allows for full control flow, including exception handling, but duplicates variables. - With `lambda` expression: {code} add_one = udf(lambda x: x + 1 if x is not None else None, IntegerType()) {code} No variable duplication but only pure expressions. - Using nested functions with immediate call: {code} def add_one(c): def add_one_(x): if x is not None: return x + 1 return udf(add_one_, IntegerType())(c) {code} Quite verbose but enables full control flow and clearly indicates expected number of arguments. - Using `udf` functions as a decorator: {code} @udf def add_one(x): """Adds one""" if x is not None: return x + 1 {code} Possible but only with default `returnType` (or curried `@partial(udf, returnType=IntegerType())`). Proposed Add `udf` decorator which can be used as follows: {code} from pyspark.sql.decorators import udf @udf(IntegerType()) def add_one(x): """Adds one""" if x is not None: return x + 1 {code} or {code} @udf() def strip(x): """Strips String""" if x is not None: return x.strip() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org