Maciej Szymkiewicz created SPARK-19160:
------------------------------------------

             Summary: UDF creation
                 Key: SPARK-19160
                 URL: https://issues.apache.org/jira/browse/SPARK-19160
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark, SQL
    Affects Versions: 2.1.0, 2.0.0, 1.6.0, 1.5.0
            Reporter: Maciej Szymkiewicz


Right now there are a few ways we can create UDF:

- With standalone function:
{code}
def _add_one(x):
    """Adds one"""
    if x is not None:
         return x + 1            

add_one = udf(_add_one, IntegerType())
{code}
    This allows for full control flow, including exception handling, but 
duplicates variables.
    

- With `lambda` expression:
{code}
add_one = udf(lambda x: x + 1 if x is not None else None, IntegerType())
{code}
No variable duplication but only pure expressions.

- Using nested functions with immediate call:
{code}
def add_one(c):
    def add_one_(x):
        if x is not None:
            return x + 1
    return udf(add_one_, IntegerType())(c)
{code}
Quite verbose but enables full control flow and clearly indicates expected 
number of arguments.
    
- Using `udf` functions as a decorator:
{code}
@udf
def add_one(x):
    """Adds one"""
    if x is not None:
        return x + 1
{code}
Possible but only with default `returnType` (or curried `@partial(udf, 
returnType=IntegerType())`).
    
Proposed

Add `udf` decorator which can be used as follows:

{code}
from pyspark.sql.decorators import udf

@udf(IntegerType())
def add_one(x):
    """Adds one"""
    if x is not None:
        return x + 1
{code}

or 

{code}
@udf()
def strip(x):
    """Strips String"""
    if x is not None:
        return x.strip()
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to