The syntax looks right. Are you still getting the error when you open a new python session and run this same code ? Are you running on your laptop with spark local mode or are you running this on a yarn based cluster ? It does seem like something in your python session isnt getting serialized right. But does not look like it's related to this code snippet.
On Thu, Apr 4, 2019 at 3:49 PM Adaryl Wakefield < adaryl.wakefi...@hotmail.com> wrote: > Are we not supposed to be using udfs anymore? I copied an example straight > from a book and I’m getting weird results and I think it’s because the book > is using a much older version of Spark. The code below is pretty straight > forward but I’m getting an error none the less. I’ve been doing a bunch of > googling and not getting much results. > > > > from pyspark.sql import SparkSession > > from pyspark.sql.functions import * > > from pyspark.sql.types import * > > > > spark = SparkSession \ > > .builder \ > > .appName("Python Spark SQL basic example") \ > > .getOrCreate() > > > > df = spark.read.csv("full201801.dat",header="true") > > > > columntransform = udf(lambda x: 'Non-Fat Dry Milk' if x == '23040010' else > 'foo', StringType()) > > > > df.select(df.PRODUCT_NC, > columntransform(df.PRODUCT_NC).alias('COMMODITY')).show() > > > > Error. > > *Py4JJavaError*: An error occurred while calling o110.showString. > > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 2.0 (TID 2, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > > File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 242, in > main > > File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 144, in > read_udfs > > File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 120, in > read_single_udf > > File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 60, in > read_command > > File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 171, > in _read_with_length > > return self.loads(obj) > > File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 566, > in loads > > return pickle.loads(obj, encoding=encoding) > > TypeError: _fill_function() missing 4 required positional arguments: > 'defaults', 'dict', 'module', and 'closure_values' > > > > > > B. > > > > > > >