Hi, I've been trying to debug a Spark UDF for a couple of days now but I can't seem to figure out what is going on. The UDF essentially pads a 2D array to a certain fixed length. When the code uses NumPy, it fails with a PickleException. When I re write using plain python, it works like charm.:
This does not work: @udf("array<array<float>>") def pad(arr: List[List[float]], n: int) -> List[List[float]]: return np.pad(arr, [(n, 0), (0, 0)], "constant", constant_values=0.0).tolist() But this works: @udf("array<array<float>>") def pad(arr, n): padded_arr = [] for i in range(n): padded_arr.append([0.0] * len(arr[0])) padded_arr.extend(arr) return padded_arr The code for calling them remains exactly the same: df.withColumn("test", pad(col("array_col"), expected_length - actual_length) What am I missing? The arrays do not have any NaNs or Nulls. Any thoughts or suggestions or tips for troubleshooting would be appreciated. Best regards, Sanket