Hi,
I've been trying to debug a Spark UDF for a couple of days now but I can't
seem to figure out what is going on. The UDF essentially pads a 2D array to
a certain fixed length. When the code uses NumPy, it fails with a
PickleException. When I re write using plain python, it works like charm.:

This does not work:


@udf("array<array<float>>")
def pad(arr: List[List[float]], n: int) -> List[List[float]]:
    return np.pad(arr, [(n, 0), (0, 0)], "constant",
constant_values=0.0).tolist()

But this works:
@udf("array<array<float>>")
def pad(arr, n):
    padded_arr = []
    for i in range(n):
        padded_arr.append([0.0] * len(arr[0]))
    padded_arr.extend(arr)
    return padded_arr

The code for calling them remains exactly the same:
df.withColumn("test", pad(col("array_col"), expected_length - actual_length)

What am I missing?

The arrays do not have any NaNs or Nulls.
Any thoughts or suggestions or tips for troubleshooting would be
appreciated.

Best regards,
Sanket

Reply via email to