Re: Usage of PyArrow in Spark

Hyukjin Kwon Wed, 17 Jul 2019 04:18:32 -0700

Regular Python UDFs don't use PyArrow under the hood.
Yes, they can potentially benefit but they can be easily worked around via
Pandas UDFs.


For instance, both below are virtually identical.

@udf(...)
def func(col):
    return col

@pandas_udf(...)
def pandas_func(col):
    return a.apply(lambda col: col)

If we only need some minimised change, I would be positive about adding
Arrow support into regular Python UDFs. Otherwise, I am not sure yet.


2019년 7월 17일 (수) 오후 1:19, Abdeali Kothari <abdealikoth...@gmail.com>님이 작성:

> Hi,
> In spark 2.3+ I saw that pyarrow was being used in a bunch of places in
> spark. And I was trying to understand the benefit in terms of serialization
> / deserializaiton it provides.
>
> I understand that the new pandas-udf works only if pyarrow is installed.
> But what about the plain old PythonUDF which can be used in map() kind of
> operations?
> Are they also using pyarrow under the hood to reduce the cost is serde? Or
> do they remain as earlier and no performance gain should be expected in
> those?
>
> If I'm not mistaken, plain old PythonUDFs could also benefit from Arrow as
> the data transfer cost to serialize/deserialzie from Java to Python and
> back still exists and could potentially be reduced by using Arrow?
> Is my understanding correct? Are there any plans to implement this?
>
> Pointers to any notes or Jira about this would be appreciated.
>

Re: Usage of PyArrow in Spark

Reply via email to