Re: PyFlink UDF: When to use vectorized vs scalar

Dian Fu Sun, 18 Apr 2021 18:49:29 -0700

Hi Yik San,

It much depends on what you want to do in your Python UDF implementation. As 
you know that, for vectorized Python UDF (aka. Pandas UDF), the input data are 
organized as columnar format. So if your Python UDF implementation could 
benefit from this, e.g. making use of the functionalities provided in the 
libraries such as Pandas, Numpy, etc which are columnar oriented, then 
vectorized Python UDF is usually a better choice. However, if you have to 
operate the input data one row at a time, then I guess that the non-vectorized 
Python UDF is enough.


PS: you could also run some performance test when it’s unclear which one is 
better.

Regards,
Dian

> 2021年4月16日 下午8:24，Fabian Paul <fabianp...@data-artisans.com> 写道：
> 
> Hi Yik San,
> 
> I think the usage of vectorized udfs highly depends on your input and output 
> formats. For your example my first impression would say that parsing a JSON 
> string is always an rather expensive operation and the vectorization has not 
> much impact on that. 
> 
> I am ccing Dian Fu who is more familiar with pyflink
> 
> Best,
> Fabian
> 
>> On 16. Apr 2021, at 11:04, Yik San Chan <evan.chanyik...@gmail.com 
>> <mailto:evan.chanyik...@gmail.com>> wrote:
>> 
>> The question is cross-posted on Stack Overflow 
>> https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar
>>  
>> <https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar>
>> 
>> Is there a simple set of rules to follow when deciding between vectorized vs 
>> scalar PyFlink UDF?
>> 
>> According to 
>> [docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html>),
>>  vectorized UDF has advantages of: (1) smaller ser-de and invocation 
>> overhead (2) Vector calculation are highly optimized thanks to libs such as 
>> Numpy.
>> 
>> > Vectorized Python user-defined functions are functions which are executed 
>> > by transferring a batch of elements between JVM and Python VM in Arrow 
>> > columnar format. The performance of vectorized Python user-defined 
>> > functions are usually much higher than non-vectorized Python user-defined 
>> > functions as the serialization/deserialization overhead and invocation 
>> > overhead are much reduced. Besides, users could leverage the popular 
>> > Python libraries such as Pandas, Numpy, etc for the vectorized Python 
>> > user-defined functions implementation. These Python libraries are highly 
>> > optimized and provide high-performance data structures and functions.
>> 
>> **QUESTION 1**: Is vectorized UDF ALWAYS preferred? 
>> 
>> Let's say, in my use case, I want to simply extract some fields from a JSON 
>> column, that is not supported by Flink [built-in 
>> functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html>)
>>  yet, therefore I need to define my udf like:
>> 
>> ```python
>> @udf(...)
>> def extract_field_from_json(json_value, field_name):
>>     import json
>>     return json.loads(json_value)[field_name]
>> ```
>> 
>> **QUESTION 2**: Will I also benefit from vectorized UDF in this case?
>> 
>> Best,
>> Yik San
>

Re: PyFlink UDF: When to use vectorized vs scalar

Reply via email to