Re: pyflink sql中select,where都带udf，其中一个udf失效

Dian Fu Sun, 18 Oct 2020 18:52:03 -0700

这个问题是一个bug， 我创建了一个JIRA：https://issues.apache.org/jira/browse/FLINK-19675 
<https://issues.apache.org/jira/browse/FLINK-19675>


出现的条件：在一个Calc里同时有Python UDF、Where条件、复合列访问。

在没有修复之前， 可以这样work around一下：

tmp_table = st_env.from_path("source")\
    .select("kubernetes.get('container').get('name') as name, message, 
clusterName, '@timestamp' "
            "as ts")

data_stream = st_env._j_tenv.toAppendStream(tmp_table._j_table, 
tmp_table._j_table.getSchema().toRowType())
table = Table(st_env._j_tenv.fromDataStream(data_stream, "name, message, 
clusterName, ts"), st_env)

table\
    .where("error_exist(message) = true")\
    .select("log_get(message),'','','ERROR','Asia/Shanghai', 
'ts',name,clusterName") \
    .execute_insert("sink").get_job_client().get_job_execution_result().result()


> 在 2020年10月16日，上午10:47，whh_960101 <[email protected]> 写道：
> 
> 我摘取了plan其中一部分
> 在过滤数据这里
> == Abstract Syntax Tree ==
> 
> +- LogicalFilter(condition=[error_exist($1)])
> 
> 
> 
> 
> 
> 
> == Optimized Logical Plan ==
> 
>      +- PythonCalc(select=[message, kubernetes, clusterName, 
> error_exist(message) AS f0]) 
> #感觉应该是这个地方出问题了，这里应该不是select，应该是where或者filter，上面已经有了LogicalFilter(condition=[error_exist($1)])
> 
> 
> 
> 
> == Physical Execution Plan ==
> 
> 
> 
> 
>  Stage 3 : Operator
> 
>   content : StreamExecPythonCalc
> 
>   ship_strategy : FORWARD
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 在 2020-10-15 20:59:12，"Dian Fu" <[email protected]> 写道：
>> 可以把plan打印出来看一下？打印plan可以参考这个：https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/python/table-api-users-guide/intro_to_table_api.html#explain-tables
>> 
>>> 在 2020年10月15日，下午7:02，whh_960101 <[email protected]> 写道：
>>> 
>>> hi，
>>> 我刚才改了一下你的例子[1]，通过from_elements构建一个source表
>>> 然后使用我的udf
>>> source.where(udf1(msg)=True).select("udf2(msg)").execute_insert('sink').get_job_client().get_job_execution_result().result()
>>> 打印出来的结果能够很好的筛选出我想要的数据
>>> 
>>> 
>>> 
>>> 
>>> 但是通过StreamTableEnvironment，连接kafka，取消息队列来构建source表，即之前那种写法
>>> source  = st_env.from_path('source')
>>> source.where(udf1(msg)=True).select("udf2(msg)").execute_insert('sink').get_job_client().get_job_execution_result().result()
>>> 
>>> 这个where筛选就失效了，最后打印出全部数据
>>> 
>>> 
>>> 而只在where中使用udf，即
>>> source.where(udf1(msg)=True).select("msg").execute_insert('sink').get_job_client().get_job_execution_result().result()
>>> 打印结果就是经过筛选后的
>>> 
>>> 
>>> 
>>> 
>>> 想请问一下这种问题出在哪里？
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 在 2020-10-15 16:57:39，"Xingbo Huang" <[email protected]> 写道：
>>>> Hi,
>>>> 我们有例子覆盖你这种情况的[1]，你需要看下你udf1的逻辑是什么样的
>>>> 
>>>> 
>>>> [1]
>>>> https://github.com/apache/flink/blob/release-1.11/flink-python/pyflink/table/tests/test_udf.py#L67
>>>> 
>>>> Best,
>>>> Xingbo
>>>> 
>>>> whh_960101 <[email protected]> 于2020年10月15日周四 下午2:30写道：
>>>> 
>>>>> 您好，我使用pyflink时的代码如下，有如下问题：
>>>>> 
>>>>> 
>>>>> source  = st_env.from_path('source')
>>>>> #st_env是StreamTableEnvironment，source是kafka源端
>>>>> #只在where语句中加udf1 (input_types =DataTypes.STRING(), result_type =
>>>>> DataTypes.BOOLEAN())
>>>>> table =
>>>>> source.select("msg").where(udf1(msg)=True).execute_insert('sink').get_job_client().get_job_execution_result().result()
>>>>> 
>>>>> 
>>>>> 这样打印出来的结果很好的筛选了数据
>>>>> 
>>>>> 
>>>>> 但如果在select语句中加另一个udf2 (input_types =DataTypes.STRING(), result_type =
>>>>> DataTypes.STRING())（将msg简单处理为新的String）
>>>>> table =
>>>>> source.select("udf2(msg)").where(udf1(msg)=True).execute_insert('sink').get_job_client().get_job_execution_result().result()
>>>>> 这个where筛选就失效了，最后打印出全部数据
>>>>> 
>>>>> 
>>>>> 如果改成where在前也不行，换成filter也不行
>>>>> table =
>>>>> source.where(udf1(msg)=True).select("udf2(msg)").execute_insert('sink').get_job_client().get_job_execution_result().result()
>>>>> 
>>>>> 
>>>>> select、where中的udf会冲突吗？这个问题该怎么解决？
>>>>> 希望您们能够给予解答！感谢！
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>>> 
>>>

Re: pyflink sql中select,where都带udf，其中一个udf失效

回复