Re: What is the flink-kubernetes-operator webhook for?

2022-12-09 Thread Andrew Otto
Okay, thank you both.  We will disable webhook creation unless we end up
needing it.



On Fri, Dec 9, 2022 at 9:39 AM Gyula Fóra  wrote:

> To add to what Matyas said:
>
> Validation in itself is a mandatory step for every spec change that is
> submitted to guard against broken configs (things like negative parallelism
> etc).
>
> But validation can happen in 2 places. It can be done through the webhook,
> which would result in upfront rejection of the spec on validation error.
>
> Or it can happen during regular processing/reconciliation process in which
> case errors are recorded in the status .
>
> The webhook is nice way to get validation error’s immediately but as you
> see it’s not necessary as validation would happen anyways .
>
> Gyula
>
> On Fri, 9 Dec 2022 at 09:21, Őrhidi Mátyás 
> wrote:
>
>> Hi Otto,
>>
>> webhooks in general are optional components of the k8s operator pattern.
>> Mostly used for validation, sometimes for changing custom resources and
>> handling multiple versions, etc. It's an optional component in the Flink
>> Kubernetes Operator too.
>>
>> Regards,
>> Matyas
>>
>> On Fri, Dec 9, 2022 at 5:59 AM Andrew Otto  wrote:
>>
>>> Hello!
>>>
>>> What is the Flink Kubernetes Webhook
>>> 
>>> for?  I probably don't know just because I don't know k8s that well, but
>>> reading code and other docs didn't particular enlighten me :)
>>>
>>> It looks like maybe its for doing some extra validation of k8s API
>>> requests, and allows you to customize how those requests are validated and
>>> processed if you have special requirements to do so.
>>>
>>> Since it can be so easily disabled
>>> ,
>>> do we need to install it for production use?  FWIW, we will not be using
>>> FlinkSessionJob, so perhaps we don't need it if we don't use that?
>>>
>>> Thanks!
>>> -Andrew Otto
>>>  Wikimedia Foundation
>>>
>>


Re: 关于Flink重启策略疑惑

2022-12-09 Thread weijie guo
你好

1.Flink中(JM)JobMaster会监控各个Task的状态,如果Task由于某些原因失败了,JM触发failover,并且决策哪些task应该被重新启动。当然,如果JM挂掉的话,Flink支持配置高可用(HA),通过持久化一些信息到外部系统,从而做到通过standby
JM正确接管作业。
2.无论单个Task挂掉还是TaskManager挂掉failover流程都可以正确处理,处理流程基本是一致的,TaskManager挂掉可以认为是上面所有被调度上去的Task
fail了。

Best regards,

Weijie


李义  于2022年12月9日周五 15:28写道:

> 你好,我们团队在调研Flink相关技术。关于故障重启策略有些困惑
> Task 故障恢复 | Apache Flink
>
> 1.故障重启是通过什么技术手段触发的,我搜查了很多资料 ,都仅提到重启策略是怎么配置的,但是谁触发的? 它不可能挂掉了自己重启吧?
> 2.故障重启是Task级别还是作用于TaskManager服务?
>
> 感谢并支持Flink开发者们的工作,Thanks!
>


Re: What is the flink-kubernetes-operator webhook for?

2022-12-09 Thread Gyula Fóra
To add to what Matyas said:

Validation in itself is a mandatory step for every spec change that is
submitted to guard against broken configs (things like negative parallelism
etc).

But validation can happen in 2 places. It can be done through the webhook,
which would result in upfront rejection of the spec on validation error.

Or it can happen during regular processing/reconciliation process in which
case errors are recorded in the status .

The webhook is nice way to get validation error’s immediately but as you
see it’s not necessary as validation would happen anyways .

Gyula

On Fri, 9 Dec 2022 at 09:21, Őrhidi Mátyás  wrote:

> Hi Otto,
>
> webhooks in general are optional components of the k8s operator pattern.
> Mostly used for validation, sometimes for changing custom resources and
> handling multiple versions, etc. It's an optional component in the Flink
> Kubernetes Operator too.
>
> Regards,
> Matyas
>
> On Fri, Dec 9, 2022 at 5:59 AM Andrew Otto  wrote:
>
>> Hello!
>>
>> What is the Flink Kubernetes Webhook
>> 
>> for?  I probably don't know just because I don't know k8s that well, but
>> reading code and other docs didn't particular enlighten me :)
>>
>> It looks like maybe its for doing some extra validation of k8s API
>> requests, and allows you to customize how those requests are validated and
>> processed if you have special requirements to do so.
>>
>> Since it can be so easily disabled
>> ,
>> do we need to install it for production use?  FWIW, we will not be using
>> FlinkSessionJob, so perhaps we don't need it if we don't use that?
>>
>> Thanks!
>> -Andrew Otto
>>  Wikimedia Foundation
>>
>


Re: What is the flink-kubernetes-operator webhook for?

2022-12-09 Thread Őrhidi Mátyás
Hi Otto,

webhooks in general are optional components of the k8s operator pattern.
Mostly used for validation, sometimes for changing custom resources and
handling multiple versions, etc. It's an optional component in the Flink
Kubernetes Operator too.

Regards,
Matyas

On Fri, Dec 9, 2022 at 5:59 AM Andrew Otto  wrote:

> Hello!
>
> What is the Flink Kubernetes Webhook
> 
> for?  I probably don't know just because I don't know k8s that well, but
> reading code and other docs didn't particular enlighten me :)
>
> It looks like maybe its for doing some extra validation of k8s API
> requests, and allows you to customize how those requests are validated and
> processed if you have special requirements to do so.
>
> Since it can be so easily disabled
> ,
> do we need to install it for production use?  FWIW, we will not be using
> FlinkSessionJob, so perhaps we don't need it if we don't use that?
>
> Thanks!
> -Andrew Otto
>  Wikimedia Foundation
>


What is the flink-kubernetes-operator webhook for?

2022-12-09 Thread Andrew Otto
Hello!

What is the Flink Kubernetes Webhook

for?  I probably don't know just because I don't know k8s that well, but
reading code and other docs didn't particular enlighten me :)

It looks like maybe its for doing some extra validation of k8s API
requests, and allows you to customize how those requests are validated and
processed if you have special requirements to do so.

Since it can be so easily disabled
,
do we need to install it for production use?  FWIW, we will not be using
FlinkSessionJob, so perhaps we don't need it if we don't use that?

Thanks!
-Andrew Otto
 Wikimedia Foundation


Re: Could not restore keyed state backend for KeyedProcessOperator

2022-12-09 Thread Lars Skjærven
Lifecycle rulesNone

On Fri, Dec 9, 2022 at 3:17 AM Hangxiang Yu  wrote:

> Hi, Lars.
> Could you check whether you have configured the lifecycle of google cloud
> storage[1] which is not recommended in the flink checkpoint usage?
>
> [1] https://cloud.google.com/storage/docs/lifecycle
>
> On Fri, Dec 9, 2022 at 2:02 AM Lars Skjærven  wrote:
>
>> Hello,
>> We had an incident today with a job that could not restore after crash
>> (for unknown reason). Specifically, it fails due to a missing checkpoint
>> file. We've experienced this a total of three times with Flink 1.15.2, but
>> never with 1.14.x. Last time was during a node upgrade, but that was not
>> the case this time.
>>
>> I've not been able to reproduce this issue. I've checked that I can kill
>> the taskmanager and jobmanager (using kubectl delete pod), and the job
>> restores as expected.
>>
>> The job is running with kubernetes high availability, rocksdb and
>> incremental checkpointing.
>>
>> Any tips are highly appreciated.
>>
>> Thanks,
>> Lars
>>
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed
>> state backend for
>> KeyedProcessOperator_bf374b554824ef28e76619f4fa153430_(2/2) from any of the
>> 1 provided restore options.
>> at
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
>> at
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
>> at
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164)
>> ... 11 more
>> Caused by: org.apache.flink.runtime.state.BackendBuildingException:
>> Caught unexpected exception.
>> at
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395)
>> at
>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483)
>> at
>> org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97)
>> at
>> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329)
>> at
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
>> at
>> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>> ... 13 more
>> Caused by: java.io.FileNotFoundException: Item not found:
>> 'gs://some-bucket-name/flink-jobs/namespaces/default/jobs/d60a6c94-ddbc-42a1-947e-90f62749835a/checkpoints/d60a6c94ddbc42a1947e90f62749835a/shared/3cb2bb55-b4b0-44e5-948a-5d38ec088253'.
>> Note, it is possible that the live version is still available but the
>> requested generation is deleted.
>> at
>> com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:46)
>>
>>
>
> --
> Best,
> Hangxiang.
>


Re: ArgoCD unable to process health from k8s FlinkDeployment - stuck in Processing

2022-12-09 Thread Gyula Fóra
Hi!

The resource lifecycle state is currently not shown explicitly in the
status.

You are confusing it with reconciliation status. At the moment you can only
get this through the Java client:

https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/status/CommonStatus.java

This seems to be a very common request so we should probably expose this
field directly in the status, even though it could technically be derived
from other fields.

Could you please open a jira ticket for this improvement?

Cheers
Gyula

On Fri, 9 Dec 2022 at 05:46, Edgar H  wrote:

> Morning all,
>
> Recently been testing the Flink k8s operator (flink.apache.org/v1beta1)
> and although the jobs do startup and run perfectly fine, their status in
> ArgoCD is not yet as it should be, some details:
>
> When describing the flinkdeployment I'm currently trying to test, the
> follows appears in events:
>
>   TypeReason Age   From  Message
>   --     ---
>   Normal  Submit 22m   JobManagerDeployment  Starting deployment
>   Normal  StatusChanged  21m   Job   Job status changed
> from RECONCILING to CREATED
>   Normal  StatusChanged  20m   Job   Job status changed
> from CREATED to RUNNING
>
> On top of it, the reconciliation timestamp and the state are as follows:
>
> Reconciliation Timestamp:  1670581014190
> State: DEPLOYED
>
> From what I've read in the docs, the flinkdeployment is not considered
> healthy until that state: STABLE, right?
>
>
>- DEPLOYED : The resource is deployed/submitted to Kubernetes, but
>it’s not yet considered to be stable and might be rolled back in the future
>- STABLE : The resource deployment is considered to be stable and
>won’t be rolled back
>
>
> The jobs have been running for some hours already, one of them would throw
> some exceptions but won't cause downtime. What does it take for the job to
> be in STABLE state rather than just DEPLOYED? Would that be the cause of
> the Processing... health status in ArgoCD or just that internally in k8s
> the flinkoperator can't really notice the pods running well?
>


ArgoCD unable to process health from k8s FlinkDeployment - stuck in Processing

2022-12-09 Thread Edgar H
Morning all,

Recently been testing the Flink k8s operator (flink.apache.org/v1beta1) and
although the jobs do startup and run perfectly fine, their status in ArgoCD
is not yet as it should be, some details:

When describing the flinkdeployment I'm currently trying to test, the
follows appears in events:

  TypeReason Age   From  Message
  --     ---
  Normal  Submit 22m   JobManagerDeployment  Starting deployment
  Normal  StatusChanged  21m   Job   Job status changed
from RECONCILING to CREATED
  Normal  StatusChanged  20m   Job   Job status changed
from CREATED to RUNNING

On top of it, the reconciliation timestamp and the state are as follows:

Reconciliation Timestamp:  1670581014190
State: DEPLOYED

>From what I've read in the docs, the flinkdeployment is not considered
healthy until that state: STABLE, right?

- DEPLOYED : The resource is deployed/submitted to Kubernetes, but it’s not
yet considered to be stable and might be rolled back in the future
- STABLE : The resource deployment is considered to be stable and won’t be
rolled back

The jobs have been running for some hours already, one of them would throw
some exceptions but won't cause downtime. What does it take for the job to
be in STABLE state rather than just DEPLOYED? Would that be the cause of
the Processing... health status in ArgoCD or just that internally in k8s
the flinkoperator can't really notice the pods running well?


Re: ZLIB Vulnerability Exposure in Flink statebackend RocksDB

2022-12-09 Thread Martijn Visser
Hi Vidya,

Please keep in mind that the Flink project is driven by volunteers. If
you're noticing an outdated version for the lz4 compression library and an
update is required, it would be great if you can open the PR to update that
dependency yourself.

Best regards,

Martijn

On Thu, Dec 8, 2022 at 10:31 PM Vidya Sagar Mula 
wrote:

> Thank you Yanfei for taking this issue as a bug and planning a fix in the
> upcoming version.
>
> I have another vulnerability bug coming on our product. It is related to
> the "LZ4" compression library version. Can you please take a look at this
> link?
> https://nvd.nist.gov/vuln/detail/CVE-2019-17543
>
> I have noticed that, Flink code base is using "*1.8.0 *.version>*". Vulnerability is present for the versions *before 1.9.2.*
>
> https://github.com/apache/flink/blob/master/pom.xml
>
> Can you please look into this issue also and address it in the coming
> releases?
>
> Questions:
> ---
> - Is the code actually using this compression library? Can this
> vulnerability issue be ignored?
>
> - Can you please let me know if this is going to be addressed. If yes,
> until we move to the new Flink version to get the latest changes, would it
> be ok if we upgrade the version of LZ4 in our local cloned code base? I
> would like to understand the impact if we make changes in our local Flink
> code with regards to testing efforts and any other affected modules?
>
> Can you please clarify this?
>
> Thanks,
> Vidya Sagar.
>
>
> On Wed, Dec 7, 2022 at 7:59 AM Yanfei Lei  wrote:
>
>> Hi Vidya Sagar,
>>
>> Thanks for bringing this up.
>>
>> The RocksDB state backend defaults to Snappy[1]. If the compression
>> option is not specifically configured, this vulnerability of ZLIB has no
>> effect on the Flink application for the time being.
>>
>> *> is there any plan in the coming days to address this? *
>>
>> The FRocksDB 6.20.3-ververica-1.0
>> 
>>   does
>> depend on ZLIB 1.2.11, FLINK-30321 is created to address this.
>>
>> *> If this needs to be fixed, is there any plan from Ververica to address
>> this vulnerability?*
>>
>> Yes, we plan to publish a new version of FRocksDB[3] in Flink 1.17, and 
>> FLINK-30321
>> would be included in the new release.
>>
>> *> how to address this vulnerability issue as this is coming as a high
>> severity blocking issue to our product.*
>>
>> As a kind of mitigation, don't configure ZLIB compression for RocksDB
>> state backend.
>> If ZLIB must be used now and your product can't wait, maybe you can refer
>> to this release document[4] to release your own version.
>>
>> [1] https://github.com/facebook/rocksdb/wiki/Compression
>> [2] https://issues.apache.org/jira/browse/FLINK-30321
>> [3] https://cwiki.apache.org/confluence/display/FLINK/1.17+Release
>> [4]
>> https://github.com/ververica/frocksdb/blob/FRocksDB-6.20.3/FROCKSDB-RELEASE.md
>>
>> --
>> Best,
>> Yanfei
>> Ververica (Alibaba)
>>
>> Vidya Sagar Mula  于2022年12月7日周三 06:47写道:
>>
>>> Hi,
>>>
>>> There is a ZLIB vulnerability reported by the official National
>>> Vulnerability Database. This vulnerability causes memory corruption while
>>> deflating with ZLIB version less than 1.2.12.
>>> Here is the link for details...
>>>
>>>
>>> https://nvd.nist.gov/vuln/detail/cve-2018-25032#vulnCurrentDescriptionTitle
>>>
>>> *How is it linked to Flink?: *
>>> In the Flink statebackend rocksdb, there is ZLIB version 1.2.11 is used
>>> as part of the .so file. Hence, there is vulnerability exposure here.
>>>
>>> *Flink code details/links:*
>>> I am seeing the latest Flink code base where the statebackend rocksdb
>>> library *(frocksdbjni)* is coming from Ververica. The pom.xml
>>> dependency snapshot is here
>>>
>>>
>>> https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/pom.xml
>>>
>>> 
>>>
>>> com.ververica
>>>
>>> frocksdbjni
>>>
>>> 6.20.3-ververica-1.0
>>>
>>> 
>>>
>>>
>>> When I see the frocksdbjni code base, the makefile is pointing to
>>> ZLIB_VER=1.2.11. This ZLIB version is vulnerable as per the NVD.
>>>
>>> https://github.com/ververica/frocksdb/blob/FRocksDB-6.20.3/Makefile
>>>
>>> *Questions:*
>>>
>>> - This vulnerability is marked as HIGH severity. How is it addressed at
>>> the Flink/Flink Stateback RocksDb? If not now, is there any plan in the
>>> coming days to address this?
>>>
>>> - As the Statebackend RocksDb is coming from Ververica, I am not seeing
>>> any latest artifacts published from them. As per the Maven Repository, the
>>> latest version is 6.20.3-ververica-1.0
>>> 
>>>  and
>>> this is the one used in the Flink code base.
>>>
>>> https://mvnrepository.com/artifact/com.ververica/frocksdbjni
>>>
>>> If this needs to be fixed, is there any plan from Ververica to address
>>> this vulnerability?
>>>
>>> - From the Flink user perspective, it is not simple to make the changes