The Istio guideline implies that this is a guidance, not a standard. Is that correct? Is there a standard (already)? I think we should follow a standard as Flink and avoid implementing guidelines from different vendors/providers. Op ma 20 jun. 2022 om 13:36 schreef Nathan Fisher <nfis...@junctionbox.ca>:
> Would it make sense to add the annotations to the task manager and job > manager? In a non-istio environment it’d be a noop. > > mTLS as a requirement is more complicated but having some docs around > using cert-manager might be enough depending on the orgs requirement. > > On Mon, Jun 20, 2022 at 06:18, Őrhidi Mátyás <matyas.orh...@gmail.com> > wrote: > >> It seems Istio must be configured to allow Akka cluster communication to >> bypass the Istio sidecar proxy: >> https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html >> >> On Mon, Jun 20, 2022 at 11:30 AM Sigalit Eliazov <e.siga...@gmail.com> >> wrote: >> >>> Hi, >>> we have enabled HA as suggested, the task manager tries to reach the job >>> manager via pod id as expected but >>> the task manager is unable to connect to the job manager: >>> >>> >>> 2022-06-19 22:14:45,101 INFO >>> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - >>> Connecting to ResourceManager akka.tcp:// >>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d) >>> . >>> >>> >>> 2022-06-19 22:14:45,242 WARN akka.remote.transport.netty.NettyTransport >>> [] - Remote connection to [/ >>> 192.168.3.144:6123 >>> ] failed with java.io.IOException: Connection reset by peer >>> >>> >>> 2022-06-19 22:14:45,249 WARN akka.remote.ReliableDeliverySupervisor >>> [] - Association with remote system [akka.tcp:// >>> flink@192.168.3.144:6123 >>> ] has failed, address is now gated for [50] ms. Reason: [Association failed >>> with [akka.tcp:// >>> flink@192.168.3.144:6123 >>> ]] Caused by: [The remote system explicitly disassociated (reason unknown).] >>> >>> >>> 2022-06-19 22:14:45,255 INFO >>> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not >>> resolve ResourceManager address akka.tcp:// >>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0 >>> , retrying in 10000 ms: Could not connect to rpc endpoint under address >>> akka.tcp:// >>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0. >>> >>> 2022-06- >>> >>> >>> Are there any additional definitions required for that? >>> >>> >>> thanks >>> >>> Sigalit >>> >>> On Thu, Jun 16, 2022 at 2:28 PM Yang Wang <danrtsey...@gmail.com> wrote: >>> >>>> Could you please have a try with high availability enabled[1]? >>>> >>>> If HA enabled, the internal jobmanager rpc service will not be created. >>>> Instead, the TaskManager retrieves the JobManager address via HA services >>>> and connects to it via pod ip. >>>> >>>> [1]. >>>> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> Elisha, Moshe (Nokia - IL/Kfar Sava) <moshe.eli...@nokia.com> >>>> 于2022年6月16日周四 15:24写道: >>>> >>>>> Hello, >>>>> >>>>> >>>>> >>>>> We are launching Flink deployments using the Flink Kubernetes Operator >>>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/> >>>>> on a Kubernetes cluster with Istio and mTLS enabled. >>>>> >>>>> >>>>> >>>>> We found that the TaskManager is unable to communicate with the >>>>> JobManager on the jobmanager-rpc port: >>>>> >>>>> >>>>> >>>>> 2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor >>>>> [] - Association with remote system >>>>> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123] >>>>> has failed, address is now gated for [50] ms. Reason: [Association failed >>>>> with >>>>> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]] >>>>> Caused by: [The remote system explicitly disassociated (reason unknown).] >>>>> >>>>> >>>>> >>>>> The reason for the issue is that the JobManager service port >>>>> definitions are not following the Istio guidelines >>>>> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/ >>>>> (see example below). >>>>> >>>>> >>>>> >>>>> We believe a change to the default port definitions is needed but for >>>>> now, is there an immediate action we can take to work around the issue? >>>>> Perhaps overriding the default port definitions somehow? >>>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> flink-kubernetes-operator 1.0.0 >>>>> >>>>> Flink 1.14-java11 >>>>> >>>>> Kubernetes v1.19.5 >>>>> >>>>> Istio 1.7.6 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> # k get service inference-results-to-analytics-engine -o yaml >>>>> >>>>> apiVersion: v1 >>>>> >>>>> kind: Service >>>>> >>>>> metadata: >>>>> >>>>> ... >>>>> >>>>> labels: >>>>> >>>>> app: inference-results-to-analytics-engine >>>>> >>>>> type: flink-native-kubernetes >>>>> >>>>> name: inference-results-to-analytics-engine >>>>> >>>>> spec: >>>>> >>>>> clusterIP: None >>>>> >>>>> ports: >>>>> >>>>> - name: jobmanager-rpc # should start with “tcp-“ or add " >>>>> appProtocol" property >>>>> >>>>> port: 6123 >>>>> >>>>> protocol: TCP >>>>> >>>>> targetPort: 6123 >>>>> >>>>> - name: blobserver # should start with "tcp-" or add "appProtocol" >>>>> property >>>>> >>>>> port: 6124 >>>>> >>>>> protocol: TCP >>>>> >>>>> targetPort: 6124 >>>>> >>>>> selector: >>>>> >>>>> app: inference-results-to-analytics-engine >>>>> >>>>> component: jobmanager >>>>> >>>>> type: flink-native-kubernetes >>>>> >>>>> sessionAffinity: None >>>>> >>>>> type: ClusterIP >>>>> >>>>> status: >>>>> >>>>> loadBalancer: {} >>>>> >>>>> >>>>> >>>>