Hi Till, I just created the JIRA ticket: https://issues.apache.org/jira/browse/FLINK-9103
I added the JobManager and TaskManager logs, Hope this helps to resolve the issue. Regards, Edward 2018-03-27 17:48 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > Hi Edward, > > could you please file a JIRA issue for this problem. It might be as simple > as that the TaskManager's network stack uses the IP instead of the hostname > as you suggested. But we have to look into this to be sure. Also the logs > of the JobManager as well as the TaskManagers could be helpful. > > Cheers, > Till > > On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cjo...@gmail.com> > wrote: > >> >> I suspect this relates to: https://issues.apache.org/ >> jira/browse/FLINK-5030 >> >> For which there was a PR at some point but nothing has been done so far. >> It seems the current code explicitly uses the IP vs Hostname for Netty SSL >> configuration. >> >> Without that I'm really wondering how people are reasonably using SSL on >> a Kubernetes Flink-based cluster as every time a pod is (re-started) it can >> theoretically take a different IP? Or do I miss something? >> >> -- >> Christophe >> >> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo < >> edward.roja...@gmail.com> wrote: >> >>> Hi all, >>> >>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL >>> configuration based on https://ci.apache.org/proje >>> cts/flink/flink-docs-master/ops/security-ssl.html. >>> >>> However, as the IP of the nodes are dynamic (from the nature of >>> kubernetes), we are using only the DNS which we can control using >>> kubernetes services. So we add to the Subject Alternative Name(SAN) the >>> flink-jobmanager DNS and also the DNS for the task managers >>> *.flink-taskmanager-svc (each task manager has a DNS in the form >>> flink-taskmanager-0.flink-taskmanager-svc). >>> >>> Additionally we set the jobmanager.rpc.address property on all the nodes >>> and each task manager sets the taskmanager.host property, all matching the >>> ones on the certificate. >>> >>> This is working well when using Job with Parallelism set to 1. The SSL >>> validations are good and the Jobmanager can communicate with Task manager >>> and vice versa. >>> >>> But when we set the parallelism to more than 1 we have exceptions on the >>> SSL validation like this: >>> >>> Caused by: java.security.cert.CertificateException: No subject >>> alternative names matching IP address 172.30.247.163 found >>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) >>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) >>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>> tManagerImpl.java:455) >>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>> tManagerImpl.java:436) >>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust >>> ManagerImpl.java:252) >>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50 >>> 9TrustManagerImpl.java:136) >>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa >>> ndshaker.java:1601) >>> ... 21 more >>> >>> >>> From the logs I see the Jobmanager is correctly registering the >>> taskmanagers: >>> >>> org.apache.flink.runtime.instance.InstanceManager - Registered >>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl >>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) >>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts >>> is 3. Current number of alive task slots is 6. >>> >>> And also each taskmanager is correctly registered to use the hostname >>> for communication: >>> >>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will >>> use hostname/address 'flink-taskmanager-1.flink-tas >>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for >>> communication. >>> ... >>> akka.remote.Remoting - Remoting started; listening on addresses >>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager >>> -svc.default.svc.cluster.local:6122] >>> ... >>> org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig >>> [server address: flink-taskmanager-1.flink-task >>> manager-svc.default.svc.cluster.local/172.30.247.163, server port: >>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport >>> type: NIO, number of server threads: 2 (manual), number of client threads: >>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect >>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's >>> default)] >>> ... >>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data >>> connection information: bf4a9b50e57c99c17049adb66d65f685 @ >>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local >>> (dataPort=6121) >>> >>> >>> >>> But even with that, it seems like the taskmanagers are using the IP >>> communicate between them and the SSL validation fails. >>> >>> Do you know if it's possible to make the taskmanagers to use the >>> hostname to communicate instead of the IP ? >>> or >>> Do you have any advice to get the SSL configuration to work on this >>> environment ? >>> >>> Thanks in advance. >>> >>> Regards, >>> Edward >>> >> >> >> >> -- >> Christophe >> > > -- *Edward Alexander Rojas Clavijo* *Software EngineerHybrid CloudIBM France*