Hi Till,

I just created the JIRA ticket:
https://issues.apache.org/jira/browse/FLINK-9103

I added the JobManager and TaskManager logs, Hope this helps to resolve the
issue.

Regards,
Edward

2018-03-27 17:48 GMT+02:00 Till Rohrmann <trohrm...@apache.org>:

> Hi Edward,
>
> could you please file a JIRA issue for this problem. It might be as simple
> as that the TaskManager's network stack uses the IP instead of the hostname
> as you suggested. But we have to look into this to be sure. Also the logs
> of the JobManager as well as the TaskManagers could be helpful.
>
> Cheers,
> Till
>
> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cjo...@gmail.com>
> wrote:
>
>>
>> I suspect this relates to: https://issues.apache.org/
>> jira/browse/FLINK-5030
>>
>> For which there was a PR at some point but nothing has been done so far.
>> It seems the current code explicitly uses the IP vs Hostname for Netty SSL
>> configuration.
>>
>> Without that I'm really wondering how people are reasonably using SSL on
>> a Kubernetes Flink-based cluster as every time a pod is (re-started) it can
>> theoretically take a different IP? Or do I miss something?
>>
>> --
>> Christophe
>>
>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>> edward.roja...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
>>> configuration based on https://ci.apache.org/proje
>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>
>>> However, as the IP of the nodes are dynamic (from the nature of
>>> kubernetes), we are using only the DNS which we can control using
>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>> flink-jobmanager DNS and also the DNS for the task managers
>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>
>>> Additionally we set the jobmanager.rpc.address property on all the nodes
>>> and each task manager sets the taskmanager.host property, all matching the
>>> ones on the certificate.
>>>
>>> This is working well when using Job with Parallelism set to 1. The SSL
>>> validations are good and the Jobmanager can communicate with Task manager
>>> and vice versa.
>>>
>>> But when we set the parallelism to more than 1 we have exceptions on the
>>> SSL validation like this:
>>>
>>> Caused by: java.security.cert.CertificateException: No subject
>>> alternative names matching IP address 172.30.247.163 found
>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>> tManagerImpl.java:455)
>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>> tManagerImpl.java:436)
>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>> ManagerImpl.java:252)
>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>> 9TrustManagerImpl.java:136)
>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>> ndshaker.java:1601)
>>> ... 21 more
>>>
>>>
>>> From the logs I see the Jobmanager is correctly registering the
>>> taskmanagers:
>>>
>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts
>>> is 3. Current number of alive task slots is 6.
>>>
>>> And also each taskmanager is correctly registered to use the hostname
>>> for communication:
>>>
>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will
>>> use hostname/address 'flink-taskmanager-1.flink-tas
>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>> communication.
>>> ...
>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>> -svc.default.svc.cluster.local:6122]
>>> ...
>>> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
>>> [server address: flink-taskmanager-1.flink-task
>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>> type: NIO, number of server threads: 2 (manual), number of client threads:
>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect
>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>> default)]
>>> ...
>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>> (dataPort=6121)
>>>
>>>
>>>
>>> But even with that, it seems like the taskmanagers are using the IP
>>> communicate between them and the SSL validation fails.
>>>
>>> Do you know if it's possible to make the taskmanagers to use the
>>> hostname to communicate instead of the IP ?
>>> or
>>> Do you have any advice to get the SSL configuration to work on this
>>> environment ?
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Edward
>>>
>>
>>
>>
>> --
>> Christophe
>>
>
>


-- 
*Edward Alexander Rojas Clavijo*



*Software EngineerHybrid CloudIBM France*

Reply via email to