Re: Fail to deploy Flink on minikube

Till Rohrmann Thu, 03 Sep 2020 00:47:58 -0700

In order to exclude a Minikube problem, you could also try to run Flink on
an older Minikube and an older K8s version. Our end-to-end tests use
Minikube v1.8.2, for example.


Cheers,
Till

On Thu, Sep 3, 2020 at 8:44 AM Yang Wang <danrtsey...@gmail.com> wrote:

> Sorry i forget that the JobManager is binding its rpc address to
> flink-jobmanager, not the ip address.
> So you need to also update the jobmanager-session-deployment.yaml with
> following changes.
>
> ...
>       containers:
>       - name: jobmanager
>         env:
>         - name: JM_IP
>           valueFrom:
>             fieldRef:
>               apiVersion: v1
>               fieldPath: status.podIP
>         image: flink:1.11
>         args: ["jobmanager", "$(JM_IP)"]
> ...
>
> After then the JobManager is binding the rpc address with its ip.
>
> Best,
> Yang
>
>
> superainbower <superainbo...@163.com> 于2020年9月3日周四 上午11:38写道：
>
>> HI Yang,
>> I update taskmanager-session-deployment.yaml like this:
>>
>> apiVersion: apps/v1
>> kind: Deployment
>> metadata:
>>   name: flink-taskmanager
>> spec:
>>   replicas: 1
>>   selector:
>>     matchLabels:
>>       app: flink
>>       component: taskmanager
>>   template:
>>     metadata:
>>       labels:
>>         app: flink
>>         component: taskmanager
>>     spec:
>>       containers:
>>       - name: taskmanager
>>         image:
>> registry.cn-hangzhou.aliyuncs.com/superainbower/flink:1.11.1
>>         args: ["taskmanager","-Djobmanager.rpc.address=172.18.0.5"]
>>         ports:
>>         - containerPort: 6122
>>           name: rpc
>>         - containerPort: 6125
>>           name: query-state
>>         livenessProbe:
>>           tcpSocket:
>>             port: 6122
>>           initialDelaySeconds: 30
>>           periodSeconds: 60
>>         volumeMounts:
>>         - name: flink-config-volume
>>           mountPath: /opt/flink/conf/
>>         securityContext:
>>           runAsUser: 9999  # refers to user _flink_ from official flink
>> image, change if necessary
>>       volumes:
>>       - name: flink-config-volume
>>         configMap:
>>           name: flink-config
>>           items:
>>           - key: flink-conf.yaml
>>             path: flink-conf.yaml
>>           - key: log4j-console.properties
>>             path: log4j-console.properties
>>       imagePullSecrets:
>>         - name: regcred
>>
>> And Delete the TaskManager pod and restart it , but the logs print this
>>
>> Could not resolve ResourceManager address akka.tcp://
>> flink@172.18.0.5:6123/user/rpc/resourcemanager_*, retrying in 10000 ms:
>> Could not connect to rpc endpoint under address akka.tcp://
>> flink@172.18.0.5:6123/user/rpc/resourcemanager_*
>>
>> It change flink-jobmanager to 172.18.0.5
>> superainbower
>> superainbo...@163.com
>>
>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=superainbower&uid=superainbower%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22superainbower%40163.com%22%5D>
>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
>>
>> On 09/3/2020 11:09，Yang Wang<danrtsey...@gmail.com>
>> <danrtsey...@gmail.com> wrote：
>>
>> I guess something is wrong with your kube proxy, which causes TaskManager
>> could not connect to JobManager.
>> You could verify this by directly using JobManager Pod ip instead of
>> service name.
>>
>> Please do as follows.
>> * Edit the TaskManager deployment(via kubectl edit flink-taskmanager)
>> and update the args field to the following.
>>    args: ["taskmanager", "-Djobmanager.rpc.address=172.18.0.5"]    Given
>> that "172.18.0.5" is the JobManager pod ip.
>> * Delete the current TaskManager pod and let restart again
>> * Now check the TaskManager logs to check whether it could register
>> successfully
>>
>>
>>
>> Best,
>> Yang
>>
>> superainbower <superainbo...@163.com> 于2020年9月3日周四 上午9:35写道：
>>
>>> Hi Till,
>>> I find something may be helpful.
>>> The kubernetes Dashboard show job-manager ip 172.18.0.5, task-manager ip
>>> 172.18.0.6
>>> When I run command 'kubectl exec -ti flink-taskmanager-74c68c6f48-jqpbn
>>> -- /bin/bash’ && ‘ping 172.18.0.5’
>>> I can get response
>>> But when I ping flink-jobmanager ,there is no response
>>>
>>> superainbower
>>> superainbo...@163.com
>>>
>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=superainbower&uid=superainbower%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22superainbower%40163.com%22%5D>
>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
>>>
>>> On 09/3/2020 09:03，superainbower<superainbo...@163.com>
>>> <superainbo...@163.com> wrote：
>>>
>>> Hi Till,
>>> This is the taskManager log
>>> As you see, the logs print  ‘line 92 -- Could not connect to
>>> flink-jobmanager:6123’
>>> then print ‘line 128 --Could not resolve ResourceManager address
>>> akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*,
>>> retrying in 10000 ms: Could not connect to rpc endpoint under address
>>> akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*.’
>>> And repeat print this
>>>
>>> A few minutes later, the taskmanger shut down and restart
>>>
>>> This is my yaml files, could u help me to confirm did I
>>> omitted something? Thanks a lot!
>>> ---------------------------------------------------
>>> flink-configuration-configmap.yaml
>>> apiVersion: v1
>>> kind: ConfigMap
>>> metadata:
>>>   name: flink-config
>>>   labels:
>>>     app: flink
>>> data:
>>>   flink-conf.yaml: |+
>>>     jobmanager.rpc.address: flink-jobmanager
>>>     taskmanager.numberOfTaskSlots: 1
>>>     blob.server.port: 6124
>>>     jobmanager.rpc.port: 6123
>>>     taskmanager.rpc.port: 6122
>>>     queryable-state.proxy.ports: 6125
>>>     jobmanager.memory.process.size: 1024m
>>>     taskmanager.memory.process.size: 1024m
>>>     parallelism.default: 1
>>>   log4j-console.properties: |+
>>>     rootLogger.level = INFO
>>>     rootLogger.appenderRef.console.ref = ConsoleAppender
>>>     rootLogger.appenderRef.rolling.ref = RollingFileAppender
>>>     logger.akka.name = akka
>>>     logger.akka.level = INFO
>>>     logger.kafka.name= org.apache.kafka
>>>     logger.kafka.level = INFO
>>>     logger.hadoop.name = org.apache.hadoop
>>>     logger.hadoop.level = INFO
>>>     logger.zookeeper.name = org.apache.zookeeper
>>>     logger.zookeeper.level = INFO
>>>     appender.console.name = ConsoleAppender
>>>     appender.console.type = CONSOLE
>>>     appender.console.layout.type = PatternLayout
>>>     appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p
>>> %-60c %x - %m%n
>>>     appender.rolling.name = RollingFileAppender
>>>     appender.rolling.type = RollingFile
>>>     appender.rolling.append = false
>>>     appender.rolling.fileName = ${sys:log.file}
>>>     appender.rolling.filePattern = ${sys:log.file}.%i
>>>     appender.rolling.layout.type = PatternLayout
>>>     appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p
>>> %-60c %x - %m%n
>>>     appender.rolling.policies.type = Policies
>>>     appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
>>>     appender.rolling.policies.size.size=100MB
>>>     appender.rolling.strategy.type = DefaultRolloverStrategy
>>>     appender.rolling.strategy.max = 10
>>>     logger.netty.name =
>>> org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline
>>>     logger.netty.level = OFF
>>> ---------------------------------------------------
>>> jobmanager-service.yaml
>>> apiVersion: v1
>>> kind: Service
>>> metadata:
>>>   name: flink-jobmanager
>>> spec:
>>>   type: ClusterIP
>>>   ports:
>>>   - name: rpc
>>>     port: 6123
>>>   - name: blob-server
>>>     port: 6124
>>>   - name: webui
>>>     port: 8081
>>>   selector:
>>>     app: flink
>>>     component: jobmanager
>>> --------------------------------------------------
>>> jobmanager-session-deployment.yaml
>>> apiVersion: apps/v1
>>> kind: Deployment
>>> metadata:
>>>   name: flink-jobmanager
>>> spec:
>>>   replicas: 1
>>>   selector:
>>>     matchLabels:
>>>       app: flink
>>>       component: jobmanager
>>>   template:
>>>     metadata:
>>>       labels:
>>>         app: flink
>>>         component: jobmanager
>>>     spec:
>>>       containers:
>>>       - name: jobmanager
>>>         image:
>>> registry.cn-hangzhou.aliyuncs.com/superainbower/flink:1.11.1
>>>         args: ["jobmanager"]
>>>         ports:
>>>         - containerPort: 6123
>>>           name: rpc
>>>         - containerPort: 6124
>>>           name: blob-server
>>>         - containerPort: 8081
>>>           name: webui
>>>         livenessProbe:
>>>           tcpSocket:
>>>             port: 6123
>>>           initialDelaySeconds: 30
>>>           periodSeconds: 60
>>>         volumeMounts:
>>>         - name: flink-config-volume
>>>           mountPath: /opt/flink/conf
>>>         securityContext:
>>>           runAsUser: 9999  # refers to user _flink_ from official flink
>>> image, change if necessary
>>>       volumes:
>>>       - name: flink-config-volume
>>>         configMap:
>>>           name: flink-config
>>>           items:
>>>           - key: flink-conf.yaml
>>>             path: flink-conf.yaml
>>>           - key: log4j-console.properties
>>>             path: log4j-console.properties
>>>       imagePullSecrets:
>>>         - name: regcred
>>> ---------------------------------------------------
>>> taskmanager-session-deployment.yaml
>>> apiVersion: apps/v1
>>> kind: Deployment
>>> metadata:
>>>   name: flink-taskmanager
>>> spec:
>>>   replicas: 1
>>>   selector:
>>>     matchLabels:
>>>       app: flink
>>>       component: taskmanager
>>>   template:
>>>     metadata:
>>>       labels:
>>>         app: flink
>>>         component: taskmanager
>>>     spec:
>>>       containers:
>>>       - name: taskmanager
>>>         image:
>>> registry.cn-hangzhou.aliyuncs.com/superainbower/flink:1.11.1
>>>         args: ["taskmanager"]
>>>         ports:
>>>         - containerPort: 6122
>>>           name: rpc
>>>         - containerPort: 6125
>>>           name: query-state
>>>         livenessProbe:
>>>           tcpSocket:
>>>             port: 6122
>>>           initialDelaySeconds: 30
>>>           periodSeconds: 60
>>>         volumeMounts:
>>>         - name: flink-config-volume
>>>           mountPath: /opt/flink/conf/
>>>         securityContext:
>>>           runAsUser: 9999  # refers to user _flink_ from official flink
>>> image, change if necessary
>>>       volumes:
>>>       - name: flink-config-volume
>>>         configMap:
>>>           name: flink-config
>>>           items:
>>>           - key: flink-conf.yaml
>>>             path: flink-conf.yaml
>>>           - key: log4j-console.properties
>>>             path: log4j-console.properties
>>>       imagePullSecrets:
>>>         - name: regcred
>>>
>>>
>>> superainbower
>>> superainbo...@163.com
>>>
>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=superainbower&uid=superainbower%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22superainbower%40163.com%22%5D>
>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
>>>
>>> On 09/2/2020 20:38，Till Rohrmann<trohrm...@apache.org>
>>> <trohrm...@apache.org> wrote：
>>>
>>> Hmm, this is indeed strange. Could you share the logs of the TaskManager
>>> with us? Ideally you set the log level to debug. Thanks a lot.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Sep 2, 2020 at 12:45 PM art <superainbo...@163.com> wrote:
>>>
>>>> Hi Till,
>>>>
>>>> The full information when I run command ' kubectl get all’  like this:
>>>>
>>>> NAME                                     READY   STATUS    RESTARTS
>>>> AGE
>>>> pod/flink-jobmanager-85bdbd98d8-ppjmf    1/1     Running   0
>>>>  2m34s
>>>> pod/flink-taskmanager-74c68c6f48-6jb5v   1/1     Running   0
>>>>  2m34s
>>>>
>>>> NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP
>>>> PORT(S)                      AGE
>>>> service/flink-jobmanager   ClusterIP   10.103.207.75   <none>
>>>>  6123/TCP,6124/TCP,8081/TCP   2m34s
>>>> service/kubernetes         ClusterIP   10.96.0.1       <none>
>>>>  443/TCP                      5d2h
>>>>
>>>> NAME                                READY   UP-TO-DATE   AVAILABLE   AGE
>>>> deployment.apps/flink-jobmanager    1/1     1            1
>>>> 2m34s
>>>> deployment.apps/flink-taskmanager   1/1     1            1
>>>> 2m34s
>>>>
>>>> NAME                                           DESIRED   CURRENT
>>>> READY   AGE
>>>> replicaset.apps/flink-jobmanager-85bdbd98d8    1         1         1
>>>>     2m34s
>>>> replicaset.apps/flink-taskmanager-74c68c6f48   1         1         1
>>>>     2m34s
>>>>
>>>> And I can open flink ui but the task manger is 0 ,so the job manger is
>>>> work well
>>>> I think the problem is taksmanger can not register itself to jobmanger,
>>>>  did I miss some configure?
>>>>
>>>>
>>>> 在 2020年9月2日，下午5:24，Till Rohrmann <trohrm...@apache.org> 写道：
>>>>
>>>> Hi art,
>>>>
>>>> could you check what `kubectl get services` returns? Usually if you run
>>>> `kubectl get all` you should also see the services. But in your case there
>>>> are no services listed. You have see something like
>>>> service/flink-jobmanager otherwise the flink-jobmanager service (K8s
>>>> service) is not running.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Wed, Sep 2, 2020 at 11:15 AM art <superainbo...@163.com> wrote:
>>>>
>>>>> Hi Till,
>>>>>
>>>>> I’m sure the job manager-service is started, I can find it in
>>>>> Kubernetes DashBoard
>>>>>
>>>>> When I run command ' kubectl get deployment’ I can got this:
>>>>> flink-jobmanager    1/1     1            1           33s
>>>>> flink-taskmanager   1/1     1            1           33s
>>>>>
>>>>> When I run command ' kubectl get all’ I can got this:
>>>>> NAME                                     READY   STATUS    RESTARTS
>>>>> AGE
>>>>> pod/flink-jobmanager-85bdbd98d8-ppjmf    1/1     Running   0
>>>>>  2m34s
>>>>> pod/flink-taskmanager-74c68c6f48-6jb5v   1/1     Running   0
>>>>>  2m34s
>>>>>
>>>>> So, I think flink-jobmanager works well, but taskmannger is restarted
>>>>> every few minutes
>>>>>
>>>>> My minikube version: v1.12.3
>>>>> Flink version:v1.11.1
>>>>>
>>>>> 在 2020年9月2日，下午4:27，Till Rohrmann <trohrm...@apache.org> 写道：
>>>>>
>>>>> Hi art,
>>>>>
>>>>> could you verify that the jobmanager-service has been started? It
>>>>> looks as if the name flink-jobmanager is not resolvable. It could also 
>>>>> help
>>>>> to know the Minikube and K8s version you are using.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Wed, Sep 2, 2020 at 9:50 AM art <superainbo...@163.com> wrote:
>>>>>
>>>>>> Hi，I’m going to deploy flink on minikube referring to
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/ops/deployment/kubernetes.html
>>>>>> ;
>>>>>> kubectl create -f flink-configuration-configmap.yaml
>>>>>> kubectl create -f jobmanager-service.yaml
>>>>>> kubectl create -f jobmanager-session-deployment.yaml
>>>>>> kubectl create -f taskmanager-session-deployment.yaml
>>>>>>
>>>>>> But I got this
>>>>>>
>>>>>> 2020-09-02 06:45:42,664 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                       [] - Association with remote system [
>>>>>> akka.tcp://flink@flink-jobmanager:6123] has failed, address is now
>>>>>> gated for [50] ms. Reason: [Association failed with [
>>>>>> akka.tcp://flink@flink-jobmanager:6123]] Caused by:
>>>>>> [java.net.UnknownHostException: flink-jobmanager: Temporary failure in 
>>>>>> name
>>>>>> resolution]
>>>>>> 2020-09-02 06:45:42,691 INFO
>>>>>>  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could
>>>>>> not resolve ResourceManager address
>>>>>> akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*,
>>>>>> retrying in 10000 ms: Could not connect to rpc endpoint under address
>>>>>> akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*.
>>>>>> 2020-09-02 06:46:02,731 INFO
>>>>>>  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could
>>>>>> not resolve ResourceManager address
>>>>>> akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*,
>>>>>> retrying in 10000 ms: Could not connect to rpc endpoint under address
>>>>>> akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*.
>>>>>> 2020-09-02 06:46:12,731 INFO
>>>>>>  akka.remote.transport.ProtocolStateActor                     [] - No
>>>>>> response from remote for outbound association. Associate timed out after
>>>>>> [20000 ms].
>>>>>>
>>>>>> And when I run the command 'kubectl exec -ti
>>>>>> flink-taskmanager-74c68c6f48-9tkvd -- /bin/bash’ && ‘ping 
>>>>>> flink-jobmanager’
>>>>>> , I find I cannot ping flink-jobmanager from taskmanager
>>>>>>
>>>>>> I am new to k8s, can anyone give me some tutorial? Thanks a lot !
>>>>>>
>>>>>
>>>>>
>>>>

Re: Fail to deploy Flink on minikube

Reply via email to