[jira] [Assigned] (FLINK-6273) Client can't connect to jobmanager whose hostname contains capital letters
[ https://issues.apache.org/jira/browse/FLINK-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng reassigned FLINK-6273: - Assignee: Yelei Feng > Client can't connect to jobmanager whose hostname contains capital letters > -- > > Key: FLINK-6273 > URL: https://issues.apache.org/jira/browse/FLINK-6273 > Project: Flink > Issue Type: Bug > Components: Client >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng >Assignee: Yelei Feng > Fix For: 1.3.0 > > > In non-HA mode, if we set jobmanager.rpc.address to a hostname with some > capital letters, flink client can't connect to jobmananger. > ERROR | [flink-akka.actor.default-dispatcher-4] | dropping message [class > akka.actor.ActorSelectionMessage] for non-local recipient > [Actor[akka.tcp://flink@szv1000258958:32586/]] arriving at > [akka.tcp://flink@szv1000258958:32586] inbound addresses are > [akka.tcp://flink@SZV1000258958:32586] | akka.remote.EndpointWriter > (Slf4jLogger.scala:65) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6273) Client can't connect to jobmanager whose hostname contains capital letters
Yelei Feng created FLINK-6273: - Summary: Client can't connect to jobmanager whose hostname contains capital letters Key: FLINK-6273 URL: https://issues.apache.org/jira/browse/FLINK-6273 Project: Flink Issue Type: Bug Components: Client Affects Versions: 1.2.0, 1.3.0 Reporter: Yelei Feng Fix For: 1.3.0 In non-HA mode, if we set jobmanager.rpc.address to a hostname with some capital letters, flink client can't connect to jobmananger. ERROR | [flink-akka.actor.default-dispatcher-4] | dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@szv1000258958:32586/]] arriving at [akka.tcp://flink@szv1000258958:32586] inbound addresses are [akka.tcp://flink@SZV1000258958:32586] | akka.remote.EndpointWriter (Slf4jLogger.scala:65) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-6152) Yarn session CLI tries to shut cluster down too agressively in interative mode
[ https://issues.apache.org/jira/browse/FLINK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng reassigned FLINK-6152: - Assignee: Yelei Feng > Yarn session CLI tries to shut cluster down too agressively in interative mode > -- > > Key: FLINK-6152 > URL: https://issues.apache.org/jira/browse/FLINK-6152 > Project: Flink > Issue Type: Bug > Components: Client >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng >Assignee: Yelei Feng > > Once yarn session CLI can't get cluster status, it shuts the cluster down and > cleanup related files even if new jobmanger will be created soon. As result, > AM restarting will fail due to missing files on HDFS > reproduce step: > 1. start yarn session in interactive mode > 2. kill jobmanager process > 3. yarn session client can't get cluster status in lookup time and hence > trigger shutdown hook which would delete local properties files and files on > HDFS, but it can't shutdown the cluster since it can't connect to jobmanager. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6213) When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later
Yelei Feng created FLINK-6213: - Summary: When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later Key: FLINK-6213 URL: https://issues.apache.org/jira/browse/FLINK-6213 Project: Flink Issue Type: Bug Components: YARN Affects Versions: 1.2.0, 1.3.0 Reporter: Yelei Feng When number of failed containers exceeds maximum failed containers and application is stopped, the AM container will be released 10 minutes later. I checked yarn log and found out after invoking {{unregisterApplicationMaster}}, the AM container is not released. After 10 minutes, the release is triggered by RM ping check timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6152) Yarn session CLI tries to shut cluster down too agressively in interative mode
Yelei Feng created FLINK-6152: - Summary: Yarn session CLI tries to shut cluster down too agressively in interative mode Key: FLINK-6152 URL: https://issues.apache.org/jira/browse/FLINK-6152 Project: Flink Issue Type: Bug Components: Client Affects Versions: 1.2.0, 1.3.0 Reporter: Yelei Feng Once yarn session CLI can't get cluster status, it shuts the cluster down and cleanup related files even if new jobmanger will be created soon. As result, AM restarting will fail due to missing files on HDFS reproduce step: 1. start yarn session in interactive mode 2. kill jobmanager process 3. yarn session client can't get cluster status in lookup time and hence trigger shutdown hook which would delete local properties files and files on HDFS, but it can't shutdown the cluster since it can't connect to jobmanager. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-6147) flink client can't detect cluster is down
[ https://issues.apache.org/jira/browse/FLINK-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng updated FLINK-6147: -- Description: I tested in yarn mode, reproduce step: 1. flink run xx.jar 2. kill yarn application CLI hangs there only showing "New JobManager elected. Connecting to null " instead of cleanup and close itself. After some digging, I found the main logic is in {{JobClientActor}}. It would terminate itself once receiving message {{ConnectionTimeout}}. It receive jobmanager status changes from two sources: zookeeper and akka deathwatch. Client sets current {{leaderSessionId}} and unwatch previous jobmanager from zk, receives {{Teminated}} of previous jobmanager from akka deathwatch and send {{ConnectionTimeout}} to itself after 60s. In a great chance, they would interfere with each other. Situation1: 1. client get notified from zk, set {{leaderSessionId}} to null 2. client unwatch previous jobmanager 3. msg {{Teminated}} of previous jobmanager never got received Situation 2: 1. msg {{Teminated}} of current jobmanager is received 2. schedule msg {{ConnectionTimeout}} after 60s 3. client get notified from zk, set {{leaderSessionId}} to null in less than 60s 4. {{ConnectionTimeout}} will be filtered out due to different {{leaderSessionId}} was: I tested in yarn mode, reproduce step: 1. flink run xx.jar 2. kill yarn application CLI hangs there only showing "New JobManager elected. Connecting to null " instead of cleanup and close itself. After some digging, I found the main logic is in {{JobClientActor}}. It would terminate itself once receiving message {{ConnectionTimeout}}. It receive jobmanager status changes from two sources: zookeeper and akka deathwatch. Client sets current {{leaderSessionId} and unwatch previous jobmanager from zk, receives {{Teminated}} of previous jobmanager from akka deathwatch and send {{ConnectionTimeout}} to itself after 60s. In a great chance, they would interfere with each other. Situation1: 1. client get notified from zk, set {{leaderSessionId}} to null 2. client unwatch previous jobmanager 3. msg {{Teminated}} of previous jobmanager never got received Situation 2: 1. msg {{Teminated}} of current jobmanager is received 2. schedule msg {{ConnectionTimeout}} after 60s 3. client get notified from zk, set {{leaderSessionId}} to null in less than 60s 4. {{ConnectionTimeout}} will be filtered out due to different {{leaderSessionId}} > flink client can't detect cluster is down > - > > Key: FLINK-6147 > URL: https://issues.apache.org/jira/browse/FLINK-6147 > Project: Flink > Issue Type: Bug > Components: Client >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng > Labels: client > > I tested in yarn mode, reproduce step: > 1. flink run xx.jar > 2. kill yarn application > CLI hangs there only showing "New JobManager elected. Connecting to null " > instead of cleanup and close itself. > After some digging, I found the main logic is in {{JobClientActor}}. It would > terminate itself once receiving message {{ConnectionTimeout}}. It receive > jobmanager status changes from two sources: zookeeper and akka deathwatch. > Client sets current {{leaderSessionId}} and unwatch previous jobmanager from > zk, receives {{Teminated}} of previous jobmanager from akka deathwatch and > send {{ConnectionTimeout}} to itself after 60s. In a great chance, they would > interfere with each other. > > Situation1: > 1. client get notified from zk, set {{leaderSessionId}} to null > 2. client unwatch previous jobmanager > 3. msg {{Teminated}} of previous jobmanager never got received > Situation 2: > 1. msg {{Teminated}} of current jobmanager is received > 2. schedule msg {{ConnectionTimeout}} after 60s > 3. client get notified from zk, set {{leaderSessionId}} to null in less than > 60s > 4. {{ConnectionTimeout}} will be filtered out due to different > {{leaderSessionId}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6147) flink client can't detect cluster is down
Yelei Feng created FLINK-6147: - Summary: flink client can't detect cluster is down Key: FLINK-6147 URL: https://issues.apache.org/jira/browse/FLINK-6147 Project: Flink Issue Type: Bug Components: Client Affects Versions: 1.2.0, 1.3.0 Reporter: Yelei Feng I tested in yarn mode, reproduce step: 1. flink run xx.jar 2. kill yarn application CLI hangs there only showing "New JobManager elected. Connecting to null " instead of cleanup and close itself. After some digging, I found the main logic is in {{JobClientActor}}. It would terminate itself once receiving message {{ConnectionTimeout}}. It receive jobmanager status changes from two sources: zookeeper and akka deathwatch. Client sets current {{leaderSessionId} and unwatch previous jobmanager from zk, receives {{Teminated}} of previous jobmanager from akka deathwatch and send {{ConnectionTimeout}} to itself after 60s. In a great chance, they would interfere with each other. Situation1: 1. client get notified from zk, set {{leaderSessionId}} to null 2. client unwatch previous jobmanager 3. msg {{Teminated}} of previous jobmanager never got received Situation 2: 1. msg {{Teminated}} of current jobmanager is received 2. schedule msg {{ConnectionTimeout}} after 60s 3. client get notified from zk, set {{leaderSessionId}} to null in less than 60s 4. {{ConnectionTimeout}} will be filtered out due to different {{leaderSessionId}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5920) port range support for config query.server.port
Yelei Feng created FLINK-5920: - Summary: port range support for config query.server.port Key: FLINK-5920 URL: https://issues.apache.org/jira/browse/FLINK-5920 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 1.3.0 Reporter: Yelei Feng Fix For: 1.3.0 we should support to set port range for config {{query.server.port}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5919) port range support for config taskmanager.data.port
[ https://issues.apache.org/jira/browse/FLINK-5919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng reassigned FLINK-5919: - Assignee: Yelei Feng > port range support for config taskmanager.data.port > --- > > Key: FLINK-5919 > URL: https://issues.apache.org/jira/browse/FLINK-5919 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.3.0 >Reporter: Yelei Feng >Assignee: Yelei Feng > Fix For: 1.3.0 > > > we should support to set port range for config {{taskmanager.data.port}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5918) port range support for config taskmanager.rpc.port
[ https://issues.apache.org/jira/browse/FLINK-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng reassigned FLINK-5918: - Assignee: Yelei Feng > port range support for config taskmanager.rpc.port > -- > > Key: FLINK-5918 > URL: https://issues.apache.org/jira/browse/FLINK-5918 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.3.0 >Reporter: Yelei Feng >Assignee: Yelei Feng > Fix For: 1.3.0 > > > we should support to set port range for config {{taskmanager.rpc.port}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5920) port range support for config query.server.port
[ https://issues.apache.org/jira/browse/FLINK-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng reassigned FLINK-5920: - Assignee: Yelei Feng > port range support for config query.server.port > --- > > Key: FLINK-5920 > URL: https://issues.apache.org/jira/browse/FLINK-5920 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.3.0 >Reporter: Yelei Feng >Assignee: Yelei Feng > Fix For: 1.3.0 > > > we should support to set port range for config {{query.server.port}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5919) port range support for config taskmanager.data.port
Yelei Feng created FLINK-5919: - Summary: port range support for config taskmanager.data.port Key: FLINK-5919 URL: https://issues.apache.org/jira/browse/FLINK-5919 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 1.3.0 Reporter: Yelei Feng Fix For: 1.3.0 we should support to set port range for config {{taskmanager.data.port}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5918) port range support for config taskmanager.rpc.port
Yelei Feng created FLINK-5918: - Summary: port range support for config taskmanager.rpc.port Key: FLINK-5918 URL: https://issues.apache.org/jira/browse/FLINK-5918 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 1.3.0 Reporter: Yelei Feng Fix For: 1.3.0 we should support to set port range for config {{taskmanager.rpc.port}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5758) Port-range for the web interface via YARN
[ https://issues.apache.org/jira/browse/FLINK-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882886#comment-15882886 ] Yelei Feng commented on FLINK-5758: --- [~StephanEwen] Could you help review the PR please, we want to merge this into our own codebase once it's approved by community ;) > Port-range for the web interface via YARN > - > > Key: FLINK-5758 > URL: https://issues.apache.org/jira/browse/FLINK-5758 > Project: Flink > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.2.0, 1.1.4, 1.3.0 >Reporter: Kanstantsin Kamkou >Assignee: Yelei Feng > Labels: network > > In case of YARN, the {{ConfigConstants.JOB_MANAGER_WEB_PORT_KEY}} [is > changed to > 0|https://github.com/apache/flink/blob/release-1.2.0/flink-yarn/src/main/java/org/apache/flink/yarn/YarnApplicationMasterRunner.java#L526]. > Please allow port ranges in this case. DevOps need that. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5758) Port-range for the web interface via YARN
[ https://issues.apache.org/jira/browse/FLINK-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng updated FLINK-5758: -- Issue Type: Sub-task (was: Improvement) Parent: FLINK-5839 > Port-range for the web interface via YARN > - > > Key: FLINK-5758 > URL: https://issues.apache.org/jira/browse/FLINK-5758 > Project: Flink > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.2.0, 1.1.4, 1.3.0 >Reporter: Kanstantsin Kamkou >Assignee: Yelei Feng > Labels: network > > In case of YARN, the {{ConfigConstants.JOB_MANAGER_WEB_PORT_KEY}} [is > changed to > 0|https://github.com/apache/flink/blob/release-1.2.0/flink-yarn/src/main/java/org/apache/flink/yarn/YarnApplicationMasterRunner.java#L526]. > Please allow port ranges in this case. DevOps need that. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (FLINK-5797) incorrect use of port range selector in BootstrapTool
[ https://issues.apache.org/jira/browse/FLINK-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng closed FLINK-5797. - Resolution: Not A Bug > incorrect use of port range selector in BootstrapTool > - > > Key: FLINK-5797 > URL: https://issues.apache.org/jira/browse/FLINK-5797 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng >Assignee: Yelei Feng > Labels: yarn > Fix For: 1.3.0 > > > In method {{BootstrapTool.startActorSystem}}, port range is iterated twice. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5797) incorrect use of port range selector in BootstrapTool
Yelei Feng created FLINK-5797: - Summary: incorrect use of port range selector in BootstrapTool Key: FLINK-5797 URL: https://issues.apache.org/jira/browse/FLINK-5797 Project: Flink Issue Type: Bug Components: YARN Affects Versions: 1.2.0, 1.3.0 Reporter: Yelei Feng Fix For: 1.3.0 In method {{BootstrapTool.startActorSystem}}, port range is iterated twice. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5797) incorrect use of port range selector in BootstrapTool
[ https://issues.apache.org/jira/browse/FLINK-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng reassigned FLINK-5797: - Assignee: Yelei Feng > incorrect use of port range selector in BootstrapTool > - > > Key: FLINK-5797 > URL: https://issues.apache.org/jira/browse/FLINK-5797 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yelei Feng >Assignee: Yelei Feng > Labels: yarn > Fix For: 1.3.0 > > > In method {{BootstrapTool.startActorSystem}}, port range is iterated twice. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5712) update several deprecated configuration options
Yelei Feng created FLINK-5712: - Summary: update several deprecated configuration options Key: FLINK-5712 URL: https://issues.apache.org/jira/browse/FLINK-5712 Project: Flink Issue Type: Bug Components: Documentation, Mesos Affects Versions: 1.2.0, 1.3.0 Reporter: Yelei Feng Priority: Minor Fix For: 1.3.0 1. We should use 'containerized.heap-cutoff-ratio' and 'containerized.heap-cutoff-min' instead of deprecated yarn-specific options in configuration doc. 2. In mesos mode, we still use deprecated naming convention of zookeeper - 'recovery.zookeeper.path.mesos-workers'. We should make it consistent with other zookeeper options by using 'high-availability.zookeeper.path.mesos-workers'. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5708) we should remove duplicated configuration options
Yelei Feng created FLINK-5708: - Summary: we should remove duplicated configuration options Key: FLINK-5708 URL: https://issues.apache.org/jira/browse/FLINK-5708 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.3.0 Reporter: Yelei Feng Priority: Minor Fix For: 1.3.0 option 'yarn.containers.vcores' is duplicated -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (FLINK-5578) Each time application is submitted to yarn, application id increases by two
[ https://issues.apache.org/jira/browse/FLINK-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yelei Feng closed FLINK-5578. - Resolution: Duplicate > Each time application is submitted to yarn, application id increases by two > --- > > Key: FLINK-5578 > URL: https://issues.apache.org/jira/browse/FLINK-5578 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Yelei Feng >Priority: Minor > Fix For: 1.2.0 > > > I tested to run a long-running cluster and single job on yarn, in both cases, > each time the application id would increase by two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5578) Each time application is submitted to yarn, application id increases by two
Yelei Feng created FLINK-5578: - Summary: Each time application is submitted to yarn, application id increases by two Key: FLINK-5578 URL: https://issues.apache.org/jira/browse/FLINK-5578 Project: Flink Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Yelei Feng Priority: Minor Fix For: 1.2.0 I tested to run a long-running cluster and single job on yarn, in both cases, each time the application id would increase by two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5577) Each time application is submitted to yarn, application id increases by two
Yelei Feng created FLINK-5577: - Summary: Each time application is submitted to yarn, application id increases by two Key: FLINK-5577 URL: https://issues.apache.org/jira/browse/FLINK-5577 Project: Flink Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Yelei Feng Priority: Minor Fix For: 1.2.0 I tested to run a long-running cluster and single job on yarn, in both cases, each time the application id would increase by two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5427) Typo in the event_timestamps_watermarks doc
Yelei Feng created FLINK-5427: - Summary: Typo in the event_timestamps_watermarks doc Key: FLINK-5427 URL: https://issues.apache.org/jira/browse/FLINK-5427 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Yelei Feng Priority: Minor Fix For: 1.2.0 I was reading the watermark doc: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/event_timestamps_watermarks.html We should replace element with lastElement in the body of method checkAndGetNextWatermark: {code:java} public Watermark checkAndGetNextWatermark(MyEvent lastElement, long extractedTimestamp) { return element.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)