[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15031083#comment-15031083 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-160431771 Un-assigning myself from the issue for now. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, 0.10.0 >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 1.0.0 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15031084#comment-15031084 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 closed the pull request at: https://github.com/apache/flink/pull/945 > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, 0.10.0 >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 1.0.0 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740755#comment-14740755 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r39268191 --- Diff: flink-core/src/main/java/org/apache/flink/util/VersionUtils.java --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.util; + +import org.apache.flink.configuration.Configuration; + +/** + * Version numbers to check compatibility between JobManager, TaskManager and JobClient. + */ +public class VersionUtils { + + public static final String FLINK_VERSION = Configuration.FLINK_VERSION; + /** +* Lower limit on client versions this job manager can work with. +*/ + public static final String JOB_CLIENT_VERSION_LOWER = "0.9.0"; + + /** +* Gets the minimum supported Client version by this Job Manager. +* +* @return Minimum supported client version number. +*/ + public static String getJobClientVersionLower() { + return JOB_CLIENT_VERSION_LOWER; + } + + /** +* Checks whether the given client version is compatible with this Job Manager +* +* @param clientVersion Version of the client +* @return Whether the given client is compatible with the Job Manager. +*/ + public static boolean isClientCompatible(String clientVersion) { + return versionComparator(JOB_CLIENT_VERSION_LOWER, clientVersion) <= 0 && versionComparator(clientVersion, FLINK_VERSION) <= 0; + } + + /** +* Checks which of the two given version strings is higher. +* +* @param version1 Version 1 +* @param version2 Version 2 +* @return 1 if version1 > version2, -1 if version1 < version2 and 0 if version1 = version2 +* +* Code taken from http://stackoverflow.com/questions/6701948/efficient-way-to-compare-version-strings-in-java";>Stack Overflow --- End diff -- Ah. Sorry. Will fix this. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740706#comment-14740706 ] ASF GitHub Bot commented on FLINK-2399: --- Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r39266023 --- Diff: flink-core/src/main/java/org/apache/flink/util/VersionUtils.java --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.util; + +import org.apache.flink.configuration.Configuration; + +/** + * Version numbers to check compatibility between JobManager, TaskManager and JobClient. + */ +public class VersionUtils { + + public static final String FLINK_VERSION = Configuration.FLINK_VERSION; + /** +* Lower limit on client versions this job manager can work with. +*/ + public static final String JOB_CLIENT_VERSION_LOWER = "0.9.0"; + + /** +* Gets the minimum supported Client version by this Job Manager. +* +* @return Minimum supported client version number. +*/ + public static String getJobClientVersionLower() { + return JOB_CLIENT_VERSION_LOWER; + } + + /** +* Checks whether the given client version is compatible with this Job Manager +* +* @param clientVersion Version of the client +* @return Whether the given client is compatible with the Job Manager. +*/ + public static boolean isClientCompatible(String clientVersion) { + return versionComparator(JOB_CLIENT_VERSION_LOWER, clientVersion) <= 0 && versionComparator(clientVersion, FLINK_VERSION) <= 0; + } + + /** +* Checks which of the two given version strings is higher. +* +* @param version1 Version 1 +* @param version2 Version 2 +* @return 1 if version1 > version2, -1 if version1 < version2 and 0 if version1 = version2 +* +* Code taken from http://stackoverflow.com/questions/6701948/efficient-way-to-compare-version-strings-in-java";>Stack Overflow --- End diff -- I doubt that it is OK to copy code from Stackoverflow. SO is under Creative Commons license. If I understood CC correctly, it requires that our code must be under CC as well if we want to use it. See http://creativecommons.org/licenses/by-sa/2.5/ > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740695#comment-14740695 ] ASF GitHub Bot commented on FLINK-2399: --- Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r39265532 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/jobmanager/JobSubmitTest.java --- @@ -192,4 +193,78 @@ public void initializeOnMaster(ClassLoader loader) throws Exception { fail(e.getMessage()); } } + + /** +* Verifies failure when the client version is lower than supported. +*/ + @Test --- End diff -- You can implement the test with `@Test(expected=JobSubmissionException.class)` which checks that the correct exception is produced. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723349#comment-14723349 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-136356151 Rebased to the current master. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716430#comment-14716430 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-135372659 @tillrohrmann , I have modified the way version information is passed to the Job Manager. It is now through the Configuration object passed. Further, rolled back on any verification for Job Manager and Task Manager versions. There is a possibility to handle compatibility between different versions by specifying a minimum supported version information. Note that this will easily be adaptable to Task manager version verification, if needed at some point. Let me know if this is a better solution. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714398#comment-14714398 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r38000188 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -524,13 +534,24 @@ class JobManager( * @param listenToEvents true if the sender wants to listen to job status and execution state * change notifications. false if not. */ - private def submitJob(jobGraph: JobGraph, listenToEvents: Boolean): Unit = { + private def submitJob( + jobGraph: JobGraph, + listenToEvents: Boolean, + clientVersion: String) +: Unit = { if (jobGraph == null) { sender ! decorateMessage( Failure( new JobSubmissionException(null, "JobGraph must not be null.") ) ) +} else if(jobManagerVersionID != clientVersion){ + sender ! decorateMessage( +Failure( + new JobSubmissionException(jobGraph.getJobID, "Client version mismatches that of Job " + +"Manager.") --- End diff -- Yeah I know, but it might be interesting for the user who sees the `JobSubmissionException` to know the exact version. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714395#comment-14714395 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r3882 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala --- @@ -547,7 +552,7 @@ class TaskManager( // successful registration. associate with the JobManager // we disambiguate duplicate or erroneous messages, to simplify debugging -case AcknowledgeRegistration(_, leaderSessionID, jobManager, id, blobPort) => +case AcknowledgeRegistration(_, leaderSessionID, jobManager, id, blobPort, jobManagerID) => --- End diff -- The initial AcknowledgeRegistration message can get lost and then the JM will send a `AlreadyRegistered` message. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713562#comment-14713562 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-135045372 Hi @tillrohrmann , thanks for the review. You're right. `getClass...` is not the right way to go. I had decided to just use this for the moment. We can certainly have a better version numbering for this purpose, to ensure the actors are compatible. I just am not sure if having a version string, which needs to be managed directly in the source code is a good idea. For example, it would make sense to extract the first digit, 0.9, 0.10 etc. since the minor bug fix releases are otherwise compatible anyway. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713549#comment-14713549 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37989581 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/jobmanager/JobSubmitTest.java --- @@ -170,4 +170,43 @@ public void initializeOnMaster(ClassLoader loader) throws Exception { fail(e.getMessage()); } } + + /** +* Verifies failure when the client and job manager versions mismatch +*/ + @Test + public void testFailureClientJobManagerVersionMismatch() { + try { + // create a simple job graph + + JobVertex jobVertex = new JobVertex("Vertex that fails in initializeOnMaster") { + + @Override + public void initializeOnMaster(ClassLoader loader) throws Exception { + throw new RuntimeException("test exception"); + } + }; + + jobVertex.setInvokableClass(Tasks.NoOpInvokable.class); + JobGraph jg = new JobGraph("test job", jobVertex); + + // submit the job + Future submitFuture = jobManager.ask(new JobManagerMessages.SubmitJob(jg, false, "RANDOM_CLIENT_VERSION"), timeout); + try { + Await.result(submitFuture, timeout); + } + catch (JobExecutionException e) { + // that is what we expect + // test that the exception nesting is not too deep + assertTrue(e.getCause() == null); --- End diff -- Yeah. I'll push a fix. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713547#comment-14713547 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37989473 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala --- @@ -547,7 +552,7 @@ class TaskManager( // successful registration. associate with the JobManager // we disambiguate duplicate or erroneous messages, to simplify debugging -case AcknowledgeRegistration(_, leaderSessionID, jobManager, id, blobPort) => +case AcknowledgeRegistration(_, leaderSessionID, jobManager, id, blobPort, jobManagerID) => --- End diff -- But the initial registration must have happened with `AcknowledgeRegistration`, right? Correct me if I'm wrong. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713545#comment-14713545 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37989277 --- Diff: flink-core/src/main/java/org/apache/flink/configuration/Configuration.java --- @@ -54,8 +54,12 @@ /** Stores the concrete key/value pairs of this configuration object. */ - private final HashMap confData; - + private final Map confData; + + /** Stores the program version */ + public final String FLINK_VERSION = getClass().getPackage().getImplementationVersion() == + null ? "FLINK_TEST_VERSION" : getClass().getPackage().getImplementationVersion(); --- End diff -- In such a case, the remote execution problem gets solved, since the version string would be properly set. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713543#comment-14713543 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37989214 --- Diff: flink-core/src/main/java/org/apache/flink/configuration/Configuration.java --- @@ -54,8 +54,12 @@ /** Stores the concrete key/value pairs of this configuration object. */ - private final HashMap confData; - + private final Map confData; + + /** Stores the program version */ + public final String FLINK_VERSION = getClass().getPackage().getImplementationVersion() == + null ? "FLINK_TEST_VERSION" : getClass().getPackage().getImplementationVersion(); --- End diff -- Ah yes. I had not considered this. In this case, the client version would be null. As for ensuring that compatible versions still work, you're right. This is why I'm not in favor of using the current version string. A better approach would be to have specific release versions in the configuration file, from where everything starts.` > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713531#comment-14713531 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37988692 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -524,13 +534,24 @@ class JobManager( * @param listenToEvents true if the sender wants to listen to job status and execution state * change notifications. false if not. */ - private def submitJob(jobGraph: JobGraph, listenToEvents: Boolean): Unit = { + private def submitJob( + jobGraph: JobGraph, + listenToEvents: Boolean, + clientVersion: String) +: Unit = { if (jobGraph == null) { sender ! decorateMessage( Failure( new JobSubmissionException(null, "JobGraph must not be null.") ) ) +} else if(jobManagerVersionID != clientVersion){ + sender ! decorateMessage( +Failure( + new JobSubmissionException(jobGraph.getJobID, "Client version mismatches that of Job " + +"Manager.") --- End diff -- The version is drawn from the Jar file's package version. Maven automatically adds this info. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713527#comment-14713527 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37988579 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -166,7 +169,8 @@ class JobManager( taskManager, connectionInfo, hardwareInformation, - numberOfSlots) => + numberOfSlots, + taskManagerVersionID) => --- End diff -- This was to ensure that the binaries on all machines are the same. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713447#comment-14713447 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-135026427 I had some comments concerning the implementation. I think as a first step we should not check whether the JM and TM have the same versions because in nearly all cases both classes will be started from the same binaries. If we should decide at a later point that we want to check that all used actors are compatible, then I would vote for using the decoration mechanism to transparently add a version ID instead of adding it manually to a subset of messages. The latter approach is IMO error-prone. Checking whether the client and the JM have compatible versions should be sufficient. I'm not quite sure whether `getClass().getPackage().getImplementationVersion()` is the right way to go here. Using this function won't allow compatible versions which only differ in a minor version number to work together. IMHO, this should be allowed. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713431#comment-14713431 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37981943 --- Diff: flink-core/src/main/java/org/apache/flink/configuration/Configuration.java --- @@ -54,8 +54,12 @@ /** Stores the concrete key/value pairs of this configuration object. */ - private final HashMap confData; - + private final Map confData; + + /** Stores the program version */ + public final String FLINK_VERSION = getClass().getPackage().getImplementationVersion() == + null ? "FLINK_TEST_VERSION" : getClass().getPackage().getImplementationVersion(); --- End diff -- What happens with minor versions which are compatible, e.g. 0.9.0 and 0.9.1? > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713429#comment-14713429 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37981839 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/client/JobClient.java --- @@ -205,7 +205,7 @@ public static void submitJobDetached( Preconditions.checkNotNull(timeout, "The timeout must not be null."); Future future = jobManagerGateway.ask( - new JobManagerMessages.SubmitJob(jobGraph, false), + new JobManagerMessages.SubmitJob(jobGraph, false, jobGraph.getClientVersion()), --- End diff -- Why don't you simply use the `getImplementationVersion` of the `JobClient` but the version of the `Configuration` which is part of the `jobGraph` instance? > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713074#comment-14713074 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37978132 --- Diff: flink-core/src/main/java/org/apache/flink/configuration/Configuration.java --- @@ -54,8 +54,12 @@ /** Stores the concrete key/value pairs of this configuration object. */ - private final HashMap confData; - + private final Map confData; + + /** Stores the program version */ + public final String FLINK_VERSION = getClass().getPackage().getImplementationVersion() == + null ? "FLINK_TEST_VERSION" : getClass().getPackage().getImplementationVersion(); --- End diff -- What's the value of `FLINK_VERSION` if you execute a Flink job on a standalone cluster from your IDE using the `RemoteExecutionEnvironment`? Will it work? > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713072#comment-14713072 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37978039 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/jobmanager/JobSubmitTest.java --- @@ -170,4 +170,43 @@ public void initializeOnMaster(ClassLoader loader) throws Exception { fail(e.getMessage()); } } + + /** +* Verifies failure when the client and job manager versions mismatch +*/ + @Test + public void testFailureClientJobManagerVersionMismatch() { + try { + // create a simple job graph + + JobVertex jobVertex = new JobVertex("Vertex that fails in initializeOnMaster") { + + @Override + public void initializeOnMaster(ClassLoader loader) throws Exception { + throw new RuntimeException("test exception"); --- End diff -- Why using a faulty `JobVertex` here. This might also be the reason for the "expected" `JobExecutionException` then. And you don't check properly whether that's not the cause for it. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713071#comment-14713071 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37977961 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/jobmanager/JobSubmitTest.java --- @@ -170,4 +170,43 @@ public void initializeOnMaster(ClassLoader loader) throws Exception { fail(e.getMessage()); } } + + /** +* Verifies failure when the client and job manager versions mismatch +*/ + @Test + public void testFailureClientJobManagerVersionMismatch() { + try { + // create a simple job graph + + JobVertex jobVertex = new JobVertex("Vertex that fails in initializeOnMaster") { + + @Override + public void initializeOnMaster(ClassLoader loader) throws Exception { + throw new RuntimeException("test exception"); + } + }; + + jobVertex.setInvokableClass(Tasks.NoOpInvokable.class); + JobGraph jg = new JobGraph("test job", jobVertex); + + // submit the job + Future submitFuture = jobManager.ask(new JobManagerMessages.SubmitJob(jg, false, "RANDOM_CLIENT_VERSION"), timeout); + try { + Await.result(submitFuture, timeout); + } + catch (JobExecutionException e) { + // that is what we expect + // test that the exception nesting is not too deep + assertTrue(e.getCause() == null); --- End diff -- Better to test for the proper exception message. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713068#comment-14713068 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37977899 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/jobmanager/JobSubmitTest.java --- @@ -170,4 +170,43 @@ public void initializeOnMaster(ClassLoader loader) throws Exception { fail(e.getMessage()); } } + + /** +* Verifies failure when the client and job manager versions mismatch +*/ + @Test + public void testFailureClientJobManagerVersionMismatch() { + try { + // create a simple job graph + + JobVertex jobVertex = new JobVertex("Vertex that fails in initializeOnMaster") { + + @Override + public void initializeOnMaster(ClassLoader loader) throws Exception { + throw new RuntimeException("test exception"); + } + }; + + jobVertex.setInvokableClass(Tasks.NoOpInvokable.class); + JobGraph jg = new JobGraph("test job", jobVertex); + + // submit the job + Future submitFuture = jobManager.ask(new JobManagerMessages.SubmitJob(jg, false, "RANDOM_CLIENT_VERSION"), timeout); + try { + Await.result(submitFuture, timeout); + } + catch (JobExecutionException e) { + // that is what we expect + // test that the exception nesting is not too deep + assertTrue(e.getCause() == null); + } + catch (Exception e) { + fail("Wrong exception type"); + } + } + catch (Exception e) { + e.printStackTrace(); + fail(e.getMessage()); --- End diff -- Why do you print the stack trace and let the test fail with an `AssertionError` instead of letting the test fail with the true reason, e.g. `e`. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713067#comment-14713067 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37977813 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala --- @@ -547,7 +552,7 @@ class TaskManager( // successful registration. associate with the JobManager // we disambiguate duplicate or erroneous messages, to simplify debugging -case AcknowledgeRegistration(_, leaderSessionID, jobManager, id, blobPort) => +case AcknowledgeRegistration(_, leaderSessionID, jobManager, id, blobPort, jobManagerID) => --- End diff -- This does not work because a `TaskManager's` registration can also be acknowledged by an `AlreadyRegistered` message. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713065#comment-14713065 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37977713 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -524,13 +534,24 @@ class JobManager( * @param listenToEvents true if the sender wants to listen to job status and execution state * change notifications. false if not. */ - private def submitJob(jobGraph: JobGraph, listenToEvents: Boolean): Unit = { + private def submitJob( + jobGraph: JobGraph, + listenToEvents: Boolean, + clientVersion: String) +: Unit = { if (jobGraph == null) { sender ! decorateMessage( Failure( new JobSubmissionException(null, "JobGraph must not be null.") ) ) +} else if(jobManagerVersionID != clientVersion){ + sender ! decorateMessage( +Failure( + new JobSubmissionException(jobGraph.getJobID, "Client version mismatches that of Job " + +"Manager.") --- End diff -- What versions do the client and the job manager have? > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713062#comment-14713062 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37977616 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -184,26 +188,32 @@ class JobManager( } else { try { - val instanceID = instanceManager.registerTaskManager( -taskManager, -connectionInfo, -hardwareInformation, -numberOfSlots, -leaderSessionID) + + if(jobManagerVersionID == taskManagerVersionID) { +val instanceID = instanceManager.registerTaskManager( + taskManager, + connectionInfo, + hardwareInformation, + numberOfSlots, + leaderSessionID) - // IMPORTANT: Send the response to the "sender", which is not the - //TaskManager actor, but the ask future! - sender() ! decorateMessage( -AcknowledgeRegistration( - registrationSessionID, - leaderSessionID.get, - self, - instanceID, - libraryCacheManager.getBlobServerPort) - ) +// IMPORTANT: Send the response to the "sender", which is not the +//TaskManager actor, but the ask future! +sender() ! decorateMessage( + AcknowledgeRegistration( +registrationSessionID, +leaderSessionID.get, +self, +instanceID, +libraryCacheManager.getBlobServerPort, +jobManagerVersionID) +) - // to be notified when the taskManager is no longer reachable - context.watch(taskManager) +// to be notified when the taskManager is no longer reachable +context.watch(taskManager) + } else{ +throw new Exception("Version mismatch error between Job Manager and Task Manager") --- End diff -- This will kill the `JobManager`. This is not a good solution. Best you revert the verification between the `TaskManager` and the `JobManager`. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713061#comment-14713061 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37977495 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -166,7 +169,8 @@ class JobManager( taskManager, connectionInfo, hardwareInformation, - numberOfSlots) => + numberOfSlots, + taskManagerVersionID) => --- End diff -- Why do you check the version when a `TaskManager` registers at the `JobManager`. I thought the critical part would be the `JobClient` and `JobManager` communication. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713060#comment-14713060 ] ASF GitHub Bot commented on FLINK-2399: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/945#discussion_r37977399 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -121,11 +121,14 @@ class JobManager( override val leaderSessionID = Some(UUID.randomUUID()) + private var jobManagerVersionID: String = _ --- End diff -- Better to use `Option` instead of `null`. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693235#comment-14693235 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-130241985 I've already added a version check between `JobClient` and `JobManager`. Will there be any further review of this? > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646279#comment-14646279 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-125991194 1. Added version checks between JobClient and JobManager. 2. Versions are accessed from the Configuration class now, since flink-core gets built first and version is readily available, leading to exact verification. 3. Added a dummy version string to allow passing of tests in IDE. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645737#comment-14645737 ] ASF GitHub Bot commented on FLINK-2399: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-125886476 Using `getClass().getPackage().getImplementationVersion()` would be a decent first approach then, I guess. The critical part seems to be the Client-to-JobManager communication. How about the following as a first step: The client sends its version together with the `SubmitJob` message (just add a field there). The JobManager would check the version and respond with a failure, if it does not match. You can probably make the JobManager part very simple, no need to add extra constructor parameters, etc. That way, the change would be minimally invasive, and we could see how well it addresses the issues, and whether we should extend this to other messages as well. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644933#comment-14644933 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-125734265 I had decided to work with my own understanding of what version means since nobody replied to the JIRA comment. getClass.getPackage.getImplementationVersion would return, 0.10 SNAPSHOT for the current build. In case this string doesn't match for the TM and JM, it would fail the registration process. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644909#comment-14644909 ] ASF GitHub Bot commented on FLINK-2399: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/945#issuecomment-125732043 I think we need to decide first what we actually mean by version and what is supposed to match what. What is actually the value `getClass().getPackage().getImplementationVersion()`? > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Assignee: Sachin Goel >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643192#comment-14643192 ] ASF GitHub Bot commented on FLINK-2399: --- GitHub user sachingoel0101 opened a pull request: https://github.com/apache/flink/pull/945 [FLINK-2399] Version checks for Job Manager and Task Manager [Reopening #944. The travis build failed somehow. ] Both Job Manager and Task Manager now communicate their respective version IDs to each other as part of registration messages. Version ID is accessed by the information Maven adds in the manifest. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink FLINK-2399 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/945.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #945 commit 6de306bccb4687177c49d9baa6e40940db7a8ff9 Author: Sachin Goel Date: 2015-07-24T15:41:34Z Added version checks for Job Manager and Task Manager > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642828#comment-14642828 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 closed the pull request at: https://github.com/apache/flink/pull/944 > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642743#comment-14642743 ] ASF GitHub Bot commented on FLINK-2399: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/944#issuecomment-125216770 Adding these checks however already make sure the older task manager or job manager will be able to register with the new ones because the message objects have changed. [I'm not sure however. Is it possible that the older messages will be cast into the newer ones when transmitted over network?] If that is the case, we should add new names for the messages, and specifically fail for the older messages. > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642733#comment-14642733 ] ASF GitHub Bot commented on FLINK-2399: --- GitHub user sachingoel0101 opened a pull request: https://github.com/apache/flink/pull/944 [FLINK-2399] Version checks for Job Manager and Task Manager Both Job Manager and Task Manager now communicate their respective version IDs to each other as part of registration messages. Version ID is accessed by the information Maven adds in the manifest. However, for enabling the tests to run in IDE[is this necessary?], where the version number is null, added a special provision to set the version ID to "", when it is null. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink FLINK-2399 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/944.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #944 commit b9d517fc26b7bc53ada215b253a3086e744781df Author: Sachin Goel Date: 2015-07-24T15:41:34Z Added version checks for Job Manager and Task Manager > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2399) Fail when actor versions don't match
[ https://issues.apache.org/jira/browse/FLINK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640543#comment-14640543 ] Sachin Goel commented on FLINK-2399: Should this work when the task managers and the job manager are from different releases? Or just for the old and new Job Manager? > Fail when actor versions don't match > > > Key: FLINK-2399 > URL: https://issues.apache.org/jira/browse/FLINK-2399 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Affects Versions: 0.9, master >Reporter: Ufuk Celebi >Priority: Minor > Fix For: 0.10 > > > Problem: there can be subtle errors when actors from different Flink versions > communicate with each other, for example when an old client (e.g. Flink 0.9) > communicates with a new JobManager (e.g. Flink 0.10-SNAPSHOT). > We can check that the versions match on first communication between the > actors and fail if they don't match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)