[jira] [Assigned] (FLINK-35509) Slack community invite link has expired
[ https://issues.apache.org/jira/browse/FLINK-35509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-35509: --- Assignee: Santwana Verma > Slack community invite link has expired > --- > > Key: FLINK-35509 > URL: https://issues.apache.org/jira/browse/FLINK-35509 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Ufuk Celebi >Assignee: Santwana Verma >Priority: Major > > The Slack invite link on the website has expired. > I've generated a new invite link without expiration here: > [https://join.slack.com/t/apache-flink/shared_invite/zt-2k0fdioxx-D0kTYYLh3pPjMu5IItqx3Q] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35509) Slack community invite link has expired
Ufuk Celebi created FLINK-35509: --- Summary: Slack community invite link has expired Key: FLINK-35509 URL: https://issues.apache.org/jira/browse/FLINK-35509 Project: Flink Issue Type: Bug Components: Project Website Reporter: Ufuk Celebi The Slack invite link on the website has expired. I've generated a new invite link without expiration here: [https://join.slack.com/t/apache-flink/shared_invite/zt-2k0fdioxx-D0kTYYLh3pPjMu5IItqx3Q] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35481) Add HISTOGRAM function in SQL & Table API
Ufuk Celebi created FLINK-35481: --- Summary: Add HISTOGRAM function in SQL & Table API Key: FLINK-35481 URL: https://issues.apache.org/jira/browse/FLINK-35481 Project: Flink Issue Type: Improvement Components: Table SQL / API Reporter: Ufuk Celebi Fix For: 1.20.0 Consider adding a HISTOGRAM aggregate function similar to ksqlDB (https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/aggregate-functions/#histogram). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35480) Add FIELD function in SQL & Table API
Ufuk Celebi created FLINK-35480: --- Summary: Add FIELD function in SQL & Table API Key: FLINK-35480 URL: https://issues.apache.org/jira/browse/FLINK-35480 Project: Flink Issue Type: Improvement Components: Table SQL / API Reporter: Ufuk Celebi Fix For: 1.20.0 Add support for the {{FIELD}} function to return the position of {{str}} in {{{}args{}}}, or 0 if not found. *References* * ksqlDB: [https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/scalar-functions/#field] * MySQL: https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_elt -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35109) Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support for 1.17 and 1.18
[ https://issues.apache.org/jira/browse/FLINK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-35109: --- Assignee: Fabian Paul (was: Ufuk Celebi) > Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support > for 1.17 and 1.18 > --- > > Key: FLINK-35109 > URL: https://issues.apache.org/jira/browse/FLINK-35109 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / Kafka >Reporter: Martijn Visser >Assignee: Fabian Paul >Priority: Blocker > Fix For: kafka-4.0.0 > > > The Flink Kafka connector currently can't compile against Flink > 1.20-SNAPSHOT. An example failure can be found at > https://github.com/apache/flink-connector-kafka/actions/runs/8659822490/job/23746484721#step:15:169 > The {code:java} TypeSerializerUpgradeTestBase{code} has had issues before, > see FLINK-32455. See also specifically the comment in > https://issues.apache.org/jira/browse/FLINK-32455?focusedCommentId=17739785&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17739785 > Next to that, there's also FLINK-25509 which can only be supported with Flink > 1.19 and higher. > So we should: > * Drop support for 1.17 and 1.18 > * Refactor the Flink Kafka connector to use the new > {code:java}MigrationTest{code} > We will support the Flink Kafka connector for Flink 1.18 via the v3.1 branch; > this change will be a new v4.0 version with support for Flink 1.19 and the > upcoming Flink 1.20 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-35239) 1.19 docs show outdated warning
[ https://issues.apache.org/jira/browse/FLINK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi resolved FLINK-35239. - Resolution: Fixed > 1.19 docs show outdated warning > --- > > Key: FLINK-35239 > URL: https://issues.apache.org/jira/browse/FLINK-35239 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.19.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: Screenshot 2024-04-25 at 15.01.57.png > > > The docs for 1.19 are currently marked as outdated although it's the > currently stable release. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35239) 1.19 docs show outdated warning
Ufuk Celebi created FLINK-35239: --- Summary: 1.19 docs show outdated warning Key: FLINK-35239 URL: https://issues.apache.org/jira/browse/FLINK-35239 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.19.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Fix For: 1.19.0 Attachments: Screenshot 2024-04-25 at 15.01.57.png The docs for 1.19 are currently marked as outdated although it's the currently stable release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35109) Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support for 1.17 and 1.18
[ https://issues.apache.org/jira/browse/FLINK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-35109: --- Assignee: Ufuk Celebi > Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support > for 1.17 and 1.18 > --- > > Key: FLINK-35109 > URL: https://issues.apache.org/jira/browse/FLINK-35109 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / Kafka >Reporter: Martijn Visser >Assignee: Ufuk Celebi >Priority: Blocker > Fix For: kafka-4.0.0 > > > The Flink Kafka connector currently can't compile against Flink > 1.20-SNAPSHOT. An example failure can be found at > https://github.com/apache/flink-connector-kafka/actions/runs/8659822490/job/23746484721#step:15:169 > The {code:java} TypeSerializerUpgradeTestBase{code} has had issues before, > see FLINK-32455. See also specifically the comment in > https://issues.apache.org/jira/browse/FLINK-32455?focusedCommentId=17739785&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17739785 > Next to that, there's also FLINK-25509 which can only be supported with Flink > 1.19 and higher. > So we should: > * Drop support for 1.17 and 1.18 > * Refactor the Flink Kafka connector to use the new > {code:java}MigrationTest{code} > We will support the Flink Kafka connector for Flink 1.18 via the v3.1 branch; > this change will be a new v4.0 version with support for Flink 1.19 and the > upcoming Flink 1.20 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35009) Change on getTransitivePredecessors breaks connectors
[ https://issues.apache.org/jira/browse/FLINK-35009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-35009: --- Assignee: Martijn Visser > Change on getTransitivePredecessors breaks connectors > - > > Key: FLINK-35009 > URL: https://issues.apache.org/jira/browse/FLINK-35009 > Project: Flink > Issue Type: Bug > Components: API / Core, Connectors / Kafka >Affects Versions: 1.18.2, 1.20.0, 1.19.1 >Reporter: Martijn Visser >Assignee: Martijn Visser >Priority: Blocker > > {code:java} > Error: Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile > (default-testCompile) on project flink-connector-kafka: Compilation failure: > Compilation failure: > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/streaming/connectors/kafka/testutils/DataGenerators.java:[214,24] > > org.apache.flink.streaming.connectors.kafka.testutils.DataGenerators.InfiniteStringsGenerator.MockTransformation > is not abstract and does not override abstract method > getTransitivePredecessorsInternal() in org.apache.flink.api.dag.Transformation > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/streaming/connectors/kafka/testutils/DataGenerators.java:[220,44] > getTransitivePredecessors() in > org.apache.flink.streaming.connectors.kafka.testutils.DataGenerators.InfiniteStringsGenerator.MockTransformation > cannot override getTransitivePredecessors() in > org.apache.flink.api.dag.Transformation > Error:overridden method is final > {code} > Example: > https://github.com/apache/flink-connector-kafka/actions/runs/8494349338/job/23269406762#step:15:167 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35038) Bump test dependency org.yaml:snakeyaml to 2.2
Ufuk Celebi created FLINK-35038: --- Summary: Bump test dependency org.yaml:snakeyaml to 2.2 Key: FLINK-35038 URL: https://issues.apache.org/jira/browse/FLINK-35038 Project: Flink Issue Type: Technical Debt Components: Connectors / Kafka Affects Versions: 3.1.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Fix For: 3.1.0 Usage of SnakeYAML via {{flink-shaded}} was replaced by an explicit test scope dependency on {{org.yaml:snakeyaml:1.31}} with FLINK-34193. This outdated version of SnakeYAML triggers security warnings. These should not be an actual issue given the test scope, but we should consider bumping the version for security hygiene purposes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-21928) DuplicateJobSubmissionException after JobManager failover
Ufuk Celebi created FLINK-21928: --- Summary: DuplicateJobSubmissionException after JobManager failover Key: FLINK-21928 URL: https://issues.apache.org/jira/browse/FLINK-21928 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.12.2, 1.11.3, 1.10.3 Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, High Availability enabled Reporter: Ufuk Celebi Consider the following scenario: * Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, high availability enabled * Flink job reaches a globally terminal state * Flink job is marked as finished in the high-availability service's RunningJobsRegistry * The JobManager fails over On recovery, the [Dispatcher throws DuplicateJobSubmissionException, because the job is marked as done in the RunningJobsRegistry|https://github.com/apache/flink/blob/release-1.12.2/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L332-L340]. When this happens, users cannot get out of the situation without manually redeploying the JobManager process and changing the job ID^1^. The desired semantics are that we don't want to re-execute a job that has reached a globally terminal state. In this particular case, we know that the job has already reached such a state (as it has been marked in the registry). Therefore, we could handle this case by executing the regular termination sequence instead of throwing a DuplicateJobSubmission. --- ^1^ With ZooKeeper HA, the respective node is not ephemeral. In Kubernetes HA, there is no notion of ephemeral data that is tied to a session in the first place afaik. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed
[ https://issues.apache.org/jira/browse/FLINK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294788#comment-17294788 ] Ufuk Celebi commented on FLINK-21351: - [~roman_khachatryan] The ticket includes fixVersion 1.11.4. Are you planning to open a PR against the release-1.11 branch? > Incremental checkpoint data would be lost once a non-stop savepoint completed > - > > Key: FLINK-21351 > URL: https://issues.apache.org/jira/browse/FLINK-21351 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.3, 1.12.1, 1.13.0 >Reporter: Yun Tang >Assignee: Roman Khachatryan >Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.4, 1.12.2, 1.13.0 > > > FLINK-10354 counted savepoint as retained checkpoint so that job could > failover from latest position. I think this operation is reasonable, however, > current implementation would let incremental checkpoint data lost immediately > once a non-stop savepoint completed. > Current general phase of incremental checkpoints: once a newer checkpoint > completed, it would be added to checkpoint store. And if the size of > completed checkpoints larger than max retained limit, it would subsume the > oldest one. This lead to the reference of incremental data decrease one and > data would be deleted once reference reached to zero. As we always ensure to > register newer checkpoint and then unregister older checkpoint, current phase > works fine as expected. > However, if a non-stop savepoint (a median manual trigger savepoint) is > completed, it would be also added into checkpoint store and just subsume > previous added checkpoint (in default retain one checkpoint case), which > would unregister older checkpoint without newer checkpoint registered, > leading to data lost. > Thanks for [~banmoy] reporting this problem first. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-10841) Reduce the number of ListObjects calls when checkpointing to S3
[ https://issues.apache.org/jira/browse/FLINK-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262719#comment-17262719 ] Ufuk Celebi commented on FLINK-10841: - [~uberspot] Presto's S3 FileSystem seems to only use v1 requests [1] whereas newer versions of the Hadoop S3 FileSystem should use v2 requests by default [2]. If both file system plugins are on the class path you have to explicitly pick Hadoop via {{s3a://}}. See also: [https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/filesystems/s3.html#hadooppresto-s3-file-systems-plugins] Does this help? [1] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java#L523-L527 [2] https://github.com/apache/hadoop/blob/rel/release-3.1.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L297-L302 > Reduce the number of ListObjects calls when checkpointing to S3 > --- > > Key: FLINK-10841 > URL: https://issues.apache.org/jira/browse/FLINK-10841 > Project: Flink > Issue Type: Improvement > Components: FileSystems >Affects Versions: 1.5.5, 1.6.2 >Reporter: Pawel Bartoszek >Priority: Minor > > With S3 configured as checkpoint store using S3 AWS Hadoop filesystem we see > loads of ListObjects calls. For instance the job with ~1600 tasks requires > around 23000 ListObjects calls for every checkpoint including clearing it up > by Flink. With checkpoint interval set to 5 minutes this adds up to hundreds > of dollars pay month just for ListObjects calls. I am aware that > implementation details might be hidden in Hadoop jar and maybe difficult to > change, but at least maybe some workaround might be suggested? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18192) Upgrade to Avro version 1.10.0 from 1.8.2
[ https://issues.apache.org/jira/browse/FLINK-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222800#comment-17222800 ] Ufuk Celebi commented on FLINK-18192: - [~trohrmann] Thanks for the ping. If the upgrade is feasible for a 1.11 patch release, it would be a clear yes from my side. If not, I would understand. ;) > Upgrade to Avro version 1.10.0 from 1.8.2 > - > > Key: FLINK-18192 > URL: https://issues.apache.org/jira/browse/FLINK-18192 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Lucas Heimberg >Assignee: Dawid Wysakowicz >Priority: Major > Fix For: 1.12.0 > > > As of version 1.11, Flink (i.e., flink-avro) still uses Avro in version 1.8.2. > Avro 1.9.2 contains many bugfixes, in particular in respect to the support > for logical types. A further advantage would be that an upgrade to Avro 1.9.2 > would also allow to use the Confluent Schema Registry client and Avro > deserializer in version 5.5.0, which finally support schema references. > Therefore it would be great if Flink could make use of Avro 1.9.2 or higher > in future releases. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s
[ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202044#comment-17202044 ] Ufuk Celebi commented on FLINK-18828: - [~trohrmann] [~fly_in_gis] Just noticed that this ticket has fix versions for 1.10 and 1.11. I think it would be better to only do this with a minor version upgrade (e.g. 1.12.0), because down stream users may rely on the current behaviour (I do). > Terminate jobmanager process with zero exit code to avoid unexpected > restarting by K8s > -- > > Key: FLINK-18828 > URL: https://issues.apache.org/jira/browse/FLINK-18828 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Yang Wang >Priority: Major > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Currently, Flink jobmanager process terminates with a non-zero exit code if > the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s > deployment, since non-zero exit code will cause unexpected restarting. Also > from a framework's perspective, a FAILED job does not mean that Flink has > failed and, hence, the return code could still be 0. > > Note: > This is a special case for standalone K8s deployment. For > standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is > harmless. And a non-zero exit code could help to check the job result quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s
[ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186623#comment-17186623 ] Ufuk Celebi commented on FLINK-18828: - Thanks for the summary, Till. I think that's a good way to frame this problem. {quote}One could argue then that users should configure their restart strategies to always restart if they don't want to reach a FAILED state. {quote} Practically speaking I think it's the only option we currently have in the context of Kubernetes standalone deployments. > Terminate jobmanager process with zero exit code to avoid unexpected > restarting by K8s > -- > > Key: FLINK-18828 > URL: https://issues.apache.org/jira/browse/FLINK-18828 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Yang Wang >Priority: Major > Fix For: 1.12.0, 1.11.2, 1.10.3 > > > Currently, Flink jobmanager process terminates with a non-zero exit code if > the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s > deployment, since non-zero exit code will cause unexpected restarting. Also > from a framework's perspective, a FAILED job does not mean that Flink has > failed and, hence, the return code could still be 0. > > Note: > This is a special case for standalone K8s deployment. For > standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is > harmless. And a non-zero exit code could help to check the job result quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s
[ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186532#comment-17186532 ] Ufuk Celebi edited comment on FLINK-18828 at 8/28/20, 1:19 PM: --- [~fly_in_gis] Thanks for the pointers. The Flink job would only transition to FAILED when the Flink-level restart strategy has been exhausted. In your example for fixed-delay with 3 attempts, the first three restarts would _not_ result in the container to exit, but only on the 4th failure would the job transition to FAILED and the container exit. I think the bigger problem with my proposal to set the policy to Never is that it would not restart in other failure scenarios (e.g. OOM killed). So overall, I don't think it's a viable option. So overall, I don't see a good way around this problem without your proposed change. --- Maybe as a follow-up we want to resurrect https://issues.apache.org/jira/browse/FLINK-10948 ? That way, users would at least be able to determine the final Flink job status. was (Author: uce): [~fly_in_gis] Thanks for the pointers. The Flink job would only transition to FAILED when the Flink-level restart strategy has been exhausted. In your example for fixed-delay with 3 attempts, the first three restarts would _not_ result in the container to exit, but only on the 4th failure would the job transition to FAILED and the container exit. I think the bigger problem with my proposal to set the policy to Never is that it would not restart in other failure scenarios (e.g. OOM killed). So overall, I don't think it's a viable option. So overall, I don't see a good way around this problem without your proposed change. --- Maybe as a follow-up we want to resurrect https://issues.apache.org/jira/browse/FLINK-10948? That way, users would at least be able to determine the final Flink job status. > Terminate jobmanager process with zero exit code to avoid unexpected > restarting by K8s > -- > > Key: FLINK-18828 > URL: https://issues.apache.org/jira/browse/FLINK-18828 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Yang Wang >Priority: Major > Fix For: 1.12.0, 1.11.2, 1.10.3 > > > Currently, Flink jobmanager process terminates with a non-zero exit code if > the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s > deployment, since non-zero exit code will cause unexpected restarting. Also > from a framework's perspective, a FAILED job does not mean that Flink has > failed and, hence, the return code could still be 0. > > Note: > This is a special case for standalone K8s deployment. For > standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is > harmless. And a non-zero exit code could help to check the job result quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s
[ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186532#comment-17186532 ] Ufuk Celebi commented on FLINK-18828: - [~fly_in_gis] Thanks for the pointers. The Flink job would only transition to FAILED when the Flink-level restart strategy has been exhausted. In your example for fixed-delay with 3 attempts, the first three restarts would _not_ result in the container to exit, but only on the 4th failure would the job transition to FAILED and the container exit. I think the bigger problem with my proposal to set the policy to Never is that it would not restart in other failure scenarios (e.g. OOM killed). So overall, I don't think it's a viable option. So overall, I don't see a good way around this problem without your proposed change. --- Maybe as a follow-up we want to resurrect https://issues.apache.org/jira/browse/FLINK-10948? That way, users would at least be able to determine the final Flink job status. > Terminate jobmanager process with zero exit code to avoid unexpected > restarting by K8s > -- > > Key: FLINK-18828 > URL: https://issues.apache.org/jira/browse/FLINK-18828 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Yang Wang >Priority: Major > Fix For: 1.12.0, 1.11.2, 1.10.3 > > > Currently, Flink jobmanager process terminates with a non-zero exit code if > the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s > deployment, since non-zero exit code will cause unexpected restarting. Also > from a framework's perspective, a FAILED job does not mean that Flink has > failed and, hence, the return code could still be 0. > > Note: > This is a special case for standalone K8s deployment. For > standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is > harmless. And a non-zero exit code could help to check the job result quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s
[ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186327#comment-17186327 ] Ufuk Celebi commented on FLINK-18828: - [~fly_in_gis] I think it makes sense to keep a non-zero exit code for failed jobs. How would users figure out whether the job has succeeded or not if we change the exit code? Regarding the unexpected restarts: What about updating the {{restartPolicy}} to {{Never}} in the spec of the Kubernetes Job (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy)? That way, we would still have the information from the exit code and we wouldn't see any restarts by default. Users would also have the flexibility to change the behaviour depending on their use case by setting {{restartPolicy: OnFailure}} again. > Terminate jobmanager process with zero exit code to avoid unexpected > restarting by K8s > -- > > Key: FLINK-18828 > URL: https://issues.apache.org/jira/browse/FLINK-18828 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.10.1, 1.12.0, 1.11.1 >Reporter: Yang Wang >Priority: Major > Fix For: 1.12.0, 1.11.2, 1.10.3 > > > Currently, Flink jobmanager process terminates with a non-zero exit code if > the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s > deployment, since non-zero exit code will cause unexpected restarting. Also > from a framework's perspective, a FAILED job does not mean that Flink has > failed and, hence, the return code could still be 0. > > Note: > This is a special case for standalone K8s deployment. For > standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is > harmless. And a non-zero exit code could help to check the job result quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers
[ https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166902#comment-17166902 ] Ufuk Celebi commented on FLINK-18695: - I agree with Stephan that it should not have much of an impact. In particular for the SSL handler the comment in the [respective Netty PR|https://github.com/netty/netty/commit/39cc7a673939dec96258ff27f5b1874671838af0#diff-2fe7b22a8d650f1ea0bf56a809c061f9R303-R310] says that the direct buffer was copied back to a heap byte[] by the SSL engine anyways. Therefore, allowing heap buffers should be a net positive in the context of the SSL handler. > Allow NettyBufferPool to allocate heap buffers > -- > > Key: FLINK-18695 > URL: https://issues.apache.org/jira/browse/FLINK-18695 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Chesnay Schepler >Priority: Major > Fix For: 1.12.0 > > > in 4.1.43 netty made a change to their SslHandler to always use heap buffers > for JDK SSLEngine implementations, to avoid an additional memory copy. > However, our {{NettyBufferPool}} forbids heap buffer allocations. > We will either have to allow heap buffer allocations, or create a custom > SslHandler implementation that does not use heap buffers (although this seems > ill-adviced?). > /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17500) Deploy JobGraph from file in StandaloneClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101541#comment-17101541 ] Ufuk Celebi commented on FLINK-17500: - [~zhuzh] Good points. I would agree to keep the JobGraph an internal API and don't make any guarantees about compatibility here. As indicated by the proposed configuration option keys I consider this internal configuration only. (In our use case we strictly tie the JobGraph to the Flink version and don't re-use it across job executions.) > Deploy JobGraph from file in StandaloneClusterEntrypoint > > > Key: FLINK-17500 > URL: https://issues.apache.org/jira/browse/FLINK-17500 > Project: Flink > Issue Type: Wish > Components: Deployment / Docker >Reporter: Ufuk Celebi >Priority: Minor > > We have a requirement to deploy a pre-generated {{JobGraph}} from a file in > {{StandaloneClusterEntrypoint}}. > Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a > Flink job from the class path using {{ClassPathPackagedProgramRetriever}}. > Our desired behaviour would be as follows: > If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a > local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy > using {{ClassPathPackagedProgramRetriever}} (current behavior). > --- > I understand that this requirement is pretty niche, but wanted to get > feedback whether the Flink community would be open to supporting this > nonetheless. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17500) Deploy JobGraph from file in StandaloneClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098866#comment-17098866 ] Ufuk Celebi commented on FLINK-17500: - # Yes, that makes sense. (y) # I just skimmed the code for {{YarnJobClusterEntrypoint}} and it looks like this would work nicely. > Deploy JobGraph from file in StandaloneClusterEntrypoint > > > Key: FLINK-17500 > URL: https://issues.apache.org/jira/browse/FLINK-17500 > Project: Flink > Issue Type: Wish > Components: Deployment / Docker >Reporter: Ufuk Celebi >Priority: Minor > > We have a requirement to deploy a pre-generated {{JobGraph}} from a file in > {{StandaloneClusterEntrypoint}}. > Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a > Flink job from the class path using {{ClassPathPackagedProgramRetriever}}. > Our desired behaviour would be as follows: > If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a > local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy > using {{ClassPathPackagedProgramRetriever}} (current behavior). > --- > I understand that this requirement is pretty niche, but wanted to get > feedback whether the Flink community would be open to supporting this > nonetheless. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17500) Deploy JobGraph form file in StandaloneClusterEntrypoint
Ufuk Celebi created FLINK-17500: --- Summary: Deploy JobGraph form file in StandaloneClusterEntrypoint Key: FLINK-17500 URL: https://issues.apache.org/jira/browse/FLINK-17500 Project: Flink Issue Type: Wish Components: Deployment / Docker Reporter: Ufuk Celebi We have a requirement to deploy a pre-generated {{JobGraph}} from a file in {{StandaloneClusterEntrypoint}}. Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a Flink job from the class path using {{ClassPathPackagedProgramRetriever}}. Our desired behaviour would be as follows: If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy using {{ClassPathPackagedProgramRetriever}} (current behavior). --- I understand that this requirement is pretty niche, but wanted to get feedback whether the Flink community would be open to supporting this nonetheless. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16499) Flink shaded hadoop could not work when Yarn timeline service is enabled
[ https://issues.apache.org/jira/browse/FLINK-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059342#comment-17059342 ] Ufuk Celebi edited comment on FLINK-16499 at 3/14/20, 1:13 PM: --- [~fly_in_gis] For proper Hadoop setups, the expected usage it to provide the Hadoop classpath as documented in https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/hadoop.html#adding-hadoop-classpaths. A similar issue was also reported in FLINK-12493. Edit: I'm just pointing this out for the general case. I understand that it might not be relevant for your usage scenario. was (Author: uce): [~fly_in_gis] For proper Hadoop setups, the expected usage it to provide the Hadoop classpath as documented in https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/hadoop.html#adding-hadoop-classpaths. A similar issue was also reported in FLINK-12493. > Flink shaded hadoop could not work when Yarn timeline service is enabled > > > Key: FLINK-16499 > URL: https://issues.apache.org/jira/browse/FLINK-16499 > Project: Flink > Issue Type: Bug > Components: BuildSystem / Shaded >Reporter: Yang Wang >Priority: Major > > When the Yarn timeline service is enabled (via > {{yarn.timeline-service.enabled=true}} in yarn-site.xml), flink-shaded-hadoop > could not work to submit Flink job to Yarn cluster. The following exception > will be thrown. > > The root cause is the {{jersey-core-xx.jar}} is not bundled into > {{flink-shaded-hadoop-xx}}{{.jar}}. > > {code:java} > 2020-03-09 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend > [] - Fatal error while running command line interface.2020-03-09 > 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend > [] - Fatal error while running command line > interface.java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader > at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > ~[?:1.8.0_242] at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at > java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at > java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > ~[?:1.8.0_242] at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at > java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at > java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > ~[?:1.8.0_242] at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at > java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at > java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242]
[jira] [Commented] (FLINK-16499) Flink shaded hadoop could not work when Yarn timeline service is enabled
[ https://issues.apache.org/jira/browse/FLINK-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059342#comment-17059342 ] Ufuk Celebi commented on FLINK-16499: - [~fly_in_gis] For proper Hadoop setups, the expected usage it to provide the Hadoop classpath as documented in https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/hadoop.html#adding-hadoop-classpaths. A similar issue was also reported in FLINK-12493. > Flink shaded hadoop could not work when Yarn timeline service is enabled > > > Key: FLINK-16499 > URL: https://issues.apache.org/jira/browse/FLINK-16499 > Project: Flink > Issue Type: Bug > Components: BuildSystem / Shaded >Reporter: Yang Wang >Priority: Major > > When the Yarn timeline service is enabled (via > {{yarn.timeline-service.enabled=true}} in yarn-site.xml), flink-shaded-hadoop > could not work to submit Flink job to Yarn cluster. The following exception > will be thrown. > > The root cause is the {{jersey-core-xx.jar}} is not bundled into > {{flink-shaded-hadoop-xx}}{{.jar}}. > > {code:java} > 2020-03-09 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend > [] - Fatal error while running command line interface.2020-03-09 > 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend > [] - Fatal error while running command line > interface.java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader > at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > ~[?:1.8.0_242] at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at > java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at > java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > ~[?:1.8.0_242] at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at > java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at > java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at > java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > ~[?:1.8.0_242] at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at > java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at > java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at > org.apache.hadoop.yarn.util.timeline.TimelineUtils.(TimelineUtils.java:50) > ~[flink-shaded-hadoop-2-uber-2.8.3-7.0.jar:2.8.3-7.0] at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:179) > ~[flink-shaded-hadoop-2-uber-2.8.3-7.0.jar:2.8.3-7.0] at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > ~[flink-shaded-hadoop-2-uber-2.8.3-7.0.jar:2.8.3-7.0] at > org.apache.flink.yarn.YarnClusterCl
[jira] [Comment Edited] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.
[ https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033449#comment-17033449 ] Ufuk Celebi edited comment on FLINK-14812 at 2/10/20 9:07 AM: -- [~elanv] I'm reopening again, because the initial context of this ticket was not the Docker entrypoint script, but the general startup scripts. If there is agreement that the Docker entrypoint change is enough, we can close this again. :) [~aljoscha] What do you think? was (Author: uce): [~elanv] I'm reopening again, because the initial context of this ticket was not the Docker entrypoint script, but the general startup scripts. If there is agreement that the Docker entrypoint change is enough, we can close this again. :) > Add custom libs to Flink classpath with an environment variable. > > > Key: FLINK-14812 > URL: https://issues.apache.org/jira/browse/FLINK-14812 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Deployment / Scripts >Reporter: Eui Heo >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > To use plugin library you need to add it to the flink classpath. The > documentation explains to put the jar file in the lib path. > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the > burden of creating and managing another container image. It would be more > efficient to add the classpath using environment variables inside the > constructFlinkClassPath function in the config.sh file. > In particular, it seems inconvenient for me to create separate images to use > the jars, even though the /opt/ flink/opt of the stock image already contains > them. > For example, there are metrics libs and file system libs: > flink-azure-fs-hadoop-1.9.1.jar > flink-s3-fs-hadoop-1.9.1.jar > flink-metrics-prometheus-1.9.1.jar > flink-metrics-influxdb-1.9.1.jar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.
[ https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reopened FLINK-14812: - > Add custom libs to Flink classpath with an environment variable. > > > Key: FLINK-14812 > URL: https://issues.apache.org/jira/browse/FLINK-14812 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Deployment / Scripts >Reporter: Eui Heo >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > To use plugin library you need to add it to the flink classpath. The > documentation explains to put the jar file in the lib path. > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the > burden of creating and managing another container image. It would be more > efficient to add the classpath using environment variables inside the > constructFlinkClassPath function in the config.sh file. > In particular, it seems inconvenient for me to create separate images to use > the jars, even though the /opt/ flink/opt of the stock image already contains > them. > For example, there are metrics libs and file system libs: > flink-azure-fs-hadoop-1.9.1.jar > flink-s3-fs-hadoop-1.9.1.jar > flink-metrics-prometheus-1.9.1.jar > flink-metrics-influxdb-1.9.1.jar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.
[ https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033449#comment-17033449 ] Ufuk Celebi commented on FLINK-14812: - [~elanv] I'm reopening again, because the initial context of this ticket was not the Docker entrypoint script, but the general startup scripts. If there is agreement that the Docker entrypoint change is enough, we can close this again. :) > Add custom libs to Flink classpath with an environment variable. > > > Key: FLINK-14812 > URL: https://issues.apache.org/jira/browse/FLINK-14812 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Deployment / Scripts >Reporter: Eui Heo >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > To use plugin library you need to add it to the flink classpath. The > documentation explains to put the jar file in the lib path. > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the > burden of creating and managing another container image. It would be more > efficient to add the classpath using environment variables inside the > constructFlinkClassPath function in the config.sh file. > In particular, it seems inconvenient for me to create separate images to use > the jars, even though the /opt/ flink/opt of the stock image already contains > them. > For example, there are metrics libs and file system libs: > flink-azure-fs-hadoop-1.9.1.jar > flink-s3-fs-hadoop-1.9.1.jar > flink-metrics-prometheus-1.9.1.jar > flink-metrics-influxdb-1.9.1.jar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-15831) Add Docker image publication to release documentation
[ https://issues.apache.org/jira/browse/FLINK-15831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-15831: --- Assignee: Patrick Lucas > Add Docker image publication to release documentation > - > > Key: FLINK-15831 > URL: https://issues.apache.org/jira/browse/FLINK-15831 > Project: Flink > Issue Type: Sub-task > Components: Documentation >Reporter: Ufuk Celebi >Assignee: Patrick Lucas >Priority: Major > > The [release documentation in the project > Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release] > describes the release process. > We need to add a note to follow up with the Docker image publication process > as part of the release checklist. The actual documentation should probably be > self-contained in the apache/flink-docker repository, but we should > definitely link to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker
[ https://issues.apache.org/jira/browse/FLINK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-15830. --- Resolution: Fixed apache/flink and docker-flink/docker-flink now have the same histories up to f26d84c when we decided to migrate. > Migrate docker-flink/docker-flink to apache/flink-docker > > > Key: FLINK-15830 > URL: https://issues.apache.org/jira/browse/FLINK-15830 > Project: Flink > Issue Type: Sub-task >Reporter: Ufuk Celebi >Assignee: Patrick Lucas >Priority: Major > > * Migrate contents including commit history. > * Add note to docker-flink/docker-flink and potentially archive -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-15829) Request apache/flink-docker repository
[ https://issues.apache.org/jira/browse/FLINK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi resolved FLINK-15829. - Resolution: Fixed > Request apache/flink-docker repository > -- > > Key: FLINK-15829 > URL: https://issues.apache.org/jira/browse/FLINK-15829 > Project: Flink > Issue Type: Sub-task >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15829) Request apache/flink-docker repository
[ https://issues.apache.org/jira/browse/FLINK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027376#comment-17027376 ] Ufuk Celebi commented on FLINK-15829: - The repository has been created in https://github.com/apache/flink-docker. There is a small hick up in the repository description. I asked INFRA to fix it in https://issues.apache.org/jira/browse/INFRA-19800. > Request apache/flink-docker repository > -- > > Key: FLINK-15829 > URL: https://issues.apache.org/jira/browse/FLINK-15829 > Project: Flink > Issue Type: Sub-task >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker
[ https://issues.apache.org/jira/browse/FLINK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-15830: Description: Migrate contents including commit history. > Migrate docker-flink/docker-flink to apache/flink-docker > > > Key: FLINK-15830 > URL: https://issues.apache.org/jira/browse/FLINK-15830 > Project: Flink > Issue Type: Sub-task >Reporter: Ufuk Celebi >Assignee: Patrick Lucas >Priority: Major > > Migrate contents including commit history. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker
[ https://issues.apache.org/jira/browse/FLINK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-15830: Description: * Migrate contents including commit history. * Add note to docker-flink/docker-flink and potentially archive was:Migrate contents including commit history. > Migrate docker-flink/docker-flink to apache/flink-docker > > > Key: FLINK-15830 > URL: https://issues.apache.org/jira/browse/FLINK-15830 > Project: Flink > Issue Type: Sub-task >Reporter: Ufuk Celebi >Assignee: Patrick Lucas >Priority: Major > > * Migrate contents including commit history. > * Add note to docker-flink/docker-flink and potentially archive -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15831) Add Docker image publication to release documentation
Ufuk Celebi created FLINK-15831: --- Summary: Add Docker image publication to release documentation Key: FLINK-15831 URL: https://issues.apache.org/jira/browse/FLINK-15831 Project: Flink Issue Type: Sub-task Components: Documentation Reporter: Ufuk Celebi The [release documentation in the project Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release] describes the release process. We need to add a note to follow up with the Docker image publication process as part of the release checklist. The actual documentation should probably be self-contained in the apache/flink-docker repository, but we should definitely link to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker
Ufuk Celebi created FLINK-15830: --- Summary: Migrate docker-flink/docker-flink to apache/flink-docker Key: FLINK-15830 URL: https://issues.apache.org/jira/browse/FLINK-15830 Project: Flink Issue Type: Sub-task Reporter: Ufuk Celebi Assignee: Patrick Lucas -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-15829) Request apache/flink-docker repository
[ https://issues.apache.org/jira/browse/FLINK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-15829: --- Assignee: Ufuk Celebi > Request apache/flink-docker repository > -- > > Key: FLINK-15829 > URL: https://issues.apache.org/jira/browse/FLINK-15829 > Project: Flink > Issue Type: Sub-task >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15829) Request apache/flink-docker repository
Ufuk Celebi created FLINK-15829: --- Summary: Request apache/flink-docker repository Key: FLINK-15829 URL: https://issues.apache.org/jira/browse/FLINK-15829 Project: Flink Issue Type: Sub-task Reporter: Ufuk Celebi -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15828) Integrate docker-flink/docker-flink into Flink release process
Ufuk Celebi created FLINK-15828: --- Summary: Integrate docker-flink/docker-flink into Flink release process Key: FLINK-15828 URL: https://issues.apache.org/jira/browse/FLINK-15828 Project: Flink Issue Type: Improvement Components: Deployment / Docker, Release System Reporter: Ufuk Celebi This ticket tracks the first phase of Flink Docker image build consolidation. The goal of this story is to integrate Docker image publication with the Flink release process and provide convenience packages of released Flink artifacts on DockerHub. For more details, check the [DISCUSS|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html] and [VOTE|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36982.html] threads on the mailing list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-14812) Add metric libs to Flink classpath with an environment variable.
[ https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978315#comment-16978315 ] Ufuk Celebi edited comment on FLINK-14812 at 11/20/19 11:08 AM: [~elanv] I think your work around with adding single JARs to {{HADOOP_CLASSPATH}} is the best option currently. I think the question for this ticket is independent of metrics and Kubernetes. The general question is whether we want to add an Hadoop-independent environment variable such as {{CLASSPATH_ADDITIONAL_JARS}} that we document and commit to maintain for our start up scripts. Personally, I would find this beneficial in the standalone job use case. [~plucas] What do you think? --- I'm not sure who in the Flink community has good insight on the scripts and can help us make a decision here. [~aljoscha] or [~chesnay] maybe? was (Author: uce): [~elanv] I think your work around with adding single JARs to {{HADOOP_CLASSPATH}} is the best option currently. I think the question for this ticket is independent of metrics and Kubernetes. The general question is whether we want to add an Hadoop-independent environment variable such as {{CLASSPATH_ADDITIONAL_JARS}} that we document and commit to maintain for our start up scripts. Personally, I would find this beneficial in the standalone job use case. [~plucas] What do you think? --- I'm not sure who in the Flink community has good insight on the scripts and can help us make a decision here. [~aljoscha] or [~chesnay] maybe? > Add metric libs to Flink classpath with an environment variable. > > > Key: FLINK-14812 > URL: https://issues.apache.org/jira/browse/FLINK-14812 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Deployment / Scripts >Reporter: Eui Heo >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > To use the Flink metric lib you need to add it to the flink classpath. The > documentation explains to put the jar file in the lib path. > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the > burden of creating and managing another container image. It would be more > efficient to add the classpath using environment variables inside the > constructFlinkClassPath function in the config.sh file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14812) Add metric libs to Flink classpath with an environment variable.
[ https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978315#comment-16978315 ] Ufuk Celebi commented on FLINK-14812: - [~elanv] I think your work around with adding single JARs to {{HADOOP_CLASSPATH}} is the best option currently. I think the question for this ticket is independent of metrics and Kubernetes. The general question is whether we want to add an Hadoop-independent environment variable such as {{CLASSPATH_ADDITIONAL_JARS}} that we document and commit to maintain for our start up scripts. Personally, I would find this beneficial in the standalone job use case. [~plucas] What do you think? --- I'm not sure who in the Flink community has good insight on the scripts and can help us make a decision here. [~aljoscha] or [~chesnay] maybe? > Add metric libs to Flink classpath with an environment variable. > > > Key: FLINK-14812 > URL: https://issues.apache.org/jira/browse/FLINK-14812 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Deployment / Scripts >Reporter: Eui Heo >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > To use the Flink metric lib you need to add it to the flink classpath. The > documentation explains to put the jar file in the lib path. > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the > burden of creating and managing another container image. It would be more > efficient to add the classpath using environment variables inside the > constructFlinkClassPath function in the config.sh file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-14242) Collapse task names in job graph visualization if too long
[ https://issues.apache.org/jira/browse/FLINK-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-14242: --- Assignee: Yadong Xie > Collapse task names in job graph visualization if too long > -- > > Key: FLINK-14242 > URL: https://issues.apache.org/jira/browse/FLINK-14242 > Project: Flink > Issue Type: Improvement > Components: Runtime / Web Frontend >Affects Versions: 1.9.0 >Reporter: Paul Lin >Assignee: Yadong Xie >Priority: Minor > > For some complex jobs, especially SQL jobs, the task names are quite long > which makes the job graph hard to read. We could auto collapse these task > names if they exceed a certain length, and provide an uncollapse button for > the full task names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Description: The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns the wrong checkpoint if: * checkpoints are preferred ({{getLatestCheckpoint(true)}}), * the latest checkpoint is *not* a savepoint, * more than a single checkpoint is retained. The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. You can apply the patch via {{git am -3 < *.patch}}. was: The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns the wrong checkpoint if checkpoints are preferred ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a savepoint. The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. You can apply the patch via {{git am -3 < *.patch}}. > getLatestCheckpoint(true) returns wrong checkpoint > -- > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns > the wrong checkpoint if: > * checkpoints are preferred ({{getLatestCheckpoint(true)}}), > * the latest checkpoint is *not* a savepoint, > * more than a single checkpoint is retained. > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > You can apply the patch via {{git am -3 < *.patch}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Attachment: (was: 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch) > getLatestCheckpoint(true) returns wrong checkpoint > -- > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns > the wrong checkpoint if checkpoints are preferred > ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a > savepoint. > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > You can apply the patch via {{git am -3 < *.patch}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Attachment: 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > getLatestCheckpoint(true) returns wrong checkpoint > -- > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns > the wrong checkpoint if checkpoints are preferred > ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a > savepoint. > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > You can apply the patch via {{git am -3 < *.patch}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Description: The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns the wrong checkpoint if checkpoints are preferred ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a savepoint. The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. You can apply the patch via {{git am -3 < *.patch}}. was: The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns the wrong checkpoint if checkpoints are preferred ({{getLatestCheckpoint(true)}}). The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. You can apply the patch via {{git am -3 < *.patch}}. > getLatestCheckpoint(true) returns wrong checkpoint > -- > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns > the wrong checkpoint if checkpoints are preferred > ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a > savepoint. > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > You can apply the patch via {{git am -3 < *.patch}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Summary: getLatestCheckpoint(true) returns wrong checkpoint (was: getLatestCheckpoint returns wrong checkpoint) > getLatestCheckpoint(true) returns wrong checkpoint > -- > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns > the wrong checkpoint if checkpoints are preferred > ({{getLatestCheckpoint(true)}}). > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > You can apply the patch via {{git am -3 < *.patch}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Description: The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns the wrong checkpoint if checkpoints are preferred ({{getLatestCheckpoint(true)}}). The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. You can apply the patch via {{git am -3 < *.patch}}. was: The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns the wrong checkpoint if checkpoints are preferred ({{getLatestCheckpoint(true)}}). The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. You can apply the patch via {{git am -3 < *.patch}}). > getLatestCheckpoint returns wrong checkpoint > > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns > the wrong checkpoint if checkpoints are preferred > ({{getLatestCheckpoint(true)}}). > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > You can apply the patch via {{git am -3 < *.patch}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Description: The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns the wrong checkpoint if checkpoints are preferred ({{getLatestCheckpoint(true)}}). The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. You can apply the patch via {{git am -3 < *.patch}}). was: The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks returns the wrong checkpoint as the latest one if enabled. The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > getLatestCheckpoint returns wrong checkpoint > > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns > the wrong checkpoint if checkpoints are preferred > ({{getLatestCheckpoint(true)}}). > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > You can apply the patch via {{git am -3 < *.patch}}). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-14145: Attachment: 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > getLatestCheckpoint returns wrong checkpoint > > > Key: FLINK-14145 > URL: https://issues.apache.org/jira/browse/FLINK-14145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: > 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch > > > The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks > returns the wrong checkpoint as the latest one if enabled. > The current implementation assumes that the latest checkpoint is a savepoint > and skips over it. I attached a patch for > {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint
Ufuk Celebi created FLINK-14145: --- Summary: getLatestCheckpoint returns wrong checkpoint Key: FLINK-14145 URL: https://issues.apache.org/jira/browse/FLINK-14145 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.9.0 Reporter: Ufuk Celebi The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks returns the wrong checkpoint as the latest one if enabled. The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13566) Support checkpoint configuration through flink-conf.yaml
[ https://issues.apache.org/jira/browse/FLINK-13566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900096#comment-16900096 ] Ufuk Celebi commented on FLINK-13566: - I see that having a good default checkpointing interval for session clusters with many different kind of jobs can be challenging, but overall the analogy to default state backend configuration makes sense to me. In application cluster environments which are becoming more commonly used the argument about job vs. cluster-level configuration actually doesn't even apply. I think this ticket warrants a small summary document with how this behaves with interesting configuration combinations. For instance, if we can enable checkpointing by default, do we need to allow users to specifically disable checkpointing via their env? Can there be problems activating checkpointing by default? [~gyfora] Are you planning to work on this? [~aljoscha] What do you think about this? > Support checkpoint configuration through flink-conf.yaml > > > Key: FLINK-13566 > URL: https://issues.apache.org/jira/browse/FLINK-13566 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Configuration >Reporter: Gyula Fora >Priority: Major > > Currently basic checkpointing configuration happens through the > StreamExecutionEnvironment and the CheckpointConfig class. > There is no way to configure checkpointing behaviour purely from the > flink-conf.yaml file (or provide a default checkpointing behaviour) as it > always needs to happen programmatically through the environment. > The checkpoint config settings are then translated down to the > CheckpointCoordinatorConfiguration which will control the runtime behaviour. > As checkpointing related settings are operational features that should not > affect the application logic I think we need to support configuring these > params through the flink-conf yaml. > In order to do this we probably need to rework the CheckpointConfig class so > that it distinguishes parameters that the user actually set from the defaults > (to support overriding what was set in the conf). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (FLINK-12813) Add Hadoop profile in building from source docs
[ https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-12813. --- Resolution: Won't Fix > Add Hadoop profile in building from source docs > --- > > Key: FLINK-12813 > URL: https://issues.apache.org/jira/browse/FLINK-12813 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.8.0 >Reporter: Ufuk Celebi >Priority: Trivial > > The docs for [building from source with > Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions] > omit the {{-Pinclude-hadoop}} profile in two code snippets. > The two code snippets that have {{-Dhadoop.version}} set, but no > {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as > well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12813) Add Hadoop profile in building from source docs
[ https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-12813: Issue Type: Improvement (was: Bug) > Add Hadoop profile in building from source docs > --- > > Key: FLINK-12813 > URL: https://issues.apache.org/jira/browse/FLINK-12813 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.8.0 >Reporter: Ufuk Celebi >Priority: Trivial > > The docs for [building from source with > Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions] > omit the {{-Pinclude-hadoop}} profile in two code snippets. > The two code snippets that have {{-Dhadoop.version}} set, but no > {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as > well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-12813) Add Hadoop profile in building from source docs
[ https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861856#comment-16861856 ] Ufuk Celebi edited comment on FLINK-12813 at 6/12/19 8:11 AM: -- Makes sense. Since a user was also confused by this: What do you think about explicitly pointing this out in the other two sections? I'm changing this to be an {{Improvement}} ticket for now. If you don't think it's necessary, let's close this ticket, so it does not linger around. was (Author: uce): Makes sense. Since a user was also confused by this: What do you think about explicitly pointing this out in the other two sections? I'm changing this to be an {{Improvement}} ticket for now. If you don't think it's necessary, let's close this ticket, so it does not linger around. > Add Hadoop profile in building from source docs > --- > > Key: FLINK-12813 > URL: https://issues.apache.org/jira/browse/FLINK-12813 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.8.0 >Reporter: Ufuk Celebi >Priority: Trivial > > The docs for [building from source with > Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions] > omit the {{-Pinclude-hadoop}} profile in two code snippets. > The two code snippets that have {{-Dhadoop.version}} set, but no > {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as > well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12813) Add Hadoop profile in building from source docs
[ https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861856#comment-16861856 ] Ufuk Celebi commented on FLINK-12813: - Makes sense. Since a user was also confused by this: What do you think about explicitly pointing this out in the other two sections? I'm changing this to be an {{Improvement}} ticket for now. If you don't think it's necessary, let's close this ticket, so it does not linger around. > Add Hadoop profile in building from source docs > --- > > Key: FLINK-12813 > URL: https://issues.apache.org/jira/browse/FLINK-12813 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.8.0 >Reporter: Ufuk Celebi >Priority: Trivial > > The docs for [building from source with > Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions] > omit the {{-Pinclude-hadoop}} profile in two code snippets. > The two code snippets that have {{-Dhadoop.version}} set, but no > {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as > well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12813) Add Hadoop profile in building from source docs
Ufuk Celebi created FLINK-12813: --- Summary: Add Hadoop profile in building from source docs Key: FLINK-12813 URL: https://issues.apache.org/jira/browse/FLINK-12813 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.8.0 Reporter: Ufuk Celebi The docs for [building from source with Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions] omit the {{-Pinclude-hadoop}} profile in two code snippets. The two code snippets that have {{-Dhadoop.version}} set, but no {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12313) SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable
[ https://issues.apache.org/jira/browse/FLINK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-12313: Issue Type: Bug (was: Test) > SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints > is unstable > --- > > Key: FLINK-12313 > URL: https://issues.apache.org/jira/browse/FLINK-12313 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Reporter: Ufuk Celebi >Priority: Critical > Labels: test-stability > > {{SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints}} > fails and prints the Thread stack traces due to no output on Travis > occasionally. > {code} > == > Printing stack trace of Java process 10071 > == > 2019-04-24 07:55:29 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode): > "Attach Listener" #17 daemon prio=9 os_prio=0 tid=0x7f294892 > nid=0x2cf5 waiting on condition [0x] >java.lang.Thread.State: RUNNABLE > "Async calls on Test Task (1/1)" #15 daemon prio=5 os_prio=0 > tid=0x7f2948dd1800 nid=0x27a9 waiting on condition [0x7f292cea9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x8bb5e558> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "Async calls on Test Task (1/1)" #14 daemon prio=5 os_prio=0 > tid=0x7f2948dce800 nid=0x27a8 in Object.wait() [0x7f292cfaa000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x8bac58f8> (a java.lang.Object) > at java.lang.Object.wait(Object.java:502) > at > org.apache.flink.streaming.runtime.tasks.SynchronousSavepointLatch.blockUntilCheckpointIsAcknowledged(SynchronousSavepointLatch.java:66) > - locked <0x8bac58f8> (a java.lang.Object) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:726) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:604) > at > org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.triggerCheckpoint(SynchronousCheckpointITCase.java:174) > at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1182) > at > java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "CloseableReaperThread" #13 daemon prio=5 os_prio=0 tid=0x7f2948d9b800 > nid=0x27a7 in Object.wait() [0x7f292d0ab000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) > - locked <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) > at > org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193) > "Test Task (1/1)" #12 prio=5 os_prio=0 tid=0x7f2948d97000 nid=0x27a6 in > Object.wait() [0x7f292d1ac000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x8e63f7d8> (a java.lang.Object) > at java.lang.Object.wait(Object.java:502) > at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63) > - locked <0x8e63f7d8> (a java.lang.Object) > at > org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.
[jira] [Created] (FLINK-12313) SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable
Ufuk Celebi created FLINK-12313: --- Summary: SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable Key: FLINK-12313 URL: https://issues.apache.org/jira/browse/FLINK-12313 Project: Flink Issue Type: Test Components: Runtime / Checkpointing Reporter: Ufuk Celebi {{SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints}} fails and prints the Thread stack traces due to no output on Travis occasionally. {code} == Printing stack trace of Java process 10071 == 2019-04-24 07:55:29 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode): "Attach Listener" #17 daemon prio=9 os_prio=0 tid=0x7f294892 nid=0x2cf5 waiting on condition [0x] java.lang.Thread.State: RUNNABLE "Async calls on Test Task (1/1)" #15 daemon prio=5 os_prio=0 tid=0x7f2948dd1800 nid=0x27a9 waiting on condition [0x7f292cea9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x8bb5e558> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "Async calls on Test Task (1/1)" #14 daemon prio=5 os_prio=0 tid=0x7f2948dce800 nid=0x27a8 in Object.wait() [0x7f292cfaa000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x8bac58f8> (a java.lang.Object) at java.lang.Object.wait(Object.java:502) at org.apache.flink.streaming.runtime.tasks.SynchronousSavepointLatch.blockUntilCheckpointIsAcknowledged(SynchronousSavepointLatch.java:66) - locked <0x8bac58f8> (a java.lang.Object) at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:726) at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:604) at org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.triggerCheckpoint(SynchronousCheckpointITCase.java:174) at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1182) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "CloseableReaperThread" #13 daemon prio=5 os_prio=0 tid=0x7f2948d9b800 nid=0x27a7 in Object.wait() [0x7f292d0ab000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) - locked <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) at org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193) "Test Task (1/1)" #12 prio=5 os_prio=0 tid=0x7f2948d97000 nid=0x27a6 in Object.wait() [0x7f292d1ac000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x8e63f7d8> (a java.lang.Object) at java.lang.Object.wait(Object.java:502) at org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63) - locked <0x8e63f7d8> (a java.lang.Object) at org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.run(SynchronousCheckpointITCase.java:161) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:335) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:724) at java.lang.Thread.run(Thread.java:748) "process reaper" #11 daemon prio=10 os_prio=0 tid=0x7f294885e000 nid=0x2793 waiting on condition [0x7f292d7e5000] java
[jira] [Closed] (FLINK-11534) Don't exit JVM after job termination with standalone job
[ https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-11534. --- Resolution: Fixed Fix Version/s: 1.9.0 > Don't exit JVM after job termination with standalone job > > > Key: FLINK-11534 > URL: https://issues.apache.org/jira/browse/FLINK-11534 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.7.0, 1.8.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > If a job deployed in job cluster mode terminates, the JVM running the > StandaloneJobClusterEntryPoint will exit via System.exit(1). > When managing such a job this requires access to external systems for logging > in order to get more details about failure causes or final termination status. > I believe that there is value in having a StandaloneJobClusterEntryPoint > option that does not exit the JVM after the job has terminated. This allows > users to gather further information if they are monitoring the job and > manually tear down the process. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11534) Don't exit JVM after job termination with standalone job
[ https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11534: Affects Version/s: 1.8.0 > Don't exit JVM after job termination with standalone job > > > Key: FLINK-11534 > URL: https://issues.apache.org/jira/browse/FLINK-11534 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.7.0, 1.8.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > > If a job deployed in job cluster mode terminates, the JVM running the > StandaloneJobClusterEntryPoint will exit via System.exit(1). > When managing such a job this requires access to external systems for logging > in order to get more details about failure causes or final termination status. > I believe that there is value in having a StandaloneJobClusterEntryPoint > option that does not exit the JVM after the job has terminated. This allows > users to gather further information if they are monitoring the job and > manually tear down the process. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8093) flink job fail because of kafka producer create fail of "javax.management.InstanceAlreadyExistsException"
[ https://issues.apache.org/jira/browse/FLINK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805723#comment-16805723 ] Ufuk Celebi commented on FLINK-8093: Yes I agree with [~NicoK]. We need to dynamically set this per sub task. Otherwise there is no reliable way to work around this. > flink job fail because of kafka producer create fail of > "javax.management.InstanceAlreadyExistsException" > - > > Key: FLINK-8093 > URL: https://issues.apache.org/jira/browse/FLINK-8093 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.3.2 > Environment: flink 1.3.2, kafka 0.9.1 >Reporter: dongtingting >Priority: Critical > > one taskmanager has multiple taskslot, one task fail because of create > kafkaProducer fail,the reason for create kafkaProducer fail is > “javax.management.InstanceAlreadyExistsException: > kafka.producer:type=producer-metrics,client-id=producer-3”。 the detail trace > is : > 2017-11-04 19:41:23,281 INFO org.apache.flink.runtime.taskmanager.Task > - Source: Custom Source -> Filter -> Map -> Filter -> Sink: > dp_client_**_log (7/80) (99551f3f892232d7df5eb9060fa9940c) switched from > RUNNING to FAILED. > org.apache.kafka.common.KafkaException: Failed to construct kafka producer > at > org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:321) > at > org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:181) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.getKafkaProducer(FlinkKafkaProducerBase.java:202) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.open(FlinkKafkaProducerBase.java:212) > at > org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:375) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:252) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.kafka.common.KafkaException: Error registering mbean > kafka.producer:type=producer-metrics,client-id=producer-3 > at > org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:159) > at > org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:77) > at > org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:288) > at org.apache.kafka.common.metrics.Metrics.addMetric(Metrics.java:255) > at org.apache.kafka.common.metrics.Metrics.addMetric(Metrics.java:239) > at > org.apache.kafka.clients.producer.internals.RecordAccumulator.registerMetrics(RecordAccumulator.java:137) > at > org.apache.kafka.clients.producer.internals.RecordAccumulator.(RecordAccumulator.java:111) > at > org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:261) > ... 9 more > Caused by: javax.management.InstanceAlreadyExistsException: > kafka.producer:type=producer-metrics,client-id=producer-3 > at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324) > at > com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522) > at > org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:157) > ... 16 more > I doubt that task in different taskslot of one taskmanager use different > classloader, and taskid may be the same in one process。 So this lead to > create kafkaProducer fail in one taskManager。 > Does anybody encountered the same problem? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12060) Unify change pom version scripts
Ufuk Celebi created FLINK-12060: --- Summary: Unify change pom version scripts Key: FLINK-12060 URL: https://issues.apache.org/jira/browse/FLINK-12060 Project: Flink Issue Type: Bug Components: Deployment / Scripts Affects Versions: 1.8.0 Reporter: Ufuk Celebi We have three places in `tools` that we use to update the pom versions when releasing: 1. https://github.com/apache/flink/blob/048367b/tools/change-version.sh#L31 2. https://github.com/apache/flink/blob/048367b/tools/releasing/create_release_branch.sh#L60 3. https://github.com/apache/flink/blob/048367b/tools/releasing/update_branch_version.sh#L52 The 1st option is buggy (it does not work with the new versioning of the shaded Hadoop build, e.g. {{2.4.1-1.9-SNAPSHOT}} will not be replaced). The 2nd and 3rd work for pom files, but the 2nd one misses a change for the doc version that is present in the 3rd one. I think we should unify these and call them where needed instead of duplicating this code in unexpected ways. An initial quick fix could remove the 1st script and update the 2rd one to match the 3rd one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11953) Introduce Plugin/Loading system and integrate it with FileSystem
[ https://issues.apache.org/jira/browse/FLINK-11953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795743#comment-16795743 ] Ufuk Celebi commented on FLINK-11953: - [~srichter] [Presto's plugin system|https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/server/PluginManager.java] could potentially serve as inspiration. I'm sure you already have a plan here, just bringing it up as FYI. > Introduce Plugin/Loading system and integrate it with FileSystem > > > Key: FLINK-11953 > URL: https://issues.apache.org/jira/browse/FLINK-11953 > Project: Flink > Issue Type: Sub-task > Components: Connectors / FileSystem >Affects Versions: 1.9.0 >Reporter: Stefan Richter >Assignee: Stefan Richter >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11798) Incorrect Kubernetes Documentation
[ https://issues.apache.org/jira/browse/FLINK-11798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786870#comment-16786870 ] Ufuk Celebi commented on FLINK-11798: - Do you mean the README in {{flink-container/kubernetes}}? > Incorrect Kubernetes Documentation > -- > > Key: FLINK-11798 > URL: https://issues.apache.org/jira/browse/FLINK-11798 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.7.2 >Reporter: Pritesh Patel >Priority: Major > > I have been trying to use the kubernetes session cluster manifests provided > in the documentation. The -Dtaskmanager.host flag doesn't seem to pass > through, meaning it uses the pod name as the host name. This wont work. > The current docs state the args should be: > > {code:java} > args: > - taskmanager > - "-Dtaskmanager.host=$(K8S_POD_IP)" > {code} > > I did manage to get it to work by using this manifest for the taskmanager > instead. This did waste alot of time as it was very hard to find. > {code:java} > args: > - taskmanager.sh > - -Dtaskmanager.host=$(K8S_POD_IP) > - -Djobmanager.rpc.address=$(JOB_MANAGER_RPC_ADDRESS) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11798) Incorrect Kubernetes Documentation
[ https://issues.apache.org/jira/browse/FLINK-11798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786145#comment-16786145 ] Ufuk Celebi commented on FLINK-11798: - [~1u0] Could you take a look at this? > Incorrect Kubernetes Documentation > -- > > Key: FLINK-11798 > URL: https://issues.apache.org/jira/browse/FLINK-11798 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.7.2 >Reporter: Pritesh Patel >Priority: Major > > I have been trying to use the kubernetes session cluster manifests provided > in the documentation. The -Dtaskmanager.host flag doesn't seem to pass > through, meaning it uses the pod name as the host name. This wont work. > The current docs state the args should be: > > {code:java} > args: > - taskmanager > - "-Dtaskmanager.host=$(K8S_POD_IP)" > {code} > > I did manage to get it to work by using this manifest for the taskmanager > instead. This did waste alot of time as it was very hard to find. > {code:java} > args: > - taskmanager.sh > - -Dtaskmanager.host=$(K8S_POD_IP) > - -Djobmanager.rpc.address=$(JOB_MANAGER_RPC_ADDRESS) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11847) Docker job-specific image creation instruction make reference to --job-jar parameter that does not actually exist in the script
[ https://issues.apache.org/jira/browse/FLINK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786143#comment-16786143 ] Ufuk Celebi commented on FLINK-11847: - {{README.md}} refers to [https://github.com/apache/flink/blob/release-1.7/flink-container/docker/build.sh] (not the one in {{flink-contrib}} which is independent). > Docker job-specific image creation instruction make reference to --job-jar > parameter that does not actually exist in the script > --- > > Key: FLINK-11847 > URL: https://issues.apache.org/jira/browse/FLINK-11847 > Project: Flink > Issue Type: Bug > Components: Deployment / Docker, Deployment / Kubernetes >Affects Versions: 1.7.2 >Reporter: Frank Wilson >Priority: Major > > Documentation [1] refers to --job-jar parameter that does not exist in the > script [2]. > > > [1]https://github.com/apache/flink/blob/release-1.7/flink-container/docker/README.md#building-the-docker-image > [2] > https://github.com/apache/flink/blob/release-1.7/flink-contrib/docker-flink/build.sh#L24 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
[ https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11533: Release Note: The container entry point will scan the Flink class path for the job jar. Therefore, the --job-classname command line argument is now optional. > Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever > --- > > Key: FLINK-11533 > URL: https://issues.apache.org/jira/browse/FLINK-11533 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0, 1.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Users running job clusters distribute their user code as part of the shared > classpath of all cluster components. We currently require users running > {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR > manifest entries that specify the main class of a JAR are ignored since they > are simply part of the classpath. > I propose to add another optional command line argument to the > {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file > (such as {{lib/usercode.jar}}) and whose Manifest is respected. > Arguments: > {code} > --job-jar > --job-classname name > {code} > Each argument is optional, but at least one of the two is required. The > job-classname has precedence over job-jar. > Implementation wise we should be able to simply create the PackagedProgram > from the jar file path in ClassPathJobGraphRetriever. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11752) Move flink-python to opt
[ https://issues.apache.org/jira/browse/FLINK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11752: Release Note: flink-python has been moved from lib to opt and is not available on the default class path any more. > Move flink-python to opt > > > Key: FLINK-11752 > URL: https://issues.apache.org/jira/browse/FLINK-11752 > Project: Flink > Issue Type: Improvement > Components: API / Python, Build System >Affects Versions: 1.7.2 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0, 1.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > As discussed on the [dev mailing > list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html], > we should move flink-python to opt instead of having it in lib by default. > The streaming counter part (flink-streaming-python) is only as part of opt > already. > I think we don't have many users of the Python batch API and this will make > the streaming/batch experience more consistent and would result in a cleaner > default classpath. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
[ https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-11533. --- Resolution: Fixed Fix Version/s: 1.8.0 > Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever > --- > > Key: FLINK-11533 > URL: https://issues.apache.org/jira/browse/FLINK-11533 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0, 1.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Users running job clusters distribute their user code as part of the shared > classpath of all cluster components. We currently require users running > {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR > manifest entries that specify the main class of a JAR are ignored since they > are simply part of the classpath. > I propose to add another optional command line argument to the > {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file > (such as {{lib/usercode.jar}}) and whose Manifest is respected. > Arguments: > {code} > --job-jar > --job-classname name > {code} > Each argument is optional, but at least one of the two is required. The > job-classname has precedence over job-jar. > Implementation wise we should be able to simply create the PackagedProgram > from the jar file path in ClassPathJobGraphRetriever. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11752) Move flink-python to opt
[ https://issues.apache.org/jira/browse/FLINK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11752: Fix Version/s: 1.8.0 > Move flink-python to opt > > > Key: FLINK-11752 > URL: https://issues.apache.org/jira/browse/FLINK-11752 > Project: Flink > Issue Type: Improvement > Components: API / Python, Build System >Affects Versions: 1.7.2 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0, 1.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > As discussed on the [dev mailing > list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html], > we should move flink-python to opt instead of having it in lib by default. > The streaming counter part (flink-streaming-python) is only as part of opt > already. > I think we don't have many users of the Python batch API and this will make > the streaming/batch experience more consistent and would result in a cleaner > default classpath. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4387) Instability in KvStateClientTest.testClientServerIntegration()
[ https://issues.apache.org/jira/browse/FLINK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781394#comment-16781394 ] Ufuk Celebi commented on FLINK-4387: I think as [~yunta] says, this should have been fixed when Netty was upgraded. Furthermore, the referenced test class does not exist anymore. [~rmetzger] I think we can close this. > Instability in KvStateClientTest.testClientServerIntegration() > -- > > Key: FLINK-4387 > URL: https://issues.apache.org/jira/browse/FLINK-4387 > Project: Flink > Issue Type: Bug > Components: Runtime / Queryable State >Affects Versions: 1.1.0, 1.5.0, 1.6.0 >Reporter: Robert Metzger >Assignee: Nico Kruber >Priority: Major > Labels: test-stability > Fix For: 1.2.0, 1.8.0 > > > According to this log: > https://s3.amazonaws.com/archive.travis-ci.org/jobs/151491745/log.txt > the {{KvStateClientTest}} didn't complete. > {code} > "main" #1 prio=5 os_prio=0 tid=0x7fb2b400a000 nid=0x29dc in Object.wait() > [0x7fb2bcb3b000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xf7c049a0> (a > io.netty.util.concurrent.DefaultPromise) > at java.lang.Object.wait(Object.java:502) > at > io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:254) > - locked <0xf7c049a0> (a > io.netty.util.concurrent.DefaultPromise) > at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:32) > at > org.apache.flink.runtime.query.netty.KvStateServer.shutDown(KvStateServer.java:185) > at > org.apache.flink.runtime.query.netty.KvStateClientTest.testClientServerIntegration(KvStateClientTest.java:680) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} > and > {code} > Exception in thread "globalEventExecutor-1-3" java.lang.AssertionError > at > io.netty.util.concurrent.AbstractScheduledEventExecutor.pollScheduledTask(AbstractScheduledEventExecutor.java:83) > at > io.netty.util.concurrent.GlobalEventExecutor.fetchFromScheduledTaskQueue(GlobalEventExecutor.java:110) > at > io.netty.util.concurrent.GlobalEventExecutor.takeTask(GlobalEventExecutor.java:95) > at > io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:226) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-1833) Refactor partition availability notification in ExecutionGraph
[ https://issues.apache.org/jira/browse/FLINK-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-1833. -- Resolution: Won't Do > Refactor partition availability notification in ExecutionGraph > -- > > Key: FLINK-1833 > URL: https://issues.apache.org/jira/browse/FLINK-1833 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 0.10.0 >Reporter: Ufuk Celebi >Priority: Major > > The mechanism to notify the JobManager about available result partitions is > hard to understand. The are two parts to this: > 1) JobManager > - The deployment of receivers happens in the Execution class although it is > by now totally unrelated to the state of a specific execution. I propose to > move this to the respective IntermediateResultPartition. > - The deployment information for a receiver is spread across different > components: when creating the TaskDeploymentDescriptor and the "caching" of > partition infos at the consuming vertex. This is very hard to follow and > results in unnecessary messages being sent (which are discarded at the TM). > 2) TaskManager > - Pipelined results notify where you would expect it in the ResultPartition, > but blocking results don't have an extra message and are implicitly > piggy-backed to the final state transition, after which the job manager > deploys receivers if all blocking partitions of a result have been produced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
[ https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11533: Fix Version/s: 1.9.0 > Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever > --- > > Key: FLINK-11533 > URL: https://issues.apache.org/jira/browse/FLINK-11533 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Users running job clusters distribute their user code as part of the shared > classpath of all cluster components. We currently require users running > {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR > manifest entries that specify the main class of a JAR are ignored since they > are simply part of the classpath. > I propose to add another optional command line argument to the > {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file > (such as {{lib/usercode.jar}}) and whose Manifest is respected. > Arguments: > {code} > --job-jar > --job-classname name > {code} > Each argument is optional, but at least one of the two is required. The > job-classname has precedence over job-jar. > Implementation wise we should be able to simply create the PackagedProgram > from the jar file path in ClassPathJobGraphRetriever. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-11784. --- Resolution: Cannot Reproduce > KryoSerializerSnapshotTest occasionally fails on Travis > --- > > Key: FLINK-11784 > URL: https://issues.apache.org/jira/browse/FLINK-11784 > Project: Flink > Issue Type: Bug > Components: API / Type Serialization System >Affects Versions: 1.8.0 >Reporter: Ufuk Celebi >Priority: Blocker > Labels: test-stability > Fix For: 1.8.0 > > > {{KryoSerializerSnapshotTest}} fails occasionally with: > {code:java} > 11:37:44.198 [ERROR] > tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest) > Time elapsed: 0.011 s <<< ERROR! > java.io.EOFException > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120) > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code} > See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output > (as part of a PR with unrelated changes). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11784: Fix Version/s: (was: 1.8.0) > KryoSerializerSnapshotTest occasionally fails on Travis > --- > > Key: FLINK-11784 > URL: https://issues.apache.org/jira/browse/FLINK-11784 > Project: Flink > Issue Type: Bug > Components: API / Type Serialization System >Affects Versions: 1.8.0 >Reporter: Ufuk Celebi >Priority: Blocker > Labels: test-stability > > {{KryoSerializerSnapshotTest}} fails occasionally with: > {code:java} > 11:37:44.198 [ERROR] > tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest) > Time elapsed: 0.011 s <<< ERROR! > java.io.EOFException > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120) > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code} > See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output > (as part of a PR with unrelated changes). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis
Ufuk Celebi created FLINK-11784: --- Summary: KryoSerializerSnapshotTest occasionally fails on Travis Key: FLINK-11784 URL: https://issues.apache.org/jira/browse/FLINK-11784 Project: Flink Issue Type: Bug Components: API / Type Serialization System Reporter: Ufuk Celebi {{KryoSerializerSnapshotTest}} fails occasionally with: {code:java} 11:37:44.198 [ERROR] tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest) Time elapsed: 0.011 s <<< ERROR! java.io.EOFException at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120) at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code} See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output (as part of a PR with unrelated changes). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11752) Move flink-python to opt
[ https://issues.apache.org/jira/browse/FLINK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-11752. --- Resolution: Fixed Fix Version/s: 1.9.0 > Move flink-python to opt > > > Key: FLINK-11752 > URL: https://issues.apache.org/jira/browse/FLINK-11752 > Project: Flink > Issue Type: Improvement > Components: API / Python, Build System >Affects Versions: 1.7.2 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > As discussed on the [dev mailing > list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html], > we should move flink-python to opt instead of having it in lib by default. > The streaming counter part (flink-streaming-python) is only as part of opt > already. > I think we don't have many users of the Python batch API and this will make > the streaming/batch experience more consistent and would result in a cleaner > default classpath. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-1275) Add support to compress network I/O
[ https://issues.apache.org/jira/browse/FLINK-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778922#comment-16778922 ] Ufuk Celebi commented on FLINK-1275: [~yanghua] It has been a long time since I created this ticket. The original idea was to use lightweight compression (e.g. Snappy) for data that is transferred via the task managers. I haven't been active on the network stack for a while, so I can't judge what the current plans are for this. I think that [~zjwang] or [~pnowojski] can provide more input on this. > Add support to compress network I/O > --- > > Key: FLINK-1275 > URL: https://issues.apache.org/jira/browse/FLINK-1275 > Project: Flink > Issue Type: Improvement > Components: Network >Affects Versions: 0.8.0 >Reporter: Ufuk Celebi >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11752) Move flink-python to opt
Ufuk Celebi created FLINK-11752: --- Summary: Move flink-python to opt Key: FLINK-11752 URL: https://issues.apache.org/jira/browse/FLINK-11752 Project: Flink Issue Type: Improvement Components: Build System, Python API Affects Versions: 1.7.2 Reporter: Ufuk Celebi Assignee: Ufuk Celebi As discussed on the [dev mailing list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html], we should move flink-python to opt instead of having it in lib by default. The streaming counter part (flink-streaming-python) is only as part of opt already. I think we don't have many users of the Python batch API and this will make the streaming/batch experience more consistent and would result in a cleaner default classpath. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint
[ https://issues.apache.org/jira/browse/FLINK-11545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-11545. --- Resolution: Fixed Fix Version/s: 1.8.0 > Add option to manually set job ID in StandaloneJobClusterEntryPoint > --- > > Key: FLINK-11545 > URL: https://issues.apache.org/jira/browse/FLINK-11545 > Project: Flink > Issue Type: Sub-task > Components: Cluster Management, Kubernetes >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Add an option to specify the job ID during job submissions via the > StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all > zeros currently. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11544) Add option to manually set job ID for job submissions via REST API
[ https://issues.apache.org/jira/browse/FLINK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-11544. --- Resolution: Fixed Fix Version/s: 1.8.0 > Add option to manually set job ID for job submissions via REST API > -- > > Key: FLINK-11544 > URL: https://issues.apache.org/jira/browse/FLINK-11544 > Project: Flink > Issue Type: Sub-task > Components: REST >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > Labels: pull-request-available > Fix For: 1.8.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Add an option to specify the job ID during job submissions via the REST API. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
[ https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-11533: --- Assignee: Ufuk Celebi (was: vinoyang) > Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever > --- > > Key: FLINK-11533 > URL: https://issues.apache.org/jira/browse/FLINK-11533 > Project: Flink > Issue Type: New Feature > Components: Cluster Management >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > > Users running job clusters distribute their user code as part of the shared > classpath of all cluster components. We currently require users running > {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR > manifest entries that specify the main class of a JAR are ignored since they > are simply part of the classpath. > I propose to add another optional command line argument to the > {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file > (such as {{lib/usercode.jar}}) and whose Manifest is respected. > Arguments: > {code} > --job-jar > --job-classname name > {code} > Each argument is optional, but at least one of the two is required. The > job-classname has precedence over job-jar. > Implementation wise we should be able to simply create the PackagedProgram > from the jar file path in ClassPathJobGraphRetriever. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
[ https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763366#comment-16763366 ] Ufuk Celebi commented on FLINK-11533: - [~yanghua] I assigned myself. Sorry for the confusion. As I wrote in the original ticket description, I was waiting to get confirmation for this ticket before providing the implementation (I already have it). > Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever > --- > > Key: FLINK-11533 > URL: https://issues.apache.org/jira/browse/FLINK-11533 > Project: Flink > Issue Type: New Feature > Components: Cluster Management >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > > Users running job clusters distribute their user code as part of the shared > classpath of all cluster components. We currently require users running > {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR > manifest entries that specify the main class of a JAR are ignored since they > are simply part of the classpath. > I propose to add another optional command line argument to the > {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file > (such as {{lib/usercode.jar}}) and whose Manifest is respected. > Arguments: > {code} > --job-jar > --job-classname name > {code} > Each argument is optional, but at least one of the two is required. The > job-classname has precedence over job-jar. > Implementation wise we should be able to simply create the PackagedProgram > from the jar file path in ClassPathJobGraphRetriever. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11534) Don't exit JVM after job termination with standalone job
[ https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763365#comment-16763365 ] Ufuk Celebi commented on FLINK-11534: - [~yanghua] I assigned myself. Sorry for the confusion. As I wrote in the original ticket description, I was waiting to get confirmation for this ticket before providing the implementation (I already have it). > Don't exit JVM after job termination with standalone job > > > Key: FLINK-11534 > URL: https://issues.apache.org/jira/browse/FLINK-11534 > Project: Flink > Issue Type: New Feature > Components: Cluster Management >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > > If a job deployed in job cluster mode terminates, the JVM running the > StandaloneJobClusterEntryPoint will exit via System.exit(1). > When managing such a job this requires access to external systems for logging > in order to get more details about failure causes or final termination status. > I believe that there is value in having a StandaloneJobClusterEntryPoint > option that does not exit the JVM after the job has terminated. This allows > users to gather further information if they are monitoring the job and > manually tear down the process. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-11534) Don't exit JVM after job termination with standalone job
[ https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-11534: --- Assignee: Ufuk Celebi (was: vinoyang) > Don't exit JVM after job termination with standalone job > > > Key: FLINK-11534 > URL: https://issues.apache.org/jira/browse/FLINK-11534 > Project: Flink > Issue Type: New Feature > Components: Cluster Management >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > > If a job deployed in job cluster mode terminates, the JVM running the > StandaloneJobClusterEntryPoint will exit via System.exit(1). > When managing such a job this requires access to external systems for logging > in order to get more details about failure causes or final termination status. > I believe that there is value in having a StandaloneJobClusterEntryPoint > option that does not exit the JVM after the job has terminated. This allows > users to gather further information if they are monitoring the job and > manually tear down the process. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11544) Add option to manually set job ID for job submissions via REST API
[ https://issues.apache.org/jira/browse/FLINK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11544: Summary: Add option to manually set job ID for job submissions via REST API (was: Add option manually set job ID for job submissions via REST API) > Add option to manually set job ID for job submissions via REST API > -- > > Key: FLINK-11544 > URL: https://issues.apache.org/jira/browse/FLINK-11544 > Project: Flink > Issue Type: Sub-task > Components: REST >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > > Add an option to specify the job ID during job submissions via the REST API. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11544) Add option manually set job ID for job submissions via REST API
Ufuk Celebi created FLINK-11544: --- Summary: Add option manually set job ID for job submissions via REST API Key: FLINK-11544 URL: https://issues.apache.org/jira/browse/FLINK-11544 Project: Flink Issue Type: Sub-task Components: REST Affects Versions: 1.7.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Add an option to specify the job ID during job submissions via the REST API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint
[ https://issues.apache.org/jira/browse/FLINK-11545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11545: Priority: Minor (was: Major) > Add option to manually set job ID in StandaloneJobClusterEntryPoint > --- > > Key: FLINK-11545 > URL: https://issues.apache.org/jira/browse/FLINK-11545 > Project: Flink > Issue Type: Sub-task > Components: Cluster Management, Kubernetes >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi >Priority: Minor > > Add an option to specify the job ID during job submissions via the > StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all > zeros currently. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11546) Add option to manually set job ID in CLI
Ufuk Celebi created FLINK-11546: --- Summary: Add option to manually set job ID in CLI Key: FLINK-11546 URL: https://issues.apache.org/jira/browse/FLINK-11546 Project: Flink Issue Type: Sub-task Components: Client Affects Versions: 1.7.0 Reporter: Ufuk Celebi Add an option to specify the job ID during job submissions via the CLI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint
Ufuk Celebi created FLINK-11545: --- Summary: Add option to manually set job ID in StandaloneJobClusterEntryPoint Key: FLINK-11545 URL: https://issues.apache.org/jira/browse/FLINK-11545 Project: Flink Issue Type: Sub-task Components: Cluster Management, Kubernetes Affects Versions: 1.7.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Add an option to specify the job ID during job submissions via the StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all zeros currently. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11537) ExecutionGraph does not reach terminal state when JobMaster lost leadership
[ https://issues.apache.org/jira/browse/FLINK-11537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762483#comment-16762483 ] Ufuk Celebi commented on FLINK-11537: - Is this related to/a duplicate of https://issues.apache.org/jira/browse/FLINK-10439? > ExecutionGraph does not reach terminal state when JobMaster lost leadership > --- > > Key: FLINK-11537 > URL: https://issues.apache.org/jira/browse/FLINK-11537 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.8.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Fix For: 1.8.0 > > > The {{ExecutionGraph}} sometimes does not reach a terminal state if the > {{JobMaster}} lost the leadership. The reason is that we use the fenced main > thread executor to execute {{ExecutionGraph}} changes and we don't wait for > the {{ExecutionGraph}} to reach the terminal state before we set the fencing > token {{null}}. > One possible solution would be to wait for the {{ExecutionGraph}} to reach > the terminal state before clearing the fencing token. This has, however, the > downside that the {{JobMaster}} is still reachable until the > {{ExecutionGraph}} has been properly terminated. Alternatively, we could use > the unfenced main thread executor to send the cancel calls out. > A Travis run where the problem occurred is here: > https://travis-ci.org/tillrohrmann/flink/jobs/489119926 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11534) Don't exit JVM after job termination with standalone job
[ https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761683#comment-16761683 ] Ufuk Celebi commented on FLINK-11534: - [~till.rohrmann] Yes. This would work. Thanks for the pointer. > Don't exit JVM after job termination with standalone job > > > Key: FLINK-11534 > URL: https://issues.apache.org/jira/browse/FLINK-11534 > Project: Flink > Issue Type: New Feature > Components: Cluster Management >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Minor > > If a job deployed in job cluster mode terminates, the JVM running the > StandaloneJobClusterEntryPoint will exit via System.exit(1). > When managing such a job this requires access to external systems for logging > in order to get more details about failure causes or final termination status. > I believe that there is value in having a StandaloneJobClusterEntryPoint > option that does not exit the JVM after the job has terminated. This allows > users to gather further information if they are monitoring the job and > manually tear down the process. > If there is agreement to have this feature, I would provide the > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11534) Don't exit JVM after job termination with standalone job
Ufuk Celebi created FLINK-11534: --- Summary: Don't exit JVM after job termination with standalone job Key: FLINK-11534 URL: https://issues.apache.org/jira/browse/FLINK-11534 Project: Flink Issue Type: New Feature Components: Cluster Management Affects Versions: 1.7.0 Reporter: Ufuk Celebi If a job deployed in job cluster mode terminates, the JVM running the StandaloneJobClusterEntryPoint will exit via System.exit(1). When managing such a job this requires access to external systems for logging in order to get more details about failure causes or final termination status. I believe that there is value in having a StandaloneJobClusterEntryPoint option that does not exit the JVM after the job has terminated. This allows users to gather further information if they are monitoring the job and manually tear down the process. If there is agreement to have this feature, I would provide the implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
Ufuk Celebi created FLINK-11533: --- Summary: Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever Key: FLINK-11533 URL: https://issues.apache.org/jira/browse/FLINK-11533 Project: Flink Issue Type: New Feature Components: Cluster Management Affects Versions: 1.7.0 Reporter: Ufuk Celebi Users running job clusters distribute their user code as part of the shared classpath of all cluster components. We currently require users running {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR manifest entries that specify the main class of a JAR are ignored since they are simply part of the classpath. I propose to add another optional command line argument to the {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file (such as {{lib/usercode.jar}}) and whose Manifest is respected. Arguments: {code} --job-jar --job-classname name {code} Each argument is optional, but at least one of the two is required. The job-classname has precedence over job-jar. Implementation wise we should be able to simply create the PackagedProgram from the jar file path in ClassPathJobGraphRetriever. If there is agreement to have this feature, I would provide the implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11525) Add option to manually set job ID
[ https://issues.apache.org/jira/browse/FLINK-11525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760788#comment-16760788 ] Ufuk Celebi commented on FLINK-11525: - [~f.pompermaier] I believe that ticket is orthogonal to what I'm proposing here. > Add option to manually set job ID > -- > > Key: FLINK-11525 > URL: https://issues.apache.org/jira/browse/FLINK-11525 > Project: Flink > Issue Type: New Feature > Components: Client, REST >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Minor > > When submitting Flink jobs programmatically it is desirable to have the > option to manually set the job ID in order to have idempotent job > submissions. This simplifies failure handling on the user side as duplicate > submissions will be rejected by Flink. In general allowing to manually set > the job ID can be beneficial for third party tooling. > The default behavior should not be altered. The following job submission > entry points should be extended to allow to specify this option: > 1. REST API > 2. StandaloneJobClusterEntrypoint > 3. CLI > Note that for 2. FLINK-10921 already suggested to allow to configure the job > ID manually. > If there is agreement to have this feature, I'll go ahead and create sub > tasks for the mentioned entry points (and provide the implementation for 1. > and 2.). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11525) Add option to manually set job ID
[ https://issues.apache.org/jira/browse/FLINK-11525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11525: Description: When submitting Flink jobs programmatically it is desirable to have the option to manually set the job ID in order to have idempotent job submissions. This simplifies failure handling on the user side as duplicate submissions will be rejected by Flink. In general allowing to manually set the job ID can be beneficial for third party tooling. The default behavior should not be altered. The following job submission entry points should be extended to allow to specify this option: 1. REST API 2. StandaloneJobClusterEntrypoint 3. CLI Note that for 2. FLINK-10921 already suggested to allow to configure the job ID manually. If there is agreement to have this feature, I'll go ahead and create sub tasks for the mentioned entry points (and provide the implementation for 1. and 2.). was: When submitting Flink jobs programmatically it is desirable to have the option to manually set the job ID in order to have idempotent job submissions. This simplifies failure handling on the user side as duplicate submissions will be rejected by Flink. In general allowing to manually set the job ID can be beneficial for third party tooling. The default behavior should not be altered. The following job submission entry points should be extended to allow to specify this option: 1. REST API 2. StandaloneJobClusterEntrypoint 3. CLI Note that for 2., FLINK-10921 already suggested to allow to configure the job ID manually. If there is agreement to have this feature, I'll go ahead and create sub tasks for the mentioned entry points (and provide the implementation for 1. and 2.). > Add option to manually set job ID > -- > > Key: FLINK-11525 > URL: https://issues.apache.org/jira/browse/FLINK-11525 > Project: Flink > Issue Type: New Feature > Components: Client, REST >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Minor > > When submitting Flink jobs programmatically it is desirable to have the > option to manually set the job ID in order to have idempotent job > submissions. This simplifies failure handling on the user side as duplicate > submissions will be rejected by Flink. In general allowing to manually set > the job ID can be beneficial for third party tooling. > The default behavior should not be altered. The following job submission > entry points should be extended to allow to specify this option: > 1. REST API > 2. StandaloneJobClusterEntrypoint > 3. CLI > Note that for 2. FLINK-10921 already suggested to allow to configure the job > ID manually. > If there is agreement to have this feature, I'll go ahead and create sub > tasks for the mentioned entry points (and provide the implementation for 1. > and 2.). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11525) Add option to manually set job ID
Ufuk Celebi created FLINK-11525: --- Summary: Add option to manually set job ID Key: FLINK-11525 URL: https://issues.apache.org/jira/browse/FLINK-11525 Project: Flink Issue Type: New Feature Components: Client, REST Affects Versions: 1.7.0 Reporter: Ufuk Celebi When submitting Flink jobs programmatically it is desirable to have the option to manually set the job ID in order to have idempotent job submissions. This simplifies failure handling on the user side as duplicate submissions will be rejected by Flink. In general allowing to manually set the job ID can be beneficial for third party tooling. The default behavior should not be altered. The following job submission entry points should be extended to allow to specify this option: 1. REST API 2. StandaloneJobClusterEntrypoint 3. CLI Note that for 2., FLINK-10921 already suggested to allow to configure the job ID manually. If there is agreement to have this feature, I'll go ahead and create sub tasks for the mentioned entry points (and provide the implementation for 1. and 2.). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders
[ https://issues.apache.org/jira/browse/FLINK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-11402: Component/s: Local Runtime > User code can fail with an UnsatisfiedLinkError in the presence of multiple > classloaders > > > Key: FLINK-11402 > URL: https://issues.apache.org/jira/browse/FLINK-11402 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, Local Runtime >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz > > > As reported on the user mailing list thread "[`env.java.opts` not persisting > after job canceled or failed and then > restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E]";, > there can be issues with using native libraries and user code class loading. > h2. Steps to reproduce > I was able to reproduce the issue reported on the mailing list using > [snappy-java|https://github.com/xerial/snappy-java] in a user program. > Running the attached user program works fine on initial submission, but > results in a failure when re-executed. > I'm using Flink 1.7.0 using a standalone cluster started via > {{bin/start-cluster.sh}}. > 0. Unpack attached Maven project and build using {{mvn clean package}} *or* > directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}} > 1. Download > [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar] > and unpack libsnappyjava for your system: > {code} > jar tf snappy-java-1.1.7.2.jar | grep libsnappy > ... > org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so > ... > org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib > ... > {code} > 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} > (path needs to be adjusted for your system): > {code} > env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64 > {code} > 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}} > {code} > bin/flink run hello-snappy-1.0-SNAPSHOT.jar > Starting execution of program > Program execution finished > Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished. > Job Runtime: 359 ms > {code} > 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}} > {code} > bin/flink run hello-snappy-1.0-SNAPSHOT.jar > Starting execution of program > > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 7d69baca58f33180cb9251449ddcd396) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268) > at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) > at com.github.uce.HelloSnappy.main(HelloSnappy.java:18) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421) > at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813) > at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126) > at > org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265) > ... 17 more > Caused by: java.lang.UnsatisfiedLinkError: Native Library > /.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded > in another classloader > at java.lang.ClassLoader.loadLibrary0(ClassL
[jira] [Commented] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders
[ https://issues.apache.org/jira/browse/FLINK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748241#comment-16748241 ] Ufuk Celebi commented on FLINK-11402: - Similar issue for RocksDB (FLINK-5408) > User code can fail with an UnsatisfiedLinkError in the presence of multiple > classloaders > > > Key: FLINK-11402 > URL: https://issues.apache.org/jira/browse/FLINK-11402 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz > > > As reported on the user mailing list thread "[`env.java.opts` not persisting > after job canceled or failed and then > restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E]";, > there can be issues with using native libraries and user code class loading. > h2. Steps to reproduce > I was able to reproduce the issue reported on the mailing list using > [snappy-java|https://github.com/xerial/snappy-java] in a user program. > Running the attached user program works fine on initial submission, but > results in a failure when re-executed. > I'm using Flink 1.7.0 using a standalone cluster started via > {{bin/start-cluster.sh}}. > 0. Unpack attached Maven project and build using {{mvn clean package}} *or* > directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}} > 1. Download > [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar] > and unpack libsnappyjava for your system: > {code} > jar tf snappy-java-1.1.7.2.jar | grep libsnappy > ... > org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so > ... > org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib > ... > {code} > 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} > (path needs to be adjusted for your system): > {code} > env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64 > {code} > 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}} > {code} > bin/flink run hello-snappy-1.0-SNAPSHOT.jar > Starting execution of program > Program execution finished > Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished. > Job Runtime: 359 ms > {code} > 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}} > {code} > bin/flink run hello-snappy-1.0-SNAPSHOT.jar > Starting execution of program > > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 7d69baca58f33180cb9251449ddcd396) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268) > at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) > at com.github.uce.HelloSnappy.main(HelloSnappy.java:18) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421) > at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813) > at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126) > at > org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265) > ... 17 more > Caused by: java.lang.UnsatisfiedLinkError: Native Library > /.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded > in another classload