from:"Ufuk Celebi \(Jira\)"

[jira] [Assigned] (FLINK-35109) Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support for 1.17 and 1.18

2024-05-10 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-35109:
---

Assignee: Fabian Paul  (was: Ufuk Celebi)

> Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support 
> for 1.17 and 1.18
> ---
>
> Key: FLINK-35109
> URL: https://issues.apache.org/jira/browse/FLINK-35109
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / Kafka
>Reporter: Martijn Visser
>Assignee: Fabian Paul
>Priority: Blocker
> Fix For: kafka-4.0.0
>
>
> The Flink Kafka connector currently can't compile against Flink 
> 1.20-SNAPSHOT. An example failure can be found at 
> https://github.com/apache/flink-connector-kafka/actions/runs/8659822490/job/23746484721#step:15:169
> The {code:java} TypeSerializerUpgradeTestBase{code} has had issues before, 
> see FLINK-32455. See also specifically the comment in 
> https://issues.apache.org/jira/browse/FLINK-32455?focusedCommentId=17739785=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17739785
> Next to that, there's also FLINK-25509 which can only be supported with Flink 
> 1.19 and higher. 
> So we should:
> * Drop support for 1.17 and 1.18
> * Refactor the Flink Kafka connector to use the new 
> {code:java}MigrationTest{code}
> We will support the Flink Kafka connector for Flink 1.18 via the v3.1 branch; 
> this change will be a new v4.0 version with support for Flink 1.19 and the 
> upcoming Flink 1.20



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-35239) 1.19 docs show outdated warning

2024-04-26 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi resolved FLINK-35239.
-
Resolution: Fixed

> 1.19 docs show outdated warning
> ---
>
> Key: FLINK-35239
> URL: https://issues.apache.org/jira/browse/FLINK-35239
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.19.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: Screenshot 2024-04-25 at 15.01.57.png
>
>
> The docs for 1.19 are currently marked as outdated although it's the 
> currently stable release.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35239) 1.19 docs show outdated warning

2024-04-25 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-35239:
---

 Summary: 1.19 docs show outdated warning
 Key: FLINK-35239
 URL: https://issues.apache.org/jira/browse/FLINK-35239
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.19.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
 Fix For: 1.19.0
 Attachments: Screenshot 2024-04-25 at 15.01.57.png

The docs for 1.19 are currently marked as outdated although it's the currently 
stable release.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-35109) Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support for 1.17 and 1.18

2024-04-19 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-35109:
---

Assignee: Ufuk Celebi

> Add support for Flink 1.20-SNAPSHOT in Flink Kafka connector and drop support 
> for 1.17 and 1.18
> ---
>
> Key: FLINK-35109
> URL: https://issues.apache.org/jira/browse/FLINK-35109
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / Kafka
>Reporter: Martijn Visser
>Assignee: Ufuk Celebi
>Priority: Blocker
> Fix For: kafka-4.0.0
>
>
> The Flink Kafka connector currently can't compile against Flink 
> 1.20-SNAPSHOT. An example failure can be found at 
> https://github.com/apache/flink-connector-kafka/actions/runs/8659822490/job/23746484721#step:15:169
> The {code:java} TypeSerializerUpgradeTestBase{code} has had issues before, 
> see FLINK-32455. See also specifically the comment in 
> https://issues.apache.org/jira/browse/FLINK-32455?focusedCommentId=17739785=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17739785
> Next to that, there's also FLINK-25509 which can only be supported with Flink 
> 1.19 and higher. 
> So we should:
> * Drop support for 1.17 and 1.18
> * Refactor the Flink Kafka connector to use the new 
> {code:java}MigrationTest{code}
> We will support the Flink Kafka connector for Flink 1.18 via the v3.1 branch; 
> this change will be a new v4.0 version with support for Flink 1.19 and the 
> upcoming Flink 1.20



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-35009) Change on getTransitivePredecessors breaks connectors

2024-04-09 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-35009:
---

Assignee: Martijn Visser

> Change on getTransitivePredecessors breaks connectors
> -
>
> Key: FLINK-35009
> URL: https://issues.apache.org/jira/browse/FLINK-35009
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, Connectors / Kafka
>Affects Versions: 1.18.2, 1.20.0, 1.19.1
>Reporter: Martijn Visser
>Assignee: Martijn Visser
>Priority: Blocker
>
> {code:java}
> Error:  Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile 
> (default-testCompile) on project flink-connector-kafka: Compilation failure: 
> Compilation failure: 
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/streaming/connectors/kafka/testutils/DataGenerators.java:[214,24]
>  
> org.apache.flink.streaming.connectors.kafka.testutils.DataGenerators.InfiniteStringsGenerator.MockTransformation
>  is not abstract and does not override abstract method 
> getTransitivePredecessorsInternal() in org.apache.flink.api.dag.Transformation
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/streaming/connectors/kafka/testutils/DataGenerators.java:[220,44]
>  getTransitivePredecessors() in 
> org.apache.flink.streaming.connectors.kafka.testutils.DataGenerators.InfiniteStringsGenerator.MockTransformation
>  cannot override getTransitivePredecessors() in 
> org.apache.flink.api.dag.Transformation
> Error:overridden method is final
> {code}
> Example: 
> https://github.com/apache/flink-connector-kafka/actions/runs/8494349338/job/23269406762#step:15:167



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35038) Bump test dependency org.yaml:snakeyaml to 2.2

2024-04-07 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-35038:
---

 Summary: Bump test dependency org.yaml:snakeyaml to 2.2 
 Key: FLINK-35038
 URL: https://issues.apache.org/jira/browse/FLINK-35038
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / Kafka
Affects Versions: 3.1.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
 Fix For: 3.1.0


Usage of SnakeYAML via {{flink-shaded}} was replaced by an explicit test scope 
dependency on {{org.yaml:snakeyaml:1.31}} with FLINK-34193.

This outdated version of SnakeYAML triggers security warnings. These should not 
be an actual issue given the test scope, but we should consider bumping the 
version for security hygiene purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-21928) DuplicateJobSubmissionException after JobManager failover

2021-03-23 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-21928:
---

 Summary: DuplicateJobSubmissionException after JobManager failover
 Key: FLINK-21928
 URL: https://issues.apache.org/jira/browse/FLINK-21928
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.12.2, 1.11.3, 1.10.3
 Environment: StandaloneApplicationClusterEntryPoint using a fixed job 
ID, High Availability enabled
Reporter: Ufuk Celebi


Consider the following scenario:
 * Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, 
high availability enabled
 * Flink job reaches a globally terminal state

 * Flink job is marked as finished in the high-availability service's 
RunningJobsRegistry

 * The JobManager fails over

On recovery, the [Dispatcher throws DuplicateJobSubmissionException, because 
the job is marked as done in the 
RunningJobsRegistry|https://github.com/apache/flink/blob/release-1.12.2/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L332-L340].

When this happens, users cannot get out of the situation without manually 
redeploying the JobManager process and changing the job ID^1^.

The desired semantics are that we don't want to re-execute a job that has 
reached a globally terminal state. In this particular case, we know that the 
job has already reached such a state (as it has been marked in the registry). 
Therefore, we could handle this case by executing the regular termination 
sequence instead of throwing a DuplicateJobSubmission.

---

^1^ With ZooKeeper HA, the respective node is not ephemeral. In Kubernetes HA, 
there is no  notion of ephemeral data that is tied to a session in the first 
place afaik.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed

2021-03-03 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294788#comment-17294788
 ] 

Ufuk Celebi commented on FLINK-21351:
-

[~roman_khachatryan] The ticket includes fixVersion 1.11.4. Are you planning to 
open a PR against the release-1.11 branch?

> Incremental checkpoint data would be lost once a non-stop savepoint completed
> -
>
> Key: FLINK-21351
> URL: https://issues.apache.org/jira/browse/FLINK-21351
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.3, 1.12.1, 1.13.0
>Reporter: Yun Tang
>Assignee: Roman Khachatryan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.12.2, 1.13.0
>
>
> FLINK-10354 counted savepoint as retained checkpoint so that job could 
> failover from latest position. I think this operation is reasonable, however, 
> current implementation would let incremental checkpoint data lost immediately 
> once a non-stop savepoint completed.
> Current general phase of incremental checkpoints: once a newer checkpoint 
> completed, it would be added to checkpoint store. And if the size of 
> completed checkpoints larger than max retained limit, it would subsume the 
> oldest one. This lead to the reference of incremental data decrease one and 
> data would be deleted once reference reached to zero. As we always ensure to 
> register newer checkpoint and then unregister older checkpoint, current phase 
> works fine as expected.
> However, if a non-stop savepoint (a median manual trigger savepoint) is 
> completed, it would be also added into checkpoint store and just subsume 
> previous added checkpoint (in default retain one checkpoint case), which 
> would unregister older checkpoint without newer checkpoint registered, 
> leading to data lost.
> Thanks for [~banmoy] reporting this problem first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-10841) Reduce the number of ListObjects calls when checkpointing to S3

2021-01-11 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262719#comment-17262719
 ] 

Ufuk Celebi commented on FLINK-10841:
-

[~uberspot] Presto's S3 FileSystem seems to only use v1 requests [1] whereas 
newer versions of the Hadoop S3 FileSystem should use v2 requests by default 
[2]. If both file system plugins are on the class path you have to explicitly 
pick Hadoop via {{s3a://}}. See also: 
[https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/filesystems/s3.html#hadooppresto-s3-file-systems-plugins]

Does this help?

 

[1] 
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java#L523-L527

[2] 
https://github.com/apache/hadoop/blob/rel/release-3.1.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L297-L302

> Reduce the number of ListObjects calls when checkpointing to S3
> ---
>
> Key: FLINK-10841
> URL: https://issues.apache.org/jira/browse/FLINK-10841
> Project: Flink
>  Issue Type: Improvement
>  Components: FileSystems
>Affects Versions: 1.5.5, 1.6.2
>Reporter: Pawel Bartoszek
>Priority: Minor
>
> With S3 configured as checkpoint store using S3 AWS Hadoop filesystem we see 
> loads of ListObjects calls. For instance the job with ~1600 tasks requires 
> around 23000 ListObjects calls for every checkpoint including clearing it up 
> by Flink. With checkpoint interval set to 5 minutes this adds up to hundreds 
> of dollars pay month just for ListObjects calls. I am aware that 
> implementation details might be hidden in Hadoop jar and maybe difficult to 
> change, but at least maybe some workaround might be suggested?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18192) Upgrade to Avro version 1.10.0 from 1.8.2

2020-10-29 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222800#comment-17222800
 ] 

Ufuk Celebi commented on FLINK-18192:
-

[~trohrmann] Thanks for the ping. If the upgrade is feasible for a 1.11 patch 
release, it would be a clear yes from my side. If not, I would understand. ;) 

> Upgrade to Avro version 1.10.0 from 1.8.2
> -
>
> Key: FLINK-18192
> URL: https://issues.apache.org/jira/browse/FLINK-18192
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Reporter: Lucas Heimberg
>Assignee: Dawid Wysakowicz
>Priority: Major
> Fix For: 1.12.0
>
>
> As of version 1.11, Flink (i.e., flink-avro) still uses Avro in version 1.8.2.
> Avro 1.9.2 contains many bugfixes, in particular in respect to the support 
> for logical types. A further advantage would be that an upgrade to Avro 1.9.2 
> would also allow to use the Confluent Schema Registry client and Avro 
> deserializer in version 5.5.0, which finally support schema references.
> Therefore it would be great if Flink could make use of Avro 1.9.2 or higher 
> in future releases.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

2020-09-25 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202044#comment-17202044
 ] 

Ufuk Celebi commented on FLINK-18828:
-

[~trohrmann] [~fly_in_gis] Just noticed that this ticket has fix versions for 
1.10 and 1.11. I think it would be better to only do this with a minor version 
upgrade (e.g. 1.12.0), because down stream users may rely on the current 
behaviour (I do).

> Terminate jobmanager process with zero exit code to avoid unexpected 
> restarting by K8s
> --
>
> Key: FLINK-18828
> URL: https://issues.apache.org/jira/browse/FLINK-18828
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Yang Wang
>Priority: Major
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if 
> the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s 
> deployment, since non-zero exit code will cause unexpected restarting. Also 
> from a framework's perspective, a FAILED job does not mean that Flink has 
> failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For 
> standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is 
> harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

2020-08-28 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186623#comment-17186623
 ] 

Ufuk Celebi commented on FLINK-18828:
-

Thanks for the summary, Till. I think that's a good way to frame this problem.
{quote}One could argue then that users should configure their restart 
strategies to always restart if they don't want to reach a FAILED state.
{quote}
Practically speaking I think it's the only option we currently have in the 
context of Kubernetes standalone deployments.

> Terminate jobmanager process with zero exit code to avoid unexpected 
> restarting by K8s
> --
>
> Key: FLINK-18828
> URL: https://issues.apache.org/jira/browse/FLINK-18828
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Yang Wang
>Priority: Major
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if 
> the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s 
> deployment, since non-zero exit code will cause unexpected restarting. Also 
> from a framework's perspective, a FAILED job does not mean that Flink has 
> failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For 
> standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is 
> harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

2020-08-28 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186532#comment-17186532
 ] 

Ufuk Celebi edited comment on FLINK-18828 at 8/28/20, 1:19 PM:
---

[~fly_in_gis] Thanks for the pointers. The Flink job would only transition to 
FAILED when the Flink-level restart strategy has been exhausted. In your 
example for fixed-delay with 3 attempts, the first three restarts would _not_ 
result in the container to exit, but only on the 4th failure would the job 
transition to FAILED and the container exit.

I think the bigger problem with my proposal to set the policy to Never is that 
it would not restart in other failure scenarios (e.g. OOM killed). So overall, 
I don't think it's a viable option.

So overall, I don't see a good way around this problem without your proposed 
change.

---

Maybe as a follow-up we want to resurrect 
https://issues.apache.org/jira/browse/FLINK-10948 ? That way, users would at 
least be able to determine the final Flink job status.


was (Author: uce):
[~fly_in_gis] Thanks for the pointers. The Flink job would only transition to 
FAILED when the Flink-level restart strategy has been exhausted. In your 
example for fixed-delay with 3 attempts, the first three restarts would _not_ 
result in the container to exit, but only on the 4th failure would the job 
transition to FAILED and the container exit.

I think the bigger problem with my proposal to set the policy to Never is that 
it would not restart in other failure scenarios (e.g. OOM killed). So overall, 
I don't think it's a viable option.

So overall, I don't see a good way around this problem without your proposed 
change.

---

Maybe as a follow-up we want to resurrect 
https://issues.apache.org/jira/browse/FLINK-10948? That way, users would at 
least be able to determine the final Flink job status.

> Terminate jobmanager process with zero exit code to avoid unexpected 
> restarting by K8s
> --
>
> Key: FLINK-18828
> URL: https://issues.apache.org/jira/browse/FLINK-18828
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Yang Wang
>Priority: Major
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if 
> the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s 
> deployment, since non-zero exit code will cause unexpected restarting. Also 
> from a framework's perspective, a FAILED job does not mean that Flink has 
> failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For 
> standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is 
> harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

2020-08-28 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186532#comment-17186532
 ] 

Ufuk Celebi commented on FLINK-18828:
-

[~fly_in_gis] Thanks for the pointers. The Flink job would only transition to 
FAILED when the Flink-level restart strategy has been exhausted. In your 
example for fixed-delay with 3 attempts, the first three restarts would _not_ 
result in the container to exit, but only on the 4th failure would the job 
transition to FAILED and the container exit.

I think the bigger problem with my proposal to set the policy to Never is that 
it would not restart in other failure scenarios (e.g. OOM killed). So overall, 
I don't think it's a viable option.

So overall, I don't see a good way around this problem without your proposed 
change.

---

Maybe as a follow-up we want to resurrect 
https://issues.apache.org/jira/browse/FLINK-10948? That way, users would at 
least be able to determine the final Flink job status.

> Terminate jobmanager process with zero exit code to avoid unexpected 
> restarting by K8s
> --
>
> Key: FLINK-18828
> URL: https://issues.apache.org/jira/browse/FLINK-18828
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Yang Wang
>Priority: Major
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if 
> the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s 
> deployment, since non-zero exit code will cause unexpected restarting. Also 
> from a framework's perspective, a FAILED job does not mean that Flink has 
> failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For 
> standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is 
> harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

2020-08-28 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186327#comment-17186327
 ] 

Ufuk Celebi commented on FLINK-18828:
-

[~fly_in_gis] I think it makes sense to keep a non-zero exit code for failed 
jobs. How would users figure out whether the job has succeeded or not if we 
change the exit code?

Regarding the unexpected restarts: What about updating the {{restartPolicy}} to 
{{Never}} in the spec of the Kubernetes Job 
(https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy)?
 That way, we would still have the information from the exit code and we 
wouldn't see any restarts by default. Users would also have the flexibility to 
change the behaviour depending on their use case by setting {{restartPolicy: 
OnFailure}} again.

 

> Terminate jobmanager process with zero exit code to avoid unexpected 
> restarting by K8s
> --
>
> Key: FLINK-18828
> URL: https://issues.apache.org/jira/browse/FLINK-18828
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.1, 1.12.0, 1.11.1
>Reporter: Yang Wang
>Priority: Major
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if 
> the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s 
> deployment, since non-zero exit code will cause unexpected restarting. Also 
> from a framework's perspective, a FAILED job does not mean that Flink has 
> failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For 
> standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is 
> harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers

2020-07-29 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166902#comment-17166902
 ] 

Ufuk Celebi commented on FLINK-18695:
-

I agree with Stephan that it should not have much of an impact. In particular 
for the SSL handler the comment in the [respective Netty 
PR|https://github.com/netty/netty/commit/39cc7a673939dec96258ff27f5b1874671838af0#diff-2fe7b22a8d650f1ea0bf56a809c061f9R303-R310]
 says that the direct buffer was copied back to a heap byte[] by the SSL engine 
anyways. Therefore, allowing heap buffers should be a net positive in the 
context of the SSL handler.

> Allow NettyBufferPool to allocate heap buffers
> --
>
> Key: FLINK-18695
> URL: https://issues.apache.org/jira/browse/FLINK-18695
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.12.0
>
>
> in 4.1.43 netty made a change to their SslHandler to always use heap buffers 
> for JDK SSLEngine implementations, to avoid an additional memory copy.
> However, our {{NettyBufferPool}} forbids heap buffer allocations.
> We will either have to allow heap buffer allocations, or create a custom 
> SslHandler implementation that does not use heap buffers (although this seems 
> ill-adviced?).
> /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17500) Deploy JobGraph from file in StandaloneClusterEntrypoint

2020-05-07 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101541#comment-17101541
 ] 

Ufuk Celebi commented on FLINK-17500:
-

[~zhuzh] Good points. I would agree to keep the JobGraph an internal API and 
don't make any guarantees about compatibility here. As indicated by the 
proposed configuration option keys I consider this internal configuration only.

(In our use case we strictly tie the JobGraph to the Flink version and don't 
re-use it across job executions.)

> Deploy JobGraph from file in StandaloneClusterEntrypoint
> 
>
> Key: FLINK-17500
> URL: https://issues.apache.org/jira/browse/FLINK-17500
> Project: Flink
>  Issue Type: Wish
>  Components: Deployment / Docker
>Reporter: Ufuk Celebi
>Priority: Minor
>
> We have a requirement to deploy a pre-generated {{JobGraph}} from a file in 
> {{StandaloneClusterEntrypoint}}.
> Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a 
> Flink job from the class path using {{ClassPathPackagedProgramRetriever}}. 
> Our desired behaviour would be as follows:
> If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a 
> local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy 
> using {{ClassPathPackagedProgramRetriever}} (current behavior).
> ---
> I understand that this requirement is pretty niche, but wanted to get 
> feedback whether the Flink community would be open to supporting this 
> nonetheless.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17500) Deploy JobGraph from file in StandaloneClusterEntrypoint

2020-05-04 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098866#comment-17098866
 ] 

Ufuk Celebi commented on FLINK-17500:
-

# Yes, that makes sense. (y)
 # I just skimmed the code for {{YarnJobClusterEntrypoint}} and it looks like 
this would work nicely.

> Deploy JobGraph from file in StandaloneClusterEntrypoint
> 
>
> Key: FLINK-17500
> URL: https://issues.apache.org/jira/browse/FLINK-17500
> Project: Flink
>  Issue Type: Wish
>  Components: Deployment / Docker
>Reporter: Ufuk Celebi
>Priority: Minor
>
> We have a requirement to deploy a pre-generated {{JobGraph}} from a file in 
> {{StandaloneClusterEntrypoint}}.
> Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a 
> Flink job from the class path using {{ClassPathPackagedProgramRetriever}}. 
> Our desired behaviour would be as follows:
> If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a 
> local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy 
> using {{ClassPathPackagedProgramRetriever}} (current behavior).
> ---
> I understand that this requirement is pretty niche, but wanted to get 
> feedback whether the Flink community would be open to supporting this 
> nonetheless.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17500) Deploy JobGraph form file in StandaloneClusterEntrypoint

2020-05-04 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-17500:
---

 Summary: Deploy JobGraph form file in StandaloneClusterEntrypoint
 Key: FLINK-17500
 URL: https://issues.apache.org/jira/browse/FLINK-17500
 Project: Flink
  Issue Type: Wish
  Components: Deployment / Docker
Reporter: Ufuk Celebi


We have a requirement to deploy a pre-generated {{JobGraph}} from a file in 
{{StandaloneClusterEntrypoint}}.

Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a Flink 
job from the class path using {{ClassPathPackagedProgramRetriever}}. 

Our desired behaviour would be as follows:

If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a 
local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy 
using {{ClassPathPackagedProgramRetriever}} (current behavior).

---

I understand that this requirement is pretty niche, but wanted to get feedback 
whether the Flink community would be open to supporting this nonetheless.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-16499) Flink shaded hadoop could not work when Yarn timeline service is enabled

2020-03-14 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059342#comment-17059342
 ] 

Ufuk Celebi edited comment on FLINK-16499 at 3/14/20, 1:13 PM:
---

[~fly_in_gis] For proper Hadoop setups, the expected usage it to provide the 
Hadoop classpath as documented in 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/hadoop.html#adding-hadoop-classpaths.
 A similar issue was also reported in FLINK-12493.

Edit: I'm just pointing this out for the general case. I understand that it 
might not be relevant for your usage scenario.


was (Author: uce):
[~fly_in_gis] For proper Hadoop setups, the expected usage it to provide the 
Hadoop classpath as documented in 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/hadoop.html#adding-hadoop-classpaths.
 A similar issue was also reported in FLINK-12493.

> Flink shaded hadoop could not work when Yarn timeline service is enabled
> 
>
> Key: FLINK-16499
> URL: https://issues.apache.org/jira/browse/FLINK-16499
> Project: Flink
>  Issue Type: Bug
>  Components: BuildSystem / Shaded
>Reporter: Yang Wang
>Priority: Major
>
> When the Yarn timeline service is enabled (via 
> {{yarn.timeline-service.enabled=true}} in yarn-site.xml), flink-shaded-hadoop 
> could not work to submit Flink job to Yarn cluster. The following exception 
> will be thrown.
>  
> The root cause is the {{jersey-core-xx.jar}} is not bundled into 
> {{flink-shaded-hadoop-xx}}{{.jar}}.
>  
> {code:java}
> 2020-03-09 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend         
>              [] - Fatal error while running command line interface.2020-03-09 
> 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend                    
>   [] - Fatal error while running command line 
> interface.java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader 
> at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> ~[?:1.8.0_242] at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at 
> java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> ~[?:1.8.0_242] at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at 
> java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> ~[?:1.8.0_242] at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at 
> java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at 
>

[jira] [Commented] (FLINK-16499) Flink shaded hadoop could not work when Yarn timeline service is enabled

2020-03-14 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059342#comment-17059342
 ] 

Ufuk Celebi commented on FLINK-16499:
-

[~fly_in_gis] For proper Hadoop setups, the expected usage it to provide the 
Hadoop classpath as documented in 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/hadoop.html#adding-hadoop-classpaths.
 A similar issue was also reported in FLINK-12493.

> Flink shaded hadoop could not work when Yarn timeline service is enabled
> 
>
> Key: FLINK-16499
> URL: https://issues.apache.org/jira/browse/FLINK-16499
> Project: Flink
>  Issue Type: Bug
>  Components: BuildSystem / Shaded
>Reporter: Yang Wang
>Priority: Major
>
> When the Yarn timeline service is enabled (via 
> {{yarn.timeline-service.enabled=true}} in yarn-site.xml), flink-shaded-hadoop 
> could not work to submit Flink job to Yarn cluster. The following exception 
> will be thrown.
>  
> The root cause is the {{jersey-core-xx.jar}} is not bundled into 
> {{flink-shaded-hadoop-xx}}{{.jar}}.
>  
> {code:java}
> 2020-03-09 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend         
>              [] - Fatal error while running command line interface.2020-03-09 
> 03:35:34,396 ERROR org.apache.flink.client.cli.CliFrontend                    
>   [] - Fatal error while running command line 
> interface.java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader 
> at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> ~[?:1.8.0_242] at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at 
> java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> ~[?:1.8.0_242] at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at 
> java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:757) ~[?:1.8.0_242] at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> ~[?:1.8.0_242] at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_242] 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_242] 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_242] at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_242] at 
> java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_242] at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_242] at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_242] at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_242] 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:1.8.0_242] at 
> org.apache.hadoop.yarn.util.timeline.TimelineUtils.(TimelineUtils.java:50)
>  ~[flink-shaded-hadoop-2-uber-2.8.3-7.0.jar:2.8.3-7.0] at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:179)
>  ~[flink-shaded-hadoop-2-uber-2.8.3-7.0.jar:2.8.3-7.0] at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> ~[flink-shaded-hadoop-2-uber-2.8.3-7.0.jar:2.8.3-7.0] at 
>

[jira] [Comment Edited] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.

2020-02-10 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033449#comment-17033449
 ] 

Ufuk Celebi edited comment on FLINK-14812 at 2/10/20 9:07 AM:
--

[~elanv] I'm reopening again, because the initial context of this ticket was 
not the Docker entrypoint script, but the general startup scripts. If there is 
agreement that the Docker entrypoint change is enough, we can close this again. 
:) 

[~aljoscha] What do you think?


was (Author: uce):
[~elanv] I'm reopening again, because the initial context of this ticket was 
not the Docker entrypoint script, but the general startup scripts. If there is 
agreement that the Docker entrypoint change is enough, we can close this again. 
:) 

> Add custom libs to Flink classpath with an environment variable.
> 
>
> Key: FLINK-14812
> URL: https://issues.apache.org/jira/browse/FLINK-14812
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes, Deployment / Scripts
>Reporter: Eui Heo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To use plugin library you need to add it to the flink classpath. The 
> documentation explains to put the jar file in the lib path.
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
> However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the 
> burden of creating and managing another container image. It would be more 
> efficient to add the classpath using environment variables inside the 
> constructFlinkClassPath function in the config.sh file.
> In particular, it seems inconvenient for me to create separate images to use 
> the jars, even though the /opt/ flink/opt of the stock image already contains 
> them.
> For example, there are metrics libs and file system libs:
> flink-azure-fs-hadoop-1.9.1.jar
> flink-s3-fs-hadoop-1.9.1.jar
> flink-metrics-prometheus-1.9.1.jar
> flink-metrics-influxdb-1.9.1.jar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.

2020-02-10 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reopened FLINK-14812:
-

> Add custom libs to Flink classpath with an environment variable.
> 
>
> Key: FLINK-14812
> URL: https://issues.apache.org/jira/browse/FLINK-14812
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes, Deployment / Scripts
>Reporter: Eui Heo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To use plugin library you need to add it to the flink classpath. The 
> documentation explains to put the jar file in the lib path.
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
> However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the 
> burden of creating and managing another container image. It would be more 
> efficient to add the classpath using environment variables inside the 
> constructFlinkClassPath function in the config.sh file.
> In particular, it seems inconvenient for me to create separate images to use 
> the jars, even though the /opt/ flink/opt of the stock image already contains 
> them.
> For example, there are metrics libs and file system libs:
> flink-azure-fs-hadoop-1.9.1.jar
> flink-s3-fs-hadoop-1.9.1.jar
> flink-metrics-prometheus-1.9.1.jar
> flink-metrics-influxdb-1.9.1.jar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.

2020-02-10 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033449#comment-17033449
 ] 

Ufuk Celebi commented on FLINK-14812:
-

[~elanv] I'm reopening again, because the initial context of this ticket was 
not the Docker entrypoint script, but the general startup scripts. If there is 
agreement that the Docker entrypoint change is enough, we can close this again. 
:) 

> Add custom libs to Flink classpath with an environment variable.
> 
>
> Key: FLINK-14812
> URL: https://issues.apache.org/jira/browse/FLINK-14812
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes, Deployment / Scripts
>Reporter: Eui Heo
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To use plugin library you need to add it to the flink classpath. The 
> documentation explains to put the jar file in the lib path.
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
> However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the 
> burden of creating and managing another container image. It would be more 
> efficient to add the classpath using environment variables inside the 
> constructFlinkClassPath function in the config.sh file.
> In particular, it seems inconvenient for me to create separate images to use 
> the jars, even though the /opt/ flink/opt of the stock image already contains 
> them.
> For example, there are metrics libs and file system libs:
> flink-azure-fs-hadoop-1.9.1.jar
> flink-s3-fs-hadoop-1.9.1.jar
> flink-metrics-prometheus-1.9.1.jar
> flink-metrics-influxdb-1.9.1.jar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-15831) Add Docker image publication to release documentation

2020-02-05 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-15831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-15831:
---

Assignee: Patrick Lucas

> Add Docker image publication to release documentation
> -
>
> Key: FLINK-15831
> URL: https://issues.apache.org/jira/browse/FLINK-15831
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Ufuk Celebi
>Assignee: Patrick Lucas
>Priority: Major
>
> The [release documentation in the project 
> Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release]
>  describes the release process.
> We need to add a note to follow up with the Docker image publication process 
> as part of the release checklist. The actual documentation should probably be 
> self-contained in the apache/flink-docker repository, but we should 
> definitely link to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker

2020-02-03 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-15830.
---
Resolution: Fixed

apache/flink and docker-flink/docker-flink now have the same histories up to 
f26d84c when we decided to migrate.

> Migrate docker-flink/docker-flink to apache/flink-docker
> 
>
> Key: FLINK-15830
> URL: https://issues.apache.org/jira/browse/FLINK-15830
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Ufuk Celebi
>Assignee: Patrick Lucas
>Priority: Major
>
> * Migrate contents including commit history.
> * Add note to docker-flink/docker-flink and potentially archive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15829) Request apache/flink-docker repository

2020-01-31 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027376#comment-17027376
 ] 

Ufuk Celebi commented on FLINK-15829:
-

The repository has been created in https://github.com/apache/flink-docker.

There is a small hick up in the repository description. I asked INFRA to fix it 
in https://issues.apache.org/jira/browse/INFRA-19800.

> Request apache/flink-docker repository
> --
>
> Key: FLINK-15829
> URL: https://issues.apache.org/jira/browse/FLINK-15829
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (FLINK-15829) Request apache/flink-docker repository

2020-01-31 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi resolved FLINK-15829.
-
Resolution: Fixed

> Request apache/flink-docker repository
> --
>
> Key: FLINK-15829
> URL: https://issues.apache.org/jira/browse/FLINK-15829
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker

2020-01-31 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-15830:

Description: 
* Migrate contents including commit history.
* Add note to docker-flink/docker-flink and potentially archive

  was:Migrate contents including commit history.


> Migrate docker-flink/docker-flink to apache/flink-docker
> 
>
> Key: FLINK-15830
> URL: https://issues.apache.org/jira/browse/FLINK-15830
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Ufuk Celebi
>Assignee: Patrick Lucas
>Priority: Major
>
> * Migrate contents including commit history.
> * Add note to docker-flink/docker-flink and potentially archive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker

2020-01-31 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-15830:

Description: Migrate contents including commit history.

> Migrate docker-flink/docker-flink to apache/flink-docker
> 
>
> Key: FLINK-15830
> URL: https://issues.apache.org/jira/browse/FLINK-15830
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Ufuk Celebi
>Assignee: Patrick Lucas
>Priority: Major
>
> Migrate contents including commit history.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15831) Add Docker image publication to release documentation

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15831:
---

 Summary: Add Docker image publication to release documentation
 Key: FLINK-15831
 URL: https://issues.apache.org/jira/browse/FLINK-15831
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: Ufuk Celebi


The [release documentation in the project 
Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release]
 describes the release process.

We need to add a note to follow up with the Docker image publication process as 
part of the release checklist. The actual documentation should probably be 
self-contained in the apache/flink-docker repository, but we should definitely 
link to it.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15830:
---

 Summary: Migrate docker-flink/docker-flink to apache/flink-docker
 Key: FLINK-15830
 URL: https://issues.apache.org/jira/browse/FLINK-15830
 Project: Flink
  Issue Type: Sub-task
Reporter: Ufuk Celebi
Assignee: Patrick Lucas






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-15829) Request apache/flink-docker repository

2020-01-31 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-15829:
---

Assignee: Ufuk Celebi

> Request apache/flink-docker repository
> --
>
> Key: FLINK-15829
> URL: https://issues.apache.org/jira/browse/FLINK-15829
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15829) Request apache/flink-docker repository

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15829:
---

 Summary: Request apache/flink-docker repository
 Key: FLINK-15829
 URL: https://issues.apache.org/jira/browse/FLINK-15829
 Project: Flink
  Issue Type: Sub-task
Reporter: Ufuk Celebi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15828) Integrate docker-flink/docker-flink into Flink release process

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15828:
---

 Summary: Integrate docker-flink/docker-flink into Flink release 
process
 Key: FLINK-15828
 URL: https://issues.apache.org/jira/browse/FLINK-15828
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Docker, Release System
Reporter: Ufuk Celebi


This ticket tracks the first phase of Flink Docker image build consolidation.

The goal of this story is to integrate Docker image publication with the Flink 
release process and provide convenience packages of released Flink artifacts on 
DockerHub.

For more details, check the 
[DISCUSS|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html]
 and 
[VOTE|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36982.html]
 threads on the mailing list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-14812) Add metric libs to Flink classpath with an environment variable.

2019-11-20 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978315#comment-16978315
 ] 

Ufuk Celebi edited comment on FLINK-14812 at 11/20/19 11:08 AM:


[~elanv] I think your work around with adding single JARs to 
{{HADOOP_CLASSPATH}} is the best option currently.

I think the question for this ticket is independent of metrics and Kubernetes. 
The general question is whether we want to add an Hadoop-independent 
environment variable such as {{CLASSPATH_ADDITIONAL_JARS}} that we document and 
commit to maintain for our start up scripts.

Personally, I would find this beneficial in the standalone job use case. 
[~plucas] What do you think?

---

I'm not sure who in the Flink community has good insight on the scripts and can 
help us make a decision here. [~aljoscha] or [~chesnay] maybe?


was (Author: uce):
[~elanv] I think your work around with adding single JARs to 
{{HADOOP_CLASSPATH}} is the best option currently.

I think the question for this ticket is independent of metrics and Kubernetes. 
The general question is whether we want to add an Hadoop-independent 
environment variable such as {{CLASSPATH_ADDITIONAL_JARS}} that we document and 
commit to maintain for our start up scripts.

Personally, I would find this beneficial in the standalone job use case. 
[~plucas] What do you think?

---

I'm not sure who in the Flink community has good insight on the scripts and can 
help us make a decision here. [~aljoscha] or [~chesnay] maybe?

 

 

 

> Add metric libs to Flink classpath with an environment variable.
> 
>
> Key: FLINK-14812
> URL: https://issues.apache.org/jira/browse/FLINK-14812
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes, Deployment / Scripts
>Reporter: Eui Heo
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> To use the Flink metric lib you need to add it to the flink classpath. The 
> documentation explains to put the jar file in the lib path.
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
> However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the 
> burden of creating and managing another container image. It would be more 
> efficient to add the classpath using environment variables inside the 
> constructFlinkClassPath function in the config.sh file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-14242) Collapse task names in job graph visualization if too long

2019-09-29 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-14242:
---

Assignee: Yadong Xie

> Collapse task names in job graph visualization if too long
> --
>
> Key: FLINK-14242
> URL: https://issues.apache.org/jira/browse/FLINK-14242
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.0
>Reporter: Paul Lin
>Assignee: Yadong Xie
>Priority: Minor
>
> For some complex jobs, especially SQL jobs, the task names are quite long 
> which makes the job graph hard to read.  We could auto collapse these task 
> names if they exceed a certain length, and provide an uncollapse button for 
> the full task names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Description: 
The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
the wrong checkpoint if:
* checkpoints are preferred  ({{getLatestCheckpoint(true)}}),
* the latest checkpoint is *not* a savepoint,
* more than a single checkpoint is retained.

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

You can apply the patch via {{git am -3 < *.patch}}.

 

  was:
The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
the wrong checkpoint if checkpoints are preferred  
({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a savepoint.

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

You can apply the patch via {{git am -3 < *.patch}}.

 


> getLatestCheckpoint(true) returns wrong checkpoint
> --
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
> the wrong checkpoint if:
> * checkpoints are preferred  ({{getLatestCheckpoint(true)}}),
> * the latest checkpoint is *not* a savepoint,
> * more than a single checkpoint is retained.
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
> You can apply the patch via {{git am -3 < *.patch}}.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Attachment: 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch

> getLatestCheckpoint(true) returns wrong checkpoint
> --
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
> the wrong checkpoint if checkpoints are preferred  
> ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a 
> savepoint.
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
> You can apply the patch via {{git am -3 < *.patch}}.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Attachment: (was: 
0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch)

> getLatestCheckpoint(true) returns wrong checkpoint
> --
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
> the wrong checkpoint if checkpoints are preferred  
> ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a 
> savepoint.
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
> You can apply the patch via {{git am -3 < *.patch}}.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Description: 
The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
the wrong checkpoint if checkpoints are preferred  
({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a savepoint.

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

You can apply the patch via {{git am -3 < *.patch}}.

 

  was:
The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
the wrong checkpoint if checkpoints are preferred  
({{getLatestCheckpoint(true)}}).

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

You can apply the patch via {{git am -3 < *.patch}}.

 


> getLatestCheckpoint(true) returns wrong checkpoint
> --
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
> the wrong checkpoint if checkpoints are preferred  
> ({{getLatestCheckpoint(true)}}) and the latest checkpoint is *not* a 
> savepoint.
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
> You can apply the patch via {{git am -3 < *.patch}}.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint(true) returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Summary: getLatestCheckpoint(true) returns wrong checkpoint  (was: 
getLatestCheckpoint returns wrong checkpoint)

> getLatestCheckpoint(true) returns wrong checkpoint
> --
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
> the wrong checkpoint if checkpoints are preferred  
> ({{getLatestCheckpoint(true)}}).
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
> You can apply the patch via {{git am -3 < *.patch}}.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Description: 
The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
the wrong checkpoint if checkpoints are preferred  
({{getLatestCheckpoint(true)}}).

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

You can apply the patch via {{git am -3 < *.patch}}.

 

  was:
The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
the wrong checkpoint if checkpoints are preferred  
({{getLatestCheckpoint(true)}}).

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

You can apply the patch via {{git am -3 < *.patch}}).

 


> getLatestCheckpoint returns wrong checkpoint
> 
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
> the wrong checkpoint if checkpoints are preferred  
> ({{getLatestCheckpoint(true)}}).
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
> You can apply the patch via {{git am -3 < *.patch}}.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Description: 
The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
the wrong checkpoint if checkpoints are preferred  
({{getLatestCheckpoint(true)}}).

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

You can apply the patch via {{git am -3 < *.patch}}).

 

  was:
The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks 
returns the wrong checkpoint as the latest one if enabled.

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

 


> getLatestCheckpoint returns wrong checkpoint
> 
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 returns 
> the wrong checkpoint if checkpoints are preferred  
> ({{getLatestCheckpoint(true)}}).
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
> You can apply the patch via {{git am -3 < *.patch}}).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-14145:

Attachment: 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch

> getLatestCheckpoint returns wrong checkpoint
> 
>
> Key: FLINK-14145
> URL: https://issues.apache.org/jira/browse/FLINK-14145
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: 
> 0001-FLINK-14145-runtime-Add-getLatestCheckpoint-test.patch
>
>
> The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks 
> returns the wrong checkpoint as the latest one if enabled.
> The current implementation assumes that the latest checkpoint is a savepoint 
> and skips over it. I attached a patch for 
> {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-14145:
---

 Summary: getLatestCheckpoint returns wrong checkpoint
 Key: FLINK-14145
 URL: https://issues.apache.org/jira/browse/FLINK-14145
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.9.0
Reporter: Ufuk Celebi


The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks 
returns the wrong checkpoint as the latest one if enabled.

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-13566) Support checkpoint configuration through flink-conf.yaml

2019-08-05 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-13566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900096#comment-16900096
 ] 

Ufuk Celebi commented on FLINK-13566:
-

I see that having a good default checkpointing interval for session clusters 
with many different kind of jobs can be challenging, but overall the analogy to 
default state backend configuration makes sense to me. In application cluster 
environments which are becoming more commonly used the argument about job vs. 
cluster-level configuration actually doesn't even apply.

I think this ticket warrants a small summary document with how this behaves 
with interesting configuration combinations. For instance, if we can enable 
checkpointing by default, do we need to allow users to specifically disable 
checkpointing via their env? Can there be problems activating checkpointing by 
default?

[~gyfora] Are you planning to work on this?
[~aljoscha] What do you think about this?

> Support checkpoint configuration through flink-conf.yaml
> 
>
> Key: FLINK-13566
> URL: https://issues.apache.org/jira/browse/FLINK-13566
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Configuration
>Reporter: Gyula Fora
>Priority: Major
>
> Currently basic checkpointing configuration happens through the 
> StreamExecutionEnvironment and the CheckpointConfig class.
> There is no way to configure checkpointing behaviour purely from the 
> flink-conf.yaml file (or provide a default checkpointing behaviour) as it 
> always needs to happen programmatically through the environment.
> The checkpoint config settings are then translated down to the 
> CheckpointCoordinatorConfiguration which will control the runtime behaviour.
> As checkpointing related settings are operational features that should not 
> affect the application logic I think we need to support configuring these 
> params through the flink-conf yaml.
> In order to do this we probably need to rework the CheckpointConfig class so 
> that it distinguishes parameters that the user actually set from the defaults 
> (to support overriding what was set in the conf).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Closed] (FLINK-12813) Add Hadoop profile in building from source docs

2019-06-12 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-12813.
---
Resolution: Won't Fix

> Add Hadoop profile in building from source docs
> ---
>
> Key: FLINK-12813
> URL: https://issues.apache.org/jira/browse/FLINK-12813
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.8.0
>Reporter: Ufuk Celebi
>Priority: Trivial
>
> The docs for [building from source with 
> Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions]
>  omit the {{-Pinclude-hadoop}} profile in two code snippets.
> The two code snippets that have {{-Dhadoop.version}} set, but no 
> {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-12813) Add Hadoop profile in building from source docs

2019-06-12 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-12813:

Issue Type: Improvement  (was: Bug)

> Add Hadoop profile in building from source docs
> ---
>
> Key: FLINK-12813
> URL: https://issues.apache.org/jira/browse/FLINK-12813
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.8.0
>Reporter: Ufuk Celebi
>Priority: Trivial
>
> The docs for [building from source with 
> Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions]
>  omit the {{-Pinclude-hadoop}} profile in two code snippets.
> The two code snippets that have {{-Dhadoop.version}} set, but no 
> {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-12813) Add Hadoop profile in building from source docs

2019-06-12 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861856#comment-16861856
 ] 

Ufuk Celebi edited comment on FLINK-12813 at 6/12/19 8:11 AM:
--

Makes sense. Since a user was also confused by this: What do you think about 
explicitly pointing this out in the other two sections? I'm changing this to be 
an {{Improvement}} ticket for now. If you don't think it's necessary, let's 
close this ticket, so it does not linger around.


was (Author: uce):
Makes sense. Since a user was also confused by this: What do you think about 
explicitly pointing this out in the other two sections? I'm changing this to be 
an {{Improvement}} ticket for now. If you don't think it's necessary, let's 
close this ticket, so it does not linger around.

 

> Add Hadoop profile in building from source docs
> ---
>
> Key: FLINK-12813
> URL: https://issues.apache.org/jira/browse/FLINK-12813
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.0
>Reporter: Ufuk Celebi
>Priority: Trivial
>
> The docs for [building from source with 
> Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions]
>  omit the {{-Pinclude-hadoop}} profile in two code snippets.
> The two code snippets that have {{-Dhadoop.version}} set, but no 
> {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12813) Add Hadoop profile in building from source docs

2019-06-12 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861856#comment-16861856
 ] 

Ufuk Celebi commented on FLINK-12813:
-

Makes sense. Since a user was also confused by this: What do you think about 
explicitly pointing this out in the other two sections? I'm changing this to be 
an {{Improvement}} ticket for now. If you don't think it's necessary, let's 
close this ticket, so it does not linger around.

 

> Add Hadoop profile in building from source docs
> ---
>
> Key: FLINK-12813
> URL: https://issues.apache.org/jira/browse/FLINK-12813
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.0
>Reporter: Ufuk Celebi
>Priority: Trivial
>
> The docs for [building from source with 
> Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions]
>  omit the {{-Pinclude-hadoop}} profile in two code snippets.
> The two code snippets that have {{-Dhadoop.version}} set, but no 
> {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-12813) Add Hadoop profile in building from source docs

2019-06-12 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-12813:
---

 Summary: Add Hadoop profile in building from source docs
 Key: FLINK-12813
 URL: https://issues.apache.org/jira/browse/FLINK-12813
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.0
Reporter: Ufuk Celebi


The docs for [building from source with 
Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions]
 omit the {{-Pinclude-hadoop}} profile in two code snippets.

The two code snippets that have {{-Dhadoop.version}} set, but no 
{{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as well.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-12313) SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable

2019-04-24 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-12313:

Issue Type: Bug  (was: Test)

> SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints
>  is unstable
> ---
>
> Key: FLINK-12313
> URL: https://issues.apache.org/jira/browse/FLINK-12313
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Ufuk Celebi
>Priority: Critical
>  Labels: test-stability
>
> {{SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints}}
>  fails and prints the Thread stack traces due to no output on Travis 
> occasionally.
> {code}
> ==
> Printing stack trace of Java process 10071
> ==
> 2019-04-24 07:55:29
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode):
> "Attach Listener" #17 daemon prio=9 os_prio=0 tid=0x7f294892 
> nid=0x2cf5 waiting on condition [0x]
>java.lang.Thread.State: RUNNABLE
> "Async calls on Test Task (1/1)" #15 daemon prio=5 os_prio=0 
> tid=0x7f2948dd1800 nid=0x27a9 waiting on condition [0x7f292cea9000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x8bb5e558> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> "Async calls on Test Task (1/1)" #14 daemon prio=5 os_prio=0 
> tid=0x7f2948dce800 nid=0x27a8 in Object.wait() [0x7f292cfaa000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x8bac58f8> (a java.lang.Object)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> org.apache.flink.streaming.runtime.tasks.SynchronousSavepointLatch.blockUntilCheckpointIsAcknowledged(SynchronousSavepointLatch.java:66)
>   - locked <0x8bac58f8> (a java.lang.Object)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:726)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:604)
>   at 
> org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.triggerCheckpoint(SynchronousCheckpointITCase.java:174)
>   at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1182)
>   at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> "CloseableReaperThread" #13 daemon prio=5 os_prio=0 tid=0x7f2948d9b800 
> nid=0x27a7 in Object.wait() [0x7f292d0ab000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock)
>   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
>   - locked <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock)
>   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
>   at 
> org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193)
> "Test Task (1/1)" #12 prio=5 os_prio=0 tid=0x7f2948d97000 nid=0x27a6 in 
> Object.wait() [0x7f292d1ac000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x8e63f7d8> (a java.lang.Object)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63)
>   - locked <0x8e63f7d8> (a java.lang.Object)
>   at 
>

[jira] [Created] (FLINK-12313) SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable

2019-04-24 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-12313:
---

 Summary: 
SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints 
is unstable
 Key: FLINK-12313
 URL: https://issues.apache.org/jira/browse/FLINK-12313
 Project: Flink
  Issue Type: Test
  Components: Runtime / Checkpointing
Reporter: Ufuk Celebi


{{SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints}}
 fails and prints the Thread stack traces due to no output on Travis 
occasionally.

{code}
==
Printing stack trace of Java process 10071
==
2019-04-24 07:55:29
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode):

"Attach Listener" #17 daemon prio=9 os_prio=0 tid=0x7f294892 nid=0x2cf5 
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

"Async calls on Test Task (1/1)" #15 daemon prio=5 os_prio=0 
tid=0x7f2948dd1800 nid=0x27a9 waiting on condition [0x7f292cea9000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x8bb5e558> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"Async calls on Test Task (1/1)" #14 daemon prio=5 os_prio=0 
tid=0x7f2948dce800 nid=0x27a8 in Object.wait() [0x7f292cfaa000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8bac58f8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at 
org.apache.flink.streaming.runtime.tasks.SynchronousSavepointLatch.blockUntilCheckpointIsAcknowledged(SynchronousSavepointLatch.java:66)
- locked <0x8bac58f8> (a java.lang.Object)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:726)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:604)
at 
org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.triggerCheckpoint(SynchronousCheckpointITCase.java:174)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1182)
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"CloseableReaperThread" #13 daemon prio=5 os_prio=0 tid=0x7f2948d9b800 
nid=0x27a7 in Object.wait() [0x7f292d0ab000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at 
org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193)

"Test Task (1/1)" #12 prio=5 os_prio=0 tid=0x7f2948d97000 nid=0x27a6 in 
Object.wait() [0x7f292d1ac000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8e63f7d8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at 
org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63)
- locked <0x8e63f7d8> (a java.lang.Object)
at 
org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.run(SynchronousCheckpointITCase.java:161)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:335)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:724)
at java.lang.Thread.run(Thread.java:748)

"process reaper" #11 daemon prio=10 os_prio=0 tid=0x7f294885e000 nid=0x2793 
waiting on condition [0x7f292d7e5000]

[jira] [Closed] (FLINK-11534) Don't exit JVM after job termination with standalone job

2019-04-24 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-11534.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

> Don't exit JVM after job termination with standalone job
> 
>
> Key: FLINK-11534
> URL: https://issues.apache.org/jira/browse/FLINK-11534
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.7.0, 1.8.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If a job deployed in job cluster mode terminates, the JVM running the 
> StandaloneJobClusterEntryPoint will exit via System.exit(1).
> When managing such a job this requires access to external systems for logging 
> in order to get more details about failure causes or final termination status.
> I believe that there is value in having a StandaloneJobClusterEntryPoint 
> option that does not exit the JVM after the job has terminated. This allows 
> users to gather further information if they are monitoring the job and 
> manually tear down the process.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11534) Don't exit JVM after job termination with standalone job

2019-04-16 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11534:

Affects Version/s: 1.8.0

> Don't exit JVM after job termination with standalone job
> 
>
> Key: FLINK-11534
> URL: https://issues.apache.org/jira/browse/FLINK-11534
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.7.0, 1.8.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>
> If a job deployed in job cluster mode terminates, the JVM running the 
> StandaloneJobClusterEntryPoint will exit via System.exit(1).
> When managing such a job this requires access to external systems for logging 
> in order to get more details about failure causes or final termination status.
> I believe that there is value in having a StandaloneJobClusterEntryPoint 
> option that does not exit the JVM after the job has terminated. This allows 
> users to gather further information if they are monitoring the job and 
> manually tear down the process.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-8093) flink job fail because of kafka producer create fail of "javax.management.InstanceAlreadyExistsException"

2019-03-30 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805723#comment-16805723
 ] 

Ufuk Celebi commented on FLINK-8093:


Yes I agree with [~NicoK]. We need to dynamically set this per sub task. 
Otherwise there is no reliable way to work around this. 

> flink job fail because of kafka producer create fail of 
> "javax.management.InstanceAlreadyExistsException"
> -
>
> Key: FLINK-8093
> URL: https://issues.apache.org/jira/browse/FLINK-8093
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.3.2
> Environment: flink 1.3.2, kafka 0.9.1
>Reporter: dongtingting
>Priority: Critical
>
> one taskmanager has multiple taskslot, one task fail because of create 
> kafkaProducer fail，the reason for create kafkaProducer fail is 
> “javax.management.InstanceAlreadyExistsException: 
> kafka.producer:type=producer-metrics,client-id=producer-3”。 the detail trace 
> is ：
> 2017-11-04 19:41:23,281 INFO  org.apache.flink.runtime.taskmanager.Task   
>   - Source: Custom Source -> Filter -> Map -> Filter -> Sink: 
> dp_client_**_log (7/80) (99551f3f892232d7df5eb9060fa9940c) switched from 
> RUNNING to FAILED.
> org.apache.kafka.common.KafkaException: Failed to construct kafka producer
> at 
> org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:321)
> at 
> org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:181)
> at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.getKafkaProducer(FlinkKafkaProducerBase.java:202)
> at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.open(FlinkKafkaProducerBase.java:212)
> at 
> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
> at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:375)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:252)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.kafka.common.KafkaException: Error registering mbean 
> kafka.producer:type=producer-metrics,client-id=producer-3
> at 
> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:159)
> at 
> org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:77)
> at 
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:288)
> at org.apache.kafka.common.metrics.Metrics.addMetric(Metrics.java:255)
> at org.apache.kafka.common.metrics.Metrics.addMetric(Metrics.java:239)
> at 
> org.apache.kafka.clients.producer.internals.RecordAccumulator.registerMetrics(RecordAccumulator.java:137)
> at 
> org.apache.kafka.clients.producer.internals.RecordAccumulator.(RecordAccumulator.java:111)
> at 
> org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:261)
> ... 9 more
> Caused by: javax.management.InstanceAlreadyExistsException: 
> kafka.producer:type=producer-metrics,client-id=producer-3
> at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
> at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
> at 
> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:157)
> ... 16 more
> I doubt that task in different taskslot of one taskmanager use different 
> classloader， and taskid may be  the same in one process。 So this lead to 
> create kafkaProducer fail in one taskManager。 
> Does anybody encountered the same problem？ 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-12060) Unify change pom version scripts

2019-03-28 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-12060:
---

 Summary: Unify change pom version scripts
 Key: FLINK-12060
 URL: https://issues.apache.org/jira/browse/FLINK-12060
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Scripts
Affects Versions: 1.8.0
Reporter: Ufuk Celebi


We have three places in `tools` that we use to update the pom versions when 
releasing:
1. https://github.com/apache/flink/blob/048367b/tools/change-version.sh#L31
2. 
https://github.com/apache/flink/blob/048367b/tools/releasing/create_release_branch.sh#L60
3. 
https://github.com/apache/flink/blob/048367b/tools/releasing/update_branch_version.sh#L52

The 1st option is buggy (it does not work with the new versioning of the shaded 
Hadoop build, e.g. {{2.4.1-1.9-SNAPSHOT}} will not be replaced). The 2nd and 
3rd work for pom files, but the 2nd one misses a change for the doc version 
that is present in the 3rd one.

I think we should unify these and call them where needed instead of duplicating 
this code in unexpected ways.

An initial quick fix could remove the 1st script and update the 2rd one to 
match the 3rd one.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11953) Introduce Plugin/Loading system and integrate it with FileSystem

2019-03-19 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795743#comment-16795743
 ] 

Ufuk Celebi commented on FLINK-11953:
-

[~srichter] [Presto's plugin 
system|https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/server/PluginManager.java]
 could potentially serve as inspiration. I'm sure you already have a plan here, 
just bringing it up as FYI.

 

> Introduce Plugin/Loading system and integrate it with FileSystem
> 
>
> Key: FLINK-11953
> URL: https://issues.apache.org/jira/browse/FLINK-11953
> Project: Flink
>  Issue Type: Sub-task
>  Components: Connectors / FileSystem
>Affects Versions: 1.9.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11798) Incorrect Kubernetes Documentation

2019-03-07 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786870#comment-16786870
 ] 

Ufuk Celebi commented on FLINK-11798:
-

Do you mean the README in {{flink-container/kubernetes}}?

> Incorrect Kubernetes Documentation
> --
>
> Key: FLINK-11798
> URL: https://issues.apache.org/jira/browse/FLINK-11798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.7.2
>Reporter: Pritesh Patel
>Priority: Major
>
> I have been trying to use the kubernetes session cluster manifests provided 
> in the documentation. The -Dtaskmanager.host flag doesn't seem to pass 
> through, meaning it uses the pod name as the host name. This wont work.
> The current docs state the args should be:
>  
> {code:java}
> args: 
> - taskmanager 
> - "-Dtaskmanager.host=$(K8S_POD_IP)"
> {code}
>  
> I did manage to get it to work by using this manifest for the taskmanager 
> instead. This did waste alot of time as it was very hard to find.
> {code:java}
> args: 
> - taskmanager.sh
> - -Dtaskmanager.host=$(K8S_POD_IP)
> - -Djobmanager.rpc.address=$(JOB_MANAGER_RPC_ADDRESS) 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11798) Incorrect Kubernetes Documentation

2019-03-06 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786145#comment-16786145
 ] 

Ufuk Celebi commented on FLINK-11798:
-

[~1u0] Could you take a look at this? 

> Incorrect Kubernetes Documentation
> --
>
> Key: FLINK-11798
> URL: https://issues.apache.org/jira/browse/FLINK-11798
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.7.2
>Reporter: Pritesh Patel
>Priority: Major
>
> I have been trying to use the kubernetes session cluster manifests provided 
> in the documentation. The -Dtaskmanager.host flag doesn't seem to pass 
> through, meaning it uses the pod name as the host name. This wont work.
> The current docs state the args should be:
>  
> {code:java}
> args: 
> - taskmanager 
> - "-Dtaskmanager.host=$(K8S_POD_IP)"
> {code}
>  
> I did manage to get it to work by using this manifest for the taskmanager 
> instead. This did waste alot of time as it was very hard to find.
> {code:java}
> args: 
> - taskmanager.sh
> - -Dtaskmanager.host=$(K8S_POD_IP)
> - -Djobmanager.rpc.address=$(JOB_MANAGER_RPC_ADDRESS) 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11847) Docker job-specific image creation instruction make reference to --job-jar parameter that does not actually exist in the script

2019-03-06 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786143#comment-16786143
 ] 

Ufuk Celebi commented on FLINK-11847:
-

{{README.md}} refers to 
[https://github.com/apache/flink/blob/release-1.7/flink-container/docker/build.sh]
 (not the one in {{flink-contrib}} which is independent).

> Docker job-specific image creation instruction make reference to --job-jar 
> parameter that does not actually exist in the script
> ---
>
> Key: FLINK-11847
> URL: https://issues.apache.org/jira/browse/FLINK-11847
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Docker, Deployment / Kubernetes
>Affects Versions: 1.7.2
>Reporter: Frank Wilson
>Priority: Major
>
> Documentation [1] refers to --job-jar parameter that does not exist in the 
> script [2].
>  
>  
> [1]https://github.com/apache/flink/blob/release-1.7/flink-container/docker/README.md#building-the-docker-image
> [2] 
> https://github.com/apache/flink/blob/release-1.7/flink-contrib/docker-flink/build.sh#L24



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever

2019-03-01 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11533:

Release Note: The container entry point will scan the Flink class path for 
the job jar. Therefore, the --job-classname command line argument is now 
optional.

> Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
> ---
>
> Key: FLINK-11533
> URL: https://issues.apache.org/jira/browse/FLINK-11533
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0, 1.9.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Users running job clusters distribute their user code as part of the shared 
> classpath of all cluster components. We currently require users running 
> {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR 
> manifest entries that specify the main class of a JAR are ignored since they 
> are simply part of the classpath.
> I propose to add another optional command line argument to the 
> {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file 
> (such as {{lib/usercode.jar}}) and whose Manifest is respected.
> Arguments:
> {code}
> --job-jar 
> --job-classname name
> {code}
> Each argument is optional, but at least one of the two is required. The 
> job-classname has precedence over job-jar.
> Implementation wise we should be able to simply create the PackagedProgram 
> from the jar file path in ClassPathJobGraphRetriever.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11752) Move flink-python to opt

2019-03-01 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11752:

Release Note: flink-python has been moved from lib to opt and is not 
available on the default class path any more.

> Move flink-python to opt
> 
>
> Key: FLINK-11752
> URL: https://issues.apache.org/jira/browse/FLINK-11752
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python, Build System
>Affects Versions: 1.7.2
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0, 1.9.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As discussed on the [dev mailing 
> list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html],
>  we should move flink-python to opt instead of having it in lib by default.
> The streaming counter part (flink-streaming-python) is only as part of opt 
> already.
> I think we don't have many users of the Python batch API and this will make 
> the streaming/batch experience more consistent and would result in a cleaner 
> default classpath. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever

2019-03-01 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-11533.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

> Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
> ---
>
> Key: FLINK-11533
> URL: https://issues.apache.org/jira/browse/FLINK-11533
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0, 1.9.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Users running job clusters distribute their user code as part of the shared 
> classpath of all cluster components. We currently require users running 
> {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR 
> manifest entries that specify the main class of a JAR are ignored since they 
> are simply part of the classpath.
> I propose to add another optional command line argument to the 
> {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file 
> (such as {{lib/usercode.jar}}) and whose Manifest is respected.
> Arguments:
> {code}
> --job-jar 
> --job-classname name
> {code}
> Each argument is optional, but at least one of the two is required. The 
> job-classname has precedence over job-jar.
> Implementation wise we should be able to simply create the PackagedProgram 
> from the jar file path in ClassPathJobGraphRetriever.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11752) Move flink-python to opt

2019-03-01 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11752:

Fix Version/s: 1.8.0

> Move flink-python to opt
> 
>
> Key: FLINK-11752
> URL: https://issues.apache.org/jira/browse/FLINK-11752
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python, Build System
>Affects Versions: 1.7.2
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0, 1.9.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As discussed on the [dev mailing 
> list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html],
>  we should move flink-python to opt instead of having it in lib by default.
> The streaming counter part (flink-streaming-python) is only as part of opt 
> already.
> I think we don't have many users of the Python batch API and this will make 
> the streaming/batch experience more consistent and would result in a cleaner 
> default classpath. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-4387) Instability in KvStateClientTest.testClientServerIntegration()

2019-02-28 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781394#comment-16781394
 ] 

Ufuk Celebi commented on FLINK-4387:


I think as [~yunta] says, this should have been fixed when Netty was upgraded. 
Furthermore, the referenced test class does not exist anymore. [~rmetzger] I 
think we can close this.

> Instability in KvStateClientTest.testClientServerIntegration()
> --
>
> Key: FLINK-4387
> URL: https://issues.apache.org/jira/browse/FLINK-4387
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Queryable State
>Affects Versions: 1.1.0, 1.5.0, 1.6.0
>Reporter: Robert Metzger
>Assignee: Nico Kruber
>Priority: Major
>  Labels: test-stability
> Fix For: 1.2.0, 1.8.0
>
>
> According to this log: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/151491745/log.txt
> the {{KvStateClientTest}} didn't complete.
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7fb2b400a000 nid=0x29dc in Object.wait() 
> [0x7fb2bcb3b000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xf7c049a0> (a 
> io.netty.util.concurrent.DefaultPromise)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:254)
>   - locked <0xf7c049a0> (a 
> io.netty.util.concurrent.DefaultPromise)
>   at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:32)
>   at 
> org.apache.flink.runtime.query.netty.KvStateServer.shutDown(KvStateServer.java:185)
>   at 
> org.apache.flink.runtime.query.netty.KvStateClientTest.testClientServerIntegration(KvStateClientTest.java:680)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}
> and
> {code}
> Exception in thread "globalEventExecutor-1-3" java.lang.AssertionError
>   at 
> io.netty.util.concurrent.AbstractScheduledEventExecutor.pollScheduledTask(AbstractScheduledEventExecutor.java:83)
>   at 
> io.netty.util.concurrent.GlobalEventExecutor.fetchFromScheduledTaskQueue(GlobalEventExecutor.java:110)
>   at 
> io.netty.util.concurrent.GlobalEventExecutor.takeTask(GlobalEventExecutor.java:95)
>   at 
> io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:226)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-1833) Refactor partition availability notification in ExecutionGraph

2019-02-28 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-1833.
--
Resolution: Won't Do

> Refactor partition availability notification in ExecutionGraph
> --
>
> Key: FLINK-1833
> URL: https://issues.apache.org/jira/browse/FLINK-1833
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 0.10.0
>Reporter: Ufuk Celebi
>Priority: Major
>
> The mechanism to notify the JobManager about available result partitions is 
> hard to understand. The are two parts to this:
> 1) JobManager
> - The deployment of receivers happens in the Execution class although it is 
> by now totally unrelated to the state of a specific execution. I propose to 
> move this to the respective IntermediateResultPartition.
> - The deployment information for a receiver is spread across different 
> components: when creating the TaskDeploymentDescriptor and the "caching" of 
> partition infos at the consuming vertex. This is very hard to follow and 
> results in unnecessary messages being sent (which are discarded at the TM).
> 2) TaskManager
> - Pipelined results notify where you would expect it in the ResultPartition, 
> but blocking results don't have an extra message and are implicitly 
> piggy-backed to the final state transition, after which the job manager 
> deploys receivers if all blocking partitions of a result have been produced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever

2019-02-28 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11533:

Fix Version/s: 1.9.0

> Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
> ---
>
> Key: FLINK-11533
> URL: https://issues.apache.org/jira/browse/FLINK-11533
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Users running job clusters distribute their user code as part of the shared 
> classpath of all cluster components. We currently require users running 
> {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR 
> manifest entries that specify the main class of a JAR are ignored since they 
> are simply part of the classpath.
> I propose to add another optional command line argument to the 
> {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file 
> (such as {{lib/usercode.jar}}) and whose Manifest is respected.
> Arguments:
> {code}
> --job-jar 
> --job-classname name
> {code}
> Each argument is optional, but at least one of the two is required. The 
> job-classname has precedence over job-jar.
> Implementation wise we should be able to simply create the PackagedProgram 
> from the jar file path in ClassPathJobGraphRetriever.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis

2019-02-28 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-11784.
---
Resolution: Cannot Reproduce

> KryoSerializerSnapshotTest occasionally fails on Travis
> ---
>
> Key: FLINK-11784
> URL: https://issues.apache.org/jira/browse/FLINK-11784
> Project: Flink
>  Issue Type: Bug
>  Components: API / Type Serialization System
>Affects Versions: 1.8.0
>Reporter: Ufuk Celebi
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> {{KryoSerializerSnapshotTest}} fails occasionally with:
> {code:java}
> 11:37:44.198 [ERROR] 
> tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest)
>   Time elapsed: 0.011 s  <<< ERROR!
> java.io.EOFException
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code}
> See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output 
> (as part of a PR with unrelated changes).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis

2019-02-28 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11784:

Fix Version/s: (was: 1.8.0)

> KryoSerializerSnapshotTest occasionally fails on Travis
> ---
>
> Key: FLINK-11784
> URL: https://issues.apache.org/jira/browse/FLINK-11784
> Project: Flink
>  Issue Type: Bug
>  Components: API / Type Serialization System
>Affects Versions: 1.8.0
>Reporter: Ufuk Celebi
>Priority: Blocker
>  Labels: test-stability
>
> {{KryoSerializerSnapshotTest}} fails occasionally with:
> {code:java}
> 11:37:44.198 [ERROR] 
> tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest)
>   Time elapsed: 0.011 s  <<< ERROR!
> java.io.EOFException
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code}
> See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output 
> (as part of a PR with unrelated changes).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis

2019-02-28 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11784:
---

 Summary: KryoSerializerSnapshotTest occasionally fails on Travis
 Key: FLINK-11784
 URL: https://issues.apache.org/jira/browse/FLINK-11784
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System
Reporter: Ufuk Celebi


{{KryoSerializerSnapshotTest}} fails occasionally with:
{code:java}
11:37:44.198 [ERROR] 
tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest)
  Time elapsed: 0.011 s  <<< ERROR!
java.io.EOFException
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120)
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code}
See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output 
(as part of a PR with unrelated changes).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11752) Move flink-python to opt

2019-02-27 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-11752.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

> Move flink-python to opt
> 
>
> Key: FLINK-11752
> URL: https://issues.apache.org/jira/browse/FLINK-11752
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python, Build System
>Affects Versions: 1.7.2
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As discussed on the [dev mailing 
> list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html],
>  we should move flink-python to opt instead of having it in lib by default.
> The streaming counter part (flink-streaming-python) is only as part of opt 
> already.
> I think we don't have many users of the Python batch API and this will make 
> the streaming/batch experience more consistent and would result in a cleaner 
> default classpath. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-1275) Add support to compress network I/O

2019-02-26 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778922#comment-16778922
 ] 

Ufuk Celebi commented on FLINK-1275:


[~yanghua] It has been a long time since I created this ticket. The original 
idea was to use lightweight compression (e.g. Snappy) for data that is 
transferred via the task managers. I haven't been active on the network stack 
for a while, so I can't judge what the current plans are for this.

I think that [~zjwang] or [~pnowojski] can provide more input on this.

> Add support to compress network I/O
> ---
>
> Key: FLINK-1275
> URL: https://issues.apache.org/jira/browse/FLINK-1275
> Project: Flink
>  Issue Type: Improvement
>  Components: Network
>Affects Versions: 0.8.0
>Reporter: Ufuk Celebi
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11752) Move flink-python to opt

2019-02-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11752:
---

 Summary: Move flink-python to opt
 Key: FLINK-11752
 URL: https://issues.apache.org/jira/browse/FLINK-11752
 Project: Flink
  Issue Type: Improvement
  Components: Build System, Python API
Affects Versions: 1.7.2
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


As discussed on the [dev mailing 
list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html],
 we should move flink-python to opt instead of having it in lib by default.

The streaming counter part (flink-streaming-python) is only as part of opt 
already.

I think we don't have many users of the Python batch API and this will make the 
streaming/batch experience more consistent and would result in a cleaner 
default classpath. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint

2019-02-15 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-11545.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

> Add option to manually set job ID in StandaloneJobClusterEntryPoint
> ---
>
> Key: FLINK-11545
> URL: https://issues.apache.org/jira/browse/FLINK-11545
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cluster Management, Kubernetes
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add an option to specify the job ID during job submissions via the 
> StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all 
> zeros currently.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11544) Add option to manually set job ID for job submissions via REST API

2019-02-14 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-11544.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

> Add option to manually set job ID for job submissions via REST API
> --
>
> Key: FLINK-11544
> URL: https://issues.apache.org/jira/browse/FLINK-11544
> Project: Flink
>  Issue Type: Sub-task
>  Components: REST
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add an option to specify the job ID during job submissions via the REST API.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever

2019-02-07 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-11533:
---

Assignee: Ufuk Celebi  (was: vinoyang)

> Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
> ---
>
> Key: FLINK-11533
> URL: https://issues.apache.org/jira/browse/FLINK-11533
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>
> Users running job clusters distribute their user code as part of the shared 
> classpath of all cluster components. We currently require users running 
> {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR 
> manifest entries that specify the main class of a JAR are ignored since they 
> are simply part of the classpath.
> I propose to add another optional command line argument to the 
> {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file 
> (such as {{lib/usercode.jar}}) and whose Manifest is respected.
> Arguments:
> {code}
> --job-jar 
> --job-classname name
> {code}
> Each argument is optional, but at least one of the two is required. The 
> job-classname has precedence over job-jar.
> Implementation wise we should be able to simply create the PackagedProgram 
> from the jar file path in ClassPathJobGraphRetriever.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever

2019-02-07 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763366#comment-16763366
 ] 

Ufuk Celebi commented on FLINK-11533:
-

[~yanghua] I assigned myself. Sorry for the confusion. As I wrote in the 
original ticket description, I was waiting to get confirmation for this ticket 
before providing the implementation (I already have it).

> Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
> ---
>
> Key: FLINK-11533
> URL: https://issues.apache.org/jira/browse/FLINK-11533
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>
> Users running job clusters distribute their user code as part of the shared 
> classpath of all cluster components. We currently require users running 
> {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR 
> manifest entries that specify the main class of a JAR are ignored since they 
> are simply part of the classpath.
> I propose to add another optional command line argument to the 
> {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file 
> (such as {{lib/usercode.jar}}) and whose Manifest is respected.
> Arguments:
> {code}
> --job-jar 
> --job-classname name
> {code}
> Each argument is optional, but at least one of the two is required. The 
> job-classname has precedence over job-jar.
> Implementation wise we should be able to simply create the PackagedProgram 
> from the jar file path in ClassPathJobGraphRetriever.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11534) Don't exit JVM after job termination with standalone job

2019-02-07 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763365#comment-16763365
 ] 

Ufuk Celebi commented on FLINK-11534:
-

[~yanghua] I assigned myself. Sorry for the confusion. As I wrote in the 
original ticket description, I was waiting to get confirmation for this ticket 
before providing the implementation (I already have it).

> Don't exit JVM after job termination with standalone job
> 
>
> Key: FLINK-11534
> URL: https://issues.apache.org/jira/browse/FLINK-11534
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>
> If a job deployed in job cluster mode terminates, the JVM running the 
> StandaloneJobClusterEntryPoint will exit via System.exit(1).
> When managing such a job this requires access to external systems for logging 
> in order to get more details about failure causes or final termination status.
> I believe that there is value in having a StandaloneJobClusterEntryPoint 
> option that does not exit the JVM after the job has terminated. This allows 
> users to gather further information if they are monitoring the job and 
> manually tear down the process.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (FLINK-11534) Don't exit JVM after job termination with standalone job

2019-02-07 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-11534:
---

Assignee: Ufuk Celebi  (was: vinoyang)

> Don't exit JVM after job termination with standalone job
> 
>
> Key: FLINK-11534
> URL: https://issues.apache.org/jira/browse/FLINK-11534
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>
> If a job deployed in job cluster mode terminates, the JVM running the 
> StandaloneJobClusterEntryPoint will exit via System.exit(1).
> When managing such a job this requires access to external systems for logging 
> in order to get more details about failure causes or final termination status.
> I believe that there is value in having a StandaloneJobClusterEntryPoint 
> option that does not exit the JVM after the job has terminated. This allows 
> users to gather further information if they are monitoring the job and 
> manually tear down the process.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11544) Add option to manually set job ID for job submissions via REST API

2019-02-07 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11544:

Summary: Add option to manually set job ID for job submissions via REST API 
 (was: Add option manually set job ID for job submissions via REST API)

> Add option to manually set job ID for job submissions via REST API
> --
>
> Key: FLINK-11544
> URL: https://issues.apache.org/jira/browse/FLINK-11544
> Project: Flink
>  Issue Type: Sub-task
>  Components: REST
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>
> Add an option to specify the job ID during job submissions via the REST API.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11544) Add option manually set job ID for job submissions via REST API

2019-02-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11544:
---

 Summary: Add option manually set job ID for job submissions via 
REST API
 Key: FLINK-11544
 URL: https://issues.apache.org/jira/browse/FLINK-11544
 Project: Flink
  Issue Type: Sub-task
  Components: REST
Affects Versions: 1.7.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


Add an option to specify the job ID during job submissions via the REST API.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint

2019-02-07 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11545:

Priority: Minor  (was: Major)

> Add option to manually set job ID in StandaloneJobClusterEntryPoint
> ---
>
> Key: FLINK-11545
> URL: https://issues.apache.org/jira/browse/FLINK-11545
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cluster Management, Kubernetes
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
>Priority: Minor
>
> Add an option to specify the job ID during job submissions via the 
> StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all 
> zeros currently.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11546) Add option to manually set job ID in CLI

2019-02-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11546:
---

 Summary: Add option to manually set job ID in CLI
 Key: FLINK-11546
 URL: https://issues.apache.org/jira/browse/FLINK-11546
 Project: Flink
  Issue Type: Sub-task
  Components: Client
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


Add an option to specify the job ID during job submissions via the CLI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint

2019-02-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11545:
---

 Summary: Add option to manually set job ID in 
StandaloneJobClusterEntryPoint
 Key: FLINK-11545
 URL: https://issues.apache.org/jira/browse/FLINK-11545
 Project: Flink
  Issue Type: Sub-task
  Components: Cluster Management, Kubernetes
Affects Versions: 1.7.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


Add an option to specify the job ID during job submissions via the 
StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all 
zeros currently.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11537) ExecutionGraph does not reach terminal state when JobMaster lost leadership

2019-02-07 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762483#comment-16762483
 ] 

Ufuk Celebi commented on FLINK-11537:
-

Is this related to/a duplicate of 
https://issues.apache.org/jira/browse/FLINK-10439?

> ExecutionGraph does not reach terminal state when JobMaster lost leadership
> ---
>
> Key: FLINK-11537
> URL: https://issues.apache.org/jira/browse/FLINK-11537
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.8.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
> Fix For: 1.8.0
>
>
> The {{ExecutionGraph}} sometimes does not reach a terminal state if the 
> {{JobMaster}} lost the leadership. The reason is that we use the fenced main 
> thread executor to execute {{ExecutionGraph}} changes and we don't wait for 
> the {{ExecutionGraph}} to reach the terminal state before we set the fencing 
> token {{null}}.
> One possible solution would be to wait for the {{ExecutionGraph}} to reach 
> the terminal state before clearing the fencing token. This has, however, the 
> downside that the {{JobMaster}} is still reachable until the 
> {{ExecutionGraph}} has been properly terminated. Alternatively, we could use 
> the unfenced main thread executor to send the cancel calls out.
> A Travis run where the problem occurred is here: 
> https://travis-ci.org/tillrohrmann/flink/jobs/489119926



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11534) Don't exit JVM after job termination with standalone job

2019-02-06 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761683#comment-16761683
 ] 

Ufuk Celebi commented on FLINK-11534:
-

[~till.rohrmann] Yes. This would work. Thanks for the pointer.


> Don't exit JVM after job termination with standalone job
> 
>
> Key: FLINK-11534
> URL: https://issues.apache.org/jira/browse/FLINK-11534
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Minor
>
> If a job deployed in job cluster mode terminates, the JVM running the 
> StandaloneJobClusterEntryPoint will exit via System.exit(1).
> When managing such a job this requires access to external systems for logging 
> in order to get more details about failure causes or final termination status.
> I believe that there is value in having a StandaloneJobClusterEntryPoint 
> option that does not exit the JVM after the job has terminated. This allows 
> users to gather further information if they are monitoring the job and 
> manually tear down the process.
> If there is agreement to have this feature, I would provide the 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11534) Don't exit JVM after job termination with standalone job

2019-02-05 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11534:
---

 Summary: Don't exit JVM after job termination with standalone job
 Key: FLINK-11534
 URL: https://issues.apache.org/jira/browse/FLINK-11534
 Project: Flink
  Issue Type: New Feature
  Components: Cluster Management
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


If a job deployed in job cluster mode terminates, the JVM running the 
StandaloneJobClusterEntryPoint will exit via System.exit(1).

When managing such a job this requires access to external systems for logging 
in order to get more details about failure causes or final termination status.

I believe that there is value in having a StandaloneJobClusterEntryPoint option 
that does not exit the JVM after the job has terminated. This allows users to 
gather further information if they are monitoring the job and manually tear 
down the process.

If there is agreement to have this feature, I would provide the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever

2019-02-05 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11533:
---

 Summary: Retrieve job class name from JAR manifest in 
ClassPathJobGraphRetriever
 Key: FLINK-11533
 URL: https://issues.apache.org/jira/browse/FLINK-11533
 Project: Flink
  Issue Type: New Feature
  Components: Cluster Management
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


Users running job clusters distribute their user code as part of the shared 
classpath of all cluster components. We currently require users running 
{{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR 
manifest entries that specify the main class of a JAR are ignored since they 
are simply part of the classpath.

I propose to add another optional command line argument to the 
{{StandaloneClusterEntryPoint}} that specifies the location of a JAR file (such 
as {{lib/usercode.jar}}) and whose Manifest is respected.

Arguments:
{code}
--job-jar 
--job-classname name
{code}

Each argument is optional, but at least one of the two is required. The 
job-classname has precedence over job-jar.

Implementation wise we should be able to simply create the PackagedProgram from 
the jar file path in ClassPathJobGraphRetriever.

If there is agreement to have this feature, I would provide the implementation.








--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11525) Add option to manually set job ID

2019-02-05 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760788#comment-16760788
 ] 

Ufuk Celebi commented on FLINK-11525:
-

[~f.pompermaier] I believe that ticket is orthogonal to what I'm proposing here.

> Add option to manually set job ID 
> --
>
> Key: FLINK-11525
> URL: https://issues.apache.org/jira/browse/FLINK-11525
> Project: Flink
>  Issue Type: New Feature
>  Components: Client, REST
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Minor
>
> When submitting Flink jobs programmatically it is desirable to have the 
> option to manually set the job ID in order to have idempotent job 
> submissions. This simplifies failure handling on the user side as duplicate 
> submissions will be rejected by Flink. In general allowing to manually set 
> the job ID can be beneficial for third party tooling.
> The default behavior should not be altered. The following job submission 
> entry points should be extended to allow to specify this option:
> 1. REST API
> 2. StandaloneJobClusterEntrypoint
> 3. CLI
> Note that for 2. FLINK-10921 already suggested to allow to configure the job 
> ID manually.
> If there is agreement to have this feature, I'll go ahead and create sub 
> tasks for the mentioned entry points (and provide the implementation for 1. 
> and 2.).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11525) Add option to manually set job ID

2019-02-05 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11525:

Description: 
When submitting Flink jobs programmatically it is desirable to have the option 
to manually set the job ID in order to have idempotent job submissions. This 
simplifies failure handling on the user side as duplicate submissions will be 
rejected by Flink. In general allowing to manually set the job ID can be 
beneficial for third party tooling.

The default behavior should not be altered. The following job submission entry 
points should be extended to allow to specify this option:
1. REST API
2. StandaloneJobClusterEntrypoint
3. CLI

Note that for 2. FLINK-10921 already suggested to allow to configure the job ID 
manually.

If there is agreement to have this feature, I'll go ahead and create sub tasks 
for the mentioned entry points (and provide the implementation for 1. and 2.).

  was:
When submitting Flink jobs programmatically it is desirable to have the option 
to manually set the job ID in order to have idempotent job submissions. This 
simplifies failure handling on the user side as duplicate submissions will be 
rejected by Flink. In general allowing to manually set the job ID can be 
beneficial for third party tooling.

The default behavior should not be altered. The following job submission entry 
points should be extended to allow to specify this option:
1. REST API
2. StandaloneJobClusterEntrypoint
3. CLI

Note that for 2., FLINK-10921 already suggested to allow to configure the job 
ID manually.

If there is agreement to have this feature, I'll go ahead and create sub tasks 
for the mentioned entry points (and provide the implementation for 1. and 2.).


> Add option to manually set job ID 
> --
>
> Key: FLINK-11525
> URL: https://issues.apache.org/jira/browse/FLINK-11525
> Project: Flink
>  Issue Type: New Feature
>  Components: Client, REST
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Minor
>
> When submitting Flink jobs programmatically it is desirable to have the 
> option to manually set the job ID in order to have idempotent job 
> submissions. This simplifies failure handling on the user side as duplicate 
> submissions will be rejected by Flink. In general allowing to manually set 
> the job ID can be beneficial for third party tooling.
> The default behavior should not be altered. The following job submission 
> entry points should be extended to allow to specify this option:
> 1. REST API
> 2. StandaloneJobClusterEntrypoint
> 3. CLI
> Note that for 2. FLINK-10921 already suggested to allow to configure the job 
> ID manually.
> If there is agreement to have this feature, I'll go ahead and create sub 
> tasks for the mentioned entry points (and provide the implementation for 1. 
> and 2.).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11525) Add option to manually set job ID

2019-02-05 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11525:
---

 Summary: Add option to manually set job ID 
 Key: FLINK-11525
 URL: https://issues.apache.org/jira/browse/FLINK-11525
 Project: Flink
  Issue Type: New Feature
  Components: Client, REST
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


When submitting Flink jobs programmatically it is desirable to have the option 
to manually set the job ID in order to have idempotent job submissions. This 
simplifies failure handling on the user side as duplicate submissions will be 
rejected by Flink. In general allowing to manually set the job ID can be 
beneficial for third party tooling.

The default behavior should not be altered. The following job submission entry 
points should be extended to allow to specify this option:
1. REST API
2. StandaloneJobClusterEntrypoint
3. CLI

Note that for 2., FLINK-10921 already suggested to allow to configure the job 
ID manually.

If there is agreement to have this feature, I'll go ahead and create sub tasks 
for the mentioned entry points (and provide the implementation for 1. and 2.).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders

2019-01-21 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11402:

Component/s: Local Runtime

> User code can fail with an UnsatisfiedLinkError in the presence of multiple 
> classloaders
> 
>
> Key: FLINK-11402
> URL: https://issues.apache.org/jira/browse/FLINK-11402
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Local Runtime
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz
>
>
> As reported on the user mailing list thread "[`env.java.opts` not persisting 
> after job canceled or failed and then 
> restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];,
>  there can be issues with using native libraries and user code class loading.
> h2. Steps to reproduce
> I was able to reproduce the issue reported on the mailing list using 
> [snappy-java|https://github.com/xerial/snappy-java] in a user program. 
> Running the attached user program works fine on initial submission, but 
> results in a failure when re-executed.
> I'm using Flink 1.7.0 using a standalone cluster started via 
> {{bin/start-cluster.sh}}.
> 0. Unpack attached Maven project and build using {{mvn clean package}} *or* 
> directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> 1. Download 
> [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar]
>  and unpack libsnappyjava for your system:
> {code}
> jar tf snappy-java-1.1.7.2.jar | grep libsnappy
> ...
> org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so
> ...
> org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib
> ...
> {code}
> 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} 
> (path needs to be adjusted for your system):
> {code}
> env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64
> {code}
> 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> Program execution finished
> Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished.
> Job Runtime: 359 ms
> {code}
> 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 7d69baca58f33180cb9251449ddcd396)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>   at com.github.uce.HelloSnappy.main(HelloSnappy.java:18)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
>   at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
>   at 
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
>   ... 17 more
> Caused by: java.lang.UnsatisfiedLinkError: Native Library 
> /.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded 
> in another classloader
>   at

[jira] [Commented] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders

2019-01-21 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748241#comment-16748241
 ] 

Ufuk Celebi commented on FLINK-11402:
-

Similar issue for RocksDB (FLINK-5408)

> User code can fail with an UnsatisfiedLinkError in the presence of multiple 
> classloaders
> 
>
> Key: FLINK-11402
> URL: https://issues.apache.org/jira/browse/FLINK-11402
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz
>
>
> As reported on the user mailing list thread "[`env.java.opts` not persisting 
> after job canceled or failed and then 
> restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];,
>  there can be issues with using native libraries and user code class loading.
> h2. Steps to reproduce
> I was able to reproduce the issue reported on the mailing list using 
> [snappy-java|https://github.com/xerial/snappy-java] in a user program. 
> Running the attached user program works fine on initial submission, but 
> results in a failure when re-executed.
> I'm using Flink 1.7.0 using a standalone cluster started via 
> {{bin/start-cluster.sh}}.
> 0. Unpack attached Maven project and build using {{mvn clean package}} *or* 
> directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> 1. Download 
> [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar]
>  and unpack libsnappyjava for your system:
> {code}
> jar tf snappy-java-1.1.7.2.jar | grep libsnappy
> ...
> org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so
> ...
> org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib
> ...
> {code}
> 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} 
> (path needs to be adjusted for your system):
> {code}
> env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64
> {code}
> 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> Program execution finished
> Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished.
> Job Runtime: 359 ms
> {code}
> 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 7d69baca58f33180cb9251449ddcd396)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>   at com.github.uce.HelloSnappy.main(HelloSnappy.java:18)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
>   at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
>   at 
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
>   ... 17 more
> Caused by: java.lang.UnsatisfiedLinkError: Native Library 
> /.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded 
> in another classloader
>   at

[jira] [Updated] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders

2019-01-21 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11402:

Attachment: hello-snappy.tgz

> User code can fail with an UnsatisfiedLinkError in the presence of multiple 
> classloaders
> 
>
> Key: FLINK-11402
> URL: https://issues.apache.org/jira/browse/FLINK-11402
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz
>
>
> As reported on the user mailing list thread "[`env.java.opts` not persisting 
> after job canceled or failed and then 
> restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];,
>  there can be issues with using native libraries and user code class loading.
> h2. Steps to reproduce
> I was able to reproduce the issue reported on the mailing list using 
> [snappy-java|https://github.com/xerial/snappy-java] in a user program. 
> Running the attached user program works fine on initial submission, but 
> results in a failure when re-executed.
> I'm using Flink 1.7.0 using a standalone cluster started via 
> {{bin/start-cluster.sh}}.
> 0. Unpack attached Maven project and build using {{mvn clean package}} *or* 
> directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> 1. Download 
> [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar]
>  and unpack libsnappyjava for your system:
> {code}
> jar tf snappy-java-1.1.7.2.jar | grep libsnappy
> ...
> org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so
> ...
> org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib
> ...
> {code}
> 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} 
> (path needs to be adjusted for your system):
> {code}
> env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64
> {code}
> 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> Program execution finished
> Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished.
> Job Runtime: 359 ms
> {code}
> 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 7d69baca58f33180cb9251449ddcd396)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>   at com.github.uce.HelloSnappy.main(HelloSnappy.java:18)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
>   at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
>   at 
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
>   ... 17 more
> Caused by: java.lang.UnsatisfiedLinkError: Native Library 
> /.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded 
> in another classloader
>   at

[jira] [Updated] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders

2019-01-21 Thread Ufuk Celebi (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-11402:

Attachment: hello-snappy-1.0-SNAPSHOT.jar

> User code can fail with an UnsatisfiedLinkError in the presence of multiple 
> classloaders
> 
>
> Key: FLINK-11402
> URL: https://issues.apache.org/jira/browse/FLINK-11402
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Major
> Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz
>
>
> As reported on the user mailing list thread "[`env.java.opts` not persisting 
> after job canceled or failed and then 
> restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];,
>  there can be issues with using native libraries and user code class loading.
> h2. Steps to reproduce
> I was able to reproduce the issue reported on the mailing list using 
> [snappy-java|https://github.com/xerial/snappy-java] in a user program. 
> Running the attached user program works fine on initial submission, but 
> results in a failure when re-executed.
> I'm using Flink 1.7.0 using a standalone cluster started via 
> {{bin/start-cluster.sh}}.
> 0. Unpack attached Maven project and build using {{mvn clean package}} *or* 
> directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> 1. Download 
> [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar]
>  and unpack libsnappyjava for your system:
> {code}
> jar tf snappy-java-1.1.7.2.jar | grep libsnappy
> ...
> org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so
> ...
> org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib
> ...
> {code}
> 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} 
> (path needs to be adjusted for your system):
> {code}
> env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64
> {code}
> 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> Program execution finished
> Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished.
> Job Runtime: 359 ms
> {code}
> 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}}
> {code}
> bin/flink run hello-snappy-1.0-SNAPSHOT.jar
> Starting execution of program
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 7d69baca58f33180cb9251449ddcd396)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>   at com.github.uce.HelloSnappy.main(HelloSnappy.java:18)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
>   at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
>   at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
>   at 
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
>   ... 17 more
> Caused by: java.lang.UnsatisfiedLinkError: Native Library 
> /.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded 
> in another classloader
>   at

[jira] [Created] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders

2019-01-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11402:
---

 Summary: User code can fail with an UnsatisfiedLinkError in the 
presence of multiple classloaders
 Key: FLINK-11402
 URL: https://issues.apache.org/jira/browse/FLINK-11402
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.7.0
Reporter: Ufuk Celebi
 Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz

As reported on the user mailing list thread "[`env.java.opts` not persisting 
after job canceled or failed and then 
restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];,
 there can be issues with using native libraries and user code class loading.

h2. Steps to reproduce

I was able to reproduce the issue reported on the mailing list using 
[snappy-java|https://github.com/xerial/snappy-java] in a user program. Running 
the attached user program works fine on initial submission, but results in a 
failure when re-executed.

I'm using Flink 1.7.0 using a standalone cluster started via 
{{bin/start-cluster.sh}}.

0. Unpack attached Maven project and build using {{mvn clean package}} *or* 
directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}}
1. Download 
[snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar]
 and unpack libsnappyjava for your system:
{code}
jar tf snappy-java-1.1.7.2.jar | grep libsnappy
...
org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so
...
org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib
...
{code}
2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} 
(path needs to be adjusted for your system):
{code}
env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64
{code}
3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}}
{code}
bin/flink run hello-snappy-1.0-SNAPSHOT.jar
Starting execution of program
Program execution finished
Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished.
Job Runtime: 359 ms
{code}
4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}}
{code}
bin/flink run hello-snappy-1.0-SNAPSHOT.jar
Starting execution of program


 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: 
7d69baca58f33180cb9251449ddcd396)
  at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
  at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487)
  at 
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
  at com.github.uce.HelloSnappy.main(HelloSnappy.java:18)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
  at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
  at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
  at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
  at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
  at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
  at 
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
  at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
  at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
  at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
  ... 17 more
Caused by: java.lang.UnsatisfiedLinkError: Native Library 
/.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded in 
another classloader
  at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1907)
  at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1861)
  at java.lang.Runtime.loadLibrary0(Runtime.java:870)
  at java.lang.System.loadLibrary(System.java:1122)
  at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:182)
  at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154)
  at org.xerial.snappy.Snappy.(Snappy.java:47)
  at

[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager

2018-12-13 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720158#comment-16720158
 ] 

Ufuk Celebi commented on FLINK-11127:
-

[~plucas] Thanks for the update.

Regarding connecting by IP: I think in order to expose the TM by IP, users need 
to configure the {{taskmanager.host}} option as mentioned by [~spoganshev]. His 
workaround should work, but it is indeed a bit involved.

> Make metrics query service establish connection to JobManager
> -
>
> Key: FLINK-11127
> URL: https://issues.apache.org/jira/browse/FLINK-11127
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination, Kubernetes, Metrics
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Major
>
> As part of FLINK-10247, the internal metrics query service has been separated 
> into its own actor system. Before this change, the JobManager (JM) queried 
> TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a 
> separate connection to the TM metrics query service actor.
> In the context of Kubernetes, this is problematic as the JM will typically 
> *not* be able to resolve the TMs by name, resulting in warnings as follows:
> {code}
> 2018-12-11 08:32:33,962 WARN  akka.remote.ReliableDeliverySupervisor  
>   - Association with remote system 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has 
> failed, address is now gated for [50] ms. Reason: [Association failed with 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused 
> by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve]
> {code}
> In order to expose the TMs by name in Kubernetes, users require a service 
> *for each* TM instance which is not practical.
> This currently results in the web UI not being to display some basic metrics 
> about number of sent records. You can reproduce this by following the READMEs 
> in {{flink-container/kubernetes}}.
> This worked before, because the JM is typically exposed via a service with a 
> known name and the TMs establish the connection to it which the metrics query 
> service piggybacked on.
> A potential solution to this might be to let the query service connect to the 
> JM similar to how the TMs register.
> I tagged this ticket as an improvement, but in the context of Kubernetes I 
> would consider this to be a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager

2018-12-12 Thread Ufuk Celebi (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719004#comment-16719004
 ] 

Ufuk Celebi commented on FLINK-11127:
-

[~spoganshev] Yes, the port configuration is correct.

I'm aware of the {{StatefulSet}} workaround, but it is not feasible for larger 
replica counts in my opinion, because it results in pods to be created 
sequentially:

{quote}
For a StatefulSet with N replicas, when Pods are being deployed, they are 
created sequentially, in order from \{0..N-1}.
[...]
When the nginx example above is created, three Pods will be deployed in the 
order web-0, web-1, web-2. web-1 will not be deployed before web-0 is [Running 
and Ready|https://kubernetes.io/docs/user-guide/pod-states/], and web-2 will 
not be deployed until web-1 is Running and Ready.
{quote}
(https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees)


> Make metrics query service establish connection to JobManager
> -
>
> Key: FLINK-11127
> URL: https://issues.apache.org/jira/browse/FLINK-11127
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination, Kubernetes, Metrics
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Priority: Major
>
> As part of FLINK-10247, the internal metrics query service has been separated 
> into its own actor system. Before this change, the JobManager (JM) queried 
> TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a 
> separate connection to the TM metrics query service actor.
> In the context of Kubernetes, this is problematic as the JM will typically 
> *not* be able to resolve the TMs by name, resulting in warnings as follows:
> {code}
> 2018-12-11 08:32:33,962 WARN  akka.remote.ReliableDeliverySupervisor  
>   - Association with remote system 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has 
> failed, address is now gated for [50] ms. Reason: [Association failed with 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused 
> by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve]
> {code}
> In order to expose the TMs by name in Kubernetes, users require a service 
> *for each* TM instance which is not practical.
> This currently results in the web UI not being to display some basic metrics 
> about number of sent records. You can reproduce this by following the READMEs 
> in {{flink-container/kubernetes}}.
> This worked before, because the JM is typically exposed via a service with a 
> known name and the TMs establish the connection to it which the metrics query 
> service piggybacked on.
> A potential solution to this might be to let the query service connect to the 
> JM similar to how the TMs register.
> I tagged this ticket as an improvement, but in the context of Kubernetes I 
> would consider this to be a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1465 matches

Mail list logo