[jira] [Commented] (FLINK-14380) Type Extractor POJO setter check does not allow for Immutable Case Class

2019-10-16 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952835#comment-16952835
 ] 

Stephan Ewen commented on FLINK-14380:
--

[~elliotvilhelm] What [~aljoscha] mentioned is that the Scala Case Class code 
should not go through this logic in the first place.
It should be handled by the Scala type inference and extractor, which should 
support immutable Scala Case Classes.

> Type Extractor POJO setter check does not allow for Immutable Case Class
> 
>
> Key: FLINK-14380
> URL: https://issues.apache.org/jira/browse/FLINK-14380
> Project: Flink
>  Issue Type: Bug
>  Components: API / Scala
>Affects Versions: 1.8.2, 1.9.0
> Environment: Not relavent
>  
>Reporter: Elliot Pourmand
>Priority: Major
>
> When deciding if a class conforms to POJO using the type extractor Flink 
> checks that the class implements a setter and getter method. For the setter 
> method Flink makes the assertion that the return type is `Void`. This is an 
> issue if using a case class as often the return type of a case class setter 
> is a copy of the objects class. Consider the following case class:
> {code:scala}
> case class  SomeClass(x: Int) {
> x_=(newX: Int): SomeClass = { this.copy(x = newX) }
> }
> {code}
> This class will be identified as not being valid POJO although getter 
> (generated) and setter methods are provided because the return type of the 
> setter is not void. 
> This issue discourages immutabilaty and makes the usage of case classes not 
> possible without falling back to Kryo Serializer.
> The issue is located in 
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/java/typeutils/TypeExtractor.java
>  on line 1806. Here is a permalink to the line 
> https://github.com/apache/flink/blob/80b27a150026b7b5cb707bd9fa3e17f565bb8112/flink-core/src/main/java/org/apache/flink/api/java/typeutils/TypeExtractor.java#L1806
> A copy of the if check is here
> {code:java}
> if((methodNameLow.equals("set"+fieldNameLow) || 
> methodNameLow.equals(fieldNameLow+"_$eq")) &&
>   m.getParameterTypes().length == 1 && // 
> one parameter of the field's type
>   
> (m.getGenericParameterTypes()[0].equals( fieldType ) || (fieldTypeWrapper != 
> null && m.getParameterTypes()[0].equals( fieldTypeWrapper )) || 
> (fieldTypeGeneric != null && 
> m.getGenericParameterTypes()[0].equals(fieldTypeGeneric) ) )&&
>   // return type is void.
>   m.getReturnType().equals(Void.TYPE)
>   ) {
>   hasSetter = true;
>   }
>   }
> {code}
> I believe the 
> {code:java}
> m.getReturnType().equals(Void.TYPE)
> {code}
> should be modified to 
> {code:java}
> m.getReturnType().equals(Void.TYPE) || m.getReturnType().equals(clazz)
> {code}
> This will allow for case class setters which return copies of the object 
> enabling to use case classes. This allows us to maintain immutability without 
> being forced to fall back to the Kryo Serializer.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14118) Reduce the unnecessary flushing when there is no data available for flush

2019-10-03 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943577#comment-16943577
 ] 

Stephan Ewen commented on FLINK-14118:
--

Did we have significant changes in that part of the network stack since the 1.9 
release? So that we would expect subtle issues (only detected after many runs) 
in 1.9 that would not apply to 1.10?

If it is a serious performance issue, users would appreciate the fix in 1.9 as 
well.

> Reduce the unnecessary flushing when there is no data available for flush
> -
>
> Key: FLINK-14118
> URL: https://issues.apache.org/jira/browse/FLINK-14118
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Yingjie Cao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new flush implementation which works by triggering a netty user event may 
> cause performance regression compared to the old synchronization-based one. 
> More specifically, when there is exactly one BufferConsumer in the buffer 
> queue of subpartition and no new data will be added for a while in the future 
> (may because of just no input or the logic of the operator is to collect some 
> data for processing and will not emit records immediately), that is, there is 
> no data to send, the OutputFlusher will continuously notify data available 
> and wake up the netty thread, though no data will be returned by the 
> pollBuffer method.
> For some of our production jobs, this will incur 20% to 40% CPU overhead 
> compared to the old implementation. We tried to fix the problem by checking 
> if there is new data available when flushing, if there is no new data, the 
> netty thread will not be notified. It works for our jobs and the cpu usage 
> falls to previous level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14225) Travis is unable to parse one of the secure environment variables

2019-10-03 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943574#comment-16943574
 ] 

Stephan Ewen commented on FLINK-14225:
--

Do you know which variable that is?

I am not aware of any changes in the secret environment variables in a while. 
Was that error always there that recent? 

> Travis is unable to parse one of the secure environment variables
> -
>
> Key: FLINK-14225
> URL: https://issues.apache.org/jira/browse/FLINK-14225
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.10.0
>Reporter: Gary Yao
>Priority: Blocker
>
> Example: https://travis-ci.org/apache/flink/jobs/589531009
> {noformat}
> We were unable to parse one of your secure environment variables.
> Please make sure to escape special characters such as ' ' (white space) and $ 
> (dollar symbol) with \ (backslash) .
> For example, thi$isanexample would be typed as thi\$isanexample. See 
> https://docs.travis-ci.com/user/encryption-keys.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5

2019-10-03 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943572#comment-16943572
 ] 

Stephan Ewen commented on FLINK-13417:
--

Sorry for having been out of this thread for a while.

If there is a way to avoid explicitly shaded modules (especially inside Flink) 
I believe that this is preferable. We are also trying hard to get other 
statically relocated libraries moved to flink-shaded.

 Is ZK used in more than one module (also outside flink-runtime)? In not, then 
it should be sufficient to just shade it inside the runtime package.
If it is used in other places, is it used directly, or is it used through a 
runtime utility? If we can get to the later case, we could have ZK only 
shaded/relocated in flink-runtime.

> Bump Zookeeper to 3.5.5
> ---
>
> Key: FLINK-13417
> URL: https://issues.apache.org/jira/browse/FLINK-13417
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Konstantin Knauf
>Priority: Major
> Fix For: 1.10.0
>
>
> User might want to secure their Zookeeper connection via SSL.
> This requires a Zookeeper version >= 3.5.1. We might as well try to bump it 
> to 3.5.5, which is the latest version. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-09-30 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941325#comment-16941325
 ] 

Stephan Ewen commented on FLINK-14123:
--

[~xintongsong][~azagrebin] Given that in the future, we compute the amount of 
managed memory prior to starting the internal services, it probably makes sense 
to lower the fraction anyways.

It would also be good to have good experience on the parallel GC, which seems 
to be still the best GC, throughput wise, for batch processing.

What do you think about this change here?

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1007)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1080)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1080)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
>   ... 23 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: GC 
> overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
>   at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: GC overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:831)
> Caused by: 

[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files

2019-09-27 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939689#comment-16939689
 ] 

Stephan Ewen commented on FLINK-13808:
--

Thank you [~klion26] for the debugging and the suggestion to fix this.

We should definitely have a "cancellation" call from the JobManager to the 
TaskManagers that cancels both the asynchronous materialization threads, and 
cleans up the checkpoint's temp files.

I am contemplating whether having an additional 
{{taskmanager.checkpoints.max-concurrent}} value that limits the number of 
concurrent checkpoints on the TM might be a good safety net.

Can there be cases where this is needed (the cancellation messages do not 
cancel quick enough) could also regress to a situation where no checkpoints go 
through any more, because different TMs have different checkpoints lingering, 
and there is always at least one TM declining a checkpoint.
Would the cases, where the cancellation messages are not enough, be exactly 
those dangerous cases where the system would regress into a situation where no 
checkpoints can happen any more?


> Checkpoints expired by timeout may leak RocksDB files
> -
>
> Key: FLINK-13808
> URL: https://issues.apache.org/jira/browse/FLINK-13808
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.0, 1.8.1
> Environment: So far only reliably reproducible on a 4-node cluster 
> with parallelism ≥ 100. But do try 
> https://github.com/jcaesar/flink-rocksdb-file-leak
>Reporter: Julius Michaelis
>Priority: Minor
>
> A RocksDB state backend with HDFS checkpoints, with or without local 
> recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout.
> If the size of a checkpoint crosses what can be transferred during one 
> checkpoint timeout, checkpoints will continue to fail forever. If this is 
> combined with a quick rollover of SST files (e.g. due to a high density of 
> writes), this may quickly exhaust available disk space (or memory, as /tmp is 
> the default location).
> As a workaround, the jobmanager's REST API can be frequently queried for 
> failed checkpoints, and associated files deleted accordingly.
> I've tried investing the cause a little bit, but I'm stuck:
>  * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before 
> completing.}} and similar gets printed, so
>  * [{{abortExpired}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549],
>  so
>  * [{{dispose}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416],
>  so
>  * [{{cancelCaller}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488],
>  so
>  * [the canceler is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497]
>  ([through one more 
> layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]),
>  so
>  * [{{cleanup}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95],
>  (possibly [not from 
> {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]),
>  so
>  * [{{cleanupProvidedResources}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162]
>  (this is the indirection that made me give up), so
>  * [this trace 
> log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372]
>  should be printed, but it isn't.
> I have some time to further investigate, but I'd appreciate help on finding 
> out where in this chain things go wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files

2019-09-27 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939683#comment-16939683
 ] 

Stephan Ewen commented on FLINK-13808:
--

[~Caesar] yes, each checkpoint spawns per task a background thread to 
asynchronously materialize the state checkpoint.

It is the same problem as with the temp files: Too many concurrent and 
lingering checkpoints.

> Checkpoints expired by timeout may leak RocksDB files
> -
>
> Key: FLINK-13808
> URL: https://issues.apache.org/jira/browse/FLINK-13808
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.8.0, 1.8.1
> Environment: So far only reliably reproducible on a 4-node cluster 
> with parallelism ≥ 100. But do try 
> https://github.com/jcaesar/flink-rocksdb-file-leak
>Reporter: Julius Michaelis
>Priority: Minor
>
> A RocksDB state backend with HDFS checkpoints, with or without local 
> recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout.
> If the size of a checkpoint crosses what can be transferred during one 
> checkpoint timeout, checkpoints will continue to fail forever. If this is 
> combined with a quick rollover of SST files (e.g. due to a high density of 
> writes), this may quickly exhaust available disk space (or memory, as /tmp is 
> the default location).
> As a workaround, the jobmanager's REST API can be frequently queried for 
> failed checkpoints, and associated files deleted accordingly.
> I've tried investing the cause a little bit, but I'm stuck:
>  * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before 
> completing.}} and similar gets printed, so
>  * [{{abortExpired}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549],
>  so
>  * [{{dispose}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416],
>  so
>  * [{{cancelCaller}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488],
>  so
>  * [the canceler is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497]
>  ([through one more 
> layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]),
>  so
>  * [{{cleanup}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95],
>  (possibly [not from 
> {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]),
>  so
>  * [{{cleanupProvidedResources}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162]
>  (this is the indirection that made me give up), so
>  * [this trace 
> log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372]
>  should be printed, but it isn't.
> I have some time to further investigate, but I'd appreciate help on finding 
> out where in this chain things go wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-09-20 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934441#comment-16934441
 ] 

Stephan Ewen commented on FLINK-14123:
--

Would the behavior be the same when using G1 GC? Or CMS?

CMS my be used mainly be streaming-only users, but G1 is definitely used as the 
default also by various batch users.

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1007)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1080)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1080)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
>   ... 23 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: GC 
> overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
>   at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: GC overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:831)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.deserialize(BytePrimitiveArraySerializer.java:84)
>   at 
> 

[jira] [Commented] (FLINK-14035) Introduce/Change some log for snapshot to better analysis checkpoint problem

2019-09-20 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934202#comment-16934202
 ] 

Stephan Ewen commented on FLINK-14035:
--

[~klion26] no problem and no need to apologize. That is exactly what JIRA 
discussions are for, to check and discuss.

> Introduce/Change some log for snapshot to better analysis checkpoint problem
> 
>
> Key: FLINK-14035
> URL: https://issues.apache.org/jira/browse/FLINK-14035
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Currently, the information for checkpoint are mostly debug log (especially on 
> TM side). If we want to track where the checkpoint steps and consume time 
> during each step when we have a failed checkpoint or the checkpoint time is 
> too long, we need to restart the job with enabling debug log, this issue 
> wants to improve this situation, wants to change some exist debug log from 
> debug to info, and add some more debug log.  we have changed this log level 
> in our production in Alibaba, and it seems no problem until now.
>  
> Detail
> {{change the log below from debug level to info}} 
>  * log about \{{Starting checkpoint xxx }} in TM  side
>  * log about Sync complete in TM  side
>  * log about async compete in TM  side
> Add debug log 
>  *  log about receiving the barrier  for exactly once mode  - align from at 
> lease once mode
>  
> If this issue is valid, then I'm happy to contribute it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14118) Reduce the unnecessary flushing when there is no data available for flush

2019-09-20 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934194#comment-16934194
 ] 

Stephan Ewen commented on FLINK-14118:
--

I agree, looking at the code, in theory there should be no regression.

If we merge this now, would you be able to monitor the behavior of the jobs 
after upgrading Flink and report any regressions you find?


> Reduce the unnecessary flushing when there is no data available for flush
> -
>
> Key: FLINK-14118
> URL: https://issues.apache.org/jira/browse/FLINK-14118
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Yingjie Cao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new flush implementation which works by triggering a netty user event may 
> cause performance regression compared to the old synchronization-based one. 
> More specifically, when there is exactly one BufferConsumer in the buffer 
> queue of subpartition and no new data will be added for a while in the future 
> (may because of just no input or the logic of the operator is to collect some 
> data for processing and will not emit records immediately), that is, there is 
> no data to send, the OutputFlusher will continuously notify data available 
> and wake up the netty thread, though no data will be returned by the 
> pollBuffer method.
> For some of our production jobs, this will incur 20% to 40% CPU overhead 
> compared to the old implementation. We tried to fix the problem by checking 
> if there is new data available when flushing, if there is no new data, the 
> netty thread will not be notified. It works for our jobs and the cpu usage 
> falls to previous level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-09-20 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934190#comment-16934190
 ] 

Stephan Ewen commented on FLINK-14123:
--

Thanks!
What GC were you using? G1, CMS, Parallel?
The logs look like you were using Parallel Scavenge GC, is that correct?

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1007)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1080)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1080)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
>   ... 23 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: GC 
> overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
>   at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: GC overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:831)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.deserialize(BytePrimitiveArraySerializer.java:84)
>   at 
> 

[jira] [Closed] (FLINK-13845) Drop all the content of removed "Checkpointed" interface

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13845.


> Drop all the content of removed "Checkpointed" interface
> 
>
> Key: FLINK-13845
> URL: https://issues.apache.org/jira/browse/FLINK-13845
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> From [FLINK-7461|https://issues.apache.org/jira/browse/FLINK-7461], we have 
> already removed the backward compatibility before Flink-1.1 and the 
> deprecated {{Checkpointed}} interface has been totally removed. However, we 
> still have many contents including java docs, documentation talked about this 
> non-existing interface. I think it's time to remove these contents now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-13845) Drop all the content of removed "Checkpointed" interface

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13845.
--
Fix Version/s: 1.9.1
   Resolution: Fixed

Fixed in
  - 1.9.1 via 4503b6cd3b884e9529a5637fcbcf10836f655f90
  - 1.10.0 via e273d8d89fd47d7d83e9e0e6adec5cb5514a334e

> Drop all the content of removed "Checkpointed" interface
> 
>
> Key: FLINK-13845
> URL: https://issues.apache.org/jira/browse/FLINK-13845
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> From [FLINK-7461|https://issues.apache.org/jira/browse/FLINK-7461], we have 
> already removed the backward compatibility before Flink-1.1 and the 
> deprecated {{Checkpointed}} interface has been totally removed. However, we 
> still have many contents including java docs, documentation talked about this 
> non-existing interface. I think it's time to remove these contents now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13796) Remove unused variable

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13796.


> Remove unused variable
> --
>
> Key: FLINK-13796
> URL: https://issues.apache.org/jira/browse/FLINK-13796
> Project: Flink
>  Issue Type: Task
>  Components: Deployment / YARN
>Affects Versions: 1.8.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-9941) Flush in ScalaCsvOutputFormat before close method

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-9941.
---

> Flush in ScalaCsvOutputFormat before close method
> -
>
> Key: FLINK-9941
> URL: https://issues.apache.org/jira/browse/FLINK-9941
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Scala
>Affects Versions: 1.5.1, 1.6.4, 1.7.2, 1.8.1
>Reporter: Ryan Tao
>Assignee: Jiayi Liao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Because not every stream's close method will flush, in order to ensure the 
> stability of continuous integration, we need to manually call flush() before 
> close().
> I noticed that CsvOutputFormat (Java API) has done this. As follows. 
> {code:java}
> //CsvOutputFormat
> public void close() throws IOException {
> if (wrt != null) {
> this.wrt.flush();
> this.wrt.close();
> }
> super.close();
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-13796) Remove unused variable

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13796.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via d98132f4e286296a87ab3ae879ca09eb98165c27

> Remove unused variable
> --
>
> Key: FLINK-13796
> URL: https://issues.apache.org/jira/browse/FLINK-13796
> Project: Flink
>  Issue Type: Task
>  Components: Deployment / YARN
>Affects Versions: 1.8.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-9941) Flush in ScalaCsvOutputFormat before close method

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-9941.
-
Fix Version/s: 1.9.1
   1.10.0
   Resolution: Fixed

Fixed in
  - 1.9.1 via 916f3a264556b5fac685d6b3f4e89028d6561515
  - 1.10.0 via 6f36df64e5a375ae203ba32662e5b01fcc38e340

> Flush in ScalaCsvOutputFormat before close method
> -
>
> Key: FLINK-9941
> URL: https://issues.apache.org/jira/browse/FLINK-9941
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Scala
>Affects Versions: 1.5.1, 1.6.4, 1.7.2, 1.8.1
>Reporter: Ryan Tao
>Assignee: Jiayi Liao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Because not every stream's close method will flush, in order to ensure the 
> stability of continuous integration, we need to manually call flush() before 
> close().
> I noticed that CsvOutputFormat (Java API) has done this. As follows. 
> {code:java}
> //CsvOutputFormat
> public void close() throws IOException {
> if (wrt != null) {
> this.wrt.flush();
> this.wrt.close();
> }
> super.close();
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-13449) Add ARM architecture to MemoryArchitecture

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13449.
--
Fix Version/s: 1.9.1
   1.10.0
   Resolution: Fixed

Fixed in
  - 1.9.1 via 79bc17c3b95302c98676c9a13d33f8450124fb80
  - 1.10.0 via 07f86e6eeeadefc6621e13987b0158dd0018c614

> Add ARM architecture to MemoryArchitecture
> --
>
> Key: FLINK-13449
> URL: https://issues.apache.org/jira/browse/FLINK-13449
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Stephan Ewen
>Assignee: wangxiyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, {{Memoryarchitecture}} recognizes only various versions of x86 and 
> amd64 / ia64.
> We should add aarch64 for ARM to the known architectures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13449) Add ARM architecture to MemoryArchitecture

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13449.


> Add ARM architecture to MemoryArchitecture
> --
>
> Key: FLINK-13449
> URL: https://issues.apache.org/jira/browse/FLINK-13449
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Stephan Ewen
>Assignee: wangxiyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, {{Memoryarchitecture}} recognizes only various versions of x86 and 
> amd64 / ia64.
> We should add aarch64 for ARM to the known architectures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-11859) Improve SpanningRecordSerializer performance by serializing record length to serialization buffer directly

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-11859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-11859.
--
Resolution: Fixed

Fixed via 35e57a8460c7f03010972f587bb24052ea694cce

> Improve SpanningRecordSerializer performance by serializing record length to 
> serialization buffer directly
> --
>
> Key: FLINK-11859
> URL: https://issues.apache.org/jira/browse/FLINK-11859
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the current implementation of SpanningRecordSerializer, the length of a 
> record is serialized to an intermediate length buffer and then copied to the 
> target buffer. Actually, the length filed can be serialized directly to the 
> data buffer (serializationBuffer), which can avoid the copy of length buffer. 
> Though the total bytes copied remain unchanged, it one copy of a small record 
> which incurs high overhead. The flink-benchmarks shows it can improve 
> performance and the test results are as follows.
> Result with the optimization:
> |Benchmark|Mode|Threads|Samples|Score|Score Error (99.9%)|Unit|Param: 
> channelsFlushTimeout|Param: stateBackend|
> |KeyByBenchmarks.arrayKeyBy|thrpt|1|30|2228.049605|77.631804|ops/ms| | |
> |KeyByBenchmarks.tupleKeyBy|thrpt|1|30|3968.361739|193.501755|ops/ms| | |
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|3030.016702|29.272713|ops/ms|
>  |MEMORY|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2754.77678|26.215395|ops/ms|
>  |FS|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|3001.957606|29.288019|ops/ms|
>  |FS_ASYNC|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|123.698984|3.339233|ops/ms|
>  |ROCKS|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|126.252137|1.137735|ops/ms|
>  |ROCKS_INC|
> |SerializationFrameworkMiniBenchmarks.serializerAvro|thrpt|1|30|323.658098|5.855697|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerKryo|thrpt|1|30|183.34423|3.710787|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerPojo|thrpt|1|30|404.380233|5.131744|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerRow|thrpt|1|30|527.193369|10.176726|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerTuple|thrpt|1|30|550.073024|11.724412|ops/ms|
>  | |
> |StreamNetworkBroadcastThroughputBenchmarkExecutor.networkBroadcastThroughput|thrpt|1|30|564.690627|13.766809|ops/ms|
>  | |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|49918.11806|2324.234776|ops/ms|100,100ms|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|10443.63491|315.835962|ops/ms|100,100ms,SSL|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|21387.47608|2779.832704|ops/ms|1000,1ms|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|26585.85453|860.243347|ops/ms|1000,100ms|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|8252.563405|947.129028|ops/ms|1000,100ms,SSL|
>  |
> |SumLongsBenchmark.benchmarkCount|thrpt|1|30|8806.021402|263.995836|ops/ms| | 
> |
> |WindowBenchmarks.globalWindow|thrpt|1|30|4573.620126|112.099391|ops/ms| | |
> |WindowBenchmarks.sessionWindow|thrpt|1|30|585.246412|7.026569|ops/ms| | |
> |WindowBenchmarks.slidingWindow|thrpt|1|30|449.302134|4.123669|ops/ms| | |
> |WindowBenchmarks.tumblingWindow|thrpt|1|30|2979.806858|33.818909|ops/ms| | |
> |StreamNetworkLatencyBenchmarkExecutor.networkLatency1to1|avgt|1|30|12.842865|0.13796|ms/op|
>  | |
> Result without the optimization:
>  
> |Benchmark|Mode|Threads|Samples|Score|Score Error (99.9%)|Unit|Param: 
> channelsFlushTimeout|Param: stateBackend|
> |KeyByBenchmarks.arrayKeyBy|thrpt|1|30|2060.241715|59.898485|ops/ms| | |
> |KeyByBenchmarks.tupleKeyBy|thrpt|1|30|3645.306819|223.821719|ops/ms| | |
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2992.698822|36.978115|ops/ms|
>  |MEMORY|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2756.10949|27.798937|ops/ms|
>  |FS|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2965.969876|44.159793|ops/ms|
>  |FS_ASYNC|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|125.506942|1.245978|ops/ms|
>  |ROCKS|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|127.258737|1.190588|ops/ms|
>  |ROCKS_INC|
> |SerializationFrameworkMiniBenchmarks.serializerAvro|thrpt|1|30|316.497954|8.309241|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerKryo|thrpt|1|30|189.065149|6.302073|ops/ms|
>  | |
> 

[jira] [Closed] (FLINK-11859) Improve SpanningRecordSerializer performance by serializing record length to serialization buffer directly

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-11859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-11859.


> Improve SpanningRecordSerializer performance by serializing record length to 
> serialization buffer directly
> --
>
> Key: FLINK-11859
> URL: https://issues.apache.org/jira/browse/FLINK-11859
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the current implementation of SpanningRecordSerializer, the length of a 
> record is serialized to an intermediate length buffer and then copied to the 
> target buffer. Actually, the length filed can be serialized directly to the 
> data buffer (serializationBuffer), which can avoid the copy of length buffer. 
> Though the total bytes copied remain unchanged, it one copy of a small record 
> which incurs high overhead. The flink-benchmarks shows it can improve 
> performance and the test results are as follows.
> Result with the optimization:
> |Benchmark|Mode|Threads|Samples|Score|Score Error (99.9%)|Unit|Param: 
> channelsFlushTimeout|Param: stateBackend|
> |KeyByBenchmarks.arrayKeyBy|thrpt|1|30|2228.049605|77.631804|ops/ms| | |
> |KeyByBenchmarks.tupleKeyBy|thrpt|1|30|3968.361739|193.501755|ops/ms| | |
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|3030.016702|29.272713|ops/ms|
>  |MEMORY|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2754.77678|26.215395|ops/ms|
>  |FS|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|3001.957606|29.288019|ops/ms|
>  |FS_ASYNC|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|123.698984|3.339233|ops/ms|
>  |ROCKS|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|126.252137|1.137735|ops/ms|
>  |ROCKS_INC|
> |SerializationFrameworkMiniBenchmarks.serializerAvro|thrpt|1|30|323.658098|5.855697|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerKryo|thrpt|1|30|183.34423|3.710787|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerPojo|thrpt|1|30|404.380233|5.131744|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerRow|thrpt|1|30|527.193369|10.176726|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerTuple|thrpt|1|30|550.073024|11.724412|ops/ms|
>  | |
> |StreamNetworkBroadcastThroughputBenchmarkExecutor.networkBroadcastThroughput|thrpt|1|30|564.690627|13.766809|ops/ms|
>  | |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|49918.11806|2324.234776|ops/ms|100,100ms|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|10443.63491|315.835962|ops/ms|100,100ms,SSL|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|21387.47608|2779.832704|ops/ms|1000,1ms|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|26585.85453|860.243347|ops/ms|1000,100ms|
>  |
> |StreamNetworkThroughputBenchmarkExecutor.networkThroughput|thrpt|1|30|8252.563405|947.129028|ops/ms|1000,100ms,SSL|
>  |
> |SumLongsBenchmark.benchmarkCount|thrpt|1|30|8806.021402|263.995836|ops/ms| | 
> |
> |WindowBenchmarks.globalWindow|thrpt|1|30|4573.620126|112.099391|ops/ms| | |
> |WindowBenchmarks.sessionWindow|thrpt|1|30|585.246412|7.026569|ops/ms| | |
> |WindowBenchmarks.slidingWindow|thrpt|1|30|449.302134|4.123669|ops/ms| | |
> |WindowBenchmarks.tumblingWindow|thrpt|1|30|2979.806858|33.818909|ops/ms| | |
> |StreamNetworkLatencyBenchmarkExecutor.networkLatency1to1|avgt|1|30|12.842865|0.13796|ms/op|
>  | |
> Result without the optimization:
>  
> |Benchmark|Mode|Threads|Samples|Score|Score Error (99.9%)|Unit|Param: 
> channelsFlushTimeout|Param: stateBackend|
> |KeyByBenchmarks.arrayKeyBy|thrpt|1|30|2060.241715|59.898485|ops/ms| | |
> |KeyByBenchmarks.tupleKeyBy|thrpt|1|30|3645.306819|223.821719|ops/ms| | |
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2992.698822|36.978115|ops/ms|
>  |MEMORY|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2756.10949|27.798937|ops/ms|
>  |FS|
> |MemoryStateBackendBenchmark.stateBackends|thrpt|1|30|2965.969876|44.159793|ops/ms|
>  |FS_ASYNC|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|125.506942|1.245978|ops/ms|
>  |ROCKS|
> |RocksStateBackendBenchmark.stateBackends|thrpt|1|30|127.258737|1.190588|ops/ms|
>  |ROCKS_INC|
> |SerializationFrameworkMiniBenchmarks.serializerAvro|thrpt|1|30|316.497954|8.309241|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerKryo|thrpt|1|30|189.065149|6.302073|ops/ms|
>  | |
> |SerializationFrameworkMiniBenchmarks.serializerPojo|thrpt|1|30|391.51305|7.750728|ops/ms|
>  

[jira] [Closed] (FLINK-13896) Scala 2.11 maven compile should target Java 1.8

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13896.


> Scala 2.11 maven compile should target Java 1.8
> ---
>
> Key: FLINK-13896
> URL: https://issues.apache.org/jira/browse/FLINK-13896
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.9.0
>Reporter: Terry Wang
>Assignee: Terry Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When setting TableEnvironment in scala as follwing:
>  
> {code:java}
> // we can repoduce this problem by put following code in 
> // org.apache.flink.table.api.scala.internal.StreamTableEnvironmentImplTest
> @Test
> def testCreateEnvironment(): Unit = {
>  val settings = 
> EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
>  val tEnv = TableEnvironment.create(settings);
> }
> {code}
> Then mvn test would fail with an error message like:
>  
> error: Static methods in interface require -target:JVM-1.8
>  
> We can fix this bug by adding:
> 
>  
>  -target:jvm-1.8
>  
> 
>  
> to scala-maven-plugin config
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-13896) Scala 2.11 maven compile should target Java 1.8

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13896.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via 289e147c489c3d0c28d5ea55c95d2d08b2d781b0

> Scala 2.11 maven compile should target Java 1.8
> ---
>
> Key: FLINK-13896
> URL: https://issues.apache.org/jira/browse/FLINK-13896
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.9.0
>Reporter: Terry Wang
>Assignee: Terry Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When setting TableEnvironment in scala as follwing:
>  
> {code:java}
> // we can repoduce this problem by put following code in 
> // org.apache.flink.table.api.scala.internal.StreamTableEnvironmentImplTest
> @Test
> def testCreateEnvironment(): Unit = {
>  val settings = 
> EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
>  val tEnv = TableEnvironment.create(settings);
> }
> {code}
> Then mvn test would fail with an error message like:
>  
> error: Static methods in interface require -target:JVM-1.8
>  
> We can fix this bug by adding:
> 
>  
>  -target:jvm-1.8
>  
> 
>  
> to scala-maven-plugin config
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (FLINK-13796) Remove unused variable

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen reopened FLINK-13796:
--
  Assignee: Fokko Driesprong

> Remove unused variable
> --
>
> Key: FLINK-13796
> URL: https://issues.apache.org/jira/browse/FLINK-13796
> Project: Flink
>  Issue Type: Task
>  Components: Deployment / YARN
>Affects Versions: 1.8.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13796) Remove unused variable

2019-09-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933243#comment-16933243
 ] 

Stephan Ewen commented on FLINK-13796:
--

I think it is okay to merge this. Removing dead code makes sense.

Whether this needs an issue or is a hotfix is debatable, but now that the issue 
already exists, I am fine with keeping it.

> Remove unused variable
> --
>
> Key: FLINK-13796
> URL: https://issues.apache.org/jira/browse/FLINK-13796
> Project: Flink
>  Issue Type: Task
>  Components: Deployment / YARN
>Affects Versions: 1.8.1
>Reporter: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13449) Add ARM architecture to MemoryArchitecture

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen reassigned FLINK-13449:


Assignee: wangxiyuan

> Add ARM architecture to MemoryArchitecture
> --
>
> Key: FLINK-13449
> URL: https://issues.apache.org/jira/browse/FLINK-13449
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Stephan Ewen
>Assignee: wangxiyuan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, {{Memoryarchitecture}} recognizes only various versions of x86 and 
> amd64 / ia64.
> We should add aarch64 for ARM to the known architectures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13652) Setup instructions for creating an ARM environment

2019-09-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933225#comment-16933225
 ] 

Stephan Ewen commented on FLINK-13652:
--

What do you think about adding some guides about how to do this and add it to 
the documentation under "Flink Development"?

> Setup instructions for creating an ARM environment
> --
>
> Key: FLINK-13652
> URL: https://issues.apache.org/jira/browse/FLINK-13652
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Chesnay Schepler
>Priority: Major
>
> We should provide developers with instructions for setting up an ARM 
> environment, so that they can test their changes locally without having to 
> rely on CI services.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13598) frocksdb doesn't have arm release

2019-09-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933222#comment-16933222
 ] 

Stephan Ewen commented on FLINK-13598:
--

I think we are trying to move back to more vanilla RocksDB and have the Flink 
specific native methods in a separate library.

[~azagrebin] or [~carp84] probably know more about what the status in that 
effort is.
But that would mean that we should have an ARM build of the Flink RocksDB 
plugin (probably not too hard because they are not much code).

> frocksdb doesn't have arm release 
> --
>
> Key: FLINK-13598
> URL: https://issues.apache.org/jira/browse/FLINK-13598
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.9.0, 2.0.0
>Reporter: wangxiyuan
>Priority: Major
>
> Flink now uses frocksdb which forks from rocksdb  for module 
> *flink-statebackend-rocksdb*. It doesn't contain arm release.
> Now rocksdb supports ARM from 
> [v6.2.2|https://search.maven.org/artifact/org.rocksdb/rocksdbjni/6.2.2/jar]
> Can frocksdb release an ARM package as well?
> Or AFAK, Since there were some bugs for rocksdb in the past, so that Flink 
> didn't use it directly. Have the bug been solved in rocksdb already? Can 
> Flink re-use rocksdb again now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13450) Adjust tests to tolerate arithmetic differences between x86 and ARM

2019-09-19 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen reassigned FLINK-13450:


Assignee: wangxiyuan

> Adjust tests to tolerate arithmetic differences between x86 and ARM
> ---
>
> Key: FLINK-13450
> URL: https://issues.apache.org/jira/browse/FLINK-13450
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.9.0
>Reporter: Stephan Ewen
>Assignee: wangxiyuan
>Priority: Major
>
> Certain arithmetic operations have different precision/rounding on ARM versus 
> x86.
> Tests using floating point numbers should be changed to tolerate a certain 
> minimal deviation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14118) Reduce the unnecessary flushing when there is no data available for flush

2019-09-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933200#comment-16933200
 ] 

Stephan Ewen commented on FLINK-14118:
--

Great diagnosis and great fix!

Do we have any data as to whether this fix causes any overhead in other cases, 
or is this always strictly better?

> Reduce the unnecessary flushing when there is no data available for flush
> -
>
> Key: FLINK-14118
> URL: https://issues.apache.org/jira/browse/FLINK-14118
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Yingjie Cao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new flush implementation which works by triggering a netty user event may 
> cause performance regression compared to the old synchronization-based one. 
> More specifically, when there is exactly one BufferConsumer in the buffer 
> queue of subpartition and no new data will be added for a while in the future 
> (may because of just no input or the logic of the operator is to collect some 
> data for processing and will not emit records immediately), that is, there is 
> no data to send, the OutputFlusher will continuously notify data available 
> and wake up the netty thread, though no data will be returned by the 
> pollBuffer method.
> For some of our production jobs, this will incur 20% to 40% CPU overhead 
> compared to the old implementation. We tried to fix the problem by checking 
> if there is new data available when flushing, if there is no new data, the 
> netty thread will not be notified. It works for our jobs and the cpu usage 
> falls to previous level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-09-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933186#comment-16933186
 ] 

Stephan Ewen commented on FLINK-14123:
--

I get the reasoning behind this, it probably makes sense.

We only have to be careful because changing this value now will cause 
regressions for some users in some cases. So we need to capture this in the 
release notes.

Did you verify that it avoids the GC overhead error with the changed value?

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1007)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1080)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1080)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
>   ... 23 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: GC 
> overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
>   at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: GC overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:831)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> 

[jira] [Commented] (FLINK-14112) Removing zookeeper state should cause the task manager and job managers to restart

2019-09-18 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932720#comment-16932720
 ] 

Stephan Ewen commented on FLINK-14112:
--

>From the RM / JM side (leader contender) can we treat the disappearing of a 
>ZNode as "loss of leadership"? A leader contender should recreate the node 
>when applying for leader status.

On the TM side, I am not sure what the issue is. Is it just some exception 
handling / null handling missing?

Side note: I remember in the past we also had some complexity from the fact 
that the "leader lock" is one Znode, but "leader address" and "leader fencing 
token" are different ZNodes. We were thinking to put the leader fencing token 
and the address as payload into the leader lock ZNode as a simplification of 
things.


> Removing zookeeper state should cause the task manager and job managers to 
> restart
> --
>
> Key: FLINK-14112
> URL: https://issues.apache.org/jira/browse/FLINK-14112
> Project: Flink
>  Issue Type: Wish
>  Components: Runtime / Coordination
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Aaron Levin
>Priority: Minor
>
> Suppose you have a flink application running on a cluster with the following 
> configuration:
> {noformat}
> high-availability.zookeeper.path.root: /flink
> {noformat}
> Now suppose you delete all the znodes within {{/flink}}. I experienced the 
> following:
>  * massive amount of logging
>  * application did not restart
>  * task manager did not crash or restart
>  * job manager did not crash or restart
> From this state I had to restart all the task managers and all the job 
> managers in order for the flink application to recover.
> It would be desirable for the Task Managers and Job Managers to crash if the 
> znode is not available (though perhaps you all have thought about this more 
> deeply than I!)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13985) Use native memory for managed memory.

2019-09-17 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931307#comment-16931307
 ] 

Stephan Ewen commented on FLINK-13985:
--

I would put {{allocate()}} and void {{release(long)}} in a util like 
{{MemoryUtils}} because we may reuse this in other parts as well.

Instantiating a DirectByteBuffer without a cleaner reflectively is the right 
approach for wrapping.
We can either do this on creation of the memory segment (and then 
duplicate/slice in the wrap() call) or we do this lazily in the wrap() call.

> Use native memory for managed memory.
> -
>
> Key: FLINK-13985
> URL: https://issues.apache.org/jira/browse/FLINK-13985
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Xintong Song
>Assignee: Andrey Zagrebin
>Priority: Major
>
> * Allocate memory with {{Unsafe.allocateMemory}}
>  ** {{MemoryManager}}
> Implement this issue in common code paths for the legacy / new mode. This 
> should only affect the GC behavior.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13862) Remove or rewrite Execution Plan docs

2019-09-12 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928682#comment-16928682
 ] 

Stephan Ewen commented on FLINK-13862:
--

Is that plan visualizer something we want to keep supporting?

The new Web UI is pretty decent at the visualization and I would guess subsumes 
it for the majority of users.

> Remove or rewrite Execution Plan docs
> -
>
> Key: FLINK-13862
> URL: https://issues.apache.org/jira/browse/FLINK-13862
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Stephan Ewen
>Priority: Blocker
> Fix For: 1.10.0, 1.9.1
>
>
> The *Execution Plans* section is totally outdated and refers to the old 
> {{tools/planVisalizer.html}} file that has been removed for two years.
> https://ci.apache.org/projects/flink/flink-docs-master/dev/execution_plans.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14034) In FlinkKafkaProducer, KafkaTransactionState should be made public or invoke should be made final

2019-09-12 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928515#comment-16928515
 ] 

Stephan Ewen commented on FLINK-14034:
--

I would suggest to go with the variant that does not extend the sink (extra 
function).

The sink is bound to change in the future, so the other variant would probably 
also be more safe against future changes in Flink.

> In FlinkKafkaProducer, KafkaTransactionState should be made public or invoke 
> should be made final
> -
>
> Key: FLINK-14034
> URL: https://issues.apache.org/jira/browse/FLINK-14034
> Project: Flink
>  Issue Type: Wish
>  Components: Connectors / Kafka
>Affects Versions: 1.9.0
>Reporter: Niels van Kaam
>Priority: Trivial
>
> It is not possible to override the invoke method of the FlinkKafkaProducer, 
> because the first parameter, KafkaTransactionState, is a private inner class. 
>  It is not possible to override the original invoke of SinkFunction, because 
> TwoPhaseCommitSinkFunction (which is implemented by FlinkKafkaProducer) does 
> override the original invoke method with final.
> [https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer.java]
> If there is a particular reason for this, it would be better to make the 
> invoke method in FlinkKafkaProducer final as well, and document the reason 
> such that it is clear this is by design (I don't see any overrides in the 
> same package).
> Otherwise, I would make the KafkaTransactionState publicly visible. I would 
> like to override the Invoke method to create a custom KafkaProducer which 
> performs some additional generic validations and transformations. (which can 
> also be done in a process-function, but a custom sink would simplify the code 
> of jobs)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13025) Elasticsearch 7.x support

2019-09-12 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928511#comment-16928511
 ] 

Stephan Ewen commented on FLINK-13025:
--

[~lilyevsky] would you be up to working closely together with [~yanghua] to 
give him feedback on the implementation, if [~yanghua] drives the 
implementation?

> Elasticsearch 7.x support
> -
>
> Key: FLINK-13025
> URL: https://issues.apache.org/jira/browse/FLINK-13025
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Affects Versions: 1.8.0
>Reporter: Keegan Standifer
>Priority: Major
>
> Elasticsearch 7.0.0 was released in April of 2019: 
> [https://www.elastic.co/blog/elasticsearch-7-0-0-released]
> The latest elasticsearch connector is 
> [flink-connector-elasticsearch6|https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-elasticsearch6]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5

2019-09-12 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928499#comment-16928499
 ] 

Stephan Ewen commented on FLINK-13417:
--

@Tison would it be safer to shade Flink's ZK and then let the HBase client 
bring its own version / config / everything?

I guess in a proper setup, it should work anyways due to inverted class loading 
(HBase connector uses its own dependency in the user jar, not the Flink ZK 
dependency), but not for tests.

> Bump Zookeeper to 3.5.5
> ---
>
> Key: FLINK-13417
> URL: https://issues.apache.org/jira/browse/FLINK-13417
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Konstantin Knauf
>Priority: Blocker
> Fix For: 1.10.0
>
>
> User might want to secure their Zookeeper connection via SSL.
> This requires a Zookeeper version >= 3.5.1. We might as well try to bump it 
> to 3.5.5, which is the latest version. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14038) ExecutionGraph deploy failed due to akka timeout

2019-09-11 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927487#comment-16927487
 ] 

Stephan Ewen commented on FLINK-14038:
--

What happened here is that the update to a partition failed, meaning that a 
batch shuffle has some results ready, but not all. (I assume you are running a 
DataSet job with {{InputDependencyConstraint.ANY}} or the default config).

Trying to see whether this is a GC problem makes sense. It looks like the "ACK" 
is sent immediately after receiving the "update input channel" message, so it 
should not be affected by the actual asynchronous update.


> ExecutionGraph deploy failed due to akka timeout
> 
>
> Key: FLINK-14038
> URL: https://issues.apache.org/jira/browse/FLINK-14038
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
> Environment: Flink on yarn
> Flink 1.9.0
>Reporter: liupengcheng
>Priority: Major
>
> When launching the flink application, the following error was reported, I 
> downloaded the operator logs, but still have no clue. The operator logs 
> provided no useful information and was cancelled directly.
> JobManager logs:
> {code:java}
> java.lang.IllegalStateException: Update task on TaskManager 
> container_e860_1567429198842_571077_01_06 @ zjy-hadoop-prc-st320.bj 
> (dataPort=50990) failed due to:
>   at 
> org.apache.flink.runtime.executiongraph.Execution.lambda$sendUpdatePartitionInfoRpcCall$14(Execution.java:1395)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>   at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.util.concurrent.CompletionException: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]]
>  after [1 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
>   at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>   at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
>   at akka.dispatch.OnComplete.internal(Future.scala:263)
>   at akka.dispatch.OnComplete.internal(Future.scala:261)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>  

[jira] [Commented] (FLINK-13445) Distinguishing Memory Configuration for TaskManager and JobManager

2019-09-11 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927365#comment-16927365
 ] 

Stephan Ewen commented on FLINK-13445:
--

Agreed, we should do this for the master process the same way as for the 
TaskExecutor in FLIP-49. Otherwise it would be quite unintuitive for users.

> Distinguishing Memory Configuration for TaskManager and JobManager
> --
>
> Key: FLINK-13445
> URL: https://issues.apache.org/jira/browse/FLINK-13445
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Configuration
>Affects Versions: 1.8.1
>Reporter: madong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> we use flink to run some job  build in non-java language, so we increase the 
> value of `containerized.heap-cutoff-ratio` to reserve more memory for 
> non-java process , which would affect memory allocation for jobManager. 
> Considering the different behaviors of taskManager and jobManager, should we 
> use this configuration separately?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13985) Use native memory for managed memory.

2019-09-11 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927357#comment-16927357
 ] 

Stephan Ewen commented on FLINK-13985:
--

Can you share some details about the exact mechanics?

In particular I would suggest to pay attention to the following aspects:
  - Add {{long allocate()}} and {{void release(long)}} to a core util, like 
{{MemoryUtils}} to abstract over Unsafe, so that it is easier to deal with the 
Unsafe removal later
  - Release the memory in the {{free()}} method of the memory segments.
  - Add a safety net that frees the memory segments upon garbage collection. 
Have a look at how the DirectByteBuffer does it with the cleaners and reference 
queues.

> Use native memory for managed memory.
> -
>
> Key: FLINK-13985
> URL: https://issues.apache.org/jira/browse/FLINK-13985
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Xintong Song
>Assignee: Andrey Zagrebin
>Priority: Major
>
> * Allocate memory with {{Unsafe.allocateMemory}}
>  ** {{MemoryManager}}
> Implement this issue in common code paths for the legacy / new mode. This 
> should only affect the GC behavior.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13996) Maven instructions for 3.3+ do not cover all shading special cases

2019-09-11 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927352#comment-16927352
 ] 

Stephan Ewen commented on FLINK-13996:
--

We want to remove the shading logic from the file systems in 1.10 anyways, but 
this would be relevant for the 1.8 and 1.9 docs still.

> Maven instructions for 3.3+ do not cover all shading special cases
> --
>
> Key: FLINK-13996
> URL: https://issues.apache.org/jira/browse/FLINK-13996
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System, Documentation
>Affects Versions: 1.8.0
>Reporter: Chesnay Schepler
>Priority: Major
>
> When building Flink on Maven 3.3+ extra care must be taken to ensure that the 
> shading works as expected. Since 3.3 the dependency graph is immutable, as a 
> result of which downstream modules (like flink-dist) see the unaltered set of 
> dependencies of bundled modules; regardless of these were bundled or not. As 
> a result dependencies may be bundled multiple times (original and relocated 
> versions).
> The [instructions for building Flink with Maven 
> 3.3+|https://ci.apache.org/projects/flink/flink-docs-master/flinkDev/building.html#dependency-shading]
>  correctly point out that flink-dist must be built separately, however (at 
> the very least) all filesystems relying on {{flink-fs-hadoop-shaded}} are 
> also affected.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13025) Elasticsearch 7.x support

2019-09-11 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927347#comment-16927347
 ] 

Stephan Ewen commented on FLINK-13025:
--

Regarding committer, we need to look for someone else to mentor this. IIRC, 
[~pnowojski] is not available for few weeks now.

>From the contributor side, this would best be implemented by someone with a 
>concrete Elasticsearch 7 installation and a Flink use case to try this out 
>with.
[~yanghua] do you have such an Elasticsearch 7 deployment and Flink use case?


> Elasticsearch 7.x support
> -
>
> Key: FLINK-13025
> URL: https://issues.apache.org/jira/browse/FLINK-13025
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Affects Versions: 1.8.0
>Reporter: Keegan Standifer
>Priority: Major
>
> Elasticsearch 7.0.0 was released in April of 2019: 
> [https://www.elastic.co/blog/elasticsearch-7-0-0-released]
> The latest elasticsearch connector is 
> [flink-connector-elasticsearch6|https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-elasticsearch6]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-14049) Update error message for failed partition updates to include task name

2019-09-11 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-14049:
-
Fix Version/s: 1.8.3

> Update error message for failed partition updates to include task name
> --
>
> Key: FLINK-14049
> URL: https://issues.apache.org/jira/browse/FLINK-14049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Critical
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>
> The error message for failed partition updates does not include the task name.
> That makes it useless during debugging.
> Adding the task name is a simple addition that make this error message much 
> more helpful.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-14049) Update error message for failed partition updates to include task name

2019-09-11 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-14049:
-
Fix Version/s: 1.10.0

> Update error message for failed partition updates to include task name
> --
>
> Key: FLINK-14049
> URL: https://issues.apache.org/jira/browse/FLINK-14049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Critical
> Fix For: 1.10.0, 1.9.1
>
>
> The error message for failed partition updates does not include the task name.
> That makes it useless during debugging.
> Adding the task name is a simple addition that make this error message much 
> more helpful.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-14049) Update error message for failed partition updates to include task name

2019-09-11 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-14049:


 Summary: Update error message for failed partition updates to 
include task name
 Key: FLINK-14049
 URL: https://issues.apache.org/jira/browse/FLINK-14049
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.9.1


The error message for failed partition updates does not include the task name.
That makes it useless during debugging.

Adding the task name is a simple addition that make this error message much 
more helpful.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14034) In FlinkKafkaProducer, KafkaTransactionState should be made public or invoke should be made final

2019-09-11 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927324#comment-16927324
 ] 

Stephan Ewen commented on FLINK-14034:
--

I agree, to make things clear, the current design, {{invoke()}} should be final.
Exposing the transaction state might be a bit fragile in the current state.

Can you share a bit more about the use case? What do you plan to do in the sink 
that would not work in a {{MapFunction}}?



> In FlinkKafkaProducer, KafkaTransactionState should be made public or invoke 
> should be made final
> -
>
> Key: FLINK-14034
> URL: https://issues.apache.org/jira/browse/FLINK-14034
> Project: Flink
>  Issue Type: Wish
>  Components: Connectors / Kafka
>Affects Versions: 1.9.0
>Reporter: Niels van Kaam
>Priority: Trivial
>
> It is not possible to override the invoke method of the FlinkKafkaProducer, 
> because the first parameter, KafkaTransactionState, is a private inner class. 
>  It is not possible to override the original invoke of SinkFunction, because 
> TwoPhaseCommitSinkFunction (which is implemented by FlinkKafkaProducer) does 
> override the original invoke method with final.
> [https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer.java]
> If there is a particular reason for this, it would be better to make the 
> invoke method in FlinkKafkaProducer final as well, and document the reason 
> such that it is clear this is by design (I don't see any overrides in the 
> same package).
> Otherwise, I would make the KafkaTransactionState publicly visible. I would 
> like to override the Invoke method to create a custom KafkaProducer which 
> performs some additional generic validations and transformations. (which can 
> also be done in a process-function, but a custom sink would simplify the code 
> of jobs)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14035) Introduce/Change some log for snapshot to better analysis checkpoint problem

2019-09-11 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927322#comment-16927322
 ] 

Stephan Ewen commented on FLINK-14035:
--

-1

I think we originally had these log statements on INFO level and changed them 
to DEBUG because it was flooding the log.

Can you simply configure the log level for the component to DEBUG? Why does 
this need a change in code?

> Introduce/Change some log for snapshot to better analysis checkpoint problem
> 
>
> Key: FLINK-14035
> URL: https://issues.apache.org/jira/browse/FLINK-14035
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Currently, the information for checkpoint are mostly debug log (especially on 
> TM side). If we want to track where the checkpoint steps and consume time 
> during each step when we have a failed checkpoint or the checkpoint time is 
> too long, we need to restart the job with enabling debug log, this issue 
> wants to improve this situation, wants to change some exist debug log from 
> debug to info, and add some more debug log.  we have changed this log level 
> in our production in Alibaba, and it seems no problem until now.
>  
> Detail
> {{change the log below from debug level to info}} 
>  * log about \{{Starting checkpoint xxx }} in TM  side
>  * log about Sync complete in TM  side
>  * log about async compete in TM  side
> Add debug log 
>  *  log about receiving the barrier  for exactly once mode  - align from at 
> lease once mode
>  
> If this issue is valid, then I'm happy to contribute it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13025) Elasticsearch 7.x support

2019-09-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925864#comment-16925864
 ] 

Stephan Ewen commented on FLINK-13025:
--

Would the other contributors in this discussion agree with that outcome?

Before embarking on that new implementation, I would also like to see if we can 
find a committer that would shepherd this and eventually review and merge the 
code.

> Elasticsearch 7.x support
> -
>
> Key: FLINK-13025
> URL: https://issues.apache.org/jira/browse/FLINK-13025
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Affects Versions: 1.8.0
>Reporter: Keegan Standifer
>Priority: Major
>
> Elasticsearch 7.0.0 was released in April of 2019: 
> [https://www.elastic.co/blog/elasticsearch-7-0-0-released]
> The latest elasticsearch connector is 
> [flink-connector-elasticsearch6|https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-elasticsearch6]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13445) Distinguishing Memory Configuration for TaskManager and JobManager

2019-09-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925783#comment-16925783
 ] 

Stephan Ewen commented on FLINK-13445:
--

Hi David! Flink 1.10 will not have the {{heap-cutoff}} parameters any more.
You can check this design proposal for the details:

https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

> Distinguishing Memory Configuration for TaskManager and JobManager
> --
>
> Key: FLINK-13445
> URL: https://issues.apache.org/jira/browse/FLINK-13445
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Configuration
>Affects Versions: 1.8.1
>Reporter: madong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> we use flink to run some job  build in non-java language, so we increase the 
> value of `containerized.heap-cutoff-ratio` to reserve more memory for 
> non-java process , which would affect memory allocation for jobManager. 
> Considering the different behaviors of taskManager and jobManager, should we 
> use this configuration separately?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5

2019-09-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925583#comment-16925583
 ] 

Stephan Ewen commented on FLINK-13417:
--

@Tison is that a mismatch between ZK versions now? That the HBase tests assume 
a specific version?

Do we need to shade the ZK client either in Flink or in the HBase tests?

> Bump Zookeeper to 3.5.5
> ---
>
> Key: FLINK-13417
> URL: https://issues.apache.org/jira/browse/FLINK-13417
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Konstantin Knauf
>Priority: Blocker
> Fix For: 1.10.0
>
>
> User might want to secure their Zookeeper connection via SSL.
> This requires a Zookeeper version >= 3.5.1. We might as well try to bump it 
> to 3.5.5, which is the latest version. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13025) Elasticsearch 7.x support

2019-09-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925580#comment-16925580
 ] 

Stephan Ewen commented on FLINK-13025:
--

Having an ElasticSearch 7 connector makes sense.
I am also in favor of not trying to hack compatibility though some changes in 
the ES 6.x connector, as that has problematic maintenance implications (as we 
experienced in Kafka).

So +1 from my side to create a new connector.

> Elasticsearch 7.x support
> -
>
> Key: FLINK-13025
> URL: https://issues.apache.org/jira/browse/FLINK-13025
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / ElasticSearch
>Affects Versions: 1.8.0
>Reporter: Keegan Standifer
>Priority: Major
>
> Elasticsearch 7.0.0 was released in April of 2019: 
> [https://www.elastic.co/blog/elasticsearch-7-0-0-released]
> The latest elasticsearch connector is 
> [flink-connector-elasticsearch6|https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-elasticsearch6]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2019-09-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925516#comment-16925516
 ] 

Stephan Ewen commented on FLINK-13958:
--

We had a similar problem initially with RocksDB.

The strange thing is that the JVM can load a library once under a given file 
name only, but multiple times under different file names. So in RocksDB, we 
rename the library file to something random and then it can be linked multiple 
times.

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Moravek
>Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13862) Remove or rewrite Execution Plan docs

2019-08-26 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-13862:


 Summary: Remove or rewrite Execution Plan docs
 Key: FLINK-13862
 URL: https://issues.apache.org/jira/browse/FLINK-13862
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.9.0
Reporter: Stephan Ewen
 Fix For: 1.10.0, 1.9.1


The *Execution Plans* section is totally outdated and refers to the old 
{{tools/planVisalizer.html}} file that has been removed for two years.

https://ci.apache.org/projects/flink/flink-docs-master/dev/execution_plans.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (FLINK-13791) Speed up sidenav by using group_by

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13791.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via c64e167b8003b7379545c1b83e54d9491164b7a8

> Speed up sidenav by using group_by
> --
>
> Key: FLINK-13791
> URL: https://issues.apache.org/jira/browse/FLINK-13791
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{_includes/sidenav.html}} parses through {{pages_by_language}} over and over 
> again trying to find children when building the (recursive) side navigation. 
> We could do this once with a {{group_by}} instead.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-13791) Speed up sidenav by using group_by

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13791.


> Speed up sidenav by using group_by
> --
>
> Key: FLINK-13791
> URL: https://issues.apache.org/jira/browse/FLINK-13791
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{_includes/sidenav.html}} parses through {{pages_by_language}} over and over 
> again trying to find children when building the (recursive) side navigation. 
> We could do this once with a {{group_by}} instead.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-13726) Build docs with jekyll 4.0.0.pre.beta1

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13726.


> Build docs with jekyll 4.0.0.pre.beta1
> --
>
> Key: FLINK-13726
> URL: https://issues.apache.org/jira/browse/FLINK-13726
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jekyll 4 is way faster in generating the docs than jekyll 3 - probably due to 
> the newly introduced cache. Site generation time goes down by roughly a 
> factor of 2.5 even with the current beta version!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (FLINK-13726) Build docs with jekyll 4.0.0.pre.beta1

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13726.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via ac1b8dbf15c405d0646671a138a53c9953153165

> Build docs with jekyll 4.0.0.pre.beta1
> --
>
> Key: FLINK-13726
> URL: https://issues.apache.org/jira/browse/FLINK-13726
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jekyll 4 is way faster in generating the docs than jekyll 3 - probably due to 
> the newly introduced cache. Site generation time goes down by roughly a 
> factor of 2.5 even with the current beta version!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (FLINK-13725) Use sassc for faster doc generation

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13725.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via 065de4b573a05b0c3436ff2d3af3e0c16589a1a7

> Use sassc for faster doc generation
> ---
>
> Key: FLINK-13725
> URL: https://issues.apache.org/jira/browse/FLINK-13725
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jekyll requires {{sass}} but can optionally also use a C-based implementation 
> provided by {{sassc}}. Although we do not use sass directly, there may be 
> some indirect use inside jekyll. It doesn't seem to hurt to upgrade here.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (FLINK-13729) Update website generation dependencies

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13729.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via ef74a61f54f190926a8388f46db7919e0e94420b

> Update website generation dependencies
> --
>
> Key: FLINK-13729
> URL: https://issues.apache.org/jira/browse/FLINK-13729
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The website generation dependencies are quite old. By upgrading some of them 
> we get improvements like a much nicer code highlighting and prepare for the 
> jekyll update of FLINK-13726 and FLINK-13727.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-13729) Update website generation dependencies

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13729.


> Update website generation dependencies
> --
>
> Key: FLINK-13729
> URL: https://issues.apache.org/jira/browse/FLINK-13729
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The website generation dependencies are quite old. By upgrading some of them 
> we get improvements like a much nicer code highlighting and prepare for the 
> jekyll update of FLINK-13726 and FLINK-13727.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-13725) Use sassc for faster doc generation

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13725.


> Use sassc for faster doc generation
> ---
>
> Key: FLINK-13725
> URL: https://issues.apache.org/jira/browse/FLINK-13725
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jekyll requires {{sass}} but can optionally also use a C-based implementation 
> provided by {{sassc}}. Although we do not use sass directly, there may be 
> some indirect use inside jekyll. It doesn't seem to hurt to upgrade here.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-13723) Use liquid-c for faster doc generation

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13723.


> Use liquid-c for faster doc generation
> --
>
> Key: FLINK-13723
> URL: https://issues.apache.org/jira/browse/FLINK-13723
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jekyll requires {{liquid}} and only optionally uses {{liquid-c}} if 
> available. The latter uses natively-compiled code and reduces generation time 
> by ~5% for me.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (FLINK-13723) Use liquid-c for faster doc generation

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13723.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via ac375e4f94c0d4def84a4016bf9055c6a9f7314c

> Use liquid-c for faster doc generation
> --
>
> Key: FLINK-13723
> URL: https://issues.apache.org/jira/browse/FLINK-13723
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jekyll requires {{liquid}} and only optionally uses {{liquid-c}} if 
> available. The latter uses natively-compiled code and reduces generation time 
> by ~5% for me.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-13724) Remove unnecessary whitespace from the docs' sidenav

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13724.


> Remove unnecessary whitespace from the docs' sidenav
> 
>
> Key: FLINK-13724
> URL: https://issues.apache.org/jira/browse/FLINK-13724
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The side navigation generates quite some white space that will end up in 
> every HTML page. Removing this reduces final page sizes and also improved 
> site generation speed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (FLINK-13724) Remove unnecessary whitespace from the docs' sidenav

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13724.
--
Fix Version/s: 1.10.0
   Resolution: Fixed

Fixed via e670293f90f70a5b2b72b33b48b08e414ef3fd5d

> Remove unnecessary whitespace from the docs' sidenav
> 
>
> Key: FLINK-13724
> URL: https://issues.apache.org/jira/browse/FLINK-13724
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The side navigation generates quite some white space that will end up in 
> every HTML page. Removing this reduces final page sizes and also improved 
> site generation speed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13728) Fix wrong closing tag order in sidenav

2019-08-26 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915723#comment-16915723
 ] 

Stephan Ewen commented on FLINK-13728:
--

Fixed in 1.9.1 via 9ba0a8906e24fa864a89df65edbc95c25ec3f6dd

> Fix wrong closing tag order in sidenav
> --
>
> Key: FLINK-13728
> URL: https://issues.apache.org/jira/browse/FLINK-13728
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.1, 1.9.0, 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The order of closing HTML tags in the sidenav is wrong: instead of 
> {{}} it should be {{}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-13728) Fix wrong closing tag order in sidenav

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-13728.


> Fix wrong closing tag order in sidenav
> --
>
> Key: FLINK-13728
> URL: https://issues.apache.org/jira/browse/FLINK-13728
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.1, 1.9.0, 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The order of closing HTML tags in the sidenav is wrong: instead of 
> {{}} it should be {{}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13728) Fix wrong closing tag order in sidenav

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-13728:
-
Fix Version/s: 1.9.1

> Fix wrong closing tag order in sidenav
> --
>
> Key: FLINK-13728
> URL: https://issues.apache.org/jira/browse/FLINK-13728
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.1, 1.9.0, 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The order of closing HTML tags in the sidenav is wrong: instead of 
> {{}} it should be {{}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13728) Fix wrong closing tag order in sidenav

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-13728:
-
Fix Version/s: 1.10.0

> Fix wrong closing tag order in sidenav
> --
>
> Key: FLINK-13728
> URL: https://issues.apache.org/jira/browse/FLINK-13728
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.1, 1.9.0, 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The order of closing HTML tags in the sidenav is wrong: instead of 
> {{}} it should be {{}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (FLINK-13728) Fix wrong closing tag order in sidenav

2019-08-26 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-13728.
--
Resolution: Fixed

Fixed via b1c2e213302cd68758761c60a1ccff85c5c67203

> Fix wrong closing tag order in sidenav
> --
>
> Key: FLINK-13728
> URL: https://issues.apache.org/jira/browse/FLINK-13728
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.1, 1.9.0, 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The order of closing HTML tags in the sidenav is wrong: instead of 
> {{}} it should be {{}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13856) Reduce the delete file api when the checkpoint is completed

2019-08-26 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915667#comment-16915667
 ] 

Stephan Ewen commented on FLINK-13856:
--

In general, that is a good idea.

However, this is an optimization that we can do only for exclusive state 
(belongs only to one checkpoint) but not shared state (par of incremental 
checkpoints).
I remember wanting to do that and there was some issue with selectively 
deleting the shared state first, and then dropping the exclusive location as a 
whole.

Furthermore, this only works on proper file systems, not on object stores like 
S3.

> Reduce the delete file api when the checkpoint is completed
> ---
>
> Key: FLINK-13856
> URL: https://issues.apache.org/jira/browse/FLINK-13856
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.8.1, 1.9.0
>Reporter: andrew.D.lin
>Priority: Major
> Attachments: f6cc56b7-2c74-4f4b-bb6a-476d28a22096.png
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When the new checkpoint is completed, an old checkpoint will be deleted by 
> calling CompletedCheckpoint.discardOnSubsume().
> When deleting old checkpoints, follow these steps:
> 1, drop the metadata
> 2, discard private state objects
> 3, discard location as a whole
> In some cases, is it possible to delete the checkpoint folder recursively by 
> one call?
> As far as I know the full amount of checkpoint, it should be possible to 
> delete the folder directly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (FLINK-13799) Web Job Submit Page displays stream of error message when web submit is disables in the config

2019-08-22 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen reassigned FLINK-13799:


Assignee: Yadong Xie

> Web Job Submit Page displays stream of error message when web submit is 
> disables in the config
> --
>
> Key: FLINK-13799
> URL: https://issues.apache.org/jira/browse/FLINK-13799
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.0
>Reporter: Stephan Ewen
>Assignee: Yadong Xie
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If you put {{web.submit.enable: false}} into the configuration, the web UI 
> will still display the "SubmitJob" page, but errors will continuously pop up, 
> stating
> "Unable to load requested file /jars."



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13799) Web Job Submit Page displays stream of error message when web submit is disables in the config

2019-08-22 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913110#comment-16913110
 ] 

Stephan Ewen commented on FLINK-13799:
--

The old web UI hid the "submit job" page alltogether, when web submit was 
disabled.

Can the same approach work for the new UI as well?

> Web Job Submit Page displays stream of error message when web submit is 
> disables in the config
> --
>
> Key: FLINK-13799
> URL: https://issues.apache.org/jira/browse/FLINK-13799
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.0
>Reporter: Stephan Ewen
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If you put {{web.submit.enable: false}} into the configuration, the web UI 
> will still display the "SubmitJob" page, but errors will continuously pop up, 
> stating
> "Unable to load requested file /jars."



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13770) Bump Netty to 4.1.39.Final

2019-08-22 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913084#comment-16913084
 ] 

Stephan Ewen commented on FLINK-13770:
--

Side note: We still use "NIO" as the default type, we could switch to "AUTO" in 
the future (and select epoll / kqueue when available).

> Bump Netty to 4.1.39.Final
> --
>
> Key: FLINK-13770
> URL: https://issues.apache.org/jira/browse/FLINK-13770
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I quickly went through all the changelogs for Netty 4.1.32 (which we
> currently use) to the latest Netty 4.1.39.Final. Below, you will find a
> list of bug fixes and performance improvements that may affect us. Nice
> changes we could benefit from, also for the Java > 8 efforts. The most
> important ones fixing leaks etc are #8921, #9167, #9274, #9394, and the
> various {{CompositeByteBuf}} fixes. The rest are mostly performance
> improvements.
> Since we are still early in the dev cycle for Flink 1.10, it would be
> nice to update now and verify that the new version works correctly.
> {code}
> Netty 4.1.33.Final
> - Fix ClassCastException and native crash when using kqueue transport
> (#8665)
> - Provide a way to cache the internal nioBuffer of the PooledByteBuffer
> to reduce GC (#8603)
> Netty 4.1.34.Final
> - Do not use GetPrimitiveArrayCritical(...) due multiple not-fixed bugs
> related to GCLocker (#8921)
> - Correctly monkey-patch id also in whe os / arch is used within library
> name (#8913)
> - Further reduce ensureAccessible() overhead (#8895)
> - Support using an Executor to offload blocking / long-running tasks
> when processing TLS / SSL via the SslHandler (#8847)
> - Minimize memory footprint for AbstractChannelHandlerContext for
> handlers that execute in the EventExecutor (#8786)
> - Fix three bugs in CompositeByteBuf (#8773)
> Netty 4.1.35.Final
> - Fix possible ByteBuf leak when CompositeByteBuf is resized (#8946)
> - Correctly produce ssl alert when certificate validation fails on the
> client-side when using native SSL implementation (#8949)
> Netty 4.1.37.Final
> - Don't filter out TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (#9274)
> - Try to mark child channel writable again once the parent channel
> becomes writable (#9254)
> - Properly debounce wakeups (#9191)
> - Don't read from timerfd and eventfd on each EventLoop tick (#9192)
> - Correctly detect that KeyManagerFactory is not supported when using
> OpenSSL 1.1.0+ (#9170)
> - Fix possible unsafe sharing of internal NIO buffer in CompositeByteBuf
> (#9169)
> - KQueueEventLoop won't unregister active channels reusing a file
> descriptor (#9149)
> - Prefer direct io buffers if direct buffers pooled (#9167)
> Netty 4.1.38.Final
> - Prevent ByteToMessageDecoder from overreading when !isAutoRead (#9252)
> - Correctly take length of ByteBufInputStream into account for
> readLine() / readByte() (#9310)
> - availableSharedCapacity will be slowly exhausted (#9394)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (FLINK-13779) PrometheusPushGatewayReporter support push metrics with groupingKey

2019-08-22 Thread Stephan Ewen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen reassigned FLINK-13779:


Assignee: Kaibo Zhou

> PrometheusPushGatewayReporter support push metrics with groupingKey
> ---
>
> Key: FLINK-13779
> URL: https://issues.apache.org/jira/browse/FLINK-13779
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Metrics
>Reporter: Kaibo Zhou
>Assignee: Kaibo Zhou
>Priority: Minor
>
> The Prometheus push gateway java SDK support send metrics with _groupingKey_, 
> see 
> [doc|https://prometheus.github.io/client_java/io/prometheus/client/exporter/PushGateway.html#push-io.prometheus.client.CollectorRegistry-java.lang.String-java.util.Map-].
>  
> This feature will make it convenient for users to identify, group or filter 
> their metrics by defining _groupingKey (optional)_. The user does not need to 
> configure this by default, and the default behavior remains the same.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13779) PrometheusPushGatewayReporter support push metrics with groupingKey

2019-08-22 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913083#comment-16913083
 ] 

Stephan Ewen commented on FLINK-13779:
--

Sounds like a useful feature. I assigned the issue to you.

> PrometheusPushGatewayReporter support push metrics with groupingKey
> ---
>
> Key: FLINK-13779
> URL: https://issues.apache.org/jira/browse/FLINK-13779
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Metrics
>Reporter: Kaibo Zhou
>Assignee: Kaibo Zhou
>Priority: Minor
>
> The Prometheus push gateway java SDK support send metrics with _groupingKey_, 
> see 
> [doc|https://prometheus.github.io/client_java/io/prometheus/client/exporter/PushGateway.html#push-io.prometheus.client.CollectorRegistry-java.lang.String-java.util.Map-].
>  
> This feature will make it convenient for users to identify, group or filter 
> their metrics by defining _groupingKey (optional)_. The user does not need to 
> configure this by default, and the default behavior remains the same.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files

2019-08-22 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913079#comment-16913079
 ] 

Stephan Ewen commented on FLINK-13808:
--

[~carp84] FYI

> Checkpoints expired by timeout may leak RocksDB files
> -
>
> Key: FLINK-13808
> URL: https://issues.apache.org/jira/browse/FLINK-13808
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.8.0, 1.8.1
> Environment: So far only reliably reproducible on a 4-node cluster 
> with parallelism ≥ 100. But do try 
> https://github.com/jcaesar/flink-rocksdb-file-leak
>Reporter: Julius Michaelis
>Priority: Minor
>
> A RocksDB state backend with HDFS checkpoints, with or without local 
> recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout.
> If the size of a checkpoint crosses what can be transferred during one 
> checkpoint timeout, checkpoints will continue to fail forever. If this is 
> combined with a quick rollover of SST files (e.g. due to a high density of 
> writes), this may quickly exhaust available disk space (or memory, as /tmp is 
> the default location).
> As a workaround, the jobmanager's REST API can be frequently queried for 
> failed checkpoints, and associated files deleted accordingly.
> I've tried investing the cause a little bit, but I'm stuck:
>  * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before 
> completing.}} and similar gets printed, so
>  * [{{abortExpired}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549],
>  so
>  * [{{dispose}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416],
>  so
>  * [{{cancelCaller}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488],
>  so
>  * [the canceler is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497]
>  ([through one more 
> layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]),
>  so
>  * [{{cleanup}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95],
>  (possibly [not from 
> {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]),
>  so
>  * [{{cleanupProvidedResources}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162]
>  (this is the indirection that made me give up), so
>  * [this trace 
> log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372]
>  should be printed, but it isn't.
> I have some time to further investigate, but I'd appreciate help on finding 
> out where in this chain things go wrong.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13806) Metric Fetcher floods the JM log with errors when TM is lost

2019-08-22 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913077#comment-16913077
 ] 

Stephan Ewen commented on FLINK-13806:
--

Setting to DEBUG level sounds good.

> Metric Fetcher floods the JM log with errors when TM is lost
> 
>
> Key: FLINK-13806
> URL: https://issues.apache.org/jira/browse/FLINK-13806
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.9.0
>Reporter: Stephan Ewen
>Priority: Critical
> Fix For: 1.10.0, 1.9.1
>
>
> When a task manager is lost, the log contains a series of exceptions from the 
> metrics fetcher, making it hard to identify the exceptions from the actual 
> job failure.
> The exception below is contained multiple time (in my example eight times) in 
> a simple 4 TM setup after one TM failure.
> I would suggest to suppress "failed asks" (timeouts) from the metrics fetcher 
> service, because the fetcher has not enough information to distinguish 
> between root cause exceptions and follow-up exceptions. In most cases, these 
> exceptions should be follow-up to a failure that is handled in the 
> scheduler/ExecutionGraph already, and the additional exception logging only 
> add noise to the log.
> {code}
> 2019-08-20 22:00:09,865 WARN  
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl  - 
> Requesting TaskManager's path for query services failed.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: 
> Ask timed out on [Actor[akka://flink/user/dispatcher#-1834666306]] after 
> [1 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
> at akka.dispatch.OnComplete.internal(Future.scala:263)
> at akka.dispatch.OnComplete.internal(Future.scala:261)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#-1834666306]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> ... 9 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-10333) Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)

2019-08-21 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912188#comment-16912188
 ] 

Stephan Ewen commented on FLINK-10333:
--

Great, thanks so much [~Tison] for the insights into the performance impact.

> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, 
> CompletedCheckpoints)
> -
>
> Key: FLINK-10333
> URL: https://issues.apache.org/jira/browse/FLINK-10333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Major
> Attachments: screenshot-1.png
>
>
> While going over the ZooKeeper based stores 
> ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, 
> {{ZooKeeperCompletedCheckpointStore}}) and the underlying 
> {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were 
> introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} 
> or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization 
> problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of 
> exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case 
> of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be 
> better to move {{RetrievableStateStorageHelper}} out of it for a better 
> separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even 
> if it is locked. This should not happen since it could leave another system 
> in an inconsistent state (imagine a changed {{JobGraph}} which restores from 
> an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}} 
> which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work. 
> Since these components are very important, I propose to refactor them.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13806) Metric Fetcher floods the JM log with errors when TM is lost

2019-08-20 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-13806:


 Summary: Metric Fetcher floods the JM log with errors when TM is 
lost
 Key: FLINK-13806
 URL: https://issues.apache.org/jira/browse/FLINK-13806
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.9.0
Reporter: Stephan Ewen
 Fix For: 1.10.0, 1.9.1


When a task manager is lost, the log contains a series of exceptions from the 
metrics fetcher, making it hard to identify the exceptions from the actual job 
failure.

The exception below is contained multiple time (in my example eight times) in a 
simple 4 TM setup after one TM failure.

I would suggest to suppress "failed asks" (timeouts) from the metrics fetcher 
service, because the fetcher has not enough information to distinguish between 
root cause exceptions and follow-up exceptions. In most cases, these exceptions 
should be follow-up to a failure that is handled in the 
scheduler/ExecutionGraph already, and the additional exception logging only add 
noise to the log.

{code}
2019-08-20 22:00:09,865 WARN  
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl  - 
Requesting TaskManager's path for query services failed.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on [Actor[akka://flink/user/dispatcher#-1834666306]] after [1 
ms]. Message of type 
[org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason 
for `AskTimeoutException` is that the recipient actor didn't send a reply.
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
at akka.dispatch.OnComplete.internal(Future.scala:263)
at akka.dispatch.OnComplete.internal(Future.scala:261)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/dispatcher#-1834666306]] after [1 ms]. Message of 
type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
reason for `AskTimeoutException` is that the recipient actor didn't send a 
reply.
at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
... 9 more
{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13805) Bad Error Message when TaskManager is lost

2019-08-20 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-13805:


 Summary: Bad Error Message when TaskManager is lost
 Key: FLINK-13805
 URL: https://issues.apache.org/jira/browse/FLINK-13805
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Stephan Ewen
 Fix For: 1.10.0, 1.9.1


When a TaskManager is lost, the job reports as the failure cause
{code}
org.apache.flink.util.FlinkException: The assigned slot 
6d0e469d55a2630871f43ad0f89c786c_0 was removed.
{code}

That is a pretty bad error message, as a user I don't know what that means. 
Sounds like it could simply refer to internal book keeping, maybe some 
rebalancing or so.
You need to know a lot about Flink to understand that this means actually 
"TaskManager failure".




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13798) Refactor the process of checking stream status while emitting watermark in source

2019-08-20 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911428#comment-16911428
 ] 

Stephan Ewen commented on FLINK-13798:
--

FYI: I believe one outcome of FLIP-27 will be to drop the 
{{TimestampsAndPeriodicWatermarksOperator}} and have Watermark Generation 
strictly in the source operator.

The reason is that this allows for transparent "push down" of watermark 
generation to the source partition level (like Kafka topic partition) and we 
should avoid having two slightly different ways to generate watermarks.

> Refactor the process of checking stream status while emitting watermark in 
> source
> -
>
> Key: FLINK-13798
> URL: https://issues.apache.org/jira/browse/FLINK-13798
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Task
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> As we know, the watermark could be emitted to downstream only when the stream 
> status is active. For the downstream task we already have the component of 
> StatusWatermarkValve in StreamInputProcessor to handle this logic. But for 
> the source task the current implementation of this logic seems a bit tricky. 
> There are two scenarios for the source case:
>  * Emit watermark via source context: In the specific WatermarkContext, it 
> would toggle the  stream status as active before collecting/emitting 
> records/watermarks. Then in the implementation of RecordWriterOutput, it 
> would check the status always active before really emitting watermark.
>  * TimestampsAndPeriodicWatermarksOperator: The watermark is triggered by 
> timer in interval time. When it happens, it would call output stack to emit 
> watermark. Then the RecordWriterOutput could take the role of checking status 
> before really emitting watermark.
> So we can see that the checking status logic in RecordWriterOutput only works 
> for above second scenario, and this logic seems redundant for the first 
> scenario because WatermarkContext always toggle active status before 
> emitting. Even worse, the logic is RecordWriterOutput would bring cycle 
> dependency with StreamStatusMaintainer, which is a blocker for the following 
> work of integrating source processing on runtime side.
> The solution is that we could migrate the checking logic from 
> RecordWriterOutput to TimestampsAndPeriodicWatermarksOperator. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13799) Web Job Submit Page displays stream of error message when web submit is disables in the config

2019-08-20 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-13799:


 Summary: Web Job Submit Page displays stream of error message when 
web submit is disables in the config
 Key: FLINK-13799
 URL: https://issues.apache.org/jira/browse/FLINK-13799
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Affects Versions: 1.9.0
Reporter: Stephan Ewen


If you put {{web.submit.enable: false}} into the configuration, the web UI will 
still display the "SubmitJob" page, but errors will continuously pop up, stating

"Unable to load requested file /jars."



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-10333) Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)

2019-08-20 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911394#comment-16911394
 ] 

Stephan Ewen commented on FLINK-10333:
--

Quick question: Do we have an idea if the transactional store takes the same 
time to update the ZK entry, or whether it takes significantly longer?

Background: The time to set the "lastest checkpoint" in ZK is part of the time 
it takes to complete a checkpoint.
There are some thoughts about bringing the checkpointing time down on the 
TaskManager side, so it would be great if we keep in mind to have the JM side 
also fast/efficient to not become the bottleneck for fast checkpoints.

> Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, 
> CompletedCheckpoints)
> -
>
> Key: FLINK-10333
> URL: https://issues.apache.org/jira/browse/FLINK-10333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Major
>
> While going over the ZooKeeper based stores 
> ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, 
> {{ZooKeeperCompletedCheckpointStore}}) and the underlying 
> {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were 
> introduced with past incremental changes.
> * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} 
> or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization 
> problems will either lead to removing the Znode or not
> * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of 
> exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case 
> of a failure)
> * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be 
> better to move {{RetrievableStateStorageHelper}} out of it for a better 
> separation of concerns
> * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even 
> if it is locked. This should not happen since it could leave another system 
> in an inconsistent state (imagine a changed {{JobGraph}} which restores from 
> an old checkpoint)
> * Redundant but also somewhat inconsistent put logic in the different stores
> * Shadowing of ZooKeeper specific exceptions in {{ZooKeeperStateHandleStore}} 
> which were expected to be caught in {{ZooKeeperSubmittedJobGraphStore}}
> * Getting rid of the {{SubmittedJobGraphListener}} would be helpful
> These problems made me think how reliable these components actually work. 
> Since these components are very important, I propose to refactor them.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13795) Web UI logs errors when selecting Checkpoint Tab for Batch Jobs

2019-08-20 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-13795:


 Summary: Web UI logs errors when selecting Checkpoint Tab for 
Batch Jobs
 Key: FLINK-13795
 URL: https://issues.apache.org/jira/browse/FLINK-13795
 Project: Flink
  Issue Type: Bug
  Components: Runtime / REST
Affects Versions: 1.9.0
Reporter: Stephan Ewen


The logs of the REST endpoint print errors if you run a batch job and then 
select the "Checkpoints" tab.

I would expect that this simply shows "no checkpoints available for this job" 
and not that an {{ERROR}} level entry appears in the log.

{code}
2019-08-20 12:04:54,195 ERROR 
org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler
  - Exception occurred in REST handler: Checkpointing has not been enabled.
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13717) allow to set taskmanager.host and taskmanager.bind-host separately

2019-08-20 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911137#comment-16911137
 ] 

Stephan Ewen commented on FLINK-13717:
--

My suggestion would be to

  - Set the default for "bind" to "0.0.0.0"
  - Set the default for "host" to "{{InetAddress.getLocalHost()}}" (like 
currently)



> allow to set taskmanager.host and taskmanager.bind-host separately
> --
>
> Key: FLINK-13717
> URL: https://issues.apache.org/jira/browse/FLINK-13717
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Configuration, Runtime / Network
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Robert Fiser
>Priority: Major
>
> We trying to use flink in docker container with bridge network.
> Without specifying taskmanager.host taskmanager binds the host/address which 
> is not visible in cluster. It's same behaviour when taskmanager.host is set 
> to 0.0.0.0.
> When it is se to external address or host name then taskmanager cannot bind 
> the address because of bridge network.
> So we need to set taskmanager.host which will be reported to jobmanager and 
> taskmanager.bind-host which can taskmanager bind inside the container
> It similar to https://issues.apache.org/jira/browse/FLINK-2821 but the 
> problem is with taskmanagers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13788) Document state migration constraints on keys

2019-08-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910539#comment-16910539
 ] 

Stephan Ewen commented on FLINK-13788:
--

Can the state processor API help to "rewrite" a savepoint for key migration?

> Document state migration constraints on keys
> 
>
> Key: FLINK-13788
> URL: https://issues.apache.org/jira/browse/FLINK-13788
> Project: Flink
>  Issue Type: Improvement
>Reporter: Seth Wiesman
>Assignee: Seth Wiesman
>Priority: Major
>
> [https://lists.apache.org/thread.html/79d74334baecbcbd765cead2f90df470d3ffe55b839f208e9695a6e2@%3Cuser.flink.apache.org%3E]
>  
> Key migrations are not allowed to prevent: 
>  
> 1) Key clashes
> 2) Change in key group assignment
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2019-08-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910534#comment-16910534
 ] 

Stephan Ewen commented on FLINK-4256:
-

@Thomas - I think the approach to manually split jobs with a pubsub is an 
option. It could be interesting to add some tooling for that.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Major
> Fix For: 1.9.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13717) allow to set taskmanager.host and taskmanager.bind-host separately

2019-08-18 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909983#comment-16909983
 ] 

Stephan Ewen commented on FLINK-13717:
--

I think having {{taskmanager.bind-host}} configurable makes sense. Was just 
asking if a default value of {{0.0.0.0}} would fit your use case.

> allow to set taskmanager.host and taskmanager.bind-host separately
> --
>
> Key: FLINK-13717
> URL: https://issues.apache.org/jira/browse/FLINK-13717
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Configuration, Runtime / Network
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Robert Fiser
>Priority: Major
>
> We trying to use flink in docker container with bridge network.
> Without specifying taskmanager.host taskmanager binds the host/address which 
> is not visible in cluster. It's same behaviour when taskmanager.host is set 
> to 0.0.0.0.
> When it is se to external address or host name then taskmanager cannot bind 
> the address because of bridge network.
> So we need to set taskmanager.host which will be reported to jobmanager and 
> taskmanager.bind-host which can taskmanager bind inside the container
> It similar to https://issues.apache.org/jira/browse/FLINK-2821 but the 
> problem is with taskmanagers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2019-08-18 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909980#comment-16909980
 ] 

Stephan Ewen commented on FLINK-4256:
-

This is in fact working for streaming as well, not only for batch. It works for 
both on the granularity of "pipelined regions".

However, with blocking "batch" shuffles, a batch job decomposes into many small 
pipelined regions, which can be individually recovered. Streaming programs only 
decompose into multiple pipelined regions when they do not have an all-to-all 
shuffle ({{keyBy()}} or {{rebalance()}}).

Anything beyond that, like more fine grained recovery of streaming jobs is not 
in the scope here, because it would need a mechanism different from Flink's 
current checkpointing mechanism.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Major
> Fix For: 1.9.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13432) Set max polling interval threshold for job result polling

2019-08-18 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909967#comment-16909967
 ] 

Stephan Ewen commented on FLINK-13432:
--

A "long poll" is a single poll that blocks until a result is available.
If the job is short, and the result available quickly, it returns quickly. If 
the result takes longer, the poll just blocks and waits.

It is a way to get around the trade-off between poll frequency and result 
latency.

> Set max polling interval threshold for job result polling
> -
>
> Key: FLINK-13432
> URL: https://issues.apache.org/jira/browse/FLINK-13432
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Affects Versions: 1.9.0
>Reporter: Jeff Zhang
>Priority: Major
>
> Currently, flink will increase polling interval by one ms after each polling. 
> It might be ok for most of the cases, but would be better to set a max 
> interval polling threshold



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13707) Make max parallelism configurable

2019-08-15 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908149#comment-16908149
 ] 

Stephan Ewen commented on FLINK-13707:
--

I think that is a fair point. Similar to the default parallelism in the 
{{flink-conf.yaml}}, there could also be a default max parallelism.

> Make  max parallelism configurable
> --
>
> Key: FLINK-13707
> URL: https://issues.apache.org/jira/browse/FLINK-13707
> Project: Flink
>  Issue Type: New Feature
>  Components: API / DataStream
>Reporter: xuekang
>Priority: Minor
>
> For now, if a user set parallelism larger than 128, and does not set max 
> parallelism explicitly, the system will compute a max parallelism, which is 
> 1.5 * parallelism.  When the job changes the parallelism and recover from a 
> savepoint, there may be some problem when restoring states from the 
> savepoint, as the number of key groups changed.
> To avoid this problem, and trying not to modify the code of existing jobs,  
> we want to configure the default max parallelism in flink-conf.yaml, but it 
> is not configurable now.
> Should we make it configurable? Any comments would be appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13717) allow to set taskmanager.host and taskmanager.bind-host separately

2019-08-15 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908148#comment-16908148
 ] 

Stephan Ewen commented on FLINK-13717:
--

Sounds like a good improvement.

Just to double check: Binding to 0.0.0.0 does not include listening at the 
bridge interface?
Asked differently, would by default binding to 0.0.0.0 and advertising the 
configured hostname work?


> allow to set taskmanager.host and taskmanager.bind-host separately
> --
>
> Key: FLINK-13717
> URL: https://issues.apache.org/jira/browse/FLINK-13717
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Configuration, Runtime / Network
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Robert Fiser
>Priority: Major
>
> We trying to use flink in docker container with bridge network.
> Without specifying taskmanager.host taskmanager binds the host/address which 
> is not visible in cluster. It's same behaviour when taskmanager.host is set 
> to 0.0.0.0.
> When it is se to external address or host name then taskmanager cannot bind 
> the address because of bridge network.
> So we need to set taskmanager.host which will be reported to jobmanager and 
> taskmanager.bind-host which can taskmanager bind inside the container
> It similar to https://issues.apache.org/jira/browse/FLINK-2821 but the 
> problem is with taskmanagers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13707) Make max parallelism configurable

2019-08-15 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907887#comment-16907887
 ] 

Stephan Ewen commented on FLINK-13707:
--

Just to double check: You can set the max parallelism explicitly in the 
execution environment.

Are you looking for a way to define a cluster-wide default?

> Make  max parallelism configurable
> --
>
> Key: FLINK-13707
> URL: https://issues.apache.org/jira/browse/FLINK-13707
> Project: Flink
>  Issue Type: New Feature
>  Components: API / DataStream
>Reporter: xuekang
>Priority: Minor
>
> For now, if a user set parallelism larger than 128, and does not set max 
> parallelism explicitly, the system will compute a max parallelism, which is 
> 1.5 * parallelism.  When the job changes the parallelism and recover from a 
> savepoint, there may be some problem when restoring states from the 
> savepoint, as the number of key groups changed.
> To avoid this problem, and trying not to modify the code of existing jobs,  
> we want to configure the default max parallelism in flink-conf.yaml, but it 
> is not configurable now.
> Should we make it configurable? Any comments would be appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13732) Enhance JobManagerMetricGroup with FLIP-6 architecture

2019-08-15 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907883#comment-16907883
 ] 

Stephan Ewen commented on FLINK-13732:
--

I think that is reasonable. The TaskManagers already have a {{ResourceID}} 
which is derived even from the environment (like Yarn container ID).

Having something similar on the JobManager / ResouceManager makes sense.

> Enhance JobManagerMetricGroup with FLIP-6 architecture
> --
>
> Key: FLINK-13732
> URL: https://issues.apache.org/jira/browse/FLINK-13732
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Reporter: Biao Liu
>Priority: Major
>
> There is a requirement from user mailing list [1]. I think it's reasonable 
> enough to support.
> The scenario is that when deploying a Flink cluster on Yarn, there might be 
> several {{JM(RM)}} s running on the same host. IMO that's quite a general 
> scenario. However we can't distinguish the metrics from different 
> {{JobManagerMetricGroup}}, because there is only one variable "hostname" we 
> can use.
> I think there are some problems of current implementation of 
> {{JobManagerMetricGroup}}. It's still non-FLIP-6 style. We should split the 
> metric group into {{RM}} and {{Dispatcher}} to match the FLIP-6 architecture. 
> And there should be an identification variable supported, just like {{tm_id}}.
> CC [~StephanEwen], [~till.rohrmann], [~Zentol]
> 1. 
> [http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-metrics-scope-for-YARN-single-job-td29389.html]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13698) Rework threading model of CheckpointCoordinator

2019-08-13 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906060#comment-16906060
 ] 

Stephan Ewen commented on FLINK-13698:
--

I would suggest to have the {{CheckpointCoordinator}} and the 
{{Scheduler}}/{{ExecutionGraph}} in the same thread.
That makes a lot of things much easier.

  - The checkpoint timer logic should be factored out into the periodic 
checkpoint trigger.
  - The actual call 
  - All cleanup and writing of state should be delegated to the I/O executor.
-> mainly state release as a "fire and forget" execution in the I/O executor
-> writing out master-hook data and checkpoint metadata needs to happen in 
the I/O executor, with a Future whose completion is again handles in the main 
thread executor.

> Rework threading model of CheckpointCoordinator
> ---
>
> Key: FLINK-13698
> URL: https://issues.apache.org/jira/browse/FLINK-13698
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Biao Liu
>Priority: Critical
>
> Currently {{CheckpointCoordinator}} and {{CheckpointFailureManager}} code is 
> executed by multiple different threads (mostly {{ioExecutor}}, but not only). 
> It's causing multiple concurrency issues, for example: 
> https://issues.apache.org/jira/browse/FLINK-13497
> Proper fix would be to rethink threading model there. At first glance it 
> doesn't seem that this code should be multi threaded, except of parts doing 
> the actual IO operations, so it should be possible to run everything in one 
> single ExecutionGraph's thread and just run asynchronously necessary IO 
> operations with some feedback loop ("mailbox style").
> I would strongly recommend fixing this issue before adding new features in 
> the \{{CheckpointCoordinator}} component.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5

2019-08-12 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905334#comment-16905334
 ] 

Stephan Ewen commented on FLINK-13417:
--

The {{flink-jepsen}} readme has some pointers to get started:
https://github.com/apache/flink/tree/master/flink-jepsen

> Bump Zookeeper to 3.5.5
> ---
>
> Key: FLINK-13417
> URL: https://issues.apache.org/jira/browse/FLINK-13417
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Konstantin Knauf
>Priority: Blocker
> Fix For: 1.10.0
>
>
> User might want to secure their Zookeeper connection via SSL.
> This requires a Zookeeper version >= 3.5.1. We might as well try to bump it 
> to 3.5.5, which is the latest version. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13688) HiveCatalogUseBlinkITCase.testBlinkUdf constantly failed with 1.9.0-rc2

2019-08-12 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905024#comment-16905024
 ] 

Stephan Ewen commented on FLINK-13688:
--

This is a common error in many tests.

I would recommend to never just launch a program (implicitly using the local 
environment), but always use the mini cluster test rule instead, which uses a 
deterministic test setup.

> HiveCatalogUseBlinkITCase.testBlinkUdf constantly failed with 1.9.0-rc2
> ---
>
> Key: FLINK-13688
> URL: https://issues.apache.org/jira/browse/FLINK-13688
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hive, Tests
>Affects Versions: 1.9.0
> Environment: Linux server
> kernal version: 3.10.0
> java version: "1.8.0_102"
> processor count: 96
>Reporter: Kurt Young
>Assignee: Jingsong Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I tried to build flink 1.9.0-rc2 from source and ran all tests in a linux 
> server, HiveCatalogUseBlinkITCase.testBlinkUdf will be constantly fail. 
>  
> Fail trace:
> {code:java}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 313.228 s <<< FAILURE! - in 
> org.apache.flink.table.catalog.hive.HiveCatalogUseBlinkITCase
> [ERROR] 
> testBlinkUdf(org.apache.flink.table.catalog.hive.HiveCatalogUseBlinkITCase) 
> Time elapsed: 305.155 s <<< ERROR!
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> at 
> org.apache.flink.table.catalog.hive.HiveCatalogUseBlinkITCase.testBlinkUdf(HiveCatalogUseBlinkITCase.java:180)
> Caused by: 
> org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException:
>  Could not fulfill slot request 35cf6fdc1b525de9b6eed13894e2e31d. Requested 
> resource profile (ResourceProfile{cpuCores=0.0, heapMemoryInMB=0, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0, 
> managedMemoryInMB=128}) is unfulfillable.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13586) Method ClosureCleaner.clean broke backward compatibility between 1.8.0 and 1.8.1

2019-08-08 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902822#comment-16902822
 ] 

Stephan Ewen commented on FLINK-13586:
--

My first thought would be to go with option (2).

> Method ClosureCleaner.clean broke backward compatibility between 1.8.0 and 
> 1.8.1
> 
>
> Key: FLINK-13586
> URL: https://issues.apache.org/jira/browse/FLINK-13586
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Gaël Renoux
>Priority: Major
>
> Method clean in org.apache.flink.api.java.ClosureCleaner received a new 
> parameter in Flink 1.8.1. This class is noted as internal, but is used in the 
> Kafka connectors (in 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase).
> The Kafka connectors library is not provided by the server, and must be set 
> up as a dependency with compile scope (see 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#usage,
>  or the Maven project template). Any project using those connectors and built 
> with 1.8.0 cannot be deployed on a 1.8.1 Flink server, because it would 
> target the old method.
> => This methods needs a fallback with the original two arguments (setting a 
> default value of RECURSIVE for the level argument).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13586) Method ClosureCleaner.clean broke backward compatibility between 1.8.0 and 1.8.1

2019-08-08 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902800#comment-16902800
 ] 

Stephan Ewen commented on FLINK-13586:
--

Looking at the changes in more detail, I think there was a mistake in the PR 
that the additional parameter was forced into every call site.

Call sited should have stayed the same, by keeping the original method (and 
semantics) and only adding the level in a new method.

We can do the following now:
  -  Add the method {{ClosureCleaner.clean(Object, boolean)}} and change all 
call sites in libraries and connectors to call that method.
  - As for which value we supply for the level there, we have two options:
   1.  {{TOP_LEVEL}} which makes this compatible with 1.8.0
   2.  {{RECURSIVE}} which makes this compatible with 1.8.1

Tough choice.

I don't understand how this got merged into 1.8.1 in the first place, with 
changing the semantics from flat cleaning to recursive cleaning.
That is a change in behavior that should not happen in a bugfix release.
 

> Method ClosureCleaner.clean broke backward compatibility between 1.8.0 and 
> 1.8.1
> 
>
> Key: FLINK-13586
> URL: https://issues.apache.org/jira/browse/FLINK-13586
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Gaël Renoux
>Priority: Major
>
> Method clean in org.apache.flink.api.java.ClosureCleaner received a new 
> parameter in Flink 1.8.1. This class is noted as internal, but is used in the 
> Kafka connectors (in 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase).
> The Kafka connectors library is not provided by the server, and must be set 
> up as a dependency with compile scope (see 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#usage,
>  or the Maven project template). Any project using those connectors and built 
> with 1.8.0 cannot be deployed on a 1.8.1 Flink server, because it would 
> target the old method.
> => This methods needs a fallback with the original two arguments (setting a 
> default value of RECURSIVE for the level argument).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   3   4   5   6   7   8   9   10   >