[jira] [Commented] (FLINK-9185) Potential null dereference in PrioritizedOperatorSubtaskState#resolvePrioritizedAlternatives

2018-05-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491491#comment-16491491
 ] 

Ted Yu commented on FLINK-9185:
---

Can someone assign this to Stephen ?

Thanks

> Potential null dereference in 
> PrioritizedOperatorSubtaskState#resolvePrioritizedAlternatives
> 
>
> Key: FLINK-9185
> URL: https://issues.apache.org/jira/browse/FLINK-9185
> Project: Flink
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
>
> {code}
> if (alternative != null
>   && alternative.hasState()
>   && alternative.size() == 1
>   && approveFun.apply(reference, alternative.iterator().next())) {
> {code}
> The return value from approveFun.apply would be unboxed.
> We should check that the return value is not null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8430) Implement stream-stream non-window full outer join

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491482#comment-16491482
 ] 

ASF GitHub Bot commented on FLINK-8430:
---

GitHub user hequn8128 opened a pull request:

https://github.com/apache/flink/pull/6079

[FLINK-8430] [table] Implement stream-stream non-window full outer join


## What is the purpose of the change

Support stream-stream non-window full outer join.


## Brief change log

  - Add full join process function, including `NonWindowFullJoin` and 
`NonWindowFullJoinWithNonEquiPredicates`.
  - Add IT/UT/Harness tests for full outer join
  - Change document of stream-stream join.

## Verifying this change

This change added tests and can be verified as follows:

  - Added integration tests for full join with or without non-equal 
predicates.
  - Added HarnessTests full join with or without non-equal predicates.
  - Add tests for AccMode generate by full join.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (yes)
  - If yes, how is the feature documented? (docs)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hequn8128/flink 8430

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6079


commit 8b9e8c09aae7d72fae3a6c6bed3327376029a5a2
Author: hequn8128 
Date:   2018-05-26T02:29:51Z

[FLINK-8430] [table] Implement stream-stream non-window full outer join




> Implement stream-stream non-window full outer join
> --
>
> Key: FLINK-8430
> URL: https://issues.apache.org/jira/browse/FLINK-8430
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #6079: [FLINK-8430] [table] Implement stream-stream non-w...

2018-05-25 Thread hequn8128
GitHub user hequn8128 opened a pull request:

https://github.com/apache/flink/pull/6079

[FLINK-8430] [table] Implement stream-stream non-window full outer join


## What is the purpose of the change

Support stream-stream non-window full outer join.


## Brief change log

  - Add full join process function, including `NonWindowFullJoin` and 
`NonWindowFullJoinWithNonEquiPredicates`.
  - Add IT/UT/Harness tests for full outer join
  - Change document of stream-stream join.

## Verifying this change

This change added tests and can be verified as follows:

  - Added integration tests for full join with or without non-equal 
predicates.
  - Added HarnessTests full join with or without non-equal predicates.
  - Add tests for AccMode generate by full join.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (yes)
  - If yes, how is the feature documented? (docs)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hequn8128/flink 8430

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6079


commit 8b9e8c09aae7d72fae3a6c6bed3327376029a5a2
Author: hequn8128 
Date:   2018-05-26T02:29:51Z

[FLINK-8430] [table] Implement stream-stream non-window full outer join




---


[jira] [Assigned] (FLINK-5505) Harmonize ZooKeeper configuration parameters

2018-05-25 Thread vinoyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang reassigned FLINK-5505:
---

Assignee: vinoyang

> Harmonize ZooKeeper configuration parameters
> 
>
> Key: FLINK-5505
> URL: https://issues.apache.org/jira/browse/FLINK-5505
> Project: Flink
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Till Rohrmann
>Assignee: vinoyang
>Priority: Trivial
> Fix For: 1.5.1
>
>
> Since Flink users don't necessarily know all of the Mesos terminology and a 
> JobManager runs also as a task, I would like to rename some of Flink's Mesos 
> configuration parameters. I would propose the following changes:
> {{mesos.initial-tasks}} => {{mesos.initial-taskmanagers}}
> {{mesos.maximum-failed-tasks}} => {{mesos.maximum-failed-taskmanagers}}
> {{mesos.resourcemanager.artifactserver.\*}} => {{mesos.artifactserver.*}}
> {{mesos.resourcemanager.framework.\*}} => {{mesos.framework.*}}
> {{mesos.resourcemanager.tasks.\*}} => {{mesos.taskmanager.*}}
> {{recovery.zookeeper.path.mesos-workers}} => 
> {{mesos.high-availability.zookeeper.path.mesos-workers}}
> What do you think [~eronwright]?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5992: [FLINK-8944] [Kinesis Connector] Use listShards instead o...

2018-05-25 Thread kailashhd
Github user kailashhd commented on the issue:

https://github.com/apache/flink/pull/5992
  
Please note I rebased my changes to the master as I was having some 
failures when building just the flink-connector-kinesis maven project. I hope 
this will not cause problems with reviewing the code changes. Sorry for the 
inconvenience if it does. :( 


---


[jira] [Commented] (FLINK-8944) Use ListShards for shard discovery in the flink kinesis connector

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491434#comment-16491434
 ] 

ASF GitHub Bot commented on FLINK-8944:
---

Github user kailashhd commented on the issue:

https://github.com/apache/flink/pull/5992
  
Please note I rebased my changes to the master as I was having some 
failures when building just the flink-connector-kinesis maven project. I hope 
this will not cause problems with reviewing the code changes. Sorry for the 
inconvenience if it does. :( 


> Use ListShards for shard discovery in the flink kinesis connector
> -
>
> Key: FLINK-8944
> URL: https://issues.apache.org/jira/browse/FLINK-8944
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Priority: Minor
>
> Currently the DescribeStream AWS API used to get list of shards is has a 
> restricted rate limits on AWS. (5 requests per sec per account). This is 
> problematic when running multiple flink jobs all on same account since each 
> subtasks calls the Describe Stream. Changing this to ListShards will provide 
> more flexibility on rate limits as ListShards has a 100 requests per second 
> per data stream limits.
> More details on the mailing list. https://goo.gl/mRXjKh



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8944) Use ListShards for shard discovery in the flink kinesis connector

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491432#comment-16491432
 ] 

ASF GitHub Bot commented on FLINK-8944:
---

Github user kailashhd commented on a diff in the pull request:

https://github.com/apache/flink/pull/5992#discussion_r191033584
  
--- Diff: 
flink-connectors/flink-connector-kinesis/src/test/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxyTest.java
 ---
@@ -26,20 +29,60 @@
 import com.amazonaws.ClientConfigurationFactory;
 import com.amazonaws.services.kinesis.AmazonKinesis;
 import com.amazonaws.services.kinesis.model.ExpiredIteratorException;
+import com.amazonaws.services.kinesis.model.ListShardsRequest;
+import com.amazonaws.services.kinesis.model.ListShardsResult;
 import 
com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException;
+import com.amazonaws.services.kinesis.model.Shard;
+import org.hamcrest.Description;
+import org.hamcrest.TypeSafeDiagnosingMatcher;
+import org.junit.Assert;
 import org.junit.Test;
 import org.powermock.reflect.Whitebox;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
 import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
 
+import static org.hamcrest.MatcherAssert.assertThat;
+import static org.hamcrest.collection.IsCollectionWithSize.hasSize;
+import static 
org.hamcrest.collection.IsIterableContainingInAnyOrder.containsInAnyOrder;
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
+import static org.mockito.Matchers.argThat;
+import static org.mockito.Mockito.doReturn;
+import static org.mockito.Mockito.mock;
 
 /**
  * Test for methods in the {@link KinesisProxy} class.
  */
 public class KinesisProxyTest {
+   private static final String NEXT_TOKEN = "NextToken";
+   private static final String fakeStreamName = "fake-stream";
+   private Set shardIdSet;
+   private List shards;
+
+   protected static HashMap
+   createInitialSubscribedStreamsToLastDiscoveredShardsState(List 
streams) {
+   HashMap initial = new HashMap<>();
+   for (String stream : streams) {
+   initial.put(stream, null);
+   }
+   return initial;
+   }
+
+   private static ListShardsRequestMatcher 
initialListShardsRequestMatcher() {
+   return new ListShardsRequestMatcher(null, null);
+   }
+
+   private static ListShardsRequestMatcher listShardsNextToken(final 
String nextToken) {
--- End diff --

Done


> Use ListShards for shard discovery in the flink kinesis connector
> -
>
> Key: FLINK-8944
> URL: https://issues.apache.org/jira/browse/FLINK-8944
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Priority: Minor
>
> Currently the DescribeStream AWS API used to get list of shards is has a 
> restricted rate limits on AWS. (5 requests per sec per account). This is 
> problematic when running multiple flink jobs all on same account since each 
> subtasks calls the Describe Stream. Changing this to ListShards will provide 
> more flexibility on rate limits as ListShards has a 100 requests per second 
> per data stream limits.
> More details on the mailing list. https://goo.gl/mRXjKh



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8944) Use ListShards for shard discovery in the flink kinesis connector

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491433#comment-16491433
 ] 

ASF GitHub Bot commented on FLINK-8944:
---

Github user kailashhd commented on a diff in the pull request:

https://github.com/apache/flink/pull/5992#discussion_r191033585
  
--- Diff: 
flink-connectors/flink-connector-kinesis/src/test/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxyTest.java
 ---
@@ -26,20 +29,60 @@
 import com.amazonaws.ClientConfigurationFactory;
 import com.amazonaws.services.kinesis.AmazonKinesis;
 import com.amazonaws.services.kinesis.model.ExpiredIteratorException;
+import com.amazonaws.services.kinesis.model.ListShardsRequest;
+import com.amazonaws.services.kinesis.model.ListShardsResult;
 import 
com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException;
+import com.amazonaws.services.kinesis.model.Shard;
+import org.hamcrest.Description;
+import org.hamcrest.TypeSafeDiagnosingMatcher;
+import org.junit.Assert;
 import org.junit.Test;
 import org.powermock.reflect.Whitebox;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
 import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
 
+import static org.hamcrest.MatcherAssert.assertThat;
+import static org.hamcrest.collection.IsCollectionWithSize.hasSize;
+import static 
org.hamcrest.collection.IsIterableContainingInAnyOrder.containsInAnyOrder;
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
+import static org.mockito.Matchers.argThat;
+import static org.mockito.Mockito.doReturn;
+import static org.mockito.Mockito.mock;
 
 /**
  * Test for methods in the {@link KinesisProxy} class.
  */
 public class KinesisProxyTest {
+   private static final String NEXT_TOKEN = "NextToken";
+   private static final String fakeStreamName = "fake-stream";
+   private Set shardIdSet;
+   private List shards;
--- End diff --

Done


> Use ListShards for shard discovery in the flink kinesis connector
> -
>
> Key: FLINK-8944
> URL: https://issues.apache.org/jira/browse/FLINK-8944
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Priority: Minor
>
> Currently the DescribeStream AWS API used to get list of shards is has a 
> restricted rate limits on AWS. (5 requests per sec per account). This is 
> problematic when running multiple flink jobs all on same account since each 
> subtasks calls the Describe Stream. Changing this to ListShards will provide 
> more flexibility on rate limits as ListShards has a 100 requests per second 
> per data stream limits.
> More details on the mailing list. https://goo.gl/mRXjKh



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5992: [FLINK-8944] [Kinesis Connector] Use listShards in...

2018-05-25 Thread kailashhd
Github user kailashhd commented on a diff in the pull request:

https://github.com/apache/flink/pull/5992#discussion_r191033585
  
--- Diff: 
flink-connectors/flink-connector-kinesis/src/test/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxyTest.java
 ---
@@ -26,20 +29,60 @@
 import com.amazonaws.ClientConfigurationFactory;
 import com.amazonaws.services.kinesis.AmazonKinesis;
 import com.amazonaws.services.kinesis.model.ExpiredIteratorException;
+import com.amazonaws.services.kinesis.model.ListShardsRequest;
+import com.amazonaws.services.kinesis.model.ListShardsResult;
 import 
com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException;
+import com.amazonaws.services.kinesis.model.Shard;
+import org.hamcrest.Description;
+import org.hamcrest.TypeSafeDiagnosingMatcher;
+import org.junit.Assert;
 import org.junit.Test;
 import org.powermock.reflect.Whitebox;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
 import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
 
+import static org.hamcrest.MatcherAssert.assertThat;
+import static org.hamcrest.collection.IsCollectionWithSize.hasSize;
+import static 
org.hamcrest.collection.IsIterableContainingInAnyOrder.containsInAnyOrder;
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
+import static org.mockito.Matchers.argThat;
+import static org.mockito.Mockito.doReturn;
+import static org.mockito.Mockito.mock;
 
 /**
  * Test for methods in the {@link KinesisProxy} class.
  */
 public class KinesisProxyTest {
+   private static final String NEXT_TOKEN = "NextToken";
+   private static final String fakeStreamName = "fake-stream";
+   private Set shardIdSet;
+   private List shards;
--- End diff --

Done


---


[GitHub] flink pull request #5992: [FLINK-8944] [Kinesis Connector] Use listShards in...

2018-05-25 Thread kailashhd
Github user kailashhd commented on a diff in the pull request:

https://github.com/apache/flink/pull/5992#discussion_r191033584
  
--- Diff: 
flink-connectors/flink-connector-kinesis/src/test/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxyTest.java
 ---
@@ -26,20 +29,60 @@
 import com.amazonaws.ClientConfigurationFactory;
 import com.amazonaws.services.kinesis.AmazonKinesis;
 import com.amazonaws.services.kinesis.model.ExpiredIteratorException;
+import com.amazonaws.services.kinesis.model.ListShardsRequest;
+import com.amazonaws.services.kinesis.model.ListShardsResult;
 import 
com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException;
+import com.amazonaws.services.kinesis.model.Shard;
+import org.hamcrest.Description;
+import org.hamcrest.TypeSafeDiagnosingMatcher;
+import org.junit.Assert;
 import org.junit.Test;
 import org.powermock.reflect.Whitebox;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
 import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
 
+import static org.hamcrest.MatcherAssert.assertThat;
+import static org.hamcrest.collection.IsCollectionWithSize.hasSize;
+import static 
org.hamcrest.collection.IsIterableContainingInAnyOrder.containsInAnyOrder;
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
+import static org.mockito.Matchers.argThat;
+import static org.mockito.Mockito.doReturn;
+import static org.mockito.Mockito.mock;
 
 /**
  * Test for methods in the {@link KinesisProxy} class.
  */
 public class KinesisProxyTest {
+   private static final String NEXT_TOKEN = "NextToken";
+   private static final String fakeStreamName = "fake-stream";
+   private Set shardIdSet;
+   private List shards;
+
+   protected static HashMap
+   createInitialSubscribedStreamsToLastDiscoveredShardsState(List 
streams) {
+   HashMap initial = new HashMap<>();
+   for (String stream : streams) {
+   initial.put(stream, null);
+   }
+   return initial;
+   }
+
+   private static ListShardsRequestMatcher 
initialListShardsRequestMatcher() {
+   return new ListShardsRequestMatcher(null, null);
+   }
+
+   private static ListShardsRequestMatcher listShardsNextToken(final 
String nextToken) {
--- End diff --

Done


---


[jira] [Created] (FLINK-9440) Allow cancelation and reset of timers

2018-05-25 Thread Elias Levy (JIRA)
Elias Levy created FLINK-9440:
-

 Summary: Allow cancelation and reset of timers
 Key: FLINK-9440
 URL: https://issues.apache.org/jira/browse/FLINK-9440
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.2
Reporter: Elias Levy


Currently the {{TimerService}} allows one to register timers, but it is not 
possible to delete a timer or to reset a timer to a new value.  If one wishes 
to reset a timer, one must also handle the previous inserted timer callbacks 
and ignore them.

I would be useful if the API allowed one to remove and reset timers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9344) Support INTERSECT and INTERSECT ALL for streaming

2018-05-25 Thread Ruidong Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruidong Li updated FLINK-9344:
--
Description: support non-window intersect and non-window intersect all for 
both SQL and TableAPI  (was: support intersect and intersect all for both SQL 
and TableAPI)

> Support INTERSECT and INTERSECT ALL for streaming
> -
>
> Key: FLINK-9344
> URL: https://issues.apache.org/jira/browse/FLINK-9344
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Ruidong Li
>Assignee: Ruidong Li
>Priority: Major
>
> support non-window intersect and non-window intersect all for both SQL and 
> TableAPI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9344) Support INTERSECT and INTERSECT ALL for streaming

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490976#comment-16490976
 ] 

ASF GitHub Bot commented on FLINK-9344:
---

Github user Xpray commented on the issue:

https://github.com/apache/flink/pull/5998
  
Thanks for the review @walterddr @fhueske , I've updated the PR.


> Support INTERSECT and INTERSECT ALL for streaming
> -
>
> Key: FLINK-9344
> URL: https://issues.apache.org/jira/browse/FLINK-9344
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Ruidong Li
>Assignee: Ruidong Li
>Priority: Major
>
> support intersect and intersect all for both SQL and TableAPI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5998: [FLINK-9344] [TableAPI & SQL] Support INTERSECT and INTER...

2018-05-25 Thread Xpray
Github user Xpray commented on the issue:

https://github.com/apache/flink/pull/5998
  
Thanks for the review @walterddr @fhueske , I've updated the PR.


---


[jira] [Updated] (FLINK-8770) CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager is restarted it fails to recover the job due to "checkpoint FileNotFound exception"

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8770:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager 
> is restarted it fails to recover the job due to "checkpoint FileNotFound 
> exception"
> ---
>
> Key: FLINK-8770
> URL: https://issues.apache.org/jira/browse/FLINK-8770
> Project: Flink
>  Issue Type: Bug
>  Components: Local Runtime
>Affects Versions: 1.4.0
>Reporter: Xinyang Gao
>Priority: Critical
> Fix For: 1.5.1
>
> Attachments: flink-test-jobmanager-3-b2dm8.log
>
>
> Hi, I am running a Flink cluster (1 JobManager + 6 TaskManagers) with HA mode 
> on OpenShift, I have enabled Chaos Monkey which kills either JobManager or 
> one of the TaskManager in every 5 minutes, ZooKeeper quorum is stable with no 
> chaos monkey enabled. Flink reads data from one Kafka topic and writes data 
> into another Kafka topic. Checkpoint surely is enabled, with 1000ms interval. 
> state.checkpoints.num-retained is set to 10. I am using PVC for state backend 
> (checkpoint, recovery, etc), so the checkpoints and states are persistent. 
> The restart strategy for Flink jobmanager DeploymentConfig is 
> {color:#d04437}recreate, {color:#33}which means it will kill the old 
> container of jobmanager before it restarts the new one.{color}{color}
> I have run the Chaos test for one day at first, however I have seen the 
> exception:
> {color:#FF}org.apache.flink.util.FlinkException: Could not retrieve 
> checkpoint *** from state handle under /***. This indicates that the 
> retrieved state handle is broken. Try cleaning the state handle store. 
> {color:#33}and the root cause is checkpoint 
> {color:#d04437}FileNotFound{color}. {color}{color}
> {color:#FF}{color:#33}then the Flink job keeps restarting for a few 
> hours and due to the above error it cannot be restarted successfully. 
> {color}{color}
> {color:#FF}{color:#33}After further investigation, I have found the 
> following facts in my PVC:{color}{color}
>  
> {color:#d04437}-rw-r--r--. 1 flink root 11379 Feb 23 02:10 
> completedCheckpoint0ee95157de00
> -rw-r--r--. 1 flink root 11379 Feb 23 01:51 completedCheckpoint498d0952cf00
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint650fe5b021fe
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint66634149683e
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint67f24c3b018e
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpoint6f64ebf0ae64
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint906ebe1fb337
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpoint98b79ea14b09
> -rw-r--r--. 1 flink root 11379 Feb 23 02:10 completedCheckpointa0d1070e0b6c
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 completedCheckpointbd3a9ba50322
> -rw-r--r--. 1 flink root 11355 Feb 22 17:31 completedCheckpointd433b5e108be
> -rw-r--r--. 1 flink root 11379 Feb 22 22:56 completedCheckpointdd0183ed092b
> -rw-r--r--. 1 flink root 11379 Feb 22 00:00 completedCheckpointe0a5146c3d81
> -rw-r--r--. 1 flink root 11331 Feb 22 17:06 completedCheckpointec82f3ebc2ad
> -rw-r--r--. 1 flink root 11379 Feb 23 02:11 
> completedCheckpointf86e460f6720{color}
>  
> {color:#33}The latest 10 checkpoints are created at about 02:10, if you 
> ignore the old checkpoints which were not deleted successfully (which I do 
> not care too much).{color}
>  
> {color:#33}However when checking on ZooKeeper, I see the followings in 
> flink/checkpoints path (I only list one, but the other 9 are similar){color}
> {color:#d04437}cZxid = 0x160001ff5d
> ��sr;org.apache.flink.runtime.state.RetrievableStreamStateHandle�U�+LwrappedStreamStateHandlet2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr9org.apache.flink.runtime.state.filesystem.FileStateHandle�u�b�▒▒J
>  
> stateSizefilePathtLorg/apache/flink/core/fs/Path;xp,ssrorg.apache.flink.core.fs.PathLuritLjava/net/URI;xpsr
>  
> java.net.URI�x.C�I�LstringtLjava/lang/String;xpt=file:/mnt/flink-test/recovery/completedCheckpointd004a3753870x
> [zk: localhost:2181(CONNECTED) 7] ctime = Fri Feb 23 02:08:18 UTC 2018
> mZxid = 0x160001ff5d
> mtime = Fri Feb 23 02:08:18 UTC 2018
> pZxid = 0x1d0c6d
> cversion = 31
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 492{color}
> {color:#FF}{color:#33} {color}{color}
> so the latest completedCheckpoints status stored on ZooKeeper is at about 
> {color:#d04437}02:08, {color:#33}which implies that the completed 
> checkpoints at{color}{color} 

[jira] [Updated] (FLINK-9301) NotSoMiniClusterIterations job fails on travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9301:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> NotSoMiniClusterIterations job fails on travis
> --
>
> Key: FLINK-9301
> URL: https://issues.apache.org/jira/browse/FLINK-9301
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Chesnay Schepler
>Assignee: Andrey Zagrebin
>Priority: Critical
> Fix For: 1.5.1
>
>
> The high-parallelism-iterations-test fails on travis. After starting ~55 
> taskmanagers all memory is used and further memory allocations fail.
> I'm currently letting it run another time, if it fails again I will disable 
> the test temporarily.
> https://travis-ci.org/zentol/flink-ci/builds/375189790



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8247) Support Hadoop-free variant of Flink on Mesos

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8247:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Support Hadoop-free variant of Flink on Mesos
> -
>
> Key: FLINK-8247
> URL: https://issues.apache.org/jira/browse/FLINK-8247
> Project: Flink
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
>Reporter: Eron Wright 
>Assignee: Eron Wright 
>Priority: Major
> Fix For: 1.4.3, 1.5.1
>
>
> In Hadoop-free mode, Hadoop isn't on the classpath.  The Mesos job manager 
> normally uses the Hadoop UserGroupInformation class to overlay a user context 
> (`HADOOP_USER_NAME`) for the task managers.  
> Detect the absence of Hadoop and skip over the `HadoopUserOverlay`, similar 
> to the logic in `HadoopModuleFactory`.This may require the introduction 
> of an overlay factory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8744) Add annotations for documentation common/advanced options

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8744:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add annotations for documentation common/advanced options
> -
>
> Key: FLINK-8744
> URL: https://issues.apache.org/jira/browse/FLINK-8744
> Project: Flink
>  Issue Type: New Feature
>  Components: Configuration, Documentation
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Trivial
> Fix For: 1.5.1
>
>
> The {{ConfigDocsGenerator}} only generates [the full configuration 
> reference|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#full-reference].
>  The 
> [common|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#common-options]
>  and 
> [advanced|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#advanced-options]
>  sections are still manually defined in {{config.md}} and thus prone to being 
> outdated / out-of-sync with the full reference.
> I propose adding {{@Common}}/{{@Advanced}} annotations based on which we 
> could generate these sections as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7697) Add metrics for Elasticsearch Sink

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7697:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add metrics for Elasticsearch Sink
> --
>
> Key: FLINK-7697
> URL: https://issues.apache.org/jira/browse/FLINK-7697
> Project: Flink
>  Issue Type: Wish
>  Components: ElasticSearch Connector
>Reporter: Hai Zhou
>Assignee: Hai Zhou
>Priority: Critical
> Fix For: 1.5.1
>
>
> We should add metrics  to track  events write to ElasticasearchSink.
> eg. 
> * number of successful bulk sends
> * number of documents inserted
> * number of documents updated
> * number of documents version conflicts



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9328) RocksDBStateBackend might use PlaceholderStreamStateHandle to restor due to StateBackendTestBase class not register snapshots in some UTs

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9328:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> RocksDBStateBackend might use PlaceholderStreamStateHandle to restor due to 
> StateBackendTestBase class not register snapshots in some UTs
> -
>
> Key: FLINK-9328
> URL: https://issues.apache.org/jira/browse/FLINK-9328
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.4.2
>Reporter: Yun Tang
>Priority: Minor
> Fix For: 1.5.1
>
>
> Currently, StateBackendTestBase class does not register snapshots to 
> SharedStateRegistry in testValueState, testListState, testReducingState, 
> testFoldingState and testMapState UTs, which may cause RocksDBStateBackend to 
> restore from PlaceholderStreamStateHandle during the 2nd restore procedure if 
> one specific sst file both existed in the 1st snapshot and the 2nd snapshot 
> handle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-5018) Make source idle timeout user configurable

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-5018:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Make source idle timeout user configurable
> --
>
> Key: FLINK-5018
> URL: https://issues.apache.org/jira/browse/FLINK-5018
> Project: Flink
>  Issue Type: Sub-task
>  Components: DataStream API
>Reporter: Tzu-Li (Gordon) Tai
>Priority: Major
> Fix For: 1.5.1
>
>
> There are 2 cases where sources are considered idle and should emit an idle 
> {{StreamStatus}} downstream, taking Kafka consumer as example:
> - The source instance was not assigned any partitions
> - The source instance was assigned partitions, but they currently don't have 
> any data.
> For the second case, we can only consider it idle after a timeout threshold. 
> It would be good to make this timeout user configurable besides a default 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8779) ClassLoaderITCase.testKMeansJobWithCustomClassLoader fails on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8779:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> ClassLoaderITCase.testKMeansJobWithCustomClassLoader fails on Travis
> 
>
> Key: FLINK-8779
> URL: https://issues.apache.org/jira/browse/FLINK-8779
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> The \{{ClassLoaderITCase.testKMeansJobWithCustomClassLoader}} fails on Travis 
> by producing not output for 300s. This might indicate a test instability or a 
> problem with Flink which was recently introduced.
> https://api.travis-ci.org/v3/job/344427688/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9215) TaskManager Releasing - org.apache.flink.util.FlinkException

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9215:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> TaskManager Releasing  - org.apache.flink.util.FlinkException
> -
>
> Key: FLINK-9215
> URL: https://issues.apache.org/jira/browse/FLINK-9215
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, ResourceManager, Streaming
>Affects Versions: 1.5.0
>Reporter: Bob Lau
>Assignee: Sihua Zhou
>Priority: Major
> Fix For: 1.5.1
>
>
> The exception stack is as follows:
> {code:java}
> //代码占位符
> {"root-exception":"
> org.apache.flink.util.FlinkException: Releasing TaskManager 
> 0d87aa6fa99a6c12e36775b1d6bceb19.
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManagerInternal(SlotPool.java:1067)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManager(SlotPool.java:1050)
> at sun.reflect.GeneratedMethodAccessor110.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:210)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
> at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> ","timestamp":1524106438997,
> "all-exceptions":[{"exception":"
> org.apache.flink.util.FlinkException: Releasing TaskManager 
> 0d87aa6fa99a6c12e36775b1d6bceb19.
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManagerInternal(SlotPool.java:1067)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManager(SlotPool.java:1050)
> at sun.reflect.GeneratedMethodAccessor110.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:210)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
> at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> ","task":"async wait operator 
> (14/20)","location":"slave1:60199","timestamp":1524106438996
> }],"truncated":false}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8073) Test instability FlinkKafkaProducer011ITCase.testScaleDownBeforeFirstCheckpoint()

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8073:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Test instability 
> FlinkKafkaProducer011ITCase.testScaleDownBeforeFirstCheckpoint()
> -
>
> Key: FLINK-8073
> URL: https://issues.apache.org/jira/browse/FLINK-8073
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Tests
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> Travis log: https://travis-ci.org/kl0u/flink/jobs/301985988



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9004:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Cluster test: Run general purpose job with failures with Yarn session
> -
>
> Key: FLINK-9004
> URL: https://issues.apache.org/jira/browse/FLINK-9004
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Blocker
> Fix For: 1.5.1
>
>
> Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on 
> a Yarn session cluster and simulate failures.
> The job jar should be ill-packaged, meaning that we include too many 
> dependencies in the user jar. We should include the Scala library, Hadoop and 
> Flink itself to verify that there are no class loading issues.
> The general purpose job should run with misbehavior activated. Additionally, 
> we should simulate at least the following failure scenarios:
> * Kill Flink processes
> * Kill connection to storage system for checkpoints and jobs
> * Simulate network partition
> We should run the test at least with the following state backend: RocksDB 
> incremental async and checkpointing to S3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-6369) Better support for overlay networks

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-6369:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Better support for overlay networks
> ---
>
> Key: FLINK-6369
> URL: https://issues.apache.org/jira/browse/FLINK-6369
> Project: Flink
>  Issue Type: Improvement
>  Components: Docker, Network
>Affects Versions: 1.2.0
>Reporter: Patrick Lucas
>Priority: Major
> Fix For: 1.5.1
>
>
> Running Flink in an environment that utilizes an overlay network 
> (containerized environments like Kubernetes or Docker Compose, or cloud 
> platforms like AWS VPC) poses various challenges related to networking.
> The core problem is that in these environments, applications are frequently 
> addressed by a name different from that with which the application sees 
> itself.
> For instance, it is plausible that the Flink UI (served by the Jobmanager) is 
> accessed via an ELB, which poses a problem in HA mode when the non-leader UI 
> returns an HTTP redirect to the leader—but the user may not be able to 
> connect directly to the leader.
> Or, if a user is using [Docker 
> Compose|https://github.com/apache/flink/blob/aa21f853ab0380ec1f68ae1d0b7c8d9268da4533/flink-contrib/docker-flink/docker-compose.yml],
>  they cannot submit a job via the CLI since there is a mismatch between the 
> name used to address the Jobmanager and what the Jobmanager perceives as its 
> hostname. (see \[1] below for more detail)
> 
> h3. Problems and proposed solutions
> There are four instances of this issue that I've run into so far:
> h4. Jobmanagers must be addressed by the same name they are configured with 
> due to limitations of Akka
> Akka enforces that messages it receives are addressed with the hostname it is 
> configured with. Newer versions of Akka (>= 2.4) than what Flink uses 
> (2.3-custom) have support for accepting messages with the "wrong" hostname, 
> but it limited to a single "external" hostname.
> In many environments, it is likely that not all parties that want to connect 
> to the Jobmanager have the same way of addressing it (e.g. the ELB example 
> above). Other similarly-used protocols like HTTP generally don't have this 
> restriction: if you connect on a socket and send a well-formed message, the 
> system assumes that it is the desired recipient.
> One solution is to not use Akka at all when communicating with the cluster 
> from the outside, perhaps using an HTTP API instead. This would be somewhat 
> involved, and probabyl best left as a longer-term goal.
> A more immediate solution would be to override this behavior within Flakka, 
> the custom fork of Akka currently in use by Flink. I'm not sure how much 
> effort this would take.
> h4. The Jobmanager needs to be able to address the Taskmanagers for e.g. 
> metrics collection
> Having the Taskmanagers register themselves by IP is probably the best 
> solution here. It's a reasonable assumption that IPs can always be used for 
> communication between the nodes of a single cluster. Asking that each 
> Taskmanager container have a resolvable hostname is unreasonable.
> h4. Jobmanagers in HA mode send HTTP redirects to URLs that aren't externally 
> resolvable/routable
> If multiple Jobmanagers are used in HA mode, HTTP requests to non-leaders 
> (such as if you put a Kubernetes Service in front of all Jobmanagers in a 
> cluster) get redirected to the (supposed) hostname of the leader, but this is 
> potentially unresolvable/unroutable externally.
> Enabling non-leader Jobmanagers to proxy API calls to the leader would solve 
> this. The non-leaders could even serve static asset requests (e.g. for css or 
> js files) directly.
> h4. Queryable state requests involve direct communication with Taskmanagers
> Currently, queryable state requests involve communication between the client 
> and the Jobmanager (for key partitioning lookups) and between the client and 
> all Taskmanagers.
> If the client is inside the network (as would be common in production 
> use-cases where high-volume lookups are required) this is a non-issue, but 
> problems crop up if the client is outside the network.
> For the communication with the Jobmanager, a similar solution as above can be 
> used: if all Jobmanagers can service all key partitioning lookup requests 
> (e.g. by proxying) then a simple Service can be used.
> The story is a bit different for the Taskmanagers. The partitioning lookup to 
> the Jobmanager would return the name of the particular Taskmanager that owned 
> the desired data, but that name (likely an IP, as proposed in the second 
> section above) is not necessarily resolvable/routable from the client.
> In the context of Kubernetes, where individual containers are generally not 
> 

[jira] [Updated] (FLINK-8336) YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3 test instability

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8336:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3 test instability
> ---
>
> Key: FLINK-8336
> URL: https://issues.apache.org/jira/browse/FLINK-8336
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem, Tests, YARN
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Nico Kruber
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.3, 1.5.1
>
>
> The {{YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3}} fails on 
> Travis. I suspect that this has something to do with the consistency 
> guarantees S3 gives us.
> https://travis-ci.org/tillrohrmann/flink/jobs/323930297



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9013) Document yarn.containers.vcores only being effective when adapting YARN config

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9013:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Document yarn.containers.vcores only being effective when adapting YARN config
> --
>
> Key: FLINK-9013
> URL: https://issues.apache.org/jira/browse/FLINK-9013
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.5.1
>
>
> Even after specifying {{yarn.containers.vcores}} and having Flink request 
> such a container from YARN, it may not take these into account at all and 
> return a container with 1 vcore.
> The YARN configuration needs to be adapted to take the vcores into account, 
> e.g. by setting the {{FairScheduler}} in {{yarn-site.xml}}:
> {code}
> 
>   yarn.resourcemanager.scheduler.class
>   
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
> 
> {code}
> This fact should be documented at least at the configuration parameter 
> documentation of  {{yarn.containers.vcores}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8046) ContinuousFileMonitoringFunction wrongly ignores files with exact same timestamp

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8046:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> ContinuousFileMonitoringFunction wrongly ignores files with exact same 
> timestamp
> 
>
> Key: FLINK-8046
> URL: https://issues.apache.org/jira/browse/FLINK-8046
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.2
>Reporter: Juan Miguel Cejuela
>Priority: Major
>  Labels: stream
> Fix For: 1.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The current monitoring of files sets the internal variable 
> `globalModificationTime` to filter out files that are "older". However, the 
> current test (to check "older") does 
> `boolean shouldIgnore = modificationTime <= globalModificationTime;` (rom 
> `shouldIgnore`)
> The comparison should strictly be SMALLER (NOT smaller or equal). The method 
> documentation also states "This happens if the modification time of the file 
> is _smaller_ than...".
> The equality acceptance for "older", makes some files with same exact 
> timestamp to be ignored. The behavior is also non-deterministic, as the first 
> file to be accepted ("first" being pretty much random) makes the rest of 
> files with same exact timestamp to be ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8343) Add support for job cluster deployment

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8343:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add support for job cluster deployment
> --
>
> Key: FLINK-8343
> URL: https://issues.apache.org/jira/browse/FLINK-8343
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: flip-6
> Fix For: 1.5.1
>
>
> For Flip-6 we have to enable a different job cluster deployment. The 
> difference is that we directly submit the job when we deploy the Flink 
> cluster instead of following a two step approach (first deployment and then 
> submission).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7915) Verify functionality of RollingSinkSecuredITCase

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7915:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Verify functionality of RollingSinkSecuredITCase
> 
>
> Key: FLINK-7915
> URL: https://issues.apache.org/jira/browse/FLINK-7915
> Project: Flink
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> I recently stumbled across the test {{RollingSinkSecuredITCase}} which will 
> only be executed for Hadoop version {{>= 3}}. When trying to run it from 
> IntelliJ I immediately run into a class not found exception for 
> {{jdbm.helpers.CachePolicy}} and even after fixing this problem, the test 
> would not run because it complained about wrong security settings.
> I think we should check whether this test is at all working and if not, then 
> we should remove or replace it with something working.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7550) Give names to REST client/server for clearer logging.

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7550:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Give names to REST client/server for clearer logging.
> -
>
> Key: FLINK-7550
> URL: https://issues.apache.org/jira/browse/FLINK-7550
> Project: Flink
>  Issue Type: Improvement
>  Components: REST
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Priority: Major
> Fix For: 1.5.1
>
>
> This issue proposes to give names to the entities composing a REST-ful 
> service and use these names when logging messages. This will help debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9109) Add flink modify command to documentation

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9109:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add flink modify command to documentation
> -
>
> Key: FLINK-9109
> URL: https://issues.apache.org/jira/browse/FLINK-9109
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: flip-6
> Fix For: 1.5.1
>
>
> We should add documentation for the {{flink modify}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8302) Support shift_left and shift_right in TableAPI

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8302:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Support shift_left and shift_right in TableAPI
> --
>
> Key: FLINK-8302
> URL: https://issues.apache.org/jira/browse/FLINK-8302
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: DuBin
>Priority: Major
>  Labels: features
> Fix For: 1.5.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Add shift_left and shift_right support in TableAPI, shift_left(input, n) act 
> as input << n, shift_right(input, n) act as input >> n.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9046) Improve error messages when receiving unexpected messages.

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9046:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Improve error messages when receiving unexpected messages.
> --
>
> Key: FLINK-9046
> URL: https://issues.apache.org/jira/browse/FLINK-9046
> Project: Flink
>  Issue Type: Bug
>  Components: REST, Webfrontend
>Affects Versions: 1.5.0
>Reporter: Kostas Kloudas
>Priority: Major
> Fix For: 1.5.1
>
>
> Currently, in many cases, when a Rest Handler received an unexpected 
> messages, e.g. for a job that it does not exist, it logs the full stack trace 
> often with misguiding messages. This can happen for example if we launch a 
> cluster, connect to the WebUI and monitor a job, then kill the cluster while 
> not shutting down the WebUI and start a new cluster on the same port. In this 
> case the WebUI will keep asking for the previous job and the 
> JobDetailsHandler will log the following:
>  
> {code:java}
> 2018-03-21 14:27:24,319 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler - Exception 
> occurred in REST handler.
> org.apache.flink.runtime.rest.NotFoundException: Job 
> 548afad8217f4a18db7f50e60d48885a not found
> at 
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:90)
> at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at 
> org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.lambda$getExecutionGraph$0(ExecutionGraphCache.java:133)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:755)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:20)
> at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:18)
> at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107){code}
>  
> The same holds when asking through the WebUI for the logs of TM. In this 
> case, the logs will contain: 
> {code:java}
> 2018-03-21 12:26:05,096 ERROR 
> org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerDetailsHandler - 
> Implementation error: Unhandled exception.
> 

[jira] [Updated] (FLINK-8731) TwoInputStreamTaskTest flaky on travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8731:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> TwoInputStreamTaskTest flaky on travis
> --
>
> Key: FLINK-8731
> URL: https://issues.apache.org/jira/browse/FLINK-8731
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming, Tests
>Affects Versions: 1.5.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> https://travis-ci.org/zentol/flink/builds/344307861
> {code}
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.479 sec <<< 
> FAILURE! - in org.apache.flink.streaming.runtime.tasks.TwoInputStreamTaskTest
> testOpenCloseAndTimestamps(org.apache.flink.streaming.runtime.tasks.TwoInputStreamTaskTest)
>   Time elapsed: 0.05 sec  <<< ERROR!
> java.lang.Exception: error in task
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskTestHarness.waitForTaskCompletion(StreamTaskTestHarness.java:250)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskTestHarness.waitForTaskCompletion(StreamTaskTestHarness.java:233)
>   at 
> org.apache.flink.streaming.runtime.tasks.TwoInputStreamTaskTest.testOpenCloseAndTimestamps(TwoInputStreamTaskTest.java:99)
> Caused by: org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Boolean cannot be returned by getChannelIndex()
> getChannelIndex() should return int
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.waitAndGetNextInputGate(UnionInputGate.java:212)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.getNextBufferOrEvent(UnionInputGate.java:158)
>   at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:164)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:292)
>   at 
> org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:115)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:308)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskTestHarness$TaskThread.run(StreamTaskTestHarness.java:437)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8971) Create general purpose testing job

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8971:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Create general purpose testing job
> --
>
> Key: FLINK-8971
> URL: https://issues.apache.org/jira/browse/FLINK-8971
> Project: Flink
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.5.1
>
>
> In order to write better end-to-end tests we need a general purpose testing 
> job which comprises as many Flink aspects as possible. These include 
> different types for records and state, user defined components, state types 
> and operators.
> The job should allow to activate a certain misbehavior, such as slowing 
> certain paths down or throwing exceptions to simulate failures.
> The job should come with a data generator which generates input data such 
> that the job can verify it's own behavior. This includes the state as well as 
> the input/output records.
> We already have the [heavily misbehaved 
> job|https://github.com/rmetzger/flink-testing-jobs/blob/flink-1.4/basic-jobs/src/main/java/com/dataartisans/heavymisbehaved/HeavyMisbehavedJob.java]
>  which simulates some misbehavior. There is also the [state machine 
> job|https://github.com/dataArtisans/flink-testing-jobs/tree/master/streaming-state-machine]
>  which can verify itself for invalid state changes which indicate data loss. 
> We should incorporate their characteristics into the new general purpose job.
> Additionally, the general purpose job should contain the following aspects:
> * Job containing a sliding window aggregation
> * At least one generic Kryo type
> * At least one generic Avro type
> * At least one Avro specific record type
> * At least one input type for which we register a Kryo serializer
> * At least one input type for which we provide a user defined serializer
> * At least one state type for which we provide a user defined serializer
> * At least one state type which uses the AvroSerializer
> * Include an operator with ValueState
> * Value state changes should be verified (e.g. predictable series of values)
> * Include an operator with operator state
> * Include an operator with broadcast state
> * Broadcast state changes should be verified (e.g. predictable series of 
> values)
> * Include union state
> * User defined watermark assigner
> The job should be made available in the {{flink-end-to-end-tests}} module.
> This issue is intended to serve as an umbrella issue for developing and 
> extending this job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-6022) Don't serialise Schema when serialising Avro GenericRecord

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-6022:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Don't serialise Schema when serialising Avro GenericRecord
> --
>
> Key: FLINK-6022
> URL: https://issues.apache.org/jira/browse/FLINK-6022
> Project: Flink
>  Issue Type: Improvement
>  Components: Type Serialization System
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>Priority: Major
> Fix For: 1.5.1
>
>
> Currently, Flink is serializing the schema for each Avro GenericRecord in the 
> stream.
> This leads to a lot of overhead over the wire/disk + high serialization costs.
> Therefore, I'm proposing to improve the support for GenericRecord in Flink by 
> shipping the schema to each serializer  through the AvroTypeInformation.
> Then, we can only support GenericRecords with the same type per stream, but 
> the performance will be much better.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-6997) SavepointITCase fails in master branch sometimes

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-6997:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> SavepointITCase fails in master branch sometimes
> 
>
> Key: FLINK-6997
> URL: https://issues.apache.org/jira/browse/FLINK-6997
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.0, 1.5.0
>Reporter: Ted Yu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> I got the following test failure (with commit 
> a0b781461bcf8c2f1d00b93464995f03eda589f1)
> {code}
> testSavepointForJobWithIteration(org.apache.flink.test.checkpointing.SavepointITCase)
>   Time elapsed: 8.129 sec  <<< ERROR!
> java.io.IOException: java.lang.Exception: Failed to complete savepoint
>   at 
> org.apache.flink.runtime.testingUtils.TestingCluster.triggerSavepoint(TestingCluster.scala:342)
>   at 
> org.apache.flink.runtime.testingUtils.TestingCluster.triggerSavepoint(TestingCluster.scala:316)
>   at 
> org.apache.flink.test.checkpointing.SavepointITCase.testSavepointForJobWithIteration(SavepointITCase.java:827)
> Caused by: java.lang.Exception: Failed to complete savepoint
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anon$7.apply(JobManager.scala:821)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anon$7.apply(JobManager.scala:805)
>   at 
> org.apache.flink.runtime.concurrent.impl.FlinkFuture$5.onComplete(FlinkFuture.java:272)
>   at akka.dispatch.OnComplete.internal(Future.scala:247)
>   at akka.dispatch.OnComplete.internal(Future.scala:245)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.Exception: Failed to trigger savepoint: Not all required 
> tasks are currently running.
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerSavepoint(CheckpointCoordinator.java:382)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:800)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.testingUtils.TestingJobManagerLike$$anonfun$handleTestingMessage$1.applyOrElse(TestingJobManagerLike.scala:95)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>   at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>   at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-5125) ContinuousFileProcessingCheckpointITCase is Flaky

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-5125:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> ContinuousFileProcessingCheckpointITCase is Flaky
> -
>
> Key: FLINK-5125
> URL: https://issues.apache.org/jira/browse/FLINK-5125
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming Connectors
>Reporter: Aljoscha Krettek
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> This is the travis log: 
> https://api.travis-ci.org/jobs/177402367/log.txt?deansi=true
> The relevant sections is:
> {code}
> Running org.apache.flink.test.checkpointing.CoStreamCheckpointingITCase
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.571 sec - 
> in org.apache.flink.test.exampleJavaPrograms.EnumTriangleBasicITCase
> Running org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.704 sec - 
> in org.apache.flink.test.checkpointing.CoStreamCheckpointingITCase
> Running 
> org.apache.flink.test.checkpointing.EventTimeAllWindowCheckpointingITCase
> Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.805 sec - 
> in org.apache.flink.test.checkpointing.EventTimeAllWindowCheckpointingITCase
> Running 
> org.apache.flink.test.checkpointing.ContinuousFileProcessingCheckpointITCase
> org.apache.flink.client.program.ProgramInvocationException: The program 
> execution failed: Job execution failed.
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>   at 
> org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:392)
>   at 
> org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:209)
>   at 
> org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:173)
>   at org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:32)
>   at 
> org.apache.flink.test.checkpointing.StreamFaultToleranceTestBase.runCheckpointedProgram(StreamFaultToleranceTestBase.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
>   at 
> 

[jira] [Updated] (FLINK-7009) dogstatsd mode in statsd reporter

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7009:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> dogstatsd mode in statsd reporter
> -
>
> Key: FLINK-7009
> URL: https://issues.apache.org/jira/browse/FLINK-7009
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.4.0
> Environment: org.apache.flink.metrics.statsd.StatsDReporter
>Reporter: David Brinegar
>Priority: Major
> Fix For: 1.5.1
>
>
> The current statsd reporter can only report a subset of Flink metrics owing 
> to the manner in which Flink variables are handled, mainly around invalid 
> characters and metrics too long.  As an option, it would be quite useful to 
> have a stricter dogstatsd compliant output.  Dogstatsd metrics are tagged, 
> should be less than 200 characters including tag names and values, be 
> alphanumeric + underbar, delimited by periods.  As a further pragmatic 
> restriction, negative and other invalid values should be ignored rather than 
> sent to the backend.  These restrictions play well with a broad set of 
> collectors and time series databases.
> This mode would:
> * convert output to ascii alphanumeric characters with underbar, delimited by 
> periods.  Runs of invalid characters within a metric segment would be 
> collapsed to a single underbar.
> * report all Flink variables as tags
> * compress overly long segments, say over 50 chars, to a symbolic 
> representation of the metric name, to preserve the unique metric time series 
> but avoid downstream truncation
> * compress 32 character Flink IDs like tm_id, task_id, job_id, 
> task_attempt_id, to the first 8 characters, again to preserve enough 
> distinction amongst metrics while trimming up to 96 characters from the metric
> * remove object references from names, such as the instance hash id of the 
> serializer
> * drop negative or invalid numeric values such as "n/a", "-1" which is used 
> for unknowns like JVM.Memory.NonHeap.Max, and "-9223372036854775808" which is 
> used for unknowns like currentLowWaterMark
> With these in place, it becomes quite reasonable to support LatencyGauge 
> metrics as well.
> One idea for symbolic compression is to take the first 10 valid characters 
> plus a hash of the long name.  For example, a value like this operator_name:
> {code:java}
> TriggerWindow(TumblingProcessingTimeWindows(5000), 
> ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.PojoSerializer@f3395ffa,
>  
> reduceFunction=org.apache.flink.streaming.examples.socket.SocketWindowWordCount$1@4201c465},
>  ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java-301))
> {code}
> would first drop the instance references.  The stable version would be:
>  
> {code:java}
> TriggerWindow(TumblingProcessingTimeWindows(5000), 
> ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.PojoSerializer,
>  
> reduceFunction=org.apache.flink.streaming.examples.socket.SocketWindowWordCount$1},
>  ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java-301))
> {code}
> and then the compressed name would be the first ten valid characters plus the 
> hash of the stable string:
> {code}
> TriggerWin_d8c007da
> {code}
> This is just one way of dealing with unruly default names, the main point 
> would be to preserve the metrics so they are valid, avoid truncation, and can 
> be aggregated along other dimensions even if this particular dimension is 
> hard to parse after the compression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9032) Deprecate ConfigConstants#YARN_CONTAINER_START_COMMAND_TEMPLATE

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9032:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Deprecate ConfigConstants#YARN_CONTAINER_START_COMMAND_TEMPLATE
> ---
>
> Key: FLINK-9032
> URL: https://issues.apache.org/jira/browse/FLINK-9032
> Project: Flink
>  Issue Type: Improvement
>  Components: Configuration, Documentation
>Affects Versions: 1.5.0
>Reporter: Hai Zhou
>Assignee: Hai Zhou
>Priority: Major
> Fix For: 1.6.0, 1.5.1
>
>
> this contants("yarn.container-start-command-template") has disappeared from 
> the [1.5.0-SNAPSHOT 
> docs|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html].
> We should restore it, and I think it should be renamed 
> "containerized.start-command-template".
> [~Zentol], what do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8067) User code ClassLoader not set before calling ProcessingTimeCallback

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8067:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> User code ClassLoader not set before calling ProcessingTimeCallback
> ---
>
> Key: FLINK-8067
> URL: https://issues.apache.org/jira/browse/FLINK-8067
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Gary Yao
>Priority: Minor
> Fix For: 1.5.1
>
>
> The user code ClassLoader is not set as the context ClassLoader for the 
> thread invoking {{ProcessingTimeCallback#onProcessingTime(long timestamp)}}:
> https://github.com/apache/flink/blob/84a07a34ac22af14f2dd0319447ca5f45de6d0bb/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L222
> This is problematic, for example, if user code dynamically loads classes in 
> {{ProcessFunction#onTimer(long timestamp, OnTimerContext ctx, Collector 
> out)}} using the current thread's context ClassLoader (also see FLINK-8005).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7727) Extend logging in file server handlers

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7727:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Extend logging in file server handlers
> --
>
> Key: FLINK-7727
> URL: https://issues.apache.org/jira/browse/FLINK-7727
> Project: Flink
>  Issue Type: Improvement
>  Components: REST, Webfrontend
>Affects Versions: 1.4.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
> Fix For: 1.5.1
>
>
> The file server handlers check several failure conditions but don't log 
> anything (like the path), making debugging difficult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7981) Bump commons-lang3 version to 3.6

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7981:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Bump commons-lang3 version to 3.6
> -
>
> Key: FLINK-7981
> URL: https://issues.apache.org/jira/browse/FLINK-7981
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System
>Affects Versions: 1.4.0
>Reporter: Hai Zhou
>Assignee: Hai Zhou
>Priority: Trivial
> Fix For: 1.5.1
>
>
> Update commons-lang3 from 3.3.2 to 3.6.
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.
> **other**
> [BEAM-2481:Update commons-lang3 dependency to version 
> 3.6|https://issues.apache.org/jira/browse/BEAM-2481]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8417) Support STSAssumeRoleSessionCredentialsProvider in FlinkKinesisConsumer

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8417:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Support STSAssumeRoleSessionCredentialsProvider in FlinkKinesisConsumer
> ---
>
> Key: FLINK-8417
> URL: https://issues.apache.org/jira/browse/FLINK-8417
> Project: Flink
>  Issue Type: New Feature
>  Components: Kinesis Connector
>Reporter: Tzu-Li (Gordon) Tai
>Priority: Major
> Fix For: 1.5.1
>
>
> As discussed in ML: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connectors-With-Temporary-Credentials-td17734.html.
> Users need the functionality to access cross-account AWS Kinesis streams, 
> using AWS Temporary Credentials [1].
> We should add support for 
> {{AWSConfigConstants.CredentialsProvider.STSAssumeRole}}, which internally 
> would use the {{STSAssumeRoleSessionCredentialsProvider}} [2] in 
> {{AWSUtil#getCredentialsProvider(Properties)}}.
> [1] https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html
> [2] 
> https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/STSAssumeRoleSessionCredentialsProvider.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8707) Excessive amount of files opened by flink task manager

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8707:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Excessive amount of files opened by flink task manager
> --
>
> Key: FLINK-8707
> URL: https://issues.apache.org/jira/browse/FLINK-8707
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager
>Affects Versions: 1.3.2
> Environment: NAME="Red Hat Enterprise Linux Server"
> VERSION="7.3 (Maipo)"
> Two boxes, each with a Job Manager & Task Manager, using Zookeeper for HA.
> flink.yaml below with some settings (removed exact box names) etc:
> env.log.dir: ...some dir...residing on the same box
> env.pid.dir: some dir...residing on the same box
> metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
> metrics.reporters: jmx
> state.backend: filesystem
> state.backend.fs.checkpointdir: file:///some_nfs_mount
> state.checkpoints.dir: file:///some_nfs_mount
> state.checkpoints.num-retained: 3
> high-availability.cluster-id: /tst
> high-availability.storageDir: file:///some_nfs_mount/ha
> high-availability: zookeeper
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.quorum: ...list of zookeeper boxes
> env.java.opts.jobmanager: ...some extra jar args
> jobmanager.archive.fs.dir: some dir...residing on the same box
> jobmanager.web.submit.enable: true
> jobmanager.web.tmpdir:  some dir...residing on the same box
> env.java.opts.taskmanager: some extra jar args
> taskmanager.tmp.dirs:  some dir...residing on the same box/var/tmp
> taskmanager.network.memory.min: 1024MB
> taskmanager.network.memory.max: 2048MB
> blob.storage.directory:  some dir...residing on the same box
>Reporter: Alexander Gardner
>Priority: Critical
> Fix For: 1.5.1
>
> Attachments: AfterRunning-3-jobs-Box2-TM-JCONSOLE.png, 
> AfterRunning-3-jobs-TM-FDs-BOX2.jpg, AfterRunning-3-jobs-lsof-p.box2-TM, 
> AfterRunning-3-jobs-lsof.box2-TM, AterRunning-3-jobs-Box1-TM-JCONSOLE.png, 
> box1-jobmgr-lsof, box1-taskmgr-lsof, box2-jobmgr-lsof, box2-taskmgr-lsof, 
> ll.txt, ll.txt, lsof.txt, lsof.txt, lsofp.txt, lsofp.txt
>
>
> The job manager has less FDs than the task manager.
>  
> Hi
> A support alert indicated that there were a lot of open files for the boxes 
> running Flink.
> There were 4 flink jobs that were dormant but had consumed a number of msgs 
> from Kafka using the FlinkKafkaConsumer010.
> A simple general lsof:
> $ lsof | wc -l       ->  returned 153114 open file descriptors.
> Focusing on the TaskManager process (process ID = 12154):
> $ lsof | grep 12154 | wc -l-    > returned 129322 open FDs
> $ lsof -p 12154 | wc -l   -> returned 531 FDs
> There were 228 threads running for the task manager.
>  
> Drilling down a bit further, looking at a_inode and FIFO entries: 
> $ lsof -p 12154 | grep a_inode | wc -l = 100 FDs
> $ lsof -p 12154 | grep FIFO | wc -l  = 200 FDs
> $ /proc/12154/maps = 920 entries.
> Apart from lsof identifying lots of JARs and SOs being referenced there were 
> also 244 child processes for the task manager process.
> Noticed that in each environment, a creep of file descriptors...are the above 
> figures deemed excessive for the no of FDs in use? I know Flink uses Netty - 
> is it using a separate Selector for reads & writes? 
> Additionally Flink uses memory mapped files? or direct bytebuffers are these 
> skewing the numbers of FDs shown?
> Example of one child process ID 6633:
> java 12154 6633 dfdev 387u a_inode 0,9 0 5869 [eventpoll]
>  java 12154 6633 dfdev 388r FIFO 0,8 0t0 459758080 pipe
>  java 12154 6633 dfdev 389w FIFO 0,8 0t0 459758080 pipe
> Lasty, cannot identify yet the reason for the creep in FDs even if Flink is 
> pretty dormant or has dormant jobs. Production nodes are not experiencing 
> excessive amounts of throughput yet either.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7816) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7816:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: FLINK-7816
> URL: https://issues.apache.org/jira/browse/FLINK-7816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Scala API
>Reporter: Aljoscha Krettek
>Priority: Major
> Fix For: 1.5.1
>
>
> We have the same problem as Spark: SPARK-14540



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8946) TaskManager stop sending metrics after JobManager failover

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8946:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> TaskManager stop sending metrics after JobManager failover
> --
>
> Key: FLINK-8946
> URL: https://issues.apache.org/jira/browse/FLINK-8946
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics, TaskManager
>Affects Versions: 1.4.2
>Reporter: Truong Duc Kien
>Assignee: vinoyang
>Priority: Critical
> Fix For: 1.5.1
>
>
> Running in Yarn-standalone mode, when the Job Manager performs a failover, 
> all TaskManager that are inherited from the previous JobManager will not send 
> metrics to the new JobManager and other registered metric reporters.
>  
> A cursory glance reveal that these line of code might be the cause
> [https://github.com/apache/flink/blob/a3478fdfa0f792104123fefbd9bdf01f5029de51/flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala#L1082-L1086]
> Perhap the TaskManager close its metrics group when disassociating 
> JobManager, but not creating a new one on fail-over association ?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-6856) Eagerly check if serializer config snapshots are deserializable when snapshotting

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-6856:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Eagerly check if serializer config snapshots are deserializable when 
> snapshotting
> -
>
> Key: FLINK-6856
> URL: https://issues.apache.org/jira/browse/FLINK-6856
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Critical
> Fix For: 1.5.1
>
>
> Currently, if serializer config snapshots are not deserializable (for 
> example, if the user did not correctly include the deserialization empty 
> constructor, or the read / write methods are simply wrongly implemented), 
> user's would only be able to find out this when restoring from the snapshot.
> We could eagerly do a check for this when snapshotting, and fail with a good 
> message indicating that the config snapshot can not be deserialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7982) Bump commons-configuration to 2.2

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7982:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Bump commons-configuration to 2.2
> -
>
> Key: FLINK-7982
> URL: https://issues.apache.org/jira/browse/FLINK-7982
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System
>Affects Versions: 1.4.0
>Reporter: Hai Zhou
>Assignee: Hai Zhou
>Priority: Major
> Fix For: 1.5.1
>
>
> Currently the dependency
> {{org.apache.commons:commons-configuration (version:1.7, Sep, 2011)}}, 
> update to
> {{org.apache.commons: commons-configuration2: 2.2}}
> Reference hadoop:
> [Hadoop Commom: HADOOP-14648 - Bump commons-configuration2 to 
> 2.1.1|https://issues.apache.org/jira/browse/HADOOP-14648]
> [Hadoop Common: HADOOP-13660 - Upgrade commons-configuration version to 
> 2.1|https://issues.apache.org/jira/browse/HADOOP-13660]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-5505) Harmonize ZooKeeper configuration parameters

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-5505:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Harmonize ZooKeeper configuration parameters
> 
>
> Key: FLINK-5505
> URL: https://issues.apache.org/jira/browse/FLINK-5505
> Project: Flink
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Till Rohrmann
>Priority: Trivial
> Fix For: 1.5.1
>
>
> Since Flink users don't necessarily know all of the Mesos terminology and a 
> JobManager runs also as a task, I would like to rename some of Flink's Mesos 
> configuration parameters. I would propose the following changes:
> {{mesos.initial-tasks}} => {{mesos.initial-taskmanagers}}
> {{mesos.maximum-failed-tasks}} => {{mesos.maximum-failed-taskmanagers}}
> {{mesos.resourcemanager.artifactserver.\*}} => {{mesos.artifactserver.*}}
> {{mesos.resourcemanager.framework.\*}} => {{mesos.framework.*}}
> {{mesos.resourcemanager.tasks.\*}} => {{mesos.taskmanager.*}}
> {{recovery.zookeeper.path.mesos-workers}} => 
> {{mesos.high-availability.zookeeper.path.mesos-workers}}
> What do you think [~eronwright]?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7582) Rest client may buffer responses indefinitely under heavy laod

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7582:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Rest client may buffer responses indefinitely under heavy laod
> --
>
> Key: FLINK-7582
> URL: https://issues.apache.org/jira/browse/FLINK-7582
> Project: Flink
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 1.4.0
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.5.1
>
>
> The RestClient uses an executor for sending requests and parsing responses. 
> Under heavy load, i.e. lots of requests being sent, the executor may be used 
> exclusively for sending requests. The responses that are received by the 
> netty threads are thus never parsed and are buffered in memory, until either 
> requests stop coming in or all memory is used up.
> We should let the netty receiver thread do the parsing as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-6437) Move history server configuration to a separate file

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-6437:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Move history server configuration to a separate file
> 
>
> Key: FLINK-6437
> URL: https://issues.apache.org/jira/browse/FLINK-6437
> Project: Flink
>  Issue Type: Improvement
>  Components: History Server
>Affects Versions: 1.3.0
>Reporter: Stephan Ewen
>Priority: Major
> Fix For: 1.5.1
>
>
> I suggest to keep the {{flink-conf.yaml}} leaner by moving configuration of 
> the History Server to a different file.
> In general, I would propose to move configurations of separate, independent 
> and optional components to individual config files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7881) flink can't deployed on yarn with ha

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7881:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> flink can't deployed on yarn with ha
> 
>
> Key: FLINK-7881
> URL: https://issues.apache.org/jira/browse/FLINK-7881
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.2
>Reporter: deng
>Priority: Critical
> Fix For: 1.5.1
>
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> I start yarn-ssession.sh on yarn, but it can't read hdfs logical url. It 
> always connect to hdfs://master:8020, it should be 9000, my hdfs defaultfs is 
> hdfs://master.
> I have config the YARN_CONF_DIR and HADOOP_CONF_DIR, it didn't work.
> Is it a bug? i use flink-1.3.0-bin-hadoop27-scala_2.10
> 2017-10-20 11:00:05,395 DEBUG org.apache.hadoop.ipc.Client
>   - IPC Client (1035144464) connection to 
> startdt/173.16.5.215:8020 from admin: closed
> 2017-10-20 11:00:05,398 ERROR 
> org.apache.flink.yarn.YarnApplicationMasterRunner - YARN 
> Application Master initialization failed
> java.net.ConnectException: Call From spark3/173.16.5.216 to master:8020 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-4052) Unstable test ConnectionUtilsTest

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-4052:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Unstable test ConnectionUtilsTest
> -
>
> Key: FLINK-4052
> URL: https://issues.apache.org/jira/browse/FLINK-4052
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.0.2, 1.3.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.1.0, 1.5.1
>
>
> The error is the following:
> {code}
> ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics:59 
> expected: but 
> was:
> {code}
> The probable cause for the failure is that the test tries to select an unused 
> closed port (from the ephemeral range), and then assumes that all connections 
> to that port fail.
> If a concurrent test actually uses that port, connections to the port will 
> succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7551) Add VERSION to the REST urls.

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7551:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add VERSION to the REST urls. 
> --
>
> Key: FLINK-7551
> URL: https://issues.apache.org/jira/browse/FLINK-7551
> Project: Flink
>  Issue Type: Improvement
>  Components: REST
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Priority: Major
> Fix For: 1.5.1
>
>
> This is to guarantee that we can update the REST API without breaking 
> existing third-party clients.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8718) Non-parallel DataStreamSource does not set max parallelism

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8718:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Non-parallel DataStreamSource does not set max parallelism
> --
>
> Key: FLINK-8718
> URL: https://issues.apache.org/jira/browse/FLINK-8718
> Project: Flink
>  Issue Type: Bug
>  Components: DataStream API, Streaming
>Affects Versions: 1.5.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Major
> Fix For: 1.5.1
>
>
> {{org.apache.flink.streaming.api.datastream.DataStreamSource}} does not set 
> {{maxParallelism}} to 1 if it is non-parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8803) Mini Cluster Shutdown with HA unstable, causing test failures

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8803:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Mini Cluster Shutdown with HA unstable, causing test failures
> -
>
> Key: FLINK-8803
> URL: https://issues.apache.org/jira/browse/FLINK-8803
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Stephan Ewen
>Priority: Critical
> Fix For: 1.5.1
>
>
> When the {{FlinkMiniCluster}} is created for HA tests with ZooKeeper, the 
> shutdown is unstable.
> It looks like ZooKeeper may be shut down before the JobManager is shut down, 
> causing the shutdown procedure of the JobManager (specifically 
> {{ZooKeeperSubmittedJobGraphStore.removeJobGraph}}) to block until tests time 
> out.
> Full log: https://api.travis-ci.org/v3/job/346853707/log.txt
> Note that no ZK threads are alive any more, seems ZK is shut down already.
> Relevant Stack Traces:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f973800a800 nid=0x43b4 waiting on 
> condition [0x7f973eb0b000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x8966cf18> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:212)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.ready(package.scala:169)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.startInternalShutdown(FlinkMiniCluster.scala:469)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.stop(FlinkMiniCluster.scala:435)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.closeAsync(FlinkMiniCluster.scala:719)
>   at 
> org.apache.flink.test.util.MiniClusterResource.after(MiniClusterResource.java:104)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:50)
> ...
> {code}
> {code}
> "flink-akka.actor.default-dispatcher-2" #1012 prio=5 os_prio=0 
> tid=0x7f97394fa800 nid=0x3328 waiting on condition [0x7f971db29000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x87f82a70> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:336)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:241)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:225)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.release(ZooKeeperStateHandleStore.java:478)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.releaseAndTryRemove(ZooKeeperStateHandleStore.java:435)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.releaseAndTryRemove(ZooKeeperStateHandleStore.java:405)
>   at 
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.removeJobGraph(ZooKeeperSubmittedJobGraphStore.java:266)
>   - locked <0x807f4258> (a java.lang.Object)
>   at 
> 

[jira] [Updated] (FLINK-9142) Lower the minimum number of buffers for incoming channels to 1

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9142:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Lower the minimum number of buffers for incoming channels to 1
> --
>
> Key: FLINK-9142
> URL: https://issues.apache.org/jira/browse/FLINK-9142
> Project: Flink
>  Issue Type: Sub-task
>  Components: Network
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Priority: Major
> Fix For: 1.5.1
>
>
> Even if we make the floating buffers optional, we still require 
> {{taskmanager.network.memory.buffers-per-channel}} number of (exclusive) 
> buffers per incoming channel with credit-based flow control while without, 
> the minimum was 1 and only the maximum number of buffers was influenced by 
> this parameter.
> {{taskmanager.network.memory.buffers-per-channel}} is set to {{2}} by default 
> with the argumentation that this way we will have one buffer available for 
> netty to process while a worker thread is processing/deserializing the other 
> buffer. While this seems reasonable, it does increase our minimum 
> requirements. Instead, we could probably live with {{1}} exclusive buffer and 
> up to {{gate.getNumberOfInputChannels() * (networkBuffersPerChannel - 1) + 
> extraNetworkBuffersPerGate}} floating buffers. That way we will have the same 
> memory footprint as before with only slightly changed behaviour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8101) Elasticsearch 6.x support

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8101:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Elasticsearch 6.x support
> -
>
> Key: FLINK-8101
> URL: https://issues.apache.org/jira/browse/FLINK-8101
> Project: Flink
>  Issue Type: New Feature
>  Components: ElasticSearch Connector
>Affects Versions: 1.4.0
>Reporter: Hai Zhou
>Priority: Major
> Fix For: 1.5.1
>
>
> Recently, elasticsearch 6.0.0 was released: 
> https://www.elastic.co/blog/elasticsearch-6-0-0-released  
> The minimum version of ES6 compatible Elasticsearch Java Client is 5.6.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8416) Kinesis consumer doc examples should demonstrate preferred default credentials provider

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8416:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Kinesis consumer doc examples should demonstrate preferred default 
> credentials provider
> ---
>
> Key: FLINK-8416
> URL: https://issues.apache.org/jira/browse/FLINK-8416
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Kinesis Connector
>Reporter: Tzu-Li (Gordon) Tai
>Priority: Major
> Fix For: 1.4.3, 1.3.4, 1.5.1
>
>
> The Kinesis consumer docs 
> [here](https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/connectors/kinesis.html#kinesis-consumer)
>  demonstrate providing credentials by explicitly supplying the AWS Access ID 
> and Key.
> The always preferred approach for AWS, unless running locally, is to 
> automatically fetch the shipped credentials from the AWS environment.
> That is actually the default behaviour of the Kinesis consumer, so the docs 
> should demonstrate that more clearly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9010) NoResourceAvailableException with FLIP-6

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9010:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> NoResourceAvailableException with FLIP-6 
> -
>
> Key: FLINK-9010
> URL: https://issues.apache.org/jira/browse/FLINK-9010
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.1
>
>
> I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
> with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
> exception:
> {code}
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000101 - Remaining pending container 
> requests: 302
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000101 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
> 2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> 2018-03-16 03:41:20,176 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> 

[jira] [Updated] (FLINK-7351) test instability in JobClientActorRecoveryITCase#testJobClientRecovery

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7351:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> test instability in JobClientActorRecoveryITCase#testJobClientRecovery
> --
>
> Key: FLINK-7351
> URL: https://issues.apache.org/jira/browse/FLINK-7351
> Project: Flink
>  Issue Type: Bug
>  Components: Job-Submission, Tests
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> On a 16-core VM, the following test failed during {{mvn clean verify}}
> {code}
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.814 sec 
> <<< FAILURE! - in org.apache.flink.runtime.client.JobClientActorRecoveryITCase
> testJobClientRecovery(org.apache.flink.runtime.client.JobClientActorRecoveryITCase)
>   Time elapsed: 21.299 sec  <<< ERROR!
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:933)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Not enough free slots available to run the job. You can decrease the operator 
> parallelism or increase the number of slots per TaskManager in the 
> configuration. Resources available to scheduler: Number of instances=0, total 
> number of slots=0, available slots=0
> at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:334)
> at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.allocateSlot(Scheduler.java:139)
> at 
> org.apache.flink.runtime.executiongraph.Execution.allocateSlotForExecution(Execution.java:368)
> at 
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:309)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:596)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:450)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleLazy(ExecutionGraph.java:834)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:814)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1425)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8357) enable rolling in default log settings

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8357:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> enable rolling in default log settings
> --
>
> Key: FLINK-8357
> URL: https://issues.apache.org/jira/browse/FLINK-8357
> Project: Flink
>  Issue Type: Improvement
>  Components: Logging
>Reporter: Xu Mingmin
>Assignee: mingleizhang
>Priority: Major
> Fix For: 1.5.1
>
>
> The release packages uses {{org.apache.log4j.FileAppender}} for log4j and 
> {{ch.qos.logback.core.FileAppender}} for logback, which could results in very 
> large log files. 
> For most cases, if not all, we need to enable rotation in a production 
> cluster, and I suppose it's a good idea to make rotation as default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9253) Make buffer count per InputGate always #channels*buffersPerChannel + ExclusiveBuffers

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9253:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Make buffer count per InputGate always #channels*buffersPerChannel + 
> ExclusiveBuffers
> -
>
> Key: FLINK-9253
> URL: https://issues.apache.org/jira/browse/FLINK-9253
> Project: Flink
>  Issue Type: Sub-task
>  Components: Network
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.5.1
>
>
> The credit-based flow control path assigns exclusive buffers only to remote 
> channels (which makes sense since local channels don't use any own buffers). 
> However, this is a bit intransparent with respect to how much data may be in 
> buffers since this depends on the actual schedule of the job and not the job 
> graph.
> By adapting the floating buffers to use a maximum of 
> {{#channels*buffersPerChannel + floatingBuffersPerGate - #exclusiveBuffers}}, 
> we would be channel-type agnostic and keep the old behaviour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8408) YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n unstable on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8408:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n unstable on Travis
> --
>
> Key: FLINK-8408
> URL: https://issues.apache.org/jira/browse/FLINK-8408
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Nico Kruber
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.3, 1.5.1
>
>
> The {{YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n}} is unstable 
> on Travis.
> https://travis-ci.org/tillrohrmann/flink/jobs/327216460



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9098) ClassLoaderITCase unstable

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9098:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> ClassLoaderITCase unstable
> --
>
> Key: FLINK-9098
> URL: https://issues.apache.org/jira/browse/FLINK-9098
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Stephan Ewen
>Priority: Critical
> Fix For: 1.5.1
>
>
> The some savepoint disposal seems to fail, after that all successive tests 
> fail because there are not anymore enough slots.
> Full log: https://api.travis-ci.org/v3/job/356900367/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8820) FlinkKafkaConsumer010 reads too many bytes

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8820:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> FlinkKafkaConsumer010 reads too many bytes
> --
>
> Key: FLINK-8820
> URL: https://issues.apache.org/jira/browse/FLINK-8820
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 1.4.0
>Reporter: Fabian Hueske
>Priority: Critical
> Fix For: 1.5.1
>
>
> A user reported that the FlinkKafkaConsumer010 very rarely consumes too many 
> bytes, i.e., the returned message is too large. The application is running 
> for about a year and the problem started to occur after upgrading to Flink 
> 1.4.0.
> The user made a good effort in debugging the problem but was not able to 
> reproduce it in a controlled environment. It seems that the data is correctly 
> stored in Kafka.
> Here's the thread on the thread on the user mailing list for a detailed 
> description of the problem and analysis so far: 
> https://lists.apache.org/thread.html/1d62f616d275e9e23a5215ddf7f5466051be7ea96897d827232fcb4e@%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8418) Kafka08ITCase.testStartFromLatestOffsets() times out on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8418:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Kafka08ITCase.testStartFromLatestOffsets() times out on Travis
> --
>
> Key: FLINK-8418
> URL: https://issues.apache.org/jira/browse/FLINK-8418
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Tests
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.3, 1.5.1
>
>
> Instance: https://travis-ci.org/kl0u/flink/builds/327733085



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-5789) Make Bucketing Sink independent of Hadoop's FileSystem

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-5789:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Make Bucketing Sink independent of Hadoop's FileSystem
> --
>
> Key: FLINK-5789
> URL: https://issues.apache.org/jira/browse/FLINK-5789
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming Connectors
>Affects Versions: 1.2.0, 1.1.4
>Reporter: Stephan Ewen
>Priority: Major
> Fix For: 1.5.1
>
>
> The {{BucketingSink}} is hard wired to Hadoop's FileSystem, bypassing Flink's 
> file system abstraction.
> This causes several issues:
>   - The bucketing sink will behave different than other file sinks with 
> respect to configuration
>   - Directly supported file systems (not through hadoop) like the MapR File 
> System does not work in the same way with the BuketingSink as other file 
> systems
>   - The previous point is all the more problematic in the effort to make 
> Hadoop an optional dependency and with in other stacks (Mesos, Kubernetes, 
> AWS, GCE, Azure) with ideally no Hadoop dependency.
> We should port the {{BucketingSink}} to use Flink's FileSystem classes.
> To support the *truncate* functionality that is needed for the exactly-once 
> semantics of the Bucketing Sink, we should extend Flink's FileSystem 
> abstraction to have the methods
>   - {{boolean supportsTruncate()}}
>   - {{void truncate(Path, long)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8521) FlinkKafkaProducer011ITCase.testRunOutOfProducersInThePool timed out on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8521:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> FlinkKafkaProducer011ITCase.testRunOutOfProducersInThePool timed out on Travis
> --
>
> Key: FLINK-8521
> URL: https://issues.apache.org/jira/browse/FLINK-8521
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Piotr Nowojski
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.3, 1.5.1
>
>
> The {{FlinkKafkaProducer011ITCase.testRunOutOfProducersInThePool}} timed out 
> on Travis with producing no output for longer than 300s.
>  
> https://travis-ci.org/tillrohrmann/flink/jobs/334642014



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-6571) InfiniteSource in SourceStreamOperatorTest should deal with InterruptedExceptions

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-6571:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> InfiniteSource in SourceStreamOperatorTest should deal with 
> InterruptedExceptions
> -
>
> Key: FLINK-6571
> URL: https://issues.apache.org/jira/browse/FLINK-6571
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> So this is a new one: i got a failing test 
> ({{testNoMaxWatermarkOnAsyncStop}}) due to an uncatched InterruptedException.
> {code}
> [00:28:15] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 0.828 sec <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.operators.StreamSourceOperatorTest
> [00:28:15] 
> testNoMaxWatermarkOnAsyncStop(org.apache.flink.streaming.runtime.operators.StreamSourceOperatorTest)
>   Time elapsed: 0 sec  <<< ERROR!
> [00:28:15] java.lang.InterruptedException: sleep interrupted
> [00:28:15]at java.lang.Thread.sleep(Native Method)
> [00:28:15]at 
> org.apache.flink.streaming.runtime.operators.StreamSourceOperatorTest$InfiniteSource.run(StreamSourceOperatorTest.java:343)
> [00:28:15]at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> [00:28:15]at 
> org.apache.flink.streaming.runtime.operators.StreamSourceOperatorTest.testNoMaxWatermarkOnAsyncStop(StreamSourceOperatorTest.java:176)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7964) Add Apache Kafka 1.0 connector

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7964:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add Apache Kafka 1.0 connector
> --
>
> Key: FLINK-7964
> URL: https://issues.apache.org/jira/browse/FLINK-7964
> Project: Flink
>  Issue Type: Improvement
>  Components: Kafka Connector
>Affects Versions: 1.4.0
>Reporter: Hai Zhou
>Assignee: Hai Zhou
>Priority: Major
> Fix For: 1.5.1
>
>
> Kafka 1.0.0 is no mere bump of the version number. The Apache Kafka Project 
> Management Committee has packed a number of valuable enhancements into the 
> release. Here is a summary of a few of them:
> * Since its introduction in version 0.10, the Streams API has become hugely 
> popular among Kafka users, including the likes of Pinterest, Rabobank, 
> Zalando, and The New York Times. In 1.0, the the API continues to evolve at a 
> healthy pace. To begin with, the builder API has been improved (KIP-120). A 
> new API has been added to expose the state of active tasks at runtime 
> (KIP-130). The new cogroup API makes it much easier to deal with partitioned 
> aggregates with fewer StateStores and fewer moving parts in your code 
> (KIP-150). Debuggability gets easier with enhancements to the print() and 
> writeAsText() methods (KIP-160). And if that’s not enough, check out KIP-138 
> and KIP-161 too. For more on streams, check out the Apache Kafka Streams 
> documentation, including some helpful new tutorial videos.
> * Operating Kafka at scale requires that the system remain observable, and to 
> make that easier, we’ve made a number of improvements to metrics. These are 
> too many to summarize without becoming tedious, but Connect metrics have been 
> significantly improved (KIP-196), a litany of new health check metrics are 
> now exposed (KIP-188), and we now have a global topic and partition count 
> (KIP-168). Check out KIP-164 and KIP-187 for even more.
> * We now support Java 9, leading, among other things, to significantly faster 
> TLS and CRC32C implementations. Over-the-wire encryption will be faster now, 
> which will keep Kafka fast and compute costs low when encryption is enabled.
> * In keeping with the security theme, KIP-152 cleans up the error handling on 
> Simple Authentication Security Layer (SASL) authentication attempts. 
> Previously, some authentication error conditions were indistinguishable from 
> broker failures and were not logged in a clear way. This is cleaner now.
> * Kafka can now tolerate disk failures better. Historically, JBOD storage 
> configurations have not been recommended, but the architecture has 
> nevertheless been tempting: after all, why not rely on Kafka’s own 
> replication mechanism to protect against storage failure rather than using 
> RAID? With KIP-112, Kafka now handles disk failure more gracefully. A single 
> disk failure in a JBOD broker will not bring the entire broker down; rather, 
> the broker will continue serving any log files that remain on functioning 
> disks.
> * Since release 0.11.0, the idempotent producer (which is the producer used 
> in the presence of a transaction, which of course is the producer we use for 
> exactly-once processing) required max.in.flight.requests.per.connection to be 
> equal to one. As anyone who has written or tested a wire protocol can attest, 
> this put an upper bound on throughput. Thanks to KAFKA-5949, this can now be 
> as large as five, relaxing the throughput constraint quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8135) Add description to MessageParameter

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8135:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add description to MessageParameter
> ---
>
> Key: FLINK-8135
> URL: https://issues.apache.org/jira/browse/FLINK-8135
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, REST
>Reporter: Chesnay Schepler
>Assignee: Andrei
>Priority: Major
> Fix For: 1.5.1
>
>
> For documentation purposes we should add an {{getDescription()}} method to 
> the {{MessageParameter}} class, describing what this particular parameter is 
> used for and which values are accepted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8410) Kafka consumer's commitedOffsets gauge metric is prematurely set

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8410:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Kafka consumer's commitedOffsets gauge metric is prematurely set
> 
>
> Key: FLINK-8410
> URL: https://issues.apache.org/jira/browse/FLINK-8410
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Metrics
>Affects Versions: 1.4.0, 1.3.2, 1.5.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Major
> Fix For: 1.4.3, 1.3.4, 1.5.1
>
>
> The {{committedOffset}} metric gauge value is set too early. It is set here:
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka-0.9/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/Kafka09Fetcher.java#L236
> This sets the committed offset before the actual commit happens, which varies 
> depending on whether the commit mode is auto periodically, or committed on 
> checkpoints. Moreover, in the 0.9+ consumers, the {{KafkaConsumerThread}} may 
> choose to supersede some commit attempts if the commit takes longer than the 
> commit interval.
> While the committed offset back to Kafka is not a critical value used by the 
> consumer, it will be best to have more accurate values as a Flink-shipped 
> metric.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9257) End-to-end tests prints "All tests PASS" even if individual test-script returns non-zero exit code

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9257:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> End-to-end tests prints "All tests PASS" even if individual test-script 
> returns non-zero exit code
> --
>
> Key: FLINK-9257
> URL: https://issues.apache.org/jira/browse/FLINK-9257
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Florian Schmidt
>Assignee: Florian Schmidt
>Priority: Critical
> Fix For: 1.5.1
>
>
> In some cases the test-suite exits with non-zero exit code but still prints 
> "All tests PASS" to stdout. This happens because how the test runner works, 
> which is roughly as follows
>  # Either run-nightly-tests.sh or run-precommit-tests.sh executes a suite of 
> tests consisting of one multiple bash scripts.
>  # As soon as one of those bash scripts exists with non-zero exit code, the 
> tests won't continue to run and the test-suite will also exit with non-zero 
> exit code.
>  # *During the cleanup hook (trap cleanup EXIT in common.sh) it will be 
> checked whether there are non-empty out files or log files with certain 
> exceptions. If a tests fails with non-zero exit code, but does not have any 
> exceptions or .out files, this will still print "All tests PASS" to stdout, 
> even though they don't*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8768) Change NettyMessageDecoder to inherit from LengthFieldBasedFrameDecoder

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8768:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Change NettyMessageDecoder to inherit from LengthFieldBasedFrameDecoder
> ---
>
> Key: FLINK-8768
> URL: https://issues.apache.org/jira/browse/FLINK-8768
> Project: Flink
>  Issue Type: Improvement
>  Components: Network
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.5.1
>
>
> Let {{NettyMessageDecoder}} inherit from {{LengthFieldBasedFrameDecoder}} 
> instead of being an additional step in the pipeline. This does not only 
> remove overhead in the pipeline itself but also allows use to override the 
> {{#extractFrame()}} method to restore the old Netty 4.0.27 behaviour for 
> non-credit based code paths which had a bug with Netty >= 4.0.28 there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9193) Deprecate non-well-defined output methods on DataStream

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9193:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Deprecate non-well-defined output methods on DataStream
> ---
>
> Key: FLINK-9193
> URL: https://issues.apache.org/jira/browse/FLINK-9193
> Project: Flink
>  Issue Type: Improvement
>  Components: DataStream API
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>Priority: Major
> Fix For: 1.5.1
>
>
> Some output methods on {{DataStream}} that write text to files are not safe 
> to use in a streaming program as they have no consistency guarantees. They 
> are:
>  - {{writeAsText()}}
>  - {{writeAsCsv()}}
>  - {{writeToSocket()}}
>  - {{writeUsingOutputFormat()}}
> Along with those we should also deprecate the {{SinkFunctions}} that they use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9355) Simplify configuration of local recovery to a simple on/off

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9355:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Simplify configuration of local recovery to a simple on/off
> ---
>
> Key: FLINK-9355
> URL: https://issues.apache.org/jira/browse/FLINK-9355
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.5.1
>
>
> We can change the configuration of local recovery to a simple 
> enabled/disabled choice. In the future, if a backend can offer different 
> implementations of local recovery, the backend can offer its own config 
> constants as an expert option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8825) Disallow new String() without charset in checkstyle

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8825:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Disallow new String() without charset in checkstyle
> ---
>
> Key: FLINK-8825
> URL: https://issues.apache.org/jira/browse/FLINK-8825
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Aljoscha Krettek
>Assignee: vinoyang
>Priority: Major
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8452) BucketingSinkTest.testNonRollingAvroKeyValueWithCompressionWriter instable on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8452:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> BucketingSinkTest.testNonRollingAvroKeyValueWithCompressionWriter instable on 
> Travis
> 
>
> Key: FLINK-8452
> URL: https://issues.apache.org/jira/browse/FLINK-8452
> Project: Flink
>  Issue Type: Bug
>  Components: DataStream API, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> The {{BucketingSinkTest.testNonRollingAvroKeyValueWithCompressionWriter}} 
> seems to be instable on Travis:
>  
> https://travis-ci.org/tillrohrmann/flink/jobs/330261310



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7916) Remove NetworkStackThroughputITCase

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7916:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Remove NetworkStackThroughputITCase
> ---
>
> Key: FLINK-7916
> URL: https://issues.apache.org/jira/browse/FLINK-7916
> Project: Flink
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.5.1
>
>
> Flink's code base contains the {{NetworkStackThroughputITCase}} which is not 
> really a test. Moreover it is marked as {{Ignored}}. I propose to remove this 
> test because it is more of a benchmark. We could think about creating a 
> benchmark project where we move these kind of "tests".
> In general I think we should remove ignored tests if they won't be fixed 
> immediately. The danger is far too high that we forget about them and then we 
> only keep the maintenance burden of it. This is especially true for the above 
> mentioned test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8623) ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics unstable on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8623:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics unstable on 
> Travis
> 
>
> Key: FLINK-8623
> URL: https://issues.apache.org/jira/browse/FLINK-8623
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.3, 1.5.1
>
>
> {{ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics}} fails on 
> Travis: https://travis-ci.org/apache/flink/jobs/33932



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9360) HA end-to-end nightly test takes more than 15 min in Travis CI

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9360:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> HA end-to-end nightly test takes more than 15 min in Travis CI
> --
>
> Key: FLINK-9360
> URL: https://issues.apache.org/jira/browse/FLINK-9360
> Project: Flink
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Andrey Zagrebin
>Priority: Major
>  Labels: E2E, Nightly
> Fix For: 1.5.1
>
>
> We have not discussed how long the nightly tests should run. Currently 
> overall testing build time is around the limit of free Travis CI VM (50 min).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8160) Extend OperatorHarness to expose metrics

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8160:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Extend OperatorHarness to expose metrics
> 
>
> Key: FLINK-8160
> URL: https://issues.apache.org/jira/browse/FLINK-8160
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics, Streaming
>Reporter: Chesnay Schepler
>Assignee: Tuo Wang
>Priority: Major
> Fix For: 1.5.1
>
>
> To better test interactions between operators and metrics the harness should 
> expose the metrics registered by the operator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8192) Properly annotate APIs of all Flink connectors with @Public / @PublicEvolving / @Internal

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8192:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Properly annotate APIs of all Flink connectors with @Public / @PublicEvolving 
> / @Internal
> -
>
> Key: FLINK-8192
> URL: https://issues.apache.org/jira/browse/FLINK-8192
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Reporter: Tzu-Li (Gordon) Tai
>Priority: Major
> Fix For: 1.5.1
>
>
> Currently, the APIs of the Flink connectors have absolutely no annotations on 
> whether their usage is {{Public}} / {{PublicEvolving}} / or {{Internal}}.
> We have, for example, instances in the past where a user was mistakenly using 
> an abstract internal base class in the Elasticsearch connector.
> This JIRA tracks the coverage of API usage annotation for all Flink shipped 
> connectors. Ideally, a separate subtask should be created for each individual 
> connector.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8161) Flakey YARNSessionCapacitySchedulerITCase on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8161:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Flakey YARNSessionCapacitySchedulerITCase on Travis
> ---
>
> Key: FLINK-8161
> URL: https://issues.apache.org/jira/browse/FLINK-8161
> Project: Flink
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
> Attachments: 24547.10.tar.gz, 
> api.travis-ci.org_v3_job_380488917_log.tar.xz
>
>
> The {{YARNSessionCapacitySchedulerITCase}} spuriously fails on Travis because 
> it now contains {{2017-11-25 22:49:49,204 WARN  
> akka.remote.transport.netty.NettyTransport- Remote 
> connection to [null] failed with java.nio.channels.NotYetConnectedException}} 
> from time to time in the logs. I suspect that this is due to switching from 
> Flakka to Akka 2.4.0. In order to solve this problem I propose to add this 
> log statement to the whitelisted log statements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9187) add prometheus pushgateway reporter

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9187:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> add prometheus pushgateway reporter
> ---
>
> Key: FLINK-9187
> URL: https://issues.apache.org/jira/browse/FLINK-9187
> Project: Flink
>  Issue Type: New Feature
>  Components: Metrics
>Affects Versions: 1.4.2
>Reporter: lamber-ken
>Priority: Minor
>  Labels: features
> Fix For: 1.5.1
>
>
> make flink system can send metrics to prometheus via pushgateway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8907) Unable to start the process by start-cluster.sh

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8907:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Unable to start the process by start-cluster.sh
> ---
>
> Key: FLINK-8907
> URL: https://issues.apache.org/jira/browse/FLINK-8907
> Project: Flink
>  Issue Type: Bug
>  Components: Startup Shell Scripts
>Reporter: mingleizhang
>Priority: Minor
> Fix For: 1.5.1
>
>
> Build a flink based on the latest code from master branch. And I get the 
> flink binary of flink-1.6-SNAPSHOT. When I run ./start-cluster.sh by the 
> default flink-conf.yaml. I can not start the process and get the error below.
> {code:java}
>  [root@ricezhang-pjhzf bin]# ./start-cluster.sh 
> >> Starting cluster.
> >> Starting standalonesession daemon on host ricezhang-pjhzf.vclound.com.
> >> : Temporary failure in name resolutionost
> {code}
> *By the way, I build this flink from a windows computer. And copy that folder 
> {{flink-1.6-SNAPSHOT}} to centos.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7386) Flink Elasticsearch 5 connector is not compatible with Elasticsearch 5.2+ client

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7386:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Flink Elasticsearch 5 connector is not compatible with Elasticsearch 5.2+ 
> client
> 
>
> Key: FLINK-7386
> URL: https://issues.apache.org/jira/browse/FLINK-7386
> Project: Flink
>  Issue Type: Improvement
>  Components: ElasticSearch Connector
>Reporter: Dawid Wysakowicz
>Assignee: Fang Yong
>Priority: Critical
> Fix For: 1.5.1
>
>
> In Elasticsearch 5.2.0 client the class {{BulkProcessor}} was refactored and 
> has no longer the method {{add(ActionRequest)}}.
> For more info see: https://github.com/elastic/elasticsearch/pull/20109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8163:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7829) Remove (or at least deprecate) DataStream.writeToFile/Csv

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7829:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Remove (or at least deprecate) DataStream.writeToFile/Csv
> -
>
> Key: FLINK-7829
> URL: https://issues.apache.org/jira/browse/FLINK-7829
> Project: Flink
>  Issue Type: Improvement
>  Components: DataStream API
>Reporter: Aljoscha Krettek
>Priority: Major
> Fix For: 1.5.1
>
>
> These methods are seductive for users but they should never actually use them 
> in a production streaming job. For those cases the {{BucketingSink}} should 
> be used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8822) RotateLogFile may not work well when sed version is below 4.2

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8822:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> RotateLogFile may not work well when sed version is below 4.2
> -
>
> Key: FLINK-8822
> URL: https://issues.apache.org/jira/browse/FLINK-8822
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Xin Liu
>Priority: Major
> Fix For: 1.5.1
>
>
> In bin/config.sh rotateLogFilesWithPrefix(),it use extended regular to 
> process filename with "sed -E",but when sed version is below 4.2,it turns out 
> "sed: invalid option -- 'E'"
> and RotateLogFile won't work well : There will be only one logfile no matter 
> what is $MAX_LOG_FILE_NUMBER.
> so use sed -r may be more suitable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8672) Support continuous processing in CSV table source

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8672:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Support continuous processing in CSV table source
> -
>
> Key: FLINK-8672
> URL: https://issues.apache.org/jira/browse/FLINK-8672
> Project: Flink
>  Issue Type: New Feature
>  Components: Table API  SQL
>Reporter: Aljoscha Krettek
>Priority: Major
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7565) Add support for HTTP 1.1 (Chunked transfer encoding) to Flink web UI

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7565:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Add support for HTTP 1.1 (Chunked transfer encoding) to Flink web UI
> 
>
> Key: FLINK-7565
> URL: https://issues.apache.org/jira/browse/FLINK-7565
> Project: Flink
>  Issue Type: Improvement
>  Components: Webfrontend
>Reporter: Robert Metzger
>Priority: Major
> Fix For: 1.5.1
>
>
> I tried sending some REST calls to Flink's web UI, using a HTTP 1.1 client 
> (Jersey-client 2.23.1), using chunked transfer encoding (which means the 
> content-length header is not set). 
> Flink's web Ui treats those requests as empty requests (with no body), 
> leading to errors and exceptions.
> Its probably "just" an upgrade of netty-router or so to fix this?
> I've also found a good workaround for my use case (by setting 
> setChunkedEncodingEnabled(false)) (See also 
> https://stackoverflow.com/questions/18157218/jersey-2-0-content-length-not-set)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7064) test instability in WordCountMapreduceITCase

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7064:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> test instability in WordCountMapreduceITCase
> 
>
> Key: FLINK-7064
> URL: https://issues.apache.org/jira/browse/FLINK-7064
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.4.0
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> Although already mentioned in FLINK-7004, this does not seem fixed yet and 
> apparently now became an instable test:
> {code}
> Running org.apache.flink.test.hadoop.mapred.WordCountMapredITCase
> Inflater has been closed
> java.lang.NullPointerException: Inflater has been closed
>   at java.util.zip.Inflater.ensureOpen(Inflater.java:389)
>   at java.util.zip.Inflater.inflate(Inflater.java:257)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152)
>   at java.io.FilterInputStream.read(FilterInputStream.java:133)
>   at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:154)
>   at java.io.BufferedReader.readLine(BufferedReader.java:317)
>   at java.io.BufferedReader.readLine(BufferedReader.java:382)
>   at 
> javax.xml.parsers.FactoryFinder.findJarServiceProvider(FactoryFinder.java:319)
>   at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:255)
>   at 
> javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:121)
>   at 
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2467)
>   at 
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2444)
>   at 
> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2361)
>   at org.apache.hadoop.conf.Configuration.get(Configuration.java:968)
>   at 
> org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2010)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:423)
>   at 
> org.apache.flink.hadoopcompatibility.HadoopInputs.readHadoopFile(HadoopInputs.java:63)
>   at 
> org.apache.flink.test.hadoop.mapred.WordCountMapredITCase.internalRun(WordCountMapredITCase.java:79)
>   at 
> org.apache.flink.test.hadoop.mapred.WordCountMapredITCase.testProgram(WordCountMapredITCase.java:67)
>   at 
> org.apache.flink.test.util.JavaProgramTestBase.testJobWithObjectReuse(JavaProgramTestBase.java:127)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
>   at 
> 

[jira] [Updated] (FLINK-8439) Document using a custom AWS Credentials Provider with flink-3s-fs-hadoop

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8439:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Document using a custom AWS Credentials Provider with flink-3s-fs-hadoop
> 
>
> Key: FLINK-8439
> URL: https://issues.apache.org/jira/browse/FLINK-8439
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Dyana Rose
>Priority: Critical
> Fix For: 1.4.3, 1.5.1
>
>
> This came up when using the s3 for the file system backend and running under 
> ECS.
> With no credentials in the container, hadoop-aws will default to EC2 instance 
> level credentials when accessing S3. However when running under ECS, you will 
> generally want to default to the task definition's IAM role.
> In this case you need to set the hadoop property
> {code:java}
> fs.s3a.aws.credentials.provider{code}
> to a fully qualified class name(s). see [hadoop-aws 
> docs|https://github.com/apache/hadoop/blob/1ba491ff907fc5d2618add980734a3534e2be098/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md]
> This works as expected when you add this setting to flink-conf.yaml but there 
> is a further 'gotcha.'  Because the AWS sdk is shaded, the actual full class 
> name for, in this case, the ContainerCredentialsProvider is
> {code:java}
> org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.auth.ContainerCredentialsProvider{code}
>  
> meaning the full setting is:
> {code:java}
> fs.s3a.aws.credentials.provider: 
> org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.auth.ContainerCredentialsProvider{code}
> If you instead set it to the unshaded class name you will see a very 
> confusing error stating that the ContainerCredentialsProvider doesn't 
> implement AWSCredentialsProvider (which it most certainly does.)
> Adding this information (how to specify alternate Credential Providers, and 
> the name space gotcha) to the [AWS deployment 
> docs|https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/aws.html]
>  would be useful to anyone else using S3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9159) Sanity check default timeout values

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9159:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> Sanity check default timeout values
> ---
>
> Key: FLINK-9159
> URL: https://issues.apache.org/jira/browse/FLINK-9159
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Fabian Hueske
>Priority: Blocker
> Fix For: 1.5.1
>
>
> Check that the default timeout values for resource release are sanely chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9258) ConcurrentModificationException in ComponentMetricGroup.getAllVariables

2018-05-25 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9258:
-
Fix Version/s: (was: 1.5.0)
   1.5.1

> ConcurrentModificationException in ComponentMetricGroup.getAllVariables
> ---
>
> Key: FLINK-9258
> URL: https://issues.apache.org/jira/browse/FLINK-9258
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics
>Affects Versions: 1.4.2
>Reporter: Narayanan Arunachalam
>Assignee: Chesnay Schepler
>Priority: Major
> Fix For: 1.4.3, 1.5.1
>
>
> Seeing this exception at the job startup time. Looks like there is a race 
> condition when the metrics variables are constructed.
> The error is intermittent.
> Exception in thread "main" 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>     at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply$mcV$sp(JobManager.scala:897)
>     at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:840)
>     at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:840)
>     at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>     at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>     at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>     at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>     at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.util.ConcurrentModificationException
>     at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
>     at java.util.HashMap$EntryIterator.next(HashMap.java:1471)
>     at java.util.HashMap$EntryIterator.next(HashMap.java:1469)
>     at java.util.HashMap.putMapEntries(HashMap.java:511)
>     at java.util.HashMap.putAll(HashMap.java:784)
>     at 
> org.apache.flink.runtime.metrics.groups.ComponentMetricGroup.getAllVariables(ComponentMetricGroup.java:63)
>     at 
> org.apache.flink.runtime.metrics.groups.ComponentMetricGroup.getAllVariables(ComponentMetricGroup.java:63)
>     at 
> com.netflix.spaas.metrics.MetricsReporterRegistry.getTags(MetricsReporterRegistry.java:147)
>     at 
> com.netflix.spaas.metrics.MetricsReporterRegistry.mergeWithSourceAndSinkTags(MetricsReporterRegistry.java:170)
>     at 
> com.netflix.spaas.metrics.MetricsReporterRegistry.addReporter(MetricsReporterRegistry.java:75)
>     at 
> com.netflix.spaas.nfflink.connector.kafka.source.Kafka010Consumer.createFetcher(Kafka010Consumer.java:69)
>     at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:549)
>     at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:86)
>     at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
>     at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:94)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>     at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   >