[jira] [Created] (FLINK-4223) Rearrange scaladoc and javadoc for Scala API

2016-07-15 Thread Chiwan Park (JIRA)
Chiwan Park created FLINK-4223:
--

 Summary: Rearrange scaladoc and javadoc for Scala API
 Key: FLINK-4223
 URL: https://issues.apache.org/jira/browse/FLINK-4223
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Chiwan Park
Priority: Minor


Currently, some scaladocs for Scala API (Gelly Scala API, FlinkML, Streaming 
Scala API) are not in scaladoc but in javadoc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4222) Allow Kinesis configuration to get credentials from AWS Metadata

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380312#comment-15380312
 ] 

ASF GitHub Bot commented on FLINK-4222:
---

GitHub user chadnickbok opened a pull request:

https://github.com/apache/flink/pull/2260

[FLINK-4222] Allow Kinesis configuration to get credentials from AWS 
Metadata   

When called without credentials, the AmazonKinesisClient tries to configure 
itself automatically, searching for credentials from environment variables, 
java system properties, and finally from instance profile credentials delivered 
through the Amazon EC2 metadata service.

Add the AWSConfigConstant "AUTO", which supports creating an 
AmazonKinesisClient without any AWSCredentials, which allows for this 
auto-discovery mechanism to take place and supports getting kinesis credentials 
from the AWS EC2 metadata service.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chadnickbok/flink aws-metadata-auth-kinesis

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2260


commit 8d43f05ce65b88067fd5a4808c773cafc693c0f2
Author: Nick Chadwick 
Date:   2016-07-15T23:24:19Z

Support automatic AWS Credentials discovery.

When called without credentials, the AmazonKinesisClient tries to configure 
itself automatically, searching for credentials from environment variables, 
java system properties, and finally from instance profile credentials delivered 
through the Amazon EC2 metadata service.

Add the AWSConfigConstant "AUTO", which supports creating an 
AmazonKinesisClient without any AWSCredentials, which allows for this 
auto-discovery mechanism to take place and supports getting kinesis credentials 
from the AWS EC2 metadata service.




> Allow Kinesis configuration to get credentials from AWS Metadata
> 
>
> Key: FLINK-4222
> URL: https://issues.apache.org/jira/browse/FLINK-4222
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.0.3
>Reporter: Nick Chadwick
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When deploying Flink TaskManagers in an EC2 environment, it would be nice to 
> be able to use the EC2 IAM Role credentials provided by the EC2 Metadata 
> service.
> This allows for credentials to be automatically discovered by services 
> running on EC2 instances at runtime, and removes the need to explicitly 
> create and assign credentials to TaskManagers.
> This should be a fairly small change to the configuration of the 
> flink-connector-kinesis, which will greatly improve the ease of deployment to 
> Amazon EC2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2260: [FLINK-4222] Allow Kinesis configuration to get cr...

2016-07-15 Thread chadnickbok
GitHub user chadnickbok opened a pull request:

https://github.com/apache/flink/pull/2260

[FLINK-4222] Allow Kinesis configuration to get credentials from AWS 
Metadata   

When called without credentials, the AmazonKinesisClient tries to configure 
itself automatically, searching for credentials from environment variables, 
java system properties, and finally from instance profile credentials delivered 
through the Amazon EC2 metadata service.

Add the AWSConfigConstant "AUTO", which supports creating an 
AmazonKinesisClient without any AWSCredentials, which allows for this 
auto-discovery mechanism to take place and supports getting kinesis credentials 
from the AWS EC2 metadata service.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chadnickbok/flink aws-metadata-auth-kinesis

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2260


commit 8d43f05ce65b88067fd5a4808c773cafc693c0f2
Author: Nick Chadwick 
Date:   2016-07-15T23:24:19Z

Support automatic AWS Credentials discovery.

When called without credentials, the AmazonKinesisClient tries to configure 
itself automatically, searching for credentials from environment variables, 
java system properties, and finally from instance profile credentials delivered 
through the Amazon EC2 metadata service.

Add the AWSConfigConstant "AUTO", which supports creating an 
AmazonKinesisClient without any AWSCredentials, which allows for this 
auto-discovery mechanism to take place and supports getting kinesis credentials 
from the AWS EC2 metadata service.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (FLINK-4222) Allow Kinesis configuration to get credentials from AWS Metadata

2016-07-15 Thread Nick Chadwick (JIRA)
Nick Chadwick created FLINK-4222:


 Summary: Allow Kinesis configuration to get credentials from AWS 
Metadata
 Key: FLINK-4222
 URL: https://issues.apache.org/jira/browse/FLINK-4222
 Project: Flink
  Issue Type: Improvement
  Components: Streaming Connectors
Affects Versions: 1.0.3
Reporter: Nick Chadwick
Priority: Minor


When deploying Flink TaskManagers in an EC2 environment, it would be nice to be 
able to use the EC2 IAM Role credentials provided by the EC2 Metadata service.

This allows for credentials to be automatically discovered by services running 
on EC2 instances at runtime, and removes the need to explicitly create and 
assign credentials to TaskManagers.

This should be a fairly small change to the configuration of the 
flink-connector-kinesis, which will greatly improve the ease of deployment to 
Amazon EC2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-4209) Fix issue on docker with multiple NICs and remove supervisord dependency

2016-07-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated FLINK-4209:

Summary: Fix issue on docker with multiple NICs and remove supervisord 
dependency  (was: Docker image breaks with multiple NICs)

> Fix issue on docker with multiple NICs and remove supervisord dependency
> 
>
> Key: FLINK-4209
> URL: https://issues.apache.org/jira/browse/FLINK-4209
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-contrib
>Reporter: Ismaël Mejía
>Priority: Minor
>
> The resolution of the host is done by IP today in the docker image scripts, 
> this is an issue when the system has multiple network cards, if the hostname 
> resolution is done by name, this is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-4208) Support Running Flink processes in foreground mode

2016-07-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía closed FLINK-4208.
---
Resolution: Fixed

I decided to remove the issue for foreground mode and change the docker script 
to use an extra wait to avoid losing the process in the background.

> Support Running Flink processes in foreground mode
> --
>
> Key: FLINK-4208
> URL: https://issues.apache.org/jira/browse/FLINK-4208
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Flink clusters are started automatically in daemon mode, this is definitely 
> the default case, however if we want to start containers based on flinks, the 
> execution context gets lost. Running flink as foreground processes can fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4208) Support Running Flink processes in foreground mode

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380064#comment-15380064
 ] 

ASF GitHub Bot commented on FLINK-4208:
---

Github user iemejia closed the pull request at:

https://github.com/apache/flink/pull/2239


> Support Running Flink processes in foreground mode
> --
>
> Key: FLINK-4208
> URL: https://issues.apache.org/jira/browse/FLINK-4208
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Flink clusters are started automatically in daemon mode, this is definitely 
> the default case, however if we want to start containers based on flinks, the 
> execution context gets lost. Running flink as foreground processes can fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2239: [FLINK-4208] Support Running Flink processes in fo...

2016-07-15 Thread iemejia
Github user iemejia closed the pull request at:

https://github.com/apache/flink/pull/2239


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4209) Docker image breaks with multiple NICs

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380060#comment-15380060
 ] 

ASF GitHub Bot commented on FLINK-4209:
---

Github user iemejia commented on the issue:

https://github.com/apache/flink/pull/2240
  
I added this last commit to remove the dependency on supervisord, thanks to 
@greghogan for the 'wait' idea. Now flink has the thinnest docker image 
possible :).


> Docker image breaks with multiple NICs
> --
>
> Key: FLINK-4209
> URL: https://issues.apache.org/jira/browse/FLINK-4209
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-contrib
>Reporter: Ismaël Mejía
>Priority: Minor
>
> The resolution of the host is done by IP today in the docker image scripts, 
> this is an issue when the system has multiple network cards, if the hostname 
> resolution is done by name, this is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #2240: [FLINK-4209] Fix issue on docker with multiple NICs and r...

2016-07-15 Thread iemejia
Github user iemejia commented on the issue:

https://github.com/apache/flink/pull/2240
  
I added this last commit to remove the dependency on supervisord, thanks to 
@greghogan for the 'wait' idea. Now flink has the thinnest docker image 
possible :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2239: [FLINK-4208] Support Running Flink processes in foregroun...

2016-07-15 Thread iemejia
Github user iemejia commented on the issue:

https://github.com/apache/flink/pull/2239
  
Thanks Greg, I was probably too tired last night because I put the wait in 
a weird place, I just tried now and everything is working, it is still not 
'real' foreground, since Ctrl-C gets captured by the wait, but it fixes the 
issue for the docker image so I am going to close this pull request and the 
issue.
If you or anybody else is still interested I will keep it, but I am going 
to close it if you don't mind.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4208) Support Running Flink processes in foreground mode

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380057#comment-15380057
 ] 

ASF GitHub Bot commented on FLINK-4208:
---

Github user iemejia commented on the issue:

https://github.com/apache/flink/pull/2239
  
Thanks Greg, I was probably too tired last night because I put the wait in 
a weird place, I just tried now and everything is working, it is still not 
'real' foreground, since Ctrl-C gets captured by the wait, but it fixes the 
issue for the docker image so I am going to close this pull request and the 
issue.
If you or anybody else is still interested I will keep it, but I am going 
to close it if you don't mind.



> Support Running Flink processes in foreground mode
> --
>
> Key: FLINK-4208
> URL: https://issues.apache.org/jira/browse/FLINK-4208
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Flink clusters are started automatically in daemon mode, this is definitely 
> the default case, however if we want to start containers based on flinks, the 
> execution context gets lost. Running flink as foreground processes can fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #1947: [FLINK-1502] [WIP] Basic Metric System

2016-07-15 Thread sumitchawla
Github user sumitchawla commented on the issue:

https://github.com/apache/flink/pull/1947
  
Thanks a lot @zentol .. this is great.. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379850#comment-15379850
 ] 

ASF GitHub Bot commented on FLINK-1502:
---

Github user sumitchawla commented on the issue:

https://github.com/apache/flink/pull/1947
  
Thanks a lot @zentol .. this is great.. 


> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: 1.1.0
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379835#comment-15379835
 ] 

ASF GitHub Bot commented on FLINK-1502:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/1947
  
@sumitchawla sure you can, as described here: 
https://ci.apache.org/projects/flink/flink-docs-master/apis/metrics.html#registering-metrics


> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: 1.1.0
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #1947: [FLINK-1502] [WIP] Basic Metric System

2016-07-15 Thread zentol
Github user zentol commented on the issue:

https://github.com/apache/flink/pull/1947
  
@sumitchawla sure you can, as described here: 
https://ci.apache.org/projects/flink/flink-docs-master/apis/metrics.html#registering-metrics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379829#comment-15379829
 ] 

ASF GitHub Bot commented on FLINK-1502:
---

Github user sumitchawla commented on the issue:

https://github.com/apache/flink/pull/1947
  
@zentol .. by job writer i meant end user writing jobs using Flink API.  As 
of now i can create custom accumulators using 
`getRuntimeContext().addAccumulator(ACCUMULATOR_NAME,...` can i do something 
similar to register custom metrics in my transformations


> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: 1.1.0
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #1947: [FLINK-1502] [WIP] Basic Metric System

2016-07-15 Thread sumitchawla
Github user sumitchawla commented on the issue:

https://github.com/apache/flink/pull/1947
  
@zentol .. by job writer i meant end user writing jobs using Flink API.  As 
of now i can create custom accumulators using 
`getRuntimeContext().addAccumulator(ACCUMULATOR_NAME,...` can i do something 
similar to register custom metrics in my transformations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1267) Add crossGroup operator

2016-07-15 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379790#comment-15379790
 ] 

Stephan Ewen commented on FLINK-1267:
-

Should we move the discussion to one of the two issues and mark the other as 
duplicate?

> Add crossGroup operator
> ---
>
> Key: FLINK-1267
> URL: https://issues.apache.org/jira/browse/FLINK-1267
> Project: Flink
>  Issue Type: New Feature
>  Components: DataSet API, Local Runtime, Optimizer
>Affects Versions: 0.7.0-incubating
>Reporter: Fabian Hueske
>Assignee: pietro pinoli
>Priority: Minor
>
> A common operator is to pair-wise compare or combine all elements of a group 
> (there were two questions about this on the user mailing list, recently). 
> Right now, this can be done in two ways:
> 1. {{groupReduce}}: consume and store the complete iterator in memory and 
> build all pairs
> 2. do a self-{{Join}}: the engine builds all pairs of the full symmetric 
> Cartesian product.
> Both approaches have drawbacks. The {{groupReduce}} variant requires that the 
> full group fits into memory and is more cumbersome to implement for the user, 
> but pairs can be arbitrarily built. The self-{{Join}} approach pushes most of 
> the work into the system, but the execution strategy does not treat the 
> self-join different from a regular join (both identical inputs are shuffled, 
> etc.) and always builds the full symmetric Cartesian product.
> I propose to add a dedicated {{crossGroup()}} operator, that offers this 
> functionality in a proper way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3397) Failed streaming jobs should fall back to the most recent checkpoint/savepoint

2016-07-15 Thread Ufuk Celebi (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379767#comment-15379767
 ] 

Ufuk Celebi commented on FLINK-3397:


Thanks for the reminder! :-)

The description of checkpoints and savepoints are mostly correct. Minor changes:

"Every time a new checkpoint is taken the older ones are discarded and only the 
latest is considered for any restoration on failure."
=> This is also configurable, you can keep around multiple completed 
checkpoints.

"These checkpointed state are never cleared unless the user wants to delete a 
savepoint and create a new one."
=> I would remove the last part "and create a new one" as this is independent 
of when savepoints are cleared. The important thing is that they are not 
automatically cleared.

The rest of the description is not correct:

"Any job submitted checks if there was a savepoint already available in the 
back end store."
This is not checked automatically, but the user provides the savepoint path to 
resume from.

Regarding resuming jobs: if a job was submitted with a savepoint path to 
recover from, it will always fall back to that state in the worst case. What 
does not happen is that it is falling back to any newer savepoints (even if 
some were triggered). This is what you describe on page 2. In general though I 
would refrain from any time consideration when talking about this, the 
checkpoint ID description is good though.

All in all it's great to see that you looked into the code before doing this. I 
fear though that these changes require some more  consideration about how 
savepoints are stored/accessed. They are currently mostly independent of the 
job from which they were created.

> Failed streaming jobs should fall back to the most recent checkpoint/savepoint
> --
>
> Key: FLINK-3397
> URL: https://issues.apache.org/jira/browse/FLINK-3397
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Streaming
>Affects Versions: 1.0.0
>Reporter: Gyula Fora
>Priority: Minor
> Attachments: FLINK-3397.pdf
>
>
> The current fallback behaviour in case of a streaming job failure is slightly 
> counterintuitive:
> If a job fails it will fall back to the most recent checkpoint (if any) even 
> if there were more recent savepoint taken. This means that savepoints are not 
> regarded as checkpoints by the system only points from where a job can be 
> manually restarted.
> I suggest to change this so that savepoints are also regarded as checkpoints 
> in case of a failure and they will also be used to automatically restore the 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2259: [hotfix] Fixes broken TopSpeedWindowing scala exam...

2016-07-15 Thread kl0u
GitHub user kl0u opened a pull request:

https://github.com/apache/flink/pull/2259

[hotfix] Fixes broken TopSpeedWindowing scala example

The problem was that in the case where no input file was provided, the 
`fromCollection()` source was never exiting and the actual program was never 
running. The new source is identical to the corresponding java test. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kl0u/flink fix_example

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2259


commit f235631596afaefc05c91f91fda05a7e301db661
Author: kl0u 
Date:   2016-07-15T16:20:04Z

[hotfix] Fixes broken TopSpeedWindowing scala example




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4201) Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted

2016-07-15 Thread Ufuk Celebi (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379639#comment-15379639
 ] 

Ufuk Celebi commented on FLINK-4201:


The shut down hook is actually not a problem, because it is only active in 
standalone recovery mode. The issue is that a suspended execution graph will 
shut down the checkpoint coordinator, which discards all checkpoints on shut 
down. We still need to call shutdown in order to free some resources like the 
timer task, but have to skip discarding the checkpoints if the execution graph 
is suspended and not in a globally terminal state.

> Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted
> ---
>
> Key: FLINK-4201
> URL: https://issues.apache.org/jira/browse/FLINK-4201
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Reporter: Stefan Richter
>Assignee: Ufuk Celebi
>Priority: Blocker
>
> For example, when shutting down a Yarn session, according to the logs 
> checkpoints for jobs that did not terminate are deleted. In the shutdown 
> hook, removeAllCheckpoints is called and removes checkpoints that should 
> still be kept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-4201) Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted

2016-07-15 Thread Ufuk Celebi (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-4201:
--

Assignee: Ufuk Celebi

> Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted
> ---
>
> Key: FLINK-4201
> URL: https://issues.apache.org/jira/browse/FLINK-4201
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Reporter: Stefan Richter
>Assignee: Ufuk Celebi
>Priority: Blocker
>
> For example, when shutting down a Yarn session, according to the logs 
> checkpoints for jobs that did not terminate are deleted. In the shutdown 
> hook, removeAllCheckpoints is called and removes checkpoints that should 
> still be kept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4213) Provide CombineHint in Gelly algorithms

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379631#comment-15379631
 ] 

ASF GitHub Bot commented on FLINK-4213:
---

Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2248
  
Yes, and the job plan attached to the ticket shows the only combine as a 
"Sorted Combine" (from distinct).


> Provide CombineHint in Gelly algorithms
> ---
>
> Key: FLINK-4213
> URL: https://issues.apache.org/jira/browse/FLINK-4213
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>
> Many graph algorithms will see better {{reduce}} performance with the 
> hash-combine compared with the still default sort-combine, e.g. HITS and 
> LocalClusteringCoefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #2248: [FLINK-4213] [gelly] Provide CombineHint in Gelly algorit...

2016-07-15 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2248
  
Yes, and the job plan attached to the ticket shows the only combine as a 
"Sorted Combine" (from distinct).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2256: [FLINK-4150] [runtime] Don't clean up BlobStore on BlobSe...

2016-07-15 Thread uce
Github user uce commented on the issue:

https://github.com/apache/flink/pull/2256
  
I don't know if we "want to", but it is the current behaviour. A job should 
only fail if its restart strategy is exhausted though. Do you think we should 
change that behaviour? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379618#comment-15379618
 ] 

ASF GitHub Bot commented on FLINK-4150:
---

Github user uce commented on the issue:

https://github.com/apache/flink/pull/2256
  
I don't know if we "want to", but it is the current behaviour. A job should 
only fail if its restart strategy is exhausted though. Do you think we should 
change that behaviour? 


> Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
> 
>
> Key: FLINK-4150
> URL: https://issues.apache.org/jira/browse/FLINK-4150
> Project: Flink
>  Issue Type: Bug
>  Components: Job-Submission
>Reporter: Stefan Richter
>Assignee: Ufuk Celebi
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Submitting a job in Yarn with HA can lead to the following exception:
> {code}
> org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load 
> user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09
> ClassLoader info: URL ClassLoader:
> file: 
> '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0'
>  (invalid JAR: zip file is empty)
> Class not resolvable through given classloader.
>   at 
> org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Some job information, including the Blob ids, are stored in Zookeeper. The 
> actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set 
> to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the 
> cluster is shut down, the path for the BlobStore is deleted. When the cluster 
> is then restarted, recovering jobs cannot restore because it's Blob ids 
> stored in Zookeeper now point to deleted files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3279) Optionally disable DistinctOperator combiner

2016-07-15 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379606#comment-15379606
 ] 

Greg Hogan commented on FLINK-3279:
---

I fixed the wording of my comment. I think Fabian's suggestion was to 
investigate changing {{DistinctOperator}} from using {{GroupReduce}} to using 
{{Reduce}}. Then we could add {{setCombineHint}} to {{DistinctOperator}} rather 
than my suggestion above.

> Optionally disable DistinctOperator combiner
> 
>
> Key: FLINK-3279
> URL: https://issues.apache.org/jira/browse/FLINK-3279
> Project: Flink
>  Issue Type: New Feature
>  Components: DataSet API
>Affects Versions: 1.0.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
>
> Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} 
> which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that 
> there will be few duplicate records and disabling the combine would improve 
> performance.
> I propose adding {{public DistinctOperator setCombinable(boolean 
> combinable)}} to {{DistinctOperator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-3279) Optionally disable DistinctOperator combiner

2016-07-15 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379599#comment-15379599
 ] 

Gabor Gevay edited comment on FLINK-3279 at 7/15/16 4:00 PM:
-

I think no. And a Jira is also needed for the sum, max, etc. aggregations. 
(Maybe these two things can be in one Jira.)


was (Author: ggevay):
https://issues.apache.org/jira/browse/FLINK-3479?

> Optionally disable DistinctOperator combiner
> 
>
> Key: FLINK-3279
> URL: https://issues.apache.org/jira/browse/FLINK-3279
> Project: Flink
>  Issue Type: New Feature
>  Components: DataSet API
>Affects Versions: 1.0.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
>
> Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} 
> which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that 
> there will be few duplicate records and disabling the combine would improve 
> performance.
> I propose adding {{public DistinctOperator setCombinable(boolean 
> combinable)}} to {{DistinctOperator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3279) Optionally disable DistinctOperator combiner

2016-07-15 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379599#comment-15379599
 ] 

Gabor Gevay commented on FLINK-3279:


https://issues.apache.org/jira/browse/FLINK-3479?

> Optionally disable DistinctOperator combiner
> 
>
> Key: FLINK-3279
> URL: https://issues.apache.org/jira/browse/FLINK-3279
> Project: Flink
>  Issue Type: New Feature
>  Components: DataSet API
>Affects Versions: 1.0.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
>
> Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} 
> which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that 
> there will be few duplicate records and disabling the combine would improve 
> performance.
> I propose adding {{public DistinctOperator setCombinable(boolean 
> combinable)}} to {{DistinctOperator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-3279) Optionally disable DistinctOperator combiner

2016-07-15 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379590#comment-15379590
 ] 

Greg Hogan edited comment on FLINK-3279 at 7/15/16 3:54 PM:


Do we have a ticket for porting Distinct from using groupReduce to using reduce?


was (Author: greghogan):
Do we have a ticket for porting groupReduce to reduce?

> Optionally disable DistinctOperator combiner
> 
>
> Key: FLINK-3279
> URL: https://issues.apache.org/jira/browse/FLINK-3279
> Project: Flink
>  Issue Type: New Feature
>  Components: DataSet API
>Affects Versions: 1.0.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
>
> Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} 
> which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that 
> there will be few duplicate records and disabling the combine would improve 
> performance.
> I propose adding {{public DistinctOperator setCombinable(boolean 
> combinable)}} to {{DistinctOperator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3279) Optionally disable DistinctOperator combiner

2016-07-15 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379590#comment-15379590
 ] 

Greg Hogan commented on FLINK-3279:
---

Do we have a ticket for porting groupReduce to reduce?

> Optionally disable DistinctOperator combiner
> 
>
> Key: FLINK-3279
> URL: https://issues.apache.org/jira/browse/FLINK-3279
> Project: Flink
>  Issue Type: New Feature
>  Components: DataSet API
>Affects Versions: 1.0.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
>
> Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} 
> which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that 
> there will be few duplicate records and disabling the combine would improve 
> performance.
> I propose adding {{public DistinctOperator setCombinable(boolean 
> combinable)}} to {{DistinctOperator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-4196) Remove "recoveryTimestamp"

2016-07-15 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-4196.
-
   Resolution: Fixed
Fix Version/s: 1.1.0

Fixed via de6a3d33ecfa689fd0da1ef661bbf6edb68e9d0b

> Remove "recoveryTimestamp"
> --
>
> Key: FLINK-4196
> URL: https://issues.apache.org/jira/browse/FLINK-4196
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.0.3
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
> Fix For: 1.1.0
>
>
> I think we should remove the {{recoveryTimestamp}} that is attached on state 
> restore calls.
> Given that this is a wall clock timestamp from a master node, which may 
> change when clocks are adjusted, and between different master nodes during 
> leader change, this is an unsafe concept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-4196) Remove "recoveryTimestamp"

2016-07-15 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-4196.
---

> Remove "recoveryTimestamp"
> --
>
> Key: FLINK-4196
> URL: https://issues.apache.org/jira/browse/FLINK-4196
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.0.3
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
> Fix For: 1.1.0
>
>
> I think we should remove the {{recoveryTimestamp}} that is attached on state 
> restore calls.
> Given that this is a wall clock timestamp from a master node, which may 
> change when clocks are adjusted, and between different master nodes during 
> leader change, this is an unsafe concept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1267) Add crossGroup operator

2016-07-15 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379563#comment-15379563
 ] 

Greg Hogan commented on FLINK-1267:
---

I think my FLINK-3910 is a duplicate of this idea. I am both buoyed by Fabian 
having submitted this idea and deflated by Stephan's assessments.

Could this be implemented as {{cross()}} in the same way that {{reduce()}} can 
be applied to either a grouped or full DataSet?

> Add crossGroup operator
> ---
>
> Key: FLINK-1267
> URL: https://issues.apache.org/jira/browse/FLINK-1267
> Project: Flink
>  Issue Type: New Feature
>  Components: DataSet API, Local Runtime, Optimizer
>Affects Versions: 0.7.0-incubating
>Reporter: Fabian Hueske
>Assignee: pietro pinoli
>Priority: Minor
>
> A common operator is to pair-wise compare or combine all elements of a group 
> (there were two questions about this on the user mailing list, recently). 
> Right now, this can be done in two ways:
> 1. {{groupReduce}}: consume and store the complete iterator in memory and 
> build all pairs
> 2. do a self-{{Join}}: the engine builds all pairs of the full symmetric 
> Cartesian product.
> Both approaches have drawbacks. The {{groupReduce}} variant requires that the 
> full group fits into memory and is more cumbersome to implement for the user, 
> but pairs can be arbitrarily built. The self-{{Join}} approach pushes most of 
> the work into the system, but the execution strategy does not treat the 
> self-join different from a regular join (both identical inputs are shuffled, 
> etc.) and always builds the full symmetric Cartesian product.
> I propose to add a dedicated {{crossGroup()}} operator, that offers this 
> functionality in a proper way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4213) Provide CombineHint in Gelly algorithms

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379557#comment-15379557
 ] 

ASF GitHub Bot commented on FLINK-4213:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2248
  
Does the deadlock occur only with the combiner, or also with the sorter?


> Provide CombineHint in Gelly algorithms
> ---
>
> Key: FLINK-4213
> URL: https://issues.apache.org/jira/browse/FLINK-4213
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>
> Many graph algorithms will see better {{reduce}} performance with the 
> hash-combine compared with the still default sort-combine, e.g. HITS and 
> LocalClusteringCoefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-3466.
-
   Resolution: Fixed
Fix Version/s: 1.1.0

Fixed in e9f660d1ff5540c7ef829f2de5bb870b787c18b7

> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.1.0
>
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> 

[GitHub] flink issue #2248: [FLINK-4213] [gelly] Provide CombineHint in Gelly algorit...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2248
  
Does the deadlock occur only with the combiner, or also with the sorter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Closed] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-3466.
---

> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>Priority: Blocker
> Fix For: 1.1.0
>
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> 

[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379553#comment-15379553
 ] 

ASF GitHub Bot commented on FLINK-3466:
---

Github user StephanEwen closed the pull request at:

https://github.com/apache/flink/pull/2252


> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>Priority: Blocker
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> 

[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379552#comment-15379552
 ] 

ASF GitHub Bot commented on FLINK-3466:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2252
  
Manually merged in e9f660d1ff5540c7ef829f2de5bb870b787c18b7


> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>Priority: Blocker
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> 

[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...

2016-07-15 Thread StephanEwen
Github user StephanEwen closed the pull request at:

https://github.com/apache/flink/pull/2252


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-3466:

Affects Version/s: (was: 0.10.2)
   (was: 1.0.0)
   1.1.0

> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>Priority: Blocker
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> 

[jira] [Updated] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-3466:

Priority: Blocker  (was: Major)

> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>Priority: Blocker
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> 

[GitHub] flink issue #2252: [FLINK-3466] [runtime] Cancel state handled on state rest...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2252
  
Manually merged in e9f660d1ff5540c7ef829f2de5bb870b787c18b7


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3204) TaskManagers are not shutting down properly on YARN

2016-07-15 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379516#comment-15379516
 ] 

Robert Metzger commented on FLINK-3204:
---

I think this error still persists and also shows up in travis tests: 
https://s3.amazonaws.com/archive.travis-ci.org/jobs/144954182/log.txt

(From the /yarn-tests/container_1468587486405_0005_01_01/jobmanager.log 
file in 
https://s3.amazonaws.com/flink-logs-us/travis-artifacts/rmetzger/flink/1600/1600.5.tar.gz)

{code}
2016-07-15 12:59:46,808 INFO  org.apache.flink.yarn.YarnFlinkResourceManager
- Shutting down cluster with status SUCCEEDED : Flink YARN Client 
requested shutdown
2016-07-15 12:59:46,809 INFO  org.apache.flink.yarn.YarnFlinkResourceManager
- Unregistering application from the YARN Resource Manager
2016-07-15 12:59:46,817 INFO  
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Waiting for 
application to be successfully unregistered.
2016-07-15 12:59:46,832 INFO  org.apache.flink.yarn.YarnJobManager  
- Stopping JobManager 
akka.tcp://flink@172.17.3.43:49869/user/jobmanager.
2016-07-15 12:59:46,846 INFO  org.apache.flink.runtime.blob.BlobServer  
- Stopped BLOB server at 0.0.0.0:50790
2016-07-15 12:59:46,864 ERROR org.apache.flink.yarn.YarnJobManager  
- Executor could not execute task
java.util.concurrent.RejectedExecutionException
at 
scala.concurrent.forkjoin.ForkJoinPool.fullExternalPush(ForkJoinPool.java:1870)
at 
scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1834)
at 
scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
at 
scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:107)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-07-15 12:59:46,920 INFO  
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted 
while waiting for queue
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275)
2016-07-15 12:59:46,932 INFO  
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
Closing proxy : 
testing-worker-linux-docker-99c17e61-3364-linux-5.prod.travis-ci.org:38889
2016-07-15 12:59:46,987 INFO  
akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down 
remote daemon.
2016-07-15 12:59:46,987 INFO  
akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon 
shut down; proceeding with flushing remote transports.
2016-07-15 12:59:47,037 INFO  
akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut 
down.
{code}


> TaskManagers are not shutting down properly on YARN
> ---
>
> Key: FLINK-3204
> URL: https://issues.apache.org/jira/browse/FLINK-3204
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Robert Metzger
>  Labels: test-stability
>
> While running some experiments on a YARN cluster, I saw the following error
> {code}
> 10:15:24,741 INFO  

[jira] [Updated] (FLINK-3204) TaskManagers are not shutting down properly on YARN

2016-07-15 Thread Robert Metzger (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-3204:
--
Labels: test-stability  (was: )

> TaskManagers are not shutting down properly on YARN
> ---
>
> Key: FLINK-3204
> URL: https://issues.apache.org/jira/browse/FLINK-3204
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Robert Metzger
>  Labels: test-stability
>
> While running some experiments on a YARN cluster, I saw the following error
> {code}
> 10:15:24,741 INFO  org.apache.flink.yarn.YarnJobManager   
>- Stopping YARN JobManager with status SUCCEEDED and diagnostic Flink YARN 
> Client requested shutdown.
> 10:15:24,748 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  
>- Waiting for application to be successfully unregistered.
> 10:15:24,852 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - 
> Interrupted while waiting for queue
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275)
> 10:15:24,875 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_10when 
> stopping NMClientImpl
> 10:15:24,899 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_07when 
> stopping NMClientImpl
> 10:15:24,954 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_06when 
> stopping NMClientImpl
> 10:15:24,982 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_09when 
> stopping NMClientImpl
> 10:15:25,013 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_11when 
> stopping NMClientImpl
> 10:15:25,037 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_08when 
> stopping NMClientImpl
> 10:15:25,041 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_12when 
> stopping NMClientImpl
> 10:15:25,072 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_05when 
> stopping NMClientImpl
> 10:15:25,075 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_03when 
> stopping NMClientImpl
> 10:15:25,077 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_04when 
> stopping NMClientImpl
> 10:15:25,079 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_02when 
> stopping NMClientImpl
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-0.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-1.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-master.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-4.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-2.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-3.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-5.c.astral-sorter-757.internal:8041
> 10:15:25,085 INFO  org.apache.flink.yarn.YarnJobManager   
>- 

[jira] [Updated] (FLINK-3204) TaskManagers are not shutting down properly on YARN

2016-07-15 Thread Robert Metzger (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-3204:
--
Affects Version/s: 1.1.0

> TaskManagers are not shutting down properly on YARN
> ---
>
> Key: FLINK-3204
> URL: https://issues.apache.org/jira/browse/FLINK-3204
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Robert Metzger
>  Labels: test-stability
>
> While running some experiments on a YARN cluster, I saw the following error
> {code}
> 10:15:24,741 INFO  org.apache.flink.yarn.YarnJobManager   
>- Stopping YARN JobManager with status SUCCEEDED and diagnostic Flink YARN 
> Client requested shutdown.
> 10:15:24,748 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  
>- Waiting for application to be successfully unregistered.
> 10:15:24,852 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - 
> Interrupted while waiting for queue
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275)
> 10:15:24,875 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_10when 
> stopping NMClientImpl
> 10:15:24,899 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_07when 
> stopping NMClientImpl
> 10:15:24,954 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_06when 
> stopping NMClientImpl
> 10:15:24,982 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_09when 
> stopping NMClientImpl
> 10:15:25,013 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_11when 
> stopping NMClientImpl
> 10:15:25,037 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_08when 
> stopping NMClientImpl
> 10:15:25,041 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_12when 
> stopping NMClientImpl
> 10:15:25,072 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_05when 
> stopping NMClientImpl
> 10:15:25,075 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_03when 
> stopping NMClientImpl
> 10:15:25,077 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_04when 
> stopping NMClientImpl
> 10:15:25,079 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl
>- Failed to stop Container container_1452019681933_0002_01_02when 
> stopping NMClientImpl
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-0.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-1.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-master.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-4.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-2.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-3.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-5.c.astral-sorter-757.internal:8041
> 10:15:25,085 INFO  org.apache.flink.yarn.YarnJobManager   
>- Stopping 

[jira] [Commented] (FLINK-4186) Expose Kafka metrics through Flink metrics

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379474#comment-15379474
 ] 

ASF GitHub Bot commented on FLINK-4186:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/2236


> Expose Kafka metrics through Flink metrics
> --
>
> Key: FLINK-4186
> URL: https://issues.apache.org/jira/browse/FLINK-4186
> Project: Flink
>  Issue Type: Improvement
>  Components: Kafka Connector
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
> Fix For: 1.1.0
>
>
> Currently, we expose the Kafka metrics through Flink's accumulators.
> We can now use the metrics system in Flink to report Kafka metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-4186) Expose Kafka metrics through Flink metrics

2016-07-15 Thread Robert Metzger (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger resolved FLINK-4186.
---
   Resolution: Fixed
Fix Version/s: 1.1.0

Resolved in http://git-wip-us.apache.org/repos/asf/flink/commit/41f58182

> Expose Kafka metrics through Flink metrics
> --
>
> Key: FLINK-4186
> URL: https://issues.apache.org/jira/browse/FLINK-4186
> Project: Flink
>  Issue Type: Improvement
>  Components: Kafka Connector
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
> Fix For: 1.1.0
>
>
> Currently, we expose the Kafka metrics through Flink's accumulators.
> We can now use the metrics system in Flink to report Kafka metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2236: [FLINK-4186] Use Flink metrics to report Kafka met...

2016-07-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/2236


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2236: [FLINK-4186] Use Flink metrics to report Kafka metrics

2016-07-15 Thread rmetzger
Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/2236
  
I'm merging the change ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4186) Expose Kafka metrics through Flink metrics

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379463#comment-15379463
 ] 

ASF GitHub Bot commented on FLINK-4186:
---

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/2236
  
I'm merging the change ...


> Expose Kafka metrics through Flink metrics
> --
>
> Key: FLINK-4186
> URL: https://issues.apache.org/jira/browse/FLINK-4186
> Project: Flink
>  Issue Type: Improvement
>  Components: Kafka Connector
>Affects Versions: 1.1.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>
> Currently, we expose the Kafka metrics through Flink's accumulators.
> We can now use the metrics system in Flink to report Kafka metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #2220: [FLINK-4184] [metrics] Replace invalid characters in Sche...

2016-07-15 Thread zentol
Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2220
  
I would appreciate it if you would give me time to answer to your response 
before going ahead with a merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379459#comment-15379459
 ] 

ASF GitHub Bot commented on FLINK-4184:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2220
  
I would appreciate it if you would give me time to answer to your response 
before going ahead with a merge.


> Ganglia and GraphiteReporter report metric names with invalid characters
> 
>
> Key: FLINK-4184
> URL: https://issues.apache.org/jira/browse/FLINK-4184
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.1.0
>
>
> Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with 
> names which contain invalid characters. For example, quotes are not filtered 
> out which can be problematic for Ganglia. Moreover, dots are not replaced 
> which causes Graphite to think that an IP address is actually a scoped metric 
> name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4104) Restructure Gelly docs

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379458#comment-15379458
 ] 

ASF GitHub Bot commented on FLINK-4104:
---

Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2258
  
What do you think, @vasia?

Is it a problem to rename `gelly_guide.html` to `gelly/index.html` as done 
with ML? Also, I manufactured a TOC on the top-level Gelly page.


> Restructure Gelly docs
> --
>
> Key: FLINK-4104
> URL: https://issues.apache.org/jira/browse/FLINK-4104
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
> Fix For: 1.1.0
>
>
> The Gelly documentation has grown sufficiently long to suggest dividing into 
> sub-pages. Leave "Using Gelly" on the main page and link to the following 
> topics as sub-pages:
> * Graph API
> * Iterative Graph Processing
> * Library Methods
> * Graph Algorithms
> * Graph Generators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #2258: [FLINK-4104] [docs] Restructure Gelly docs

2016-07-15 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2258
  
What do you think, @vasia?

Is it a problem to rename `gelly_guide.html` to `gelly/index.html` as done 
with ML? Also, I manufactured a TOC on the top-level Gelly page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4104) Restructure Gelly docs

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379454#comment-15379454
 ] 

ASF GitHub Bot commented on FLINK-4104:
---

GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/2258

[FLINK-4104] [docs] Restructure Gelly docs

Split the Gelly documentation into five sub-pages.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 4104_restructure_gelly_docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2258


commit 844cbd18c8aaa08b6fa13e43d5974781b7a05197
Author: Greg Hogan 
Date:   2016-07-15T14:19:42Z

[FLINK-4104] [docs] Restructure Gelly docs

Split the Gelly documentation into five sub-pages.




> Restructure Gelly docs
> --
>
> Key: FLINK-4104
> URL: https://issues.apache.org/jira/browse/FLINK-4104
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
> Fix For: 1.1.0
>
>
> The Gelly documentation has grown sufficiently long to suggest dividing into 
> sub-pages. Leave "Using Gelly" on the main page and link to the following 
> topics as sub-pages:
> * Graph API
> * Iterative Graph Processing
> * Library Methods
> * Graph Algorithms
> * Graph Generators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2258: [FLINK-4104] [docs] Restructure Gelly docs

2016-07-15 Thread greghogan
GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/2258

[FLINK-4104] [docs] Restructure Gelly docs

Split the Gelly documentation into five sub-pages.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 4104_restructure_gelly_docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2258


commit 844cbd18c8aaa08b6fa13e43d5974781b7a05197
Author: Greg Hogan 
Date:   2016-07-15T14:19:42Z

[FLINK-4104] [docs] Restructure Gelly docs

Split the Gelly documentation into five sub-pages.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379447#comment-15379447
 ] 

ASF GitHub Bot commented on FLINK-4150:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/2256
  
Just a quick question. Do we want to remove also failed jobs from the 
BlobStore and ZK? Or only finished or cancelled jobs?


> Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
> 
>
> Key: FLINK-4150
> URL: https://issues.apache.org/jira/browse/FLINK-4150
> Project: Flink
>  Issue Type: Bug
>  Components: Job-Submission
>Reporter: Stefan Richter
>Assignee: Ufuk Celebi
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Submitting a job in Yarn with HA can lead to the following exception:
> {code}
> org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load 
> user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09
> ClassLoader info: URL ClassLoader:
> file: 
> '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0'
>  (invalid JAR: zip file is empty)
> Class not resolvable through given classloader.
>   at 
> org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Some job information, including the Blob ids, are stored in Zookeeper. The 
> actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set 
> to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the 
> cluster is shut down, the path for the BlobStore is deleted. When the cluster 
> is then restarted, recovering jobs cannot restore because it's Blob ids 
> stored in Zookeeper now point to deleted files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #2256: [FLINK-4150] [runtime] Don't clean up BlobStore on BlobSe...

2016-07-15 Thread tillrohrmann
Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/2256
  
Just a quick question. Do we want to remove also failed jobs from the 
BlobStore and ZK? Or only finished or cancelled jobs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379445#comment-15379445
 ] 

ASF GitHub Bot commented on FLINK-4184:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/2220#discussion_r70979472
  
--- Diff: 
flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java
 ---
@@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() {
}
 
// 

+   //  Getters
+   // 

+
+   // used for testing purposes
+   Map getCounters() {
--- End diff --

You could also access the protected registry and get the counters from 
there.

We may not even need the gauges/counters/histograms fields in the 
ScheduledDropwizardReporter.


> Ganglia and GraphiteReporter report metric names with invalid characters
> 
>
> Key: FLINK-4184
> URL: https://issues.apache.org/jira/browse/FLINK-4184
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.1.0
>
>
> Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with 
> names which contain invalid characters. For example, quotes are not filtered 
> out which can be problematic for Ganglia. Moreover, dots are not replaced 
> which causes Graphite to think that an IP address is actually a scoped metric 
> name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2220: [FLINK-4184] [metrics] Replace invalid characters ...

2016-07-15 Thread zentol
Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/2220#discussion_r70979472
  
--- Diff: 
flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java
 ---
@@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() {
}
 
// 

+   //  Getters
+   // 

+
+   // used for testing purposes
+   Map getCounters() {
--- End diff --

You could also access the protected registry and get the counters from 
there.

We may not even need the gauges/counters/histograms fields in the 
ScheduledDropwizardReporter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Closed] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters

2016-07-15 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-4184.

Resolution: Fixed

Fixed via 70094a1818b532f4e0ff31f5debc550ed4336286

> Ganglia and GraphiteReporter report metric names with invalid characters
> 
>
> Key: FLINK-4184
> URL: https://issues.apache.org/jira/browse/FLINK-4184
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.1.0
>
>
> Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with 
> names which contain invalid characters. For example, quotes are not filtered 
> out which can be problematic for Ganglia. Moreover, dots are not replaced 
> which causes Graphite to think that an IP address is actually a scoped metric 
> name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379416#comment-15379416
 ] 

ASF GitHub Bot commented on FLINK-4184:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/2220


> Ganglia and GraphiteReporter report metric names with invalid characters
> 
>
> Key: FLINK-4184
> URL: https://issues.apache.org/jira/browse/FLINK-4184
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.1.0
>
>
> Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with 
> names which contain invalid characters. For example, quotes are not filtered 
> out which can be problematic for Ganglia. Moreover, dots are not replaced 
> which causes Graphite to think that an IP address is actually a scoped metric 
> name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2220: [FLINK-4184] [metrics] Replace invalid characters ...

2016-07-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/2220


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2220: [FLINK-4184] [metrics] Replace invalid characters in Sche...

2016-07-15 Thread tillrohrmann
Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/2220
  
Thanks for the review @zentol. Will be merging this PR then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379409#comment-15379409
 ] 

ASF GitHub Bot commented on FLINK-4184:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/2220
  
Thanks for the review @zentol. Will be merging this PR then.


> Ganglia and GraphiteReporter report metric names with invalid characters
> 
>
> Key: FLINK-4184
> URL: https://issues.apache.org/jira/browse/FLINK-4184
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.1.0
>
>
> Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with 
> names which contain invalid characters. For example, quotes are not filtered 
> out which can be problematic for Ganglia. Moreover, dots are not replaced 
> which causes Graphite to think that an IP address is actually a scoped metric 
> name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379407#comment-15379407
 ] 

ASF GitHub Bot commented on FLINK-4184:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/2220#discussion_r70977551
  
--- Diff: 
flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java
 ---
@@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() {
}
 
// 

+   //  Getters
+   // 

+
+   // used for testing purposes
+   Map getCounters() {
--- End diff --

We can, if we mark the counters, gauges and histograms fields as protected. 
But then we would expose the implementation details to all sub-classes instead 
of having a getter which is package private. I think the latter option is a bit 
nicer, because it hides the implementation details.


> Ganglia and GraphiteReporter report metric names with invalid characters
> 
>
> Key: FLINK-4184
> URL: https://issues.apache.org/jira/browse/FLINK-4184
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
> Fix For: 1.1.0
>
>
> Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with 
> names which contain invalid characters. For example, quotes are not filtered 
> out which can be problematic for Ganglia. Moreover, dots are not replaced 
> which causes Graphite to think that an IP address is actually a scoped metric 
> name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2220: [FLINK-4184] [metrics] Replace invalid characters ...

2016-07-15 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/2220#discussion_r70977551
  
--- Diff: 
flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java
 ---
@@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() {
}
 
// 

+   //  Getters
+   // 

+
+   // used for testing purposes
+   Map getCounters() {
--- End diff --

We can, if we mark the counters, gauges and histograms fields as protected. 
But then we would expose the implementation details to all sub-classes instead 
of having a getter which is package private. I think the latter option is a bit 
nicer, because it hides the implementation details.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request #2257: [FLINK-4152] Allow re-registration of TMs at resou...

2016-07-15 Thread tillrohrmann
GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/2257

[FLINK-4152] Allow re-registration of TMs at resource manager

The `YarnFlinkResourceManager` does not allow the `JobManager` to 
re-register task managers have had been registered at the resource manager 
before. The consequence of the refusal is that the job manager rejects the 
registrations of these task managers. 

Such a scenario can happen if a `JobManager` loses leadership in an HA 
setting after it registered 
some TMs. The old behaviour was that the resource manager clears the list 
of registered workers and only accepts new registrations of task manager which 
have been started in a fresh container. However, in case that the previously 
registered TMs didn't die, they will try to reconnect to the new leader. The 
new leader will then ask the resource manager whether the TMs represent valid 
resources. Since the resource manager forgot about the already started 
containers, it rejects the TMs.

This PR changes the behaviour of the resource manager such that it can no 
longer reject TMs. Instead of being asked it will simply be informed about the 
registered TMs by the JM. If the TM happens to be running in a container which 
was started by the RM, then it will monitor this container. In case that this 
container dies, the RM will notify the JM about the death of the TM.

In that sense, the RM has no longer the authority to interfere with the 
JM-TM interactions and instead it is simply used as an additional monitoring 
service to detect dead TMs as a result of a failed container.

Furthermore, the PR adds a de-duplication method to filter out concurrent 
registration runs on the task manager. Before, it happened that a 
`RefusedRegistration` triggers a new registration run without cancelling the 
old registration run. This could lead to a massive amount of registration 
messages if the TaskManager's registration was refused multiple times.

The mechanism to de-duplicate `TriggerTaskManagerRegistration` works by 
assigning a registration run id which is changed whenever a new registration 
run is started. `TriggerTaskManagerRegistration` messages which have an 
outdated registration run id are then filtered out.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
FLINK-4152_YarnResourceManager

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2257


commit 2a29fa4df98fd63a7f27e1607152cd6e54d25ad1
Author: Till Rohrmann 
Date:   2016-07-15T08:51:59Z

Add YarnFlinkResourceManager test to reaccept task manager registrations 
from a re-elected job manager

commit 6462d4750d79512fc93bbc60ca754c99142d1794
Author: Till Rohrmann 
Date:   2016-07-15T09:50:35Z

Remove unnecessary sync logic between JobManager and ResourceManager

commit 55ce0c01783d9c4927a9f4677309c805e2b624e7
Author: Till Rohrmann 
Date:   2016-07-15T10:12:12Z

Avoid duplicate reigstration attempts in case of a refused registration

commit aeb6ae5435dd88c640dc314b9ec31357815d080c
Author: Till Rohrmann 
Date:   2016-07-15T13:04:54Z

Add test case to check that not an excessive amount of RegisterTaskManager 
messages are sent




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4152) TaskManager registration exponential backoff doesn't work

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379396#comment-15379396
 ] 

ASF GitHub Bot commented on FLINK-4152:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/2257

[FLINK-4152] Allow re-registration of TMs at resource manager

The `YarnFlinkResourceManager` does not allow the `JobManager` to 
re-register task managers have had been registered at the resource manager 
before. The consequence of the refusal is that the job manager rejects the 
registrations of these task managers. 

Such a scenario can happen if a `JobManager` loses leadership in an HA 
setting after it registered 
some TMs. The old behaviour was that the resource manager clears the list 
of registered workers and only accepts new registrations of task manager which 
have been started in a fresh container. However, in case that the previously 
registered TMs didn't die, they will try to reconnect to the new leader. The 
new leader will then ask the resource manager whether the TMs represent valid 
resources. Since the resource manager forgot about the already started 
containers, it rejects the TMs.

This PR changes the behaviour of the resource manager such that it can no 
longer reject TMs. Instead of being asked it will simply be informed about the 
registered TMs by the JM. If the TM happens to be running in a container which 
was started by the RM, then it will monitor this container. In case that this 
container dies, the RM will notify the JM about the death of the TM.

In that sense, the RM has no longer the authority to interfere with the 
JM-TM interactions and instead it is simply used as an additional monitoring 
service to detect dead TMs as a result of a failed container.

Furthermore, the PR adds a de-duplication method to filter out concurrent 
registration runs on the task manager. Before, it happened that a 
`RefusedRegistration` triggers a new registration run without cancelling the 
old registration run. This could lead to a massive amount of registration 
messages if the TaskManager's registration was refused multiple times.

The mechanism to de-duplicate `TriggerTaskManagerRegistration` works by 
assigning a registration run id which is changed whenever a new registration 
run is started. `TriggerTaskManagerRegistration` messages which have an 
outdated registration run id are then filtered out.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink 
FLINK-4152_YarnResourceManager

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2257


commit 2a29fa4df98fd63a7f27e1607152cd6e54d25ad1
Author: Till Rohrmann 
Date:   2016-07-15T08:51:59Z

Add YarnFlinkResourceManager test to reaccept task manager registrations 
from a re-elected job manager

commit 6462d4750d79512fc93bbc60ca754c99142d1794
Author: Till Rohrmann 
Date:   2016-07-15T09:50:35Z

Remove unnecessary sync logic between JobManager and ResourceManager

commit 55ce0c01783d9c4927a9f4677309c805e2b624e7
Author: Till Rohrmann 
Date:   2016-07-15T10:12:12Z

Avoid duplicate reigstration attempts in case of a refused registration

commit aeb6ae5435dd88c640dc314b9ec31357815d080c
Author: Till Rohrmann 
Date:   2016-07-15T13:04:54Z

Add test case to check that not an excessive amount of RegisterTaskManager 
messages are sent




> TaskManager registration exponential backoff doesn't work
> -
>
> Key: FLINK-4152
> URL: https://issues.apache.org/jira/browse/FLINK-4152
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, TaskManager, YARN Client
>Reporter: Robert Metzger
>Assignee: Till Rohrmann
> Attachments: logs.tgz
>
>
> While testing Flink 1.1 I've found that the TaskManagers are logging many 
> messages when registering at the JobManager.
> This is the log file: 
> https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294
> Its logging more than 3000 messages in less than a minute. I don't think that 
> this is the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379374#comment-15379374
 ] 

ASF GitHub Bot commented on FLINK-4150:
---

GitHub user uce opened a pull request:

https://github.com/apache/flink/pull/2256

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down

The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle 
of each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer 
shuts down (on JobManager shut down), all local files will be removed.

With HA, BLOBs are persisted to another file system (e.g. HDFS) via the 
`BlobStore` in order to have BLOBs available after a JobManager failure (or 
shut down). These BLOBs are only allowed to be removed when the job that 
requires them enters a globally terminal state (`FINISHED`, `CANCELLED`, 
`FAILED`).

This commit removes the `BlobStore` clean up call from the `BlobServer` 
shutdown. The `BlobStore` files will only be cleaned up via the 
`BlobLibraryCacheManager`'s' clean up task (periodically or on 
BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs 
will linger around after the job has terminated, if the job manager fails 
before the clean up.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uce/flink 4150-blobstore

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2256


commit 0d4522270881dbbb7164130f47f9d4df617c19c5
Author: Ufuk Celebi 
Date:   2016-07-14T14:29:49Z

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down

The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of
each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer
shuts down (on JobManager shut down), all local files will be removed.

With HA, BLOBs are persisted to another file system (e.g. HDFS) via the
`BlobStore` in order to have BLOBs available after a JobManager failure (or
shut down). These BLOBs are only allowed to be removed when the job that
requires them enters a globally terminal state (`FINISHED`, `CANCELLED`,
`FAILED`).

This commit removes the `BlobStore` clean up call from the `BlobServer`
shutdown. The `BlobStore` files will only be cleaned up via the
`BlobLibraryCacheManager`'s' clean up task (periodically or on
BlobLibraryCacheManager shutdown). This means that there is a chance that
BLOBs will linger around after the job has terminated, if the job manager
fails before the clean up.




> Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
> 
>
> Key: FLINK-4150
> URL: https://issues.apache.org/jira/browse/FLINK-4150
> Project: Flink
>  Issue Type: Bug
>  Components: Job-Submission
>Reporter: Stefan Richter
>Assignee: Ufuk Celebi
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Submitting a job in Yarn with HA can lead to the following exception:
> {code}
> org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load 
> user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09
> ClassLoader info: URL ClassLoader:
> file: 
> '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0'
>  (invalid JAR: zip file is empty)
> Class not resolvable through given classloader.
>   at 
> org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Some job information, including the Blob ids, are stored in Zookeeper. The 
> actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set 
> to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the 
> cluster is shut down, the path for the BlobStore is deleted. When the cluster 
> is then restarted, recovering jobs cannot restore because it's Blob ids 
> stored in Zookeeper now point to deleted files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request #2256: [FLINK-4150] [runtime] Don't clean up BlobStore on...

2016-07-15 Thread uce
GitHub user uce opened a pull request:

https://github.com/apache/flink/pull/2256

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down

The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle 
of each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer 
shuts down (on JobManager shut down), all local files will be removed.

With HA, BLOBs are persisted to another file system (e.g. HDFS) via the 
`BlobStore` in order to have BLOBs available after a JobManager failure (or 
shut down). These BLOBs are only allowed to be removed when the job that 
requires them enters a globally terminal state (`FINISHED`, `CANCELLED`, 
`FAILED`).

This commit removes the `BlobStore` clean up call from the `BlobServer` 
shutdown. The `BlobStore` files will only be cleaned up via the 
`BlobLibraryCacheManager`'s' clean up task (periodically or on 
BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs 
will linger around after the job has terminated, if the job manager fails 
before the clean up.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uce/flink 4150-blobstore

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2256


commit 0d4522270881dbbb7164130f47f9d4df617c19c5
Author: Ufuk Celebi 
Date:   2016-07-14T14:29:49Z

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down

The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of
each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer
shuts down (on JobManager shut down), all local files will be removed.

With HA, BLOBs are persisted to another file system (e.g. HDFS) via the
`BlobStore` in order to have BLOBs available after a JobManager failure (or
shut down). These BLOBs are only allowed to be removed when the job that
requires them enters a globally terminal state (`FINISHED`, `CANCELLED`,
`FAILED`).

This commit removes the `BlobStore` clean up call from the `BlobServer`
shutdown. The `BlobStore` files will only be cleaned up via the
`BlobLibraryCacheManager`'s' clean up task (periodically or on
BlobLibraryCacheManager shutdown). This means that there is a chance that
BLOBs will linger around after the job has terminated, if the job manager
fails before the clean up.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown

2016-07-15 Thread Ufuk Celebi (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi reassigned FLINK-4150:
--

Assignee: Ufuk Celebi

> Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
> 
>
> Key: FLINK-4150
> URL: https://issues.apache.org/jira/browse/FLINK-4150
> Project: Flink
>  Issue Type: Bug
>  Components: Job-Submission
>Reporter: Stefan Richter
>Assignee: Ufuk Celebi
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Submitting a job in Yarn with HA can lead to the following exception:
> {code}
> org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load 
> user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09
> ClassLoader info: URL ClassLoader:
> file: 
> '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0'
>  (invalid JAR: zip file is empty)
> Class not resolvable through given classloader.
>   at 
> org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Some job information, including the Blob ids, are stored in Zookeeper. The 
> actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set 
> to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the 
> cluster is shut down, the path for the BlobStore is deleted. When the cluster 
> is then restarted, recovering jobs cannot restore because it's Blob ids 
> stored in Zookeeper now point to deleted files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4211) Dynamic Properties not working for jobs submitted to Yarn session

2016-07-15 Thread Ufuk Celebi (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379336#comment-15379336
 ] 

Ufuk Celebi commented on FLINK-4211:


The problem is that the YARN session config and the CLI config might diverge, 
e.g. a user starts a YARN session with 
{{recovery.zookeeper.path.root=/flink/xyz}}, which is important for both the 
started cluster, but also the client to discover that cluster.

I don't think this is a blocker for the release, but it should be addressed.

> Dynamic Properties not working for jobs submitted to Yarn session
> -
>
> Key: FLINK-4211
> URL: https://issues.apache.org/jira/browse/FLINK-4211
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Reporter: Stefan Richter
>
> The command line argument for dynamic properties (-D) is not working when 
> submitting jobs to a flink session.
> Example:
> {code}
> bin/flink run -p 4 myJob.jar -D recovery.zookeeper.path.root=/flink/xyz
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-4142) Recovery problem in HA on Hadoop Yarn 2.4.1

2016-07-15 Thread Ufuk Celebi (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-4142.
--

Added note to docs about this in d08b189 (master).

> Recovery problem in HA on Hadoop Yarn 2.4.1
> ---
>
> Key: FLINK-4142
> URL: https://issues.apache.org/jira/browse/FLINK-4142
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Affects Versions: 1.0.3
>Reporter: Stefan Richter
>Assignee: Robert Metzger
> Fix For: 1.1.0
>
>
> On Hadoop Yarn 2.4.1, recovery in HA fails in the following scenario:
> 1) Kill application master, let it recover normally.
> 2) After that, kill a task manager.
> Now, Yarn tries to restart the killed task manager in an endless loop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-4182) HA recovery not working properly under ApplicationMaster failures.

2016-07-15 Thread Ufuk Celebi (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-4182.
--
Resolution: Duplicate

The quoted stack trace is a duplicate of FLINK-4150. The inconsistent result 
might be caused by accidentally running multiple jobs. I'm closing this for 
now. If inconsistencies still occur, we need some more information to re-open 
this.

> HA recovery not working properly under ApplicationMaster failures.
> --
>
> Key: FLINK-4182
> URL: https://issues.apache.org/jira/browse/FLINK-4182
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, State Backends, Checkpointing
>Affects Versions: 1.0.3
>Reporter: Stefan Richter
>Priority: Blocker
>
> When randomly killing TaskManager and ApplicationMaster, a job sometimes does 
> not properly recover in HA mode.
> There can be different symptoms for this. For example, in one case the job is 
> dying with the following exception:
> {code}
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: The program 
> execution failed: Cannot set up the user code libraries: Cannot get library 
> with hash 7fafffe9595cd06aff213b81b5da7b1682e1d6b0
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:413)
>   at 
> org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:208)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:389)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1381)
>   at 
> da.testing.StreamingStateMachineJob.main(StreamingStateMachineJob.java:61)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:509)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:331)
>   at 
> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:738)
>   at org.apache.flink.client.CliFrontend.run(CliFrontend.java:251)
>   at 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:966)
>   at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1009)
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Cannot set 
> up the user code libraries: Cannot get library with hash 
> 7fafffe9595cd06aff213b81b5da7b1682e1d6b0
>   at 
> org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1089)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:506)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:105)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>   at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at 

[jira] [Commented] (FLINK-4218) Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting

2016-07-15 Thread Sergii Koshel (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379330#comment-15379330
 ] 

Sergii Koshel commented on FLINK-4218:
--

According to 
https://github.com/Aloisius/hadoop-s3a/blob/master/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java
 *close()* means data uploaded to the S3 bucket. And according to 
*read-after-write* consistency should be visible from everywhere.

So I still don't see the reason how *java.io.FileNotFoundException* may happen 
after *close()*.

> Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." 
> causes task restarting
> --
>
> Key: FLINK-4218
> URL: https://issues.apache.org/jira/browse/FLINK-4218
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sergii Koshel
>
> Sporadically see exception as below. And restart of task because of it.
> {code:title=Exception|borderStyle=solid}
> java.lang.RuntimeException: Error triggering a checkpoint as the result of 
> receiving checkpoint barrier
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:785)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:775)
>   at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203)
>   at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:183)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:66)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:265)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3:///flink/checkpoints/ece317c26960464ba5de75f3bbc38cb2/chk-8810/96eebbeb-de14-45c7-8ebb-e7cde978d6d3
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
>   at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:351)
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileSize(AbstractFileStateHandle.java:93)
>   at 
> org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSize(FileStreamStateHandle.java:58)
>   at 
> org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.getStateSize(AbstractStateBackend.java:482)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(StreamTaskStateList.java:77)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:604)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:779)
>   ... 8 more
> {code}
> File actually exists on S3. 
> I suppose it is related to some race conditions with S3 but would be good to 
> retry a few times before stop task execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #2251: [FLINK-4212] [scripts] Lock PID file when starting daemon...

2016-07-15 Thread greghogan
Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2251
  
Looks like it is available in Linux and BSD but not MacOS. We can skip 
locking if `flock` is not available. The race condition only manifests when 
starting duplicate managers on the same system in parallel.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4212) Lock PID file when starting daemons

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379313#comment-15379313
 ] 

ASF GitHub Bot commented on FLINK-4212:
---

Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2251
  
Looks like it is available in Linux and BSD but not MacOS. We can skip 
locking if `flock` is not available. The race condition only manifests when 
starting duplicate managers on the same system in parallel.


> Lock PID file when starting daemons
> ---
>
> Key: FLINK-4212
> URL: https://issues.apache.org/jira/browse/FLINK-4212
> Project: Flink
>  Issue Type: Improvement
>  Components: Startup Shell Scripts
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>
> As noted on the mailing list (0), when multiple TaskManagers are started in 
> parallel (using pdsh) there is a race condition on updating the pid: 1) the 
> pid file is first read to parse the process' index, 2) the process is 
> started, and 3) on success the daemon pid is appended to the pid file.
> We could use a tool such as {{flock}} to lock on the pid file while starting 
> the Flink daemon.
> 0: 
> http://mail-archives.apache.org/mod_mbox/flink-user/201607.mbox/%3CCA%2BssbKXw954Bz_sBRwP6db0FntWyGWzTyP7wJZ5nhOeQnof3kg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4212) Lock PID file when starting daemons

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379288#comment-15379288
 ] 

ASF GitHub Bot commented on FLINK-4212:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2251
  
Is the `flock` command available by default on the common UNIX-style OSs?

Ubuntu has it by default, what about MacOS?


> Lock PID file when starting daemons
> ---
>
> Key: FLINK-4212
> URL: https://issues.apache.org/jira/browse/FLINK-4212
> Project: Flink
>  Issue Type: Improvement
>  Components: Startup Shell Scripts
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>
> As noted on the mailing list (0), when multiple TaskManagers are started in 
> parallel (using pdsh) there is a race condition on updating the pid: 1) the 
> pid file is first read to parse the process' index, 2) the process is 
> started, and 3) on success the daemon pid is appended to the pid file.
> We could use a tool such as {{flock}} to lock on the pid file while starting 
> the Flink daemon.
> 0: 
> http://mail-archives.apache.org/mod_mbox/flink-user/201607.mbox/%3CCA%2BssbKXw954Bz_sBRwP6db0FntWyGWzTyP7wJZ5nhOeQnof3kg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink issue #2251: [FLINK-4212] [scripts] Lock PID file when starting daemon...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2251
  
Is the `flock` command available by default on the common UNIX-style OSs?

Ubuntu has it by default, what about MacOS?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4210) Move close()/isClosed() out of MetricGroup interface

2016-07-15 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379276#comment-15379276
 ] 

Chesnay Schepler commented on FLINK-4210:
-

initially there was a separate group for them, however since some features are 
implemented with user-defined functions as well this proved to be problematic. 
Although since then some time has passed and i don't know how valid this 
problem is anymore.

> Move close()/isClosed() out of MetricGroup interface
> 
>
> Key: FLINK-4210
> URL: https://issues.apache.org/jira/browse/FLINK-4210
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: 1.1.0
>
>
> The (user-facing) MetricGroup interface currently exposes a close() and 
> isClosed() method which generally users shouldn't need to call. They are an 
> internal thing, and thus should be moved into the AbstractMetricGroup class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379270#comment-15379270
 ] 

ASF GitHub Bot commented on FLINK-3466:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2252
  
Thanks, I'll address your comments and merge this...


> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.0.0, 0.10.2
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> 

[GitHub] flink issue #2252: [FLINK-3466] [runtime] Cancel state handled on state rest...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/2252
  
Thanks, I'll address your comments and merge this...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3397) Failed streaming jobs should fall back to the most recent checkpoint/savepoint

2016-07-15 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379266#comment-15379266
 ] 

ramkrishna.s.vasudevan commented on FLINK-3397:
---

[~uce]
Just a gentle reminder, in case you had some time to give some feedback. If you 
are in mid of release, sorry to bother you. Will wait for some more time. 
Thanks.

> Failed streaming jobs should fall back to the most recent checkpoint/savepoint
> --
>
> Key: FLINK-3397
> URL: https://issues.apache.org/jira/browse/FLINK-3397
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Streaming
>Affects Versions: 1.0.0
>Reporter: Gyula Fora
>Priority: Minor
> Attachments: FLINK-3397.pdf
>
>
> The current fallback behaviour in case of a streaming job failure is slightly 
> counterintuitive:
> If a job fails it will fall back to the most recent checkpoint (if any) even 
> if there were more recent savepoint taken. This means that savepoints are not 
> regarded as checkpoints by the system only points from where a job can be 
> manually restarted.
> I suggest to change this so that savepoints are also regarded as checkpoints 
> in case of a failure and they will also be used to automatically restore the 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4210) Move close()/isClosed() out of MetricGroup interface

2016-07-15 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379258#comment-15379258
 ] 

Stephan Ewen commented on FLINK-4210:
-

So, the Flink runtime metrics for an operator are in the same group as the 
user's metrics? I though the user metrics were a subgroup of the operator group 
- wouldn't that make sense?

Aside from that, we expose close() on so many things (like the collectors) and 
it has never been an issue.

> Move close()/isClosed() out of MetricGroup interface
> 
>
> Key: FLINK-4210
> URL: https://issues.apache.org/jira/browse/FLINK-4210
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: 1.1.0
>
>
> The (user-facing) MetricGroup interface currently exposes a close() and 
> isClosed() method which generally users shouldn't need to call. They are an 
> internal thing, and thus should be moved into the AbstractMetricGroup class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4218) Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting

2016-07-15 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379254#comment-15379254
 ] 

Stephan Ewen commented on FLINK-4218:
-

The read-after-write consistency seems to hold only for new keys. The check for 
the parent directly may actually be considered a modification/update operation 
and hence fall under the eventual consistency.

Concerning the S3AFileSystem: Do you know if a "close()" call ensures that data 
is properly written? In HDFS, "close()" returns only successfully if data is 
persistent.

> Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." 
> causes task restarting
> --
>
> Key: FLINK-4218
> URL: https://issues.apache.org/jira/browse/FLINK-4218
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sergii Koshel
>
> Sporadically see exception as below. And restart of task because of it.
> {code:title=Exception|borderStyle=solid}
> java.lang.RuntimeException: Error triggering a checkpoint as the result of 
> receiving checkpoint barrier
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:785)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:775)
>   at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203)
>   at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:183)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:66)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:265)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3:///flink/checkpoints/ece317c26960464ba5de75f3bbc38cb2/chk-8810/96eebbeb-de14-45c7-8ebb-e7cde978d6d3
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
>   at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:351)
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileSize(AbstractFileStateHandle.java:93)
>   at 
> org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSize(FileStreamStateHandle.java:58)
>   at 
> org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.getStateSize(AbstractStateBackend.java:482)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(StreamTaskStateList.java:77)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:604)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:779)
>   ... 8 more
> {code}
> File actually exists on S3. 
> I suppose it is related to some race conditions with S3 but would be good to 
> retry a few times before stop task execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379249#comment-15379249
 ] 

ASF GitHub Bot commented on FLINK-3466:
---

Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956861
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/AbstractFileStateHandle.java
 ---
@@ -20,18 +20,21 @@
 
 import org.apache.flink.core.fs.FileSystem;
 import org.apache.flink.core.fs.Path;
+import org.apache.flink.runtime.state.AbstractCloseableHandle;
+import org.apache.flink.runtime.state.StateObject;
 
 import java.io.IOException;
+import java.io.Serializable;
 
 import static java.util.Objects.requireNonNull;
 
 /**
  * Base class for state that is stored in a file.
  */
-public abstract class AbstractFileStateHandle implements 
java.io.Serializable {
-   
+public abstract class AbstractFileStateHandle extends 
AbstractCloseableHandle implements StateObject, Serializable {
--- End diff --

True, will remove this.


> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.0.0, 0.10.2
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> 

[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956813
  
--- Diff: 
flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/InterruptSensitiveRestoreTest.java
 ---
@@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.streaming.runtime.tasks;
+
+import org.apache.flink.api.common.ExecutionConfig;
+import org.apache.flink.api.common.JobID;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.testutils.OneShotLatch;
+import org.apache.flink.runtime.blob.BlobKey;
+import org.apache.flink.runtime.broadcast.BroadcastVariableManager;
+import org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor;
+import 
org.apache.flink.runtime.deployment.ResultPartitionDeploymentDescriptor;
+import org.apache.flink.runtime.deployment.TaskDeploymentDescriptor;
+import org.apache.flink.runtime.execution.ExecutionState;
+import 
org.apache.flink.runtime.execution.librarycache.FallbackLibraryCacheManager;
+import org.apache.flink.runtime.executiongraph.ExecutionAttemptID;
+import org.apache.flink.runtime.filecache.FileCache;
+import org.apache.flink.runtime.instance.ActorGateway;
+import org.apache.flink.runtime.io.disk.iomanager.IOManager;
+import org.apache.flink.runtime.io.network.NetworkEnvironment;
+import org.apache.flink.runtime.jobgraph.JobVertexID;
+import org.apache.flink.runtime.memory.MemoryManager;
+import 
org.apache.flink.runtime.operators.testutils.UnregisteredTaskMetricsGroup;
+import org.apache.flink.runtime.state.StateHandle;
+import org.apache.flink.runtime.taskmanager.Task;
+import org.apache.flink.runtime.taskmanager.TaskManagerRuntimeInfo;
+import org.apache.flink.runtime.util.EnvironmentInformation;
+import org.apache.flink.runtime.util.SerializableObject;
+import org.apache.flink.streaming.api.TimeCharacteristic;
+import org.apache.flink.streaming.api.checkpoint.Checkpointed;
+import org.apache.flink.streaming.api.functions.source.SourceFunction;
+import org.apache.flink.streaming.api.graph.StreamConfig;
+import org.apache.flink.streaming.api.operators.StreamSource;
+import org.apache.flink.util.SerializedValue;
+
+import org.junit.Test;
+
+import scala.concurrent.duration.FiniteDuration;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.net.URL;
+import java.util.Collections;
+import java.util.concurrent.TimeUnit;
+
+import static org.junit.Assert.*;
+import static org.mockito.Mockito.*;
+
+/**
+ * This test checks that task restores that get stuck in the presence of 
interrupts
+ * are handled properly.
+ *
+ * In practice, reading from HDFS is interrupt sensitive: The HDFS code 
frequently deadlocks
+ * or livelocks if it is interrupted.
+ */
+public class InterruptSensitiveRestoreTest {
+
+   private static final OneShotLatch IN_RESTORE_LATCH = new OneShotLatch();
+
+   @Test
+   public void testRestoreWithInterrupt() throws Exception {
+
+   Configuration taskConfig = new Configuration();
+   StreamConfig cfg = new StreamConfig(taskConfig);
+   cfg.setTimeCharacteristic(TimeCharacteristic.ProcessingTime);
+   cfg.setStreamOperator(new StreamSource<>(new TestSource()));
+
+   StateHandle lockingHandle = new 
InterruptLockingStateHandle();
+   StreamTaskState opState = new StreamTaskState();
+   opState.setFunctionState(lockingHandle);
+   StreamTaskStateList taskState = new StreamTaskStateList(new 
StreamTaskState[] { opState });
+
+   TaskDeploymentDescriptor tdd = 
createTaskDeploymentDescriptor(taskConfig, taskState);
+   Task task = createTask(tdd);
+
+   // start the task and wait until it is in "restore"
+   task.startTaskThread();
+   

[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956861
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/AbstractFileStateHandle.java
 ---
@@ -20,18 +20,21 @@
 
 import org.apache.flink.core.fs.FileSystem;
 import org.apache.flink.core.fs.Path;
+import org.apache.flink.runtime.state.AbstractCloseableHandle;
+import org.apache.flink.runtime.state.StateObject;
 
 import java.io.IOException;
+import java.io.Serializable;
 
 import static java.util.Objects.requireNonNull;
 
 /**
  * Base class for state that is stored in a file.
  */
-public abstract class AbstractFileStateHandle implements 
java.io.Serializable {
-   
+public abstract class AbstractFileStateHandle extends 
AbstractCloseableHandle implements StateObject, Serializable {
--- End diff --

True, will remove this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379248#comment-15379248
 ] 

ASF GitHub Bot commented on FLINK-3466:
---

Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956828
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/AbstractCloseableHandle.java
 ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.state;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.concurrent.atomic.AtomicIntegerFieldUpdater;
+
+/**
+ * A simple base for closable handles.
+ * 
+ * Offers to register a stream (or other closable object) that close calls 
are delegated to if
+ * the handel is closed or was already closed.
--- End diff --

Thanks, will fix it.


> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.0.0, 0.10.2
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> 

[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379247#comment-15379247
 ] 

ASF GitHub Bot commented on FLINK-3466:
---

Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956813
  
--- Diff: 
flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/InterruptSensitiveRestoreTest.java
 ---
@@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.streaming.runtime.tasks;
+
+import org.apache.flink.api.common.ExecutionConfig;
+import org.apache.flink.api.common.JobID;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.testutils.OneShotLatch;
+import org.apache.flink.runtime.blob.BlobKey;
+import org.apache.flink.runtime.broadcast.BroadcastVariableManager;
+import org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor;
+import 
org.apache.flink.runtime.deployment.ResultPartitionDeploymentDescriptor;
+import org.apache.flink.runtime.deployment.TaskDeploymentDescriptor;
+import org.apache.flink.runtime.execution.ExecutionState;
+import 
org.apache.flink.runtime.execution.librarycache.FallbackLibraryCacheManager;
+import org.apache.flink.runtime.executiongraph.ExecutionAttemptID;
+import org.apache.flink.runtime.filecache.FileCache;
+import org.apache.flink.runtime.instance.ActorGateway;
+import org.apache.flink.runtime.io.disk.iomanager.IOManager;
+import org.apache.flink.runtime.io.network.NetworkEnvironment;
+import org.apache.flink.runtime.jobgraph.JobVertexID;
+import org.apache.flink.runtime.memory.MemoryManager;
+import 
org.apache.flink.runtime.operators.testutils.UnregisteredTaskMetricsGroup;
+import org.apache.flink.runtime.state.StateHandle;
+import org.apache.flink.runtime.taskmanager.Task;
+import org.apache.flink.runtime.taskmanager.TaskManagerRuntimeInfo;
+import org.apache.flink.runtime.util.EnvironmentInformation;
+import org.apache.flink.runtime.util.SerializableObject;
+import org.apache.flink.streaming.api.TimeCharacteristic;
+import org.apache.flink.streaming.api.checkpoint.Checkpointed;
+import org.apache.flink.streaming.api.functions.source.SourceFunction;
+import org.apache.flink.streaming.api.graph.StreamConfig;
+import org.apache.flink.streaming.api.operators.StreamSource;
+import org.apache.flink.util.SerializedValue;
+
+import org.junit.Test;
+
+import scala.concurrent.duration.FiniteDuration;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.net.URL;
+import java.util.Collections;
+import java.util.concurrent.TimeUnit;
+
+import static org.junit.Assert.*;
+import static org.mockito.Mockito.*;
+
+/**
+ * This test checks that task restores that get stuck in the presence of 
interrupts
+ * are handled properly.
+ *
+ * In practice, reading from HDFS is interrupt sensitive: The HDFS code 
frequently deadlocks
+ * or livelocks if it is interrupted.
+ */
+public class InterruptSensitiveRestoreTest {
+
+   private static final OneShotLatch IN_RESTORE_LATCH = new OneShotLatch();
+
+   @Test
+   public void testRestoreWithInterrupt() throws Exception {
+
+   Configuration taskConfig = new Configuration();
+   StreamConfig cfg = new StreamConfig(taskConfig);
+   cfg.setTimeCharacteristic(TimeCharacteristic.ProcessingTime);
+   cfg.setStreamOperator(new StreamSource<>(new TestSource()));
+
+   StateHandle lockingHandle = new 
InterruptLockingStateHandle();
+   StreamTaskState opState = new StreamTaskState();
+   opState.setFunctionState(lockingHandle);
+   StreamTaskStateList taskState = new StreamTaskStateList(new 
StreamTaskState[] { opState });
+
+   TaskDeploymentDescriptor tdd = 

[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956828
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/AbstractCloseableHandle.java
 ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.state;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.concurrent.atomic.AtomicIntegerFieldUpdater;
+
+/**
+ * A simple base for closable handles.
+ * 
+ * Offers to register a stream (or other closable object) that close calls 
are delegated to if
+ * the handel is closed or was already closed.
--- End diff --

Thanks, will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379245#comment-15379245
 ] 

ASF GitHub Bot commented on FLINK-3466:
---

Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956526
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/memory/AbstractMemStateSnapshot.java
 ---
@@ -54,6 +56,8 @@
 
/** The serialized data of the state key/value pairs */
private final byte[] data;
+   
+   private transient boolean closed;
--- End diff --

I think it is not crucial to have a strict barrier here. If the reading 
thread eventually notices the flag, it is enough. And since volatile accesses 
are much more expensive, I wanted to avoid that.


> Job might get stuck in restoreState() from HDFS due to interrupt
> 
>
> Key: FLINK-3466
> URL: https://issues.apache.org/jira/browse/FLINK-3466
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.0.0, 0.10.2
>Reporter: Robert Metzger
>Assignee: Stephan Ewen
>
> A user reported the following issue with a failing job:
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016)
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> java.io.DataInputStream.read(DataInputStream.java:149)
> org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69)
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
> java.io.ObjectInputStream.(ObjectInputStream.java:299)
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52)
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440)
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> 10:46:09,223 WARN  org.apache.flink.runtime.taskmanager.Task  
>- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck 
> in method:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038)
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> 

[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...

2016-07-15 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/2252#discussion_r70956526
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/memory/AbstractMemStateSnapshot.java
 ---
@@ -54,6 +56,8 @@
 
/** The serialized data of the state key/value pairs */
private final byte[] data;
+   
+   private transient boolean closed;
--- End diff --

I think it is not crucial to have a strict barrier here. If the reading 
thread eventually notices the flag, it is enough. And since volatile accesses 
are much more expensive, I wanted to avoid that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink issue #2226: [FLINK-4192] - Move Metrics API to separate module

2016-07-15 Thread zentol
Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2226
  
@StephanEwen I've changed the `MetricConfig`to extend `Properties` and 
remove the `setString()` method. I've kept the other methods for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-4192) Move Metrics API to separate module

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379224#comment-15379224
 ] 

ASF GitHub Bot commented on FLINK-4192:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2226
  
@StephanEwen I've changed the `MetricConfig`to extend `Properties` and 
remove the `setString()` method. I've kept the other methods for now.


> Move Metrics API to separate module
> ---
>
> Key: FLINK-4192
> URL: https://issues.apache.org/jira/browse/FLINK-4192
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.1.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
> Fix For: 1.1.0
>
>
> All metrics code currently resides in flink-core. If a user implements a 
> reporter and wants a fat jar it will now have to include the entire 
> flink-core module.
> Instead, we could move several interfaces into a separate module.
> These interfaces to move include:
> * Counter, Gauge, Histogram(Statistics)
> * MetricGroup
> * MetricReporter, Scheduled, AbstractReporter
> In addition a new MetricRegistry interface will be required as well as a 
> replacement for the Configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4053) Return value from Connection should be checked against null

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379216#comment-15379216
 ] 

ASF GitHub Bot commented on FLINK-4053:
---

Github user mushketyk commented on the issue:

https://github.com/apache/flink/pull/2128
  
Thank you @zentol !


> Return value from Connection should be checked against null
> ---
>
> Key: FLINK-4053
> URL: https://issues.apache.org/jira/browse/FLINK-4053
> Project: Flink
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ivan Mushketyk
>Priority: Minor
> Fix For: 1.1.0
>
>
> In RMQSource.java and RMQSink.java, there is code in the following pattern:
> {code}
>   connection = factory.newConnection();
>   channel = connection.createChannel();
> {code}
> According to 
> https://www.rabbitmq.com/releases/rabbitmq-java-client/current-javadoc/com/rabbitmq/client/Connection.html#createChannel()
>  :
> {code}
> Returns:
> a new channel descriptor, or null if none is available
> {code}
> The return value should be checked against null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >