[jira] [Created] (FLINK-4223) Rearrange scaladoc and javadoc for Scala API
Chiwan Park created FLINK-4223: -- Summary: Rearrange scaladoc and javadoc for Scala API Key: FLINK-4223 URL: https://issues.apache.org/jira/browse/FLINK-4223 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Chiwan Park Priority: Minor Currently, some scaladocs for Scala API (Gelly Scala API, FlinkML, Streaming Scala API) are not in scaladoc but in javadoc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4222) Allow Kinesis configuration to get credentials from AWS Metadata
[ https://issues.apache.org/jira/browse/FLINK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380312#comment-15380312 ] ASF GitHub Bot commented on FLINK-4222: --- GitHub user chadnickbok opened a pull request: https://github.com/apache/flink/pull/2260 [FLINK-4222] Allow Kinesis configuration to get credentials from AWS Metadata When called without credentials, the AmazonKinesisClient tries to configure itself automatically, searching for credentials from environment variables, java system properties, and finally from instance profile credentials delivered through the Amazon EC2 metadata service. Add the AWSConfigConstant "AUTO", which supports creating an AmazonKinesisClient without any AWSCredentials, which allows for this auto-discovery mechanism to take place and supports getting kinesis credentials from the AWS EC2 metadata service. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chadnickbok/flink aws-metadata-auth-kinesis Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2260.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2260 commit 8d43f05ce65b88067fd5a4808c773cafc693c0f2 Author: Nick ChadwickDate: 2016-07-15T23:24:19Z Support automatic AWS Credentials discovery. When called without credentials, the AmazonKinesisClient tries to configure itself automatically, searching for credentials from environment variables, java system properties, and finally from instance profile credentials delivered through the Amazon EC2 metadata service. Add the AWSConfigConstant "AUTO", which supports creating an AmazonKinesisClient without any AWSCredentials, which allows for this auto-discovery mechanism to take place and supports getting kinesis credentials from the AWS EC2 metadata service. > Allow Kinesis configuration to get credentials from AWS Metadata > > > Key: FLINK-4222 > URL: https://issues.apache.org/jira/browse/FLINK-4222 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.0.3 >Reporter: Nick Chadwick >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > When deploying Flink TaskManagers in an EC2 environment, it would be nice to > be able to use the EC2 IAM Role credentials provided by the EC2 Metadata > service. > This allows for credentials to be automatically discovered by services > running on EC2 instances at runtime, and removes the need to explicitly > create and assign credentials to TaskManagers. > This should be a fairly small change to the configuration of the > flink-connector-kinesis, which will greatly improve the ease of deployment to > Amazon EC2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2260: [FLINK-4222] Allow Kinesis configuration to get cr...
GitHub user chadnickbok opened a pull request: https://github.com/apache/flink/pull/2260 [FLINK-4222] Allow Kinesis configuration to get credentials from AWS Metadata When called without credentials, the AmazonKinesisClient tries to configure itself automatically, searching for credentials from environment variables, java system properties, and finally from instance profile credentials delivered through the Amazon EC2 metadata service. Add the AWSConfigConstant "AUTO", which supports creating an AmazonKinesisClient without any AWSCredentials, which allows for this auto-discovery mechanism to take place and supports getting kinesis credentials from the AWS EC2 metadata service. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chadnickbok/flink aws-metadata-auth-kinesis Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2260.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2260 commit 8d43f05ce65b88067fd5a4808c773cafc693c0f2 Author: Nick ChadwickDate: 2016-07-15T23:24:19Z Support automatic AWS Credentials discovery. When called without credentials, the AmazonKinesisClient tries to configure itself automatically, searching for credentials from environment variables, java system properties, and finally from instance profile credentials delivered through the Amazon EC2 metadata service. Add the AWSConfigConstant "AUTO", which supports creating an AmazonKinesisClient without any AWSCredentials, which allows for this auto-discovery mechanism to take place and supports getting kinesis credentials from the AWS EC2 metadata service. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-4222) Allow Kinesis configuration to get credentials from AWS Metadata
Nick Chadwick created FLINK-4222: Summary: Allow Kinesis configuration to get credentials from AWS Metadata Key: FLINK-4222 URL: https://issues.apache.org/jira/browse/FLINK-4222 Project: Flink Issue Type: Improvement Components: Streaming Connectors Affects Versions: 1.0.3 Reporter: Nick Chadwick Priority: Minor When deploying Flink TaskManagers in an EC2 environment, it would be nice to be able to use the EC2 IAM Role credentials provided by the EC2 Metadata service. This allows for credentials to be automatically discovered by services running on EC2 instances at runtime, and removes the need to explicitly create and assign credentials to TaskManagers. This should be a fairly small change to the configuration of the flink-connector-kinesis, which will greatly improve the ease of deployment to Amazon EC2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-4209) Fix issue on docker with multiple NICs and remove supervisord dependency
[ https://issues.apache.org/jira/browse/FLINK-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated FLINK-4209: Summary: Fix issue on docker with multiple NICs and remove supervisord dependency (was: Docker image breaks with multiple NICs) > Fix issue on docker with multiple NICs and remove supervisord dependency > > > Key: FLINK-4209 > URL: https://issues.apache.org/jira/browse/FLINK-4209 > Project: Flink > Issue Type: Improvement > Components: flink-contrib >Reporter: Ismaël Mejía >Priority: Minor > > The resolution of the host is done by IP today in the docker image scripts, > this is an issue when the system has multiple network cards, if the hostname > resolution is done by name, this is fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-4208) Support Running Flink processes in foreground mode
[ https://issues.apache.org/jira/browse/FLINK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía closed FLINK-4208. --- Resolution: Fixed I decided to remove the issue for foreground mode and change the docker script to use an extra wait to avoid losing the process in the background. > Support Running Flink processes in foreground mode > -- > > Key: FLINK-4208 > URL: https://issues.apache.org/jira/browse/FLINK-4208 > Project: Flink > Issue Type: Improvement >Reporter: Ismaël Mejía >Priority: Minor > > Flink clusters are started automatically in daemon mode, this is definitely > the default case, however if we want to start containers based on flinks, the > execution context gets lost. Running flink as foreground processes can fix > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4208) Support Running Flink processes in foreground mode
[ https://issues.apache.org/jira/browse/FLINK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380064#comment-15380064 ] ASF GitHub Bot commented on FLINK-4208: --- Github user iemejia closed the pull request at: https://github.com/apache/flink/pull/2239 > Support Running Flink processes in foreground mode > -- > > Key: FLINK-4208 > URL: https://issues.apache.org/jira/browse/FLINK-4208 > Project: Flink > Issue Type: Improvement >Reporter: Ismaël Mejía >Priority: Minor > > Flink clusters are started automatically in daemon mode, this is definitely > the default case, however if we want to start containers based on flinks, the > execution context gets lost. Running flink as foreground processes can fix > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2239: [FLINK-4208] Support Running Flink processes in fo...
Github user iemejia closed the pull request at: https://github.com/apache/flink/pull/2239 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4209) Docker image breaks with multiple NICs
[ https://issues.apache.org/jira/browse/FLINK-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380060#comment-15380060 ] ASF GitHub Bot commented on FLINK-4209: --- Github user iemejia commented on the issue: https://github.com/apache/flink/pull/2240 I added this last commit to remove the dependency on supervisord, thanks to @greghogan for the 'wait' idea. Now flink has the thinnest docker image possible :). > Docker image breaks with multiple NICs > -- > > Key: FLINK-4209 > URL: https://issues.apache.org/jira/browse/FLINK-4209 > Project: Flink > Issue Type: Improvement > Components: flink-contrib >Reporter: Ismaël Mejía >Priority: Minor > > The resolution of the host is done by IP today in the docker image scripts, > this is an issue when the system has multiple network cards, if the hostname > resolution is done by name, this is fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2240: [FLINK-4209] Fix issue on docker with multiple NICs and r...
Github user iemejia commented on the issue: https://github.com/apache/flink/pull/2240 I added this last commit to remove the dependency on supervisord, thanks to @greghogan for the 'wait' idea. Now flink has the thinnest docker image possible :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2239: [FLINK-4208] Support Running Flink processes in foregroun...
Github user iemejia commented on the issue: https://github.com/apache/flink/pull/2239 Thanks Greg, I was probably too tired last night because I put the wait in a weird place, I just tried now and everything is working, it is still not 'real' foreground, since Ctrl-C gets captured by the wait, but it fixes the issue for the docker image so I am going to close this pull request and the issue. If you or anybody else is still interested I will keep it, but I am going to close it if you don't mind. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4208) Support Running Flink processes in foreground mode
[ https://issues.apache.org/jira/browse/FLINK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380057#comment-15380057 ] ASF GitHub Bot commented on FLINK-4208: --- Github user iemejia commented on the issue: https://github.com/apache/flink/pull/2239 Thanks Greg, I was probably too tired last night because I put the wait in a weird place, I just tried now and everything is working, it is still not 'real' foreground, since Ctrl-C gets captured by the wait, but it fixes the issue for the docker image so I am going to close this pull request and the issue. If you or anybody else is still interested I will keep it, but I am going to close it if you don't mind. > Support Running Flink processes in foreground mode > -- > > Key: FLINK-4208 > URL: https://issues.apache.org/jira/browse/FLINK-4208 > Project: Flink > Issue Type: Improvement >Reporter: Ismaël Mejía >Priority: Minor > > Flink clusters are started automatically in daemon mode, this is definitely > the default case, however if we want to start containers based on flinks, the > execution context gets lost. Running flink as foreground processes can fix > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1947: [FLINK-1502] [WIP] Basic Metric System
Github user sumitchawla commented on the issue: https://github.com/apache/flink/pull/1947 Thanks a lot @zentol .. this is great.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.
[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379850#comment-15379850 ] ASF GitHub Bot commented on FLINK-1502: --- Github user sumitchawla commented on the issue: https://github.com/apache/flink/pull/1947 Thanks a lot @zentol .. this is great.. > Expose metrics to graphite, ganglia and JMX. > > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager >Affects Versions: 0.9 >Reporter: Robert Metzger >Assignee: Chesnay Schepler >Priority: Minor > Fix For: 1.1.0 > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.
[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379835#comment-15379835 ] ASF GitHub Bot commented on FLINK-1502: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/1947 @sumitchawla sure you can, as described here: https://ci.apache.org/projects/flink/flink-docs-master/apis/metrics.html#registering-metrics > Expose metrics to graphite, ganglia and JMX. > > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager >Affects Versions: 0.9 >Reporter: Robert Metzger >Assignee: Chesnay Schepler >Priority: Minor > Fix For: 1.1.0 > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1947: [FLINK-1502] [WIP] Basic Metric System
Github user zentol commented on the issue: https://github.com/apache/flink/pull/1947 @sumitchawla sure you can, as described here: https://ci.apache.org/projects/flink/flink-docs-master/apis/metrics.html#registering-metrics --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.
[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379829#comment-15379829 ] ASF GitHub Bot commented on FLINK-1502: --- Github user sumitchawla commented on the issue: https://github.com/apache/flink/pull/1947 @zentol .. by job writer i meant end user writing jobs using Flink API. As of now i can create custom accumulators using `getRuntimeContext().addAccumulator(ACCUMULATOR_NAME,...` can i do something similar to register custom metrics in my transformations > Expose metrics to graphite, ganglia and JMX. > > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager >Affects Versions: 0.9 >Reporter: Robert Metzger >Assignee: Chesnay Schepler >Priority: Minor > Fix For: 1.1.0 > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1947: [FLINK-1502] [WIP] Basic Metric System
Github user sumitchawla commented on the issue: https://github.com/apache/flink/pull/1947 @zentol .. by job writer i meant end user writing jobs using Flink API. As of now i can create custom accumulators using `getRuntimeContext().addAccumulator(ACCUMULATOR_NAME,...` can i do something similar to register custom metrics in my transformations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1267) Add crossGroup operator
[ https://issues.apache.org/jira/browse/FLINK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379790#comment-15379790 ] Stephan Ewen commented on FLINK-1267: - Should we move the discussion to one of the two issues and mark the other as duplicate? > Add crossGroup operator > --- > > Key: FLINK-1267 > URL: https://issues.apache.org/jira/browse/FLINK-1267 > Project: Flink > Issue Type: New Feature > Components: DataSet API, Local Runtime, Optimizer >Affects Versions: 0.7.0-incubating >Reporter: Fabian Hueske >Assignee: pietro pinoli >Priority: Minor > > A common operator is to pair-wise compare or combine all elements of a group > (there were two questions about this on the user mailing list, recently). > Right now, this can be done in two ways: > 1. {{groupReduce}}: consume and store the complete iterator in memory and > build all pairs > 2. do a self-{{Join}}: the engine builds all pairs of the full symmetric > Cartesian product. > Both approaches have drawbacks. The {{groupReduce}} variant requires that the > full group fits into memory and is more cumbersome to implement for the user, > but pairs can be arbitrarily built. The self-{{Join}} approach pushes most of > the work into the system, but the execution strategy does not treat the > self-join different from a regular join (both identical inputs are shuffled, > etc.) and always builds the full symmetric Cartesian product. > I propose to add a dedicated {{crossGroup()}} operator, that offers this > functionality in a proper way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3397) Failed streaming jobs should fall back to the most recent checkpoint/savepoint
[ https://issues.apache.org/jira/browse/FLINK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379767#comment-15379767 ] Ufuk Celebi commented on FLINK-3397: Thanks for the reminder! :-) The description of checkpoints and savepoints are mostly correct. Minor changes: "Every time a new checkpoint is taken the older ones are discarded and only the latest is considered for any restoration on failure." => This is also configurable, you can keep around multiple completed checkpoints. "These checkpointed state are never cleared unless the user wants to delete a savepoint and create a new one." => I would remove the last part "and create a new one" as this is independent of when savepoints are cleared. The important thing is that they are not automatically cleared. The rest of the description is not correct: "Any job submitted checks if there was a savepoint already available in the back end store." This is not checked automatically, but the user provides the savepoint path to resume from. Regarding resuming jobs: if a job was submitted with a savepoint path to recover from, it will always fall back to that state in the worst case. What does not happen is that it is falling back to any newer savepoints (even if some were triggered). This is what you describe on page 2. In general though I would refrain from any time consideration when talking about this, the checkpoint ID description is good though. All in all it's great to see that you looked into the code before doing this. I fear though that these changes require some more consideration about how savepoints are stored/accessed. They are currently mostly independent of the job from which they were created. > Failed streaming jobs should fall back to the most recent checkpoint/savepoint > -- > > Key: FLINK-3397 > URL: https://issues.apache.org/jira/browse/FLINK-3397 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing, Streaming >Affects Versions: 1.0.0 >Reporter: Gyula Fora >Priority: Minor > Attachments: FLINK-3397.pdf > > > The current fallback behaviour in case of a streaming job failure is slightly > counterintuitive: > If a job fails it will fall back to the most recent checkpoint (if any) even > if there were more recent savepoint taken. This means that savepoints are not > regarded as checkpoints by the system only points from where a job can be > manually restarted. > I suggest to change this so that savepoints are also regarded as checkpoints > in case of a failure and they will also be used to automatically restore the > streaming job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2259: [hotfix] Fixes broken TopSpeedWindowing scala exam...
GitHub user kl0u opened a pull request: https://github.com/apache/flink/pull/2259 [hotfix] Fixes broken TopSpeedWindowing scala example The problem was that in the case where no input file was provided, the `fromCollection()` source was never exiting and the actual program was never running. The new source is identical to the corresponding java test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kl0u/flink fix_example Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2259.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2259 commit f235631596afaefc05c91f91fda05a7e301db661 Author: kl0uDate: 2016-07-15T16:20:04Z [hotfix] Fixes broken TopSpeedWindowing scala example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4201) Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted
[ https://issues.apache.org/jira/browse/FLINK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379639#comment-15379639 ] Ufuk Celebi commented on FLINK-4201: The shut down hook is actually not a problem, because it is only active in standalone recovery mode. The issue is that a suspended execution graph will shut down the checkpoint coordinator, which discards all checkpoints on shut down. We still need to call shutdown in order to free some resources like the timer task, but have to skip discarding the checkpoints if the execution graph is suspended and not in a globally terminal state. > Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted > --- > > Key: FLINK-4201 > URL: https://issues.apache.org/jira/browse/FLINK-4201 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Reporter: Stefan Richter >Assignee: Ufuk Celebi >Priority: Blocker > > For example, when shutting down a Yarn session, according to the logs > checkpoints for jobs that did not terminate are deleted. In the shutdown > hook, removeAllCheckpoints is called and removes checkpoints that should > still be kept. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-4201) Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted
[ https://issues.apache.org/jira/browse/FLINK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-4201: -- Assignee: Ufuk Celebi > Checkpoints for jobs in non-terminal state (e.g. suspended) get deleted > --- > > Key: FLINK-4201 > URL: https://issues.apache.org/jira/browse/FLINK-4201 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Reporter: Stefan Richter >Assignee: Ufuk Celebi >Priority: Blocker > > For example, when shutting down a Yarn session, according to the logs > checkpoints for jobs that did not terminate are deleted. In the shutdown > hook, removeAllCheckpoints is called and removes checkpoints that should > still be kept. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4213) Provide CombineHint in Gelly algorithms
[ https://issues.apache.org/jira/browse/FLINK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379631#comment-15379631 ] ASF GitHub Bot commented on FLINK-4213: --- Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2248 Yes, and the job plan attached to the ticket shows the only combine as a "Sorted Combine" (from distinct). > Provide CombineHint in Gelly algorithms > --- > > Key: FLINK-4213 > URL: https://issues.apache.org/jira/browse/FLINK-4213 > Project: Flink > Issue Type: Improvement > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > > Many graph algorithms will see better {{reduce}} performance with the > hash-combine compared with the still default sort-combine, e.g. HITS and > LocalClusteringCoefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2248: [FLINK-4213] [gelly] Provide CombineHint in Gelly algorit...
Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2248 Yes, and the job plan attached to the ticket shows the only combine as a "Sorted Combine" (from distinct). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2256: [FLINK-4150] [runtime] Don't clean up BlobStore on BlobSe...
Github user uce commented on the issue: https://github.com/apache/flink/pull/2256 I don't know if we "want to", but it is the current behaviour. A job should only fail if its restart strategy is exhausted though. Do you think we should change that behaviour? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
[ https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379618#comment-15379618 ] ASF GitHub Bot commented on FLINK-4150: --- Github user uce commented on the issue: https://github.com/apache/flink/pull/2256 I don't know if we "want to", but it is the current behaviour. A job should only fail if its restart strategy is exhausted though. Do you think we should change that behaviour? > Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown > > > Key: FLINK-4150 > URL: https://issues.apache.org/jira/browse/FLINK-4150 > Project: Flink > Issue Type: Bug > Components: Job-Submission >Reporter: Stefan Richter >Assignee: Ufuk Celebi >Priority: Blocker > Fix For: 1.1.0 > > > Submitting a job in Yarn with HA can lead to the following exception: > {code} > org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load > user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09 > ClassLoader info: URL ClassLoader: > file: > '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0' > (invalid JAR: zip file is empty) > Class not resolvable through given classloader. > at > org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > {code} > Some job information, including the Blob ids, are stored in Zookeeper. The > actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set > to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the > cluster is shut down, the path for the BlobStore is deleted. When the cluster > is then restarted, recovering jobs cannot restore because it's Blob ids > stored in Zookeeper now point to deleted files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3279) Optionally disable DistinctOperator combiner
[ https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379606#comment-15379606 ] Greg Hogan commented on FLINK-3279: --- I fixed the wording of my comment. I think Fabian's suggestion was to investigate changing {{DistinctOperator}} from using {{GroupReduce}} to using {{Reduce}}. Then we could add {{setCombineHint}} to {{DistinctOperator}} rather than my suggestion above. > Optionally disable DistinctOperator combiner > > > Key: FLINK-3279 > URL: https://issues.apache.org/jira/browse/FLINK-3279 > Project: Flink > Issue Type: New Feature > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > > Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} > which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that > there will be few duplicate records and disabling the combine would improve > performance. > I propose adding {{public DistinctOperator setCombinable(boolean > combinable)}} to {{DistinctOperator}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-3279) Optionally disable DistinctOperator combiner
[ https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379599#comment-15379599 ] Gabor Gevay edited comment on FLINK-3279 at 7/15/16 4:00 PM: - I think no. And a Jira is also needed for the sum, max, etc. aggregations. (Maybe these two things can be in one Jira.) was (Author: ggevay): https://issues.apache.org/jira/browse/FLINK-3479? > Optionally disable DistinctOperator combiner > > > Key: FLINK-3279 > URL: https://issues.apache.org/jira/browse/FLINK-3279 > Project: Flink > Issue Type: New Feature > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > > Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} > which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that > there will be few duplicate records and disabling the combine would improve > performance. > I propose adding {{public DistinctOperator setCombinable(boolean > combinable)}} to {{DistinctOperator}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3279) Optionally disable DistinctOperator combiner
[ https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379599#comment-15379599 ] Gabor Gevay commented on FLINK-3279: https://issues.apache.org/jira/browse/FLINK-3479? > Optionally disable DistinctOperator combiner > > > Key: FLINK-3279 > URL: https://issues.apache.org/jira/browse/FLINK-3279 > Project: Flink > Issue Type: New Feature > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > > Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} > which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that > there will be few duplicate records and disabling the combine would improve > performance. > I propose adding {{public DistinctOperator setCombinable(boolean > combinable)}} to {{DistinctOperator}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-3279) Optionally disable DistinctOperator combiner
[ https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379590#comment-15379590 ] Greg Hogan edited comment on FLINK-3279 at 7/15/16 3:54 PM: Do we have a ticket for porting Distinct from using groupReduce to using reduce? was (Author: greghogan): Do we have a ticket for porting groupReduce to reduce? > Optionally disable DistinctOperator combiner > > > Key: FLINK-3279 > URL: https://issues.apache.org/jira/browse/FLINK-3279 > Project: Flink > Issue Type: New Feature > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > > Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} > which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that > there will be few duplicate records and disabling the combine would improve > performance. > I propose adding {{public DistinctOperator setCombinable(boolean > combinable)}} to {{DistinctOperator}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3279) Optionally disable DistinctOperator combiner
[ https://issues.apache.org/jira/browse/FLINK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379590#comment-15379590 ] Greg Hogan commented on FLINK-3279: --- Do we have a ticket for porting groupReduce to reduce? > Optionally disable DistinctOperator combiner > > > Key: FLINK-3279 > URL: https://issues.apache.org/jira/browse/FLINK-3279 > Project: Flink > Issue Type: New Feature > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > > Calling {{DataSet.distinct()}} executes {{DistinctOperator.DistinctFunction}} > which is a combinable {{RichGroupReduceFunction}}. Sometimes we know that > there will be few duplicate records and disabling the combine would improve > performance. > I propose adding {{public DistinctOperator setCombinable(boolean > combinable)}} to {{DistinctOperator}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-4196) Remove "recoveryTimestamp"
[ https://issues.apache.org/jira/browse/FLINK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-4196. - Resolution: Fixed Fix Version/s: 1.1.0 Fixed via de6a3d33ecfa689fd0da1ef661bbf6edb68e9d0b > Remove "recoveryTimestamp" > -- > > Key: FLINK-4196 > URL: https://issues.apache.org/jira/browse/FLINK-4196 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.0.3 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.1.0 > > > I think we should remove the {{recoveryTimestamp}} that is attached on state > restore calls. > Given that this is a wall clock timestamp from a master node, which may > change when clocks are adjusted, and between different master nodes during > leader change, this is an unsafe concept. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-4196) Remove "recoveryTimestamp"
[ https://issues.apache.org/jira/browse/FLINK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen closed FLINK-4196. --- > Remove "recoveryTimestamp" > -- > > Key: FLINK-4196 > URL: https://issues.apache.org/jira/browse/FLINK-4196 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.0.3 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.1.0 > > > I think we should remove the {{recoveryTimestamp}} that is attached on state > restore calls. > Given that this is a wall clock timestamp from a master node, which may > change when clocks are adjusted, and between different master nodes during > leader change, this is an unsafe concept. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1267) Add crossGroup operator
[ https://issues.apache.org/jira/browse/FLINK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379563#comment-15379563 ] Greg Hogan commented on FLINK-1267: --- I think my FLINK-3910 is a duplicate of this idea. I am both buoyed by Fabian having submitted this idea and deflated by Stephan's assessments. Could this be implemented as {{cross()}} in the same way that {{reduce()}} can be applied to either a grouped or full DataSet? > Add crossGroup operator > --- > > Key: FLINK-1267 > URL: https://issues.apache.org/jira/browse/FLINK-1267 > Project: Flink > Issue Type: New Feature > Components: DataSet API, Local Runtime, Optimizer >Affects Versions: 0.7.0-incubating >Reporter: Fabian Hueske >Assignee: pietro pinoli >Priority: Minor > > A common operator is to pair-wise compare or combine all elements of a group > (there were two questions about this on the user mailing list, recently). > Right now, this can be done in two ways: > 1. {{groupReduce}}: consume and store the complete iterator in memory and > build all pairs > 2. do a self-{{Join}}: the engine builds all pairs of the full symmetric > Cartesian product. > Both approaches have drawbacks. The {{groupReduce}} variant requires that the > full group fits into memory and is more cumbersome to implement for the user, > but pairs can be arbitrarily built. The self-{{Join}} approach pushes most of > the work into the system, but the execution strategy does not treat the > self-join different from a regular join (both identical inputs are shuffled, > etc.) and always builds the full symmetric Cartesian product. > I propose to add a dedicated {{crossGroup()}} operator, that offers this > functionality in a proper way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4213) Provide CombineHint in Gelly algorithms
[ https://issues.apache.org/jira/browse/FLINK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379557#comment-15379557 ] ASF GitHub Bot commented on FLINK-4213: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2248 Does the deadlock occur only with the combiner, or also with the sorter? > Provide CombineHint in Gelly algorithms > --- > > Key: FLINK-4213 > URL: https://issues.apache.org/jira/browse/FLINK-4213 > Project: Flink > Issue Type: Improvement > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > > Many graph algorithms will see better {{reduce}} performance with the > hash-combine compared with the still default sort-combine, e.g. HITS and > LocalClusteringCoefficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-3466. - Resolution: Fixed Fix Version/s: 1.1.0 Fixed in e9f660d1ff5540c7ef829f2de5bb870b787c18b7 > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Stephan Ewen >Priority: Blocker > Fix For: 1.1.0 > > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) >
[GitHub] flink issue #2248: [FLINK-4213] [gelly] Provide CombineHint in Gelly algorit...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2248 Does the deadlock occur only with the combiner, or also with the sorter? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Closed] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen closed FLINK-3466. --- > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Stephan Ewen >Priority: Blocker > Fix For: 1.1.0 > > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) >
[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379553#comment-15379553 ] ASF GitHub Bot commented on FLINK-3466: --- Github user StephanEwen closed the pull request at: https://github.com/apache/flink/pull/2252 > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Stephan Ewen >Priority: Blocker > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) >
[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379552#comment-15379552 ] ASF GitHub Bot commented on FLINK-3466: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2252 Manually merged in e9f660d1ff5540c7ef829f2de5bb870b787c18b7 > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Stephan Ewen >Priority: Blocker > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) >
[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...
Github user StephanEwen closed the pull request at: https://github.com/apache/flink/pull/2252 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-3466: Affects Version/s: (was: 0.10.2) (was: 1.0.0) 1.1.0 > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Stephan Ewen >Priority: Blocker > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) >
[jira] [Updated] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-3466: Priority: Blocker (was: Major) > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Stephan Ewen >Priority: Blocker > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) >
[GitHub] flink issue #2252: [FLINK-3466] [runtime] Cancel state handled on state rest...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2252 Manually merged in e9f660d1ff5540c7ef829f2de5bb870b787c18b7 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3204) TaskManagers are not shutting down properly on YARN
[ https://issues.apache.org/jira/browse/FLINK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379516#comment-15379516 ] Robert Metzger commented on FLINK-3204: --- I think this error still persists and also shows up in travis tests: https://s3.amazonaws.com/archive.travis-ci.org/jobs/144954182/log.txt (From the /yarn-tests/container_1468587486405_0005_01_01/jobmanager.log file in https://s3.amazonaws.com/flink-logs-us/travis-artifacts/rmetzger/flink/1600/1600.5.tar.gz) {code} 2016-07-15 12:59:46,808 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Shutting down cluster with status SUCCEEDED : Flink YARN Client requested shutdown 2016-07-15 12:59:46,809 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Unregistering application from the YARN Resource Manager 2016-07-15 12:59:46,817 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Waiting for application to be successfully unregistered. 2016-07-15 12:59:46,832 INFO org.apache.flink.yarn.YarnJobManager - Stopping JobManager akka.tcp://flink@172.17.3.43:49869/user/jobmanager. 2016-07-15 12:59:46,846 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:50790 2016-07-15 12:59:46,864 ERROR org.apache.flink.yarn.YarnJobManager - Executor could not execute task java.util.concurrent.RejectedExecutionException at scala.concurrent.forkjoin.ForkJoinPool.fullExternalPush(ForkJoinPool.java:1870) at scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1834) at scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955) at scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:107) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2016-07-15 12:59:46,920 INFO org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Interrupted while waiting for queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275) 2016-07-15 12:59:46,932 INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - Closing proxy : testing-worker-linux-docker-99c17e61-3364-linux-5.prod.travis-ci.org:38889 2016-07-15 12:59:46,987 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon. 2016-07-15 12:59:46,987 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports. 2016-07-15 12:59:47,037 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down. {code} > TaskManagers are not shutting down properly on YARN > --- > > Key: FLINK-3204 > URL: https://issues.apache.org/jira/browse/FLINK-3204 > Project: Flink > Issue Type: Bug > Components: YARN Client >Affects Versions: 1.0.0, 1.1.0 >Reporter: Robert Metzger > Labels: test-stability > > While running some experiments on a YARN cluster, I saw the following error > {code} > 10:15:24,741 INFO
[jira] [Updated] (FLINK-3204) TaskManagers are not shutting down properly on YARN
[ https://issues.apache.org/jira/browse/FLINK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-3204: -- Labels: test-stability (was: ) > TaskManagers are not shutting down properly on YARN > --- > > Key: FLINK-3204 > URL: https://issues.apache.org/jira/browse/FLINK-3204 > Project: Flink > Issue Type: Bug > Components: YARN Client >Affects Versions: 1.0.0, 1.1.0 >Reporter: Robert Metzger > Labels: test-stability > > While running some experiments on a YARN cluster, I saw the following error > {code} > 10:15:24,741 INFO org.apache.flink.yarn.YarnJobManager >- Stopping YARN JobManager with status SUCCEEDED and diagnostic Flink YARN > Client requested shutdown. > 10:15:24,748 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl >- Waiting for application to be successfully unregistered. > 10:15:24,852 INFO > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - > Interrupted while waiting for queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275) > 10:15:24,875 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_10when > stopping NMClientImpl > 10:15:24,899 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_07when > stopping NMClientImpl > 10:15:24,954 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_06when > stopping NMClientImpl > 10:15:24,982 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_09when > stopping NMClientImpl > 10:15:25,013 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_11when > stopping NMClientImpl > 10:15:25,037 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_08when > stopping NMClientImpl > 10:15:25,041 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_12when > stopping NMClientImpl > 10:15:25,072 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_05when > stopping NMClientImpl > 10:15:25,075 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_03when > stopping NMClientImpl > 10:15:25,077 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_04when > stopping NMClientImpl > 10:15:25,079 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_02when > stopping NMClientImpl > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-0.c.astral-sorter-757.internal:8041 > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-1.c.astral-sorter-757.internal:8041 > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-master.c.astral-sorter-757.internal:8041 > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-4.c.astral-sorter-757.internal:8041 > 10:15:25,081 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-2.c.astral-sorter-757.internal:8041 > 10:15:25,081 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-3.c.astral-sorter-757.internal:8041 > 10:15:25,081 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-5.c.astral-sorter-757.internal:8041 > 10:15:25,085 INFO org.apache.flink.yarn.YarnJobManager >-
[jira] [Updated] (FLINK-3204) TaskManagers are not shutting down properly on YARN
[ https://issues.apache.org/jira/browse/FLINK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-3204: -- Affects Version/s: 1.1.0 > TaskManagers are not shutting down properly on YARN > --- > > Key: FLINK-3204 > URL: https://issues.apache.org/jira/browse/FLINK-3204 > Project: Flink > Issue Type: Bug > Components: YARN Client >Affects Versions: 1.0.0, 1.1.0 >Reporter: Robert Metzger > Labels: test-stability > > While running some experiments on a YARN cluster, I saw the following error > {code} > 10:15:24,741 INFO org.apache.flink.yarn.YarnJobManager >- Stopping YARN JobManager with status SUCCEEDED and diagnostic Flink YARN > Client requested shutdown. > 10:15:24,748 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl >- Waiting for application to be successfully unregistered. > 10:15:24,852 INFO > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - > Interrupted while waiting for queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275) > 10:15:24,875 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_10when > stopping NMClientImpl > 10:15:24,899 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_07when > stopping NMClientImpl > 10:15:24,954 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_06when > stopping NMClientImpl > 10:15:24,982 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_09when > stopping NMClientImpl > 10:15:25,013 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_11when > stopping NMClientImpl > 10:15:25,037 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_08when > stopping NMClientImpl > 10:15:25,041 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_12when > stopping NMClientImpl > 10:15:25,072 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_05when > stopping NMClientImpl > 10:15:25,075 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_03when > stopping NMClientImpl > 10:15:25,077 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_04when > stopping NMClientImpl > 10:15:25,079 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl >- Failed to stop Container container_1452019681933_0002_01_02when > stopping NMClientImpl > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-0.c.astral-sorter-757.internal:8041 > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-1.c.astral-sorter-757.internal:8041 > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-master.c.astral-sorter-757.internal:8041 > 10:15:25,080 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-4.c.astral-sorter-757.internal:8041 > 10:15:25,081 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-2.c.astral-sorter-757.internal:8041 > 10:15:25,081 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-3.c.astral-sorter-757.internal:8041 > 10:15:25,081 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Closing proxy : cdh544-worker-5.c.astral-sorter-757.internal:8041 > 10:15:25,085 INFO org.apache.flink.yarn.YarnJobManager >- Stopping
[jira] [Commented] (FLINK-4186) Expose Kafka metrics through Flink metrics
[ https://issues.apache.org/jira/browse/FLINK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379474#comment-15379474 ] ASF GitHub Bot commented on FLINK-4186: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2236 > Expose Kafka metrics through Flink metrics > -- > > Key: FLINK-4186 > URL: https://issues.apache.org/jira/browse/FLINK-4186 > Project: Flink > Issue Type: Improvement > Components: Kafka Connector >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Robert Metzger > Fix For: 1.1.0 > > > Currently, we expose the Kafka metrics through Flink's accumulators. > We can now use the metrics system in Flink to report Kafka metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-4186) Expose Kafka metrics through Flink metrics
[ https://issues.apache.org/jira/browse/FLINK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger resolved FLINK-4186. --- Resolution: Fixed Fix Version/s: 1.1.0 Resolved in http://git-wip-us.apache.org/repos/asf/flink/commit/41f58182 > Expose Kafka metrics through Flink metrics > -- > > Key: FLINK-4186 > URL: https://issues.apache.org/jira/browse/FLINK-4186 > Project: Flink > Issue Type: Improvement > Components: Kafka Connector >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Robert Metzger > Fix For: 1.1.0 > > > Currently, we expose the Kafka metrics through Flink's accumulators. > We can now use the metrics system in Flink to report Kafka metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2236: [FLINK-4186] Use Flink metrics to report Kafka met...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2236 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2236: [FLINK-4186] Use Flink metrics to report Kafka metrics
Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/2236 I'm merging the change ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4186) Expose Kafka metrics through Flink metrics
[ https://issues.apache.org/jira/browse/FLINK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379463#comment-15379463 ] ASF GitHub Bot commented on FLINK-4186: --- Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/2236 I'm merging the change ... > Expose Kafka metrics through Flink metrics > -- > > Key: FLINK-4186 > URL: https://issues.apache.org/jira/browse/FLINK-4186 > Project: Flink > Issue Type: Improvement > Components: Kafka Connector >Affects Versions: 1.1.0 >Reporter: Robert Metzger >Assignee: Robert Metzger > > Currently, we expose the Kafka metrics through Flink's accumulators. > We can now use the metrics system in Flink to report Kafka metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2220: [FLINK-4184] [metrics] Replace invalid characters in Sche...
Github user zentol commented on the issue: https://github.com/apache/flink/pull/2220 I would appreciate it if you would give me time to answer to your response before going ahead with a merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters
[ https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379459#comment-15379459 ] ASF GitHub Bot commented on FLINK-4184: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/2220 I would appreciate it if you would give me time to answer to your response before going ahead with a merge. > Ganglia and GraphiteReporter report metric names with invalid characters > > > Key: FLINK-4184 > URL: https://issues.apache.org/jira/browse/FLINK-4184 > Project: Flink > Issue Type: Bug > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.1.0 > > > Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with > names which contain invalid characters. For example, quotes are not filtered > out which can be problematic for Ganglia. Moreover, dots are not replaced > which causes Graphite to think that an IP address is actually a scoped metric > name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4104) Restructure Gelly docs
[ https://issues.apache.org/jira/browse/FLINK-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379458#comment-15379458 ] ASF GitHub Bot commented on FLINK-4104: --- Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2258 What do you think, @vasia? Is it a problem to rename `gelly_guide.html` to `gelly/index.html` as done with ML? Also, I manufactured a TOC on the top-level Gelly page. > Restructure Gelly docs > -- > > Key: FLINK-4104 > URL: https://issues.apache.org/jira/browse/FLINK-4104 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > Fix For: 1.1.0 > > > The Gelly documentation has grown sufficiently long to suggest dividing into > sub-pages. Leave "Using Gelly" on the main page and link to the following > topics as sub-pages: > * Graph API > * Iterative Graph Processing > * Library Methods > * Graph Algorithms > * Graph Generators -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2258: [FLINK-4104] [docs] Restructure Gelly docs
Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2258 What do you think, @vasia? Is it a problem to rename `gelly_guide.html` to `gelly/index.html` as done with ML? Also, I manufactured a TOC on the top-level Gelly page. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4104) Restructure Gelly docs
[ https://issues.apache.org/jira/browse/FLINK-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379454#comment-15379454 ] ASF GitHub Bot commented on FLINK-4104: --- GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/2258 [FLINK-4104] [docs] Restructure Gelly docs Split the Gelly documentation into five sub-pages. You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 4104_restructure_gelly_docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2258 commit 844cbd18c8aaa08b6fa13e43d5974781b7a05197 Author: Greg HoganDate: 2016-07-15T14:19:42Z [FLINK-4104] [docs] Restructure Gelly docs Split the Gelly documentation into five sub-pages. > Restructure Gelly docs > -- > > Key: FLINK-4104 > URL: https://issues.apache.org/jira/browse/FLINK-4104 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > Fix For: 1.1.0 > > > The Gelly documentation has grown sufficiently long to suggest dividing into > sub-pages. Leave "Using Gelly" on the main page and link to the following > topics as sub-pages: > * Graph API > * Iterative Graph Processing > * Library Methods > * Graph Algorithms > * Graph Generators -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2258: [FLINK-4104] [docs] Restructure Gelly docs
GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/2258 [FLINK-4104] [docs] Restructure Gelly docs Split the Gelly documentation into five sub-pages. You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 4104_restructure_gelly_docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2258 commit 844cbd18c8aaa08b6fa13e43d5974781b7a05197 Author: Greg HoganDate: 2016-07-15T14:19:42Z [FLINK-4104] [docs] Restructure Gelly docs Split the Gelly documentation into five sub-pages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
[ https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379447#comment-15379447 ] ASF GitHub Bot commented on FLINK-4150: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2256 Just a quick question. Do we want to remove also failed jobs from the BlobStore and ZK? Or only finished or cancelled jobs? > Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown > > > Key: FLINK-4150 > URL: https://issues.apache.org/jira/browse/FLINK-4150 > Project: Flink > Issue Type: Bug > Components: Job-Submission >Reporter: Stefan Richter >Assignee: Ufuk Celebi >Priority: Blocker > Fix For: 1.1.0 > > > Submitting a job in Yarn with HA can lead to the following exception: > {code} > org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load > user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09 > ClassLoader info: URL ClassLoader: > file: > '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0' > (invalid JAR: zip file is empty) > Class not resolvable through given classloader. > at > org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > {code} > Some job information, including the Blob ids, are stored in Zookeeper. The > actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set > to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the > cluster is shut down, the path for the BlobStore is deleted. When the cluster > is then restarted, recovering jobs cannot restore because it's Blob ids > stored in Zookeeper now point to deleted files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2256: [FLINK-4150] [runtime] Don't clean up BlobStore on BlobSe...
Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2256 Just a quick question. Do we want to remove also failed jobs from the BlobStore and ZK? Or only finished or cancelled jobs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters
[ https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379445#comment-15379445 ] ASF GitHub Bot commented on FLINK-4184: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/2220#discussion_r70979472 --- Diff: flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java --- @@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() { } // + // Getters + // + + // used for testing purposes + MapgetCounters() { --- End diff -- You could also access the protected registry and get the counters from there. We may not even need the gauges/counters/histograms fields in the ScheduledDropwizardReporter. > Ganglia and GraphiteReporter report metric names with invalid characters > > > Key: FLINK-4184 > URL: https://issues.apache.org/jira/browse/FLINK-4184 > Project: Flink > Issue Type: Bug > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.1.0 > > > Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with > names which contain invalid characters. For example, quotes are not filtered > out which can be problematic for Ganglia. Moreover, dots are not replaced > which causes Graphite to think that an IP address is actually a scoped metric > name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2220: [FLINK-4184] [metrics] Replace invalid characters ...
Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/2220#discussion_r70979472 --- Diff: flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java --- @@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() { } // + // Getters + // + + // used for testing purposes + MapgetCounters() { --- End diff -- You could also access the protected registry and get the counters from there. We may not even need the gauges/counters/histograms fields in the ScheduledDropwizardReporter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Closed] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters
[ https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-4184. Resolution: Fixed Fixed via 70094a1818b532f4e0ff31f5debc550ed4336286 > Ganglia and GraphiteReporter report metric names with invalid characters > > > Key: FLINK-4184 > URL: https://issues.apache.org/jira/browse/FLINK-4184 > Project: Flink > Issue Type: Bug > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.1.0 > > > Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with > names which contain invalid characters. For example, quotes are not filtered > out which can be problematic for Ganglia. Moreover, dots are not replaced > which causes Graphite to think that an IP address is actually a scoped metric > name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters
[ https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379416#comment-15379416 ] ASF GitHub Bot commented on FLINK-4184: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2220 > Ganglia and GraphiteReporter report metric names with invalid characters > > > Key: FLINK-4184 > URL: https://issues.apache.org/jira/browse/FLINK-4184 > Project: Flink > Issue Type: Bug > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.1.0 > > > Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with > names which contain invalid characters. For example, quotes are not filtered > out which can be problematic for Ganglia. Moreover, dots are not replaced > which causes Graphite to think that an IP address is actually a scoped metric > name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2220: [FLINK-4184] [metrics] Replace invalid characters ...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2220 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2220: [FLINK-4184] [metrics] Replace invalid characters in Sche...
Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2220 Thanks for the review @zentol. Will be merging this PR then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters
[ https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379409#comment-15379409 ] ASF GitHub Bot commented on FLINK-4184: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2220 Thanks for the review @zentol. Will be merging this PR then. > Ganglia and GraphiteReporter report metric names with invalid characters > > > Key: FLINK-4184 > URL: https://issues.apache.org/jira/browse/FLINK-4184 > Project: Flink > Issue Type: Bug > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.1.0 > > > Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with > names which contain invalid characters. For example, quotes are not filtered > out which can be problematic for Ganglia. Moreover, dots are not replaced > which causes Graphite to think that an IP address is actually a scoped metric > name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4184) Ganglia and GraphiteReporter report metric names with invalid characters
[ https://issues.apache.org/jira/browse/FLINK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379407#comment-15379407 ] ASF GitHub Bot commented on FLINK-4184: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2220#discussion_r70977551 --- Diff: flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java --- @@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() { } // + // Getters + // + + // used for testing purposes + MapgetCounters() { --- End diff -- We can, if we mark the counters, gauges and histograms fields as protected. But then we would expose the implementation details to all sub-classes instead of having a getter which is package private. I think the latter option is a bit nicer, because it hides the implementation details. > Ganglia and GraphiteReporter report metric names with invalid characters > > > Key: FLINK-4184 > URL: https://issues.apache.org/jira/browse/FLINK-4184 > Project: Flink > Issue Type: Bug > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > Fix For: 1.1.0 > > > Flink's {{GangliaReporter}} and {{GraphiteReporter}} report metrics with > names which contain invalid characters. For example, quotes are not filtered > out which can be problematic for Ganglia. Moreover, dots are not replaced > which causes Graphite to think that an IP address is actually a scoped metric > name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2220: [FLINK-4184] [metrics] Replace invalid characters ...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2220#discussion_r70977551 --- Diff: flink-metrics/flink-metrics-dropwizard/src/main/java/org/apache/flink/dropwizard/ScheduledDropwizardReporter.java --- @@ -74,6 +75,15 @@ protected ScheduledDropwizardReporter() { } // + // Getters + // + + // used for testing purposes + MapgetCounters() { --- End diff -- We can, if we mark the counters, gauges and histograms fields as protected. But then we would expose the implementation details to all sub-classes instead of having a getter which is package private. I think the latter option is a bit nicer, because it hides the implementation details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2257: [FLINK-4152] Allow re-registration of TMs at resou...
GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2257 [FLINK-4152] Allow re-registration of TMs at resource manager The `YarnFlinkResourceManager` does not allow the `JobManager` to re-register task managers have had been registered at the resource manager before. The consequence of the refusal is that the job manager rejects the registrations of these task managers. Such a scenario can happen if a `JobManager` loses leadership in an HA setting after it registered some TMs. The old behaviour was that the resource manager clears the list of registered workers and only accepts new registrations of task manager which have been started in a fresh container. However, in case that the previously registered TMs didn't die, they will try to reconnect to the new leader. The new leader will then ask the resource manager whether the TMs represent valid resources. Since the resource manager forgot about the already started containers, it rejects the TMs. This PR changes the behaviour of the resource manager such that it can no longer reject TMs. Instead of being asked it will simply be informed about the registered TMs by the JM. If the TM happens to be running in a container which was started by the RM, then it will monitor this container. In case that this container dies, the RM will notify the JM about the death of the TM. In that sense, the RM has no longer the authority to interfere with the JM-TM interactions and instead it is simply used as an additional monitoring service to detect dead TMs as a result of a failed container. Furthermore, the PR adds a de-duplication method to filter out concurrent registration runs on the task manager. Before, it happened that a `RefusedRegistration` triggers a new registration run without cancelling the old registration run. This could lead to a massive amount of registration messages if the TaskManager's registration was refused multiple times. The mechanism to de-duplicate `TriggerTaskManagerRegistration` works by assigning a registration run id which is changed whenever a new registration run is started. `TriggerTaskManagerRegistration` messages which have an outdated registration run id are then filtered out. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink FLINK-4152_YarnResourceManager Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2257 commit 2a29fa4df98fd63a7f27e1607152cd6e54d25ad1 Author: Till RohrmannDate: 2016-07-15T08:51:59Z Add YarnFlinkResourceManager test to reaccept task manager registrations from a re-elected job manager commit 6462d4750d79512fc93bbc60ca754c99142d1794 Author: Till Rohrmann Date: 2016-07-15T09:50:35Z Remove unnecessary sync logic between JobManager and ResourceManager commit 55ce0c01783d9c4927a9f4677309c805e2b624e7 Author: Till Rohrmann Date: 2016-07-15T10:12:12Z Avoid duplicate reigstration attempts in case of a refused registration commit aeb6ae5435dd88c640dc314b9ec31357815d080c Author: Till Rohrmann Date: 2016-07-15T13:04:54Z Add test case to check that not an excessive amount of RegisterTaskManager messages are sent --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4152) TaskManager registration exponential backoff doesn't work
[ https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379396#comment-15379396 ] ASF GitHub Bot commented on FLINK-4152: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2257 [FLINK-4152] Allow re-registration of TMs at resource manager The `YarnFlinkResourceManager` does not allow the `JobManager` to re-register task managers have had been registered at the resource manager before. The consequence of the refusal is that the job manager rejects the registrations of these task managers. Such a scenario can happen if a `JobManager` loses leadership in an HA setting after it registered some TMs. The old behaviour was that the resource manager clears the list of registered workers and only accepts new registrations of task manager which have been started in a fresh container. However, in case that the previously registered TMs didn't die, they will try to reconnect to the new leader. The new leader will then ask the resource manager whether the TMs represent valid resources. Since the resource manager forgot about the already started containers, it rejects the TMs. This PR changes the behaviour of the resource manager such that it can no longer reject TMs. Instead of being asked it will simply be informed about the registered TMs by the JM. If the TM happens to be running in a container which was started by the RM, then it will monitor this container. In case that this container dies, the RM will notify the JM about the death of the TM. In that sense, the RM has no longer the authority to interfere with the JM-TM interactions and instead it is simply used as an additional monitoring service to detect dead TMs as a result of a failed container. Furthermore, the PR adds a de-duplication method to filter out concurrent registration runs on the task manager. Before, it happened that a `RefusedRegistration` triggers a new registration run without cancelling the old registration run. This could lead to a massive amount of registration messages if the TaskManager's registration was refused multiple times. The mechanism to de-duplicate `TriggerTaskManagerRegistration` works by assigning a registration run id which is changed whenever a new registration run is started. `TriggerTaskManagerRegistration` messages which have an outdated registration run id are then filtered out. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink FLINK-4152_YarnResourceManager Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2257 commit 2a29fa4df98fd63a7f27e1607152cd6e54d25ad1 Author: Till RohrmannDate: 2016-07-15T08:51:59Z Add YarnFlinkResourceManager test to reaccept task manager registrations from a re-elected job manager commit 6462d4750d79512fc93bbc60ca754c99142d1794 Author: Till Rohrmann Date: 2016-07-15T09:50:35Z Remove unnecessary sync logic between JobManager and ResourceManager commit 55ce0c01783d9c4927a9f4677309c805e2b624e7 Author: Till Rohrmann Date: 2016-07-15T10:12:12Z Avoid duplicate reigstration attempts in case of a refused registration commit aeb6ae5435dd88c640dc314b9ec31357815d080c Author: Till Rohrmann Date: 2016-07-15T13:04:54Z Add test case to check that not an excessive amount of RegisterTaskManager messages are sent > TaskManager registration exponential backoff doesn't work > - > > Key: FLINK-4152 > URL: https://issues.apache.org/jira/browse/FLINK-4152 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, TaskManager, YARN Client >Reporter: Robert Metzger >Assignee: Till Rohrmann > Attachments: logs.tgz > > > While testing Flink 1.1 I've found that the TaskManagers are logging many > messages when registering at the JobManager. > This is the log file: > https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294 > Its logging more than 3000 messages in less than a minute. I don't think that > this is the expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
[ https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379374#comment-15379374 ] ASF GitHub Bot commented on FLINK-4150: --- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2256 [FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer shuts down (on JobManager shut down), all local files will be removed. With HA, BLOBs are persisted to another file system (e.g. HDFS) via the `BlobStore` in order to have BLOBs available after a JobManager failure (or shut down). These BLOBs are only allowed to be removed when the job that requires them enters a globally terminal state (`FINISHED`, `CANCELLED`, `FAILED`). This commit removes the `BlobStore` clean up call from the `BlobServer` shutdown. The `BlobStore` files will only be cleaned up via the `BlobLibraryCacheManager`'s' clean up task (periodically or on BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs will linger around after the job has terminated, if the job manager fails before the clean up. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 4150-blobstore Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2256.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2256 commit 0d4522270881dbbb7164130f47f9d4df617c19c5 Author: Ufuk CelebiDate: 2016-07-14T14:29:49Z [FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer shuts down (on JobManager shut down), all local files will be removed. With HA, BLOBs are persisted to another file system (e.g. HDFS) via the `BlobStore` in order to have BLOBs available after a JobManager failure (or shut down). These BLOBs are only allowed to be removed when the job that requires them enters a globally terminal state (`FINISHED`, `CANCELLED`, `FAILED`). This commit removes the `BlobStore` clean up call from the `BlobServer` shutdown. The `BlobStore` files will only be cleaned up via the `BlobLibraryCacheManager`'s' clean up task (periodically or on BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs will linger around after the job has terminated, if the job manager fails before the clean up. > Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown > > > Key: FLINK-4150 > URL: https://issues.apache.org/jira/browse/FLINK-4150 > Project: Flink > Issue Type: Bug > Components: Job-Submission >Reporter: Stefan Richter >Assignee: Ufuk Celebi >Priority: Blocker > Fix For: 1.1.0 > > > Submitting a job in Yarn with HA can lead to the following exception: > {code} > org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load > user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09 > ClassLoader info: URL ClassLoader: > file: > '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0' > (invalid JAR: zip file is empty) > Class not resolvable through given classloader. > at > org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > {code} > Some job information, including the Blob ids, are stored in Zookeeper. The > actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set > to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the > cluster is shut down, the path for the BlobStore is deleted. When the cluster > is then restarted, recovering jobs cannot restore because it's Blob ids > stored in Zookeeper now point to deleted files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2256: [FLINK-4150] [runtime] Don't clean up BlobStore on...
GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2256 [FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer shuts down (on JobManager shut down), all local files will be removed. With HA, BLOBs are persisted to another file system (e.g. HDFS) via the `BlobStore` in order to have BLOBs available after a JobManager failure (or shut down). These BLOBs are only allowed to be removed when the job that requires them enters a globally terminal state (`FINISHED`, `CANCELLED`, `FAILED`). This commit removes the `BlobStore` clean up call from the `BlobServer` shutdown. The `BlobStore` files will only be cleaned up via the `BlobLibraryCacheManager`'s' clean up task (periodically or on BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs will linger around after the job has terminated, if the job manager fails before the clean up. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 4150-blobstore Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2256.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2256 commit 0d4522270881dbbb7164130f47f9d4df617c19c5 Author: Ufuk CelebiDate: 2016-07-14T14:29:49Z [FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer shuts down (on JobManager shut down), all local files will be removed. With HA, BLOBs are persisted to another file system (e.g. HDFS) via the `BlobStore` in order to have BLOBs available after a JobManager failure (or shut down). These BLOBs are only allowed to be removed when the job that requires them enters a globally terminal state (`FINISHED`, `CANCELLED`, `FAILED`). This commit removes the `BlobStore` clean up call from the `BlobServer` shutdown. The `BlobStore` files will only be cleaned up via the `BlobLibraryCacheManager`'s' clean up task (periodically or on BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs will linger around after the job has terminated, if the job manager fails before the clean up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Assigned] (FLINK-4150) Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
[ https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi reassigned FLINK-4150: -- Assignee: Ufuk Celebi > Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown > > > Key: FLINK-4150 > URL: https://issues.apache.org/jira/browse/FLINK-4150 > Project: Flink > Issue Type: Bug > Components: Job-Submission >Reporter: Stefan Richter >Assignee: Ufuk Celebi >Priority: Blocker > Fix For: 1.1.0 > > > Submitting a job in Yarn with HA can lead to the following exception: > {code} > org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load > user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09 > ClassLoader info: URL ClassLoader: > file: > '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0' > (invalid JAR: zip file is empty) > Class not resolvable through given classloader. > at > org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > {code} > Some job information, including the Blob ids, are stored in Zookeeper. The > actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set > to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the > cluster is shut down, the path for the BlobStore is deleted. When the cluster > is then restarted, recovering jobs cannot restore because it's Blob ids > stored in Zookeeper now point to deleted files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4211) Dynamic Properties not working for jobs submitted to Yarn session
[ https://issues.apache.org/jira/browse/FLINK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379336#comment-15379336 ] Ufuk Celebi commented on FLINK-4211: The problem is that the YARN session config and the CLI config might diverge, e.g. a user starts a YARN session with {{recovery.zookeeper.path.root=/flink/xyz}}, which is important for both the started cluster, but also the client to discover that cluster. I don't think this is a blocker for the release, but it should be addressed. > Dynamic Properties not working for jobs submitted to Yarn session > - > > Key: FLINK-4211 > URL: https://issues.apache.org/jira/browse/FLINK-4211 > Project: Flink > Issue Type: Bug > Components: YARN Client >Reporter: Stefan Richter > > The command line argument for dynamic properties (-D) is not working when > submitting jobs to a flink session. > Example: > {code} > bin/flink run -p 4 myJob.jar -D recovery.zookeeper.path.root=/flink/xyz > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-4142) Recovery problem in HA on Hadoop Yarn 2.4.1
[ https://issues.apache.org/jira/browse/FLINK-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-4142. -- Added note to docs about this in d08b189 (master). > Recovery problem in HA on Hadoop Yarn 2.4.1 > --- > > Key: FLINK-4142 > URL: https://issues.apache.org/jira/browse/FLINK-4142 > Project: Flink > Issue Type: Bug > Components: YARN Client >Affects Versions: 1.0.3 >Reporter: Stefan Richter >Assignee: Robert Metzger > Fix For: 1.1.0 > > > On Hadoop Yarn 2.4.1, recovery in HA fails in the following scenario: > 1) Kill application master, let it recover normally. > 2) After that, kill a task manager. > Now, Yarn tries to restart the killed task manager in an endless loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-4182) HA recovery not working properly under ApplicationMaster failures.
[ https://issues.apache.org/jira/browse/FLINK-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-4182. -- Resolution: Duplicate The quoted stack trace is a duplicate of FLINK-4150. The inconsistent result might be caused by accidentally running multiple jobs. I'm closing this for now. If inconsistencies still occur, we need some more information to re-open this. > HA recovery not working properly under ApplicationMaster failures. > -- > > Key: FLINK-4182 > URL: https://issues.apache.org/jira/browse/FLINK-4182 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, State Backends, Checkpointing >Affects Versions: 1.0.3 >Reporter: Stefan Richter >Priority: Blocker > > When randomly killing TaskManager and ApplicationMaster, a job sometimes does > not properly recover in HA mode. > There can be different symptoms for this. For example, in one case the job is > dying with the following exception: > {code} > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: The program > execution failed: Cannot set up the user code libraries: Cannot get library > with hash 7fafffe9595cd06aff213b81b5da7b1682e1d6b0 > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:413) > at > org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:208) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:389) > at > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1381) > at > da.testing.StreamingStateMachineJob.main(StreamingStateMachineJob.java:61) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:509) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:331) > at > org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:738) > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:251) > at > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:966) > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1009) > Caused by: org.apache.flink.runtime.client.JobSubmissionException: Cannot set > up the user code libraries: Cannot get library with hash > 7fafffe9595cd06aff213b81b5da7b1682e1d6b0 > at > org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1089) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:506) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:105) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at
[jira] [Commented] (FLINK-4218) Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting
[ https://issues.apache.org/jira/browse/FLINK-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379330#comment-15379330 ] Sergii Koshel commented on FLINK-4218: -- According to https://github.com/Aloisius/hadoop-s3a/blob/master/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java *close()* means data uploaded to the S3 bucket. And according to *read-after-write* consistency should be visible from everywhere. So I still don't see the reason how *java.io.FileNotFoundException* may happen after *close()*. > Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." > causes task restarting > -- > > Key: FLINK-4218 > URL: https://issues.apache.org/jira/browse/FLINK-4218 > Project: Flink > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sergii Koshel > > Sporadically see exception as below. And restart of task because of it. > {code:title=Exception|borderStyle=solid} > java.lang.RuntimeException: Error triggering a checkpoint as the result of > receiving checkpoint barrier > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:785) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:775) > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203) > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:183) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:66) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:265) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3:///flink/checkpoints/ece317c26960464ba5de75f3bbc38cb2/chk-8810/96eebbeb-de14-45c7-8ebb-e7cde978d6d3 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:351) > at > org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileSize(AbstractFileStateHandle.java:93) > at > org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSize(FileStreamStateHandle.java:58) > at > org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.getStateSize(AbstractStateBackend.java:482) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(StreamTaskStateList.java:77) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:604) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:779) > ... 8 more > {code} > File actually exists on S3. > I suppose it is related to some race conditions with S3 but would be good to > retry a few times before stop task execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2251: [FLINK-4212] [scripts] Lock PID file when starting daemon...
Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2251 Looks like it is available in Linux and BSD but not MacOS. We can skip locking if `flock` is not available. The race condition only manifests when starting duplicate managers on the same system in parallel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4212) Lock PID file when starting daemons
[ https://issues.apache.org/jira/browse/FLINK-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379313#comment-15379313 ] ASF GitHub Bot commented on FLINK-4212: --- Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2251 Looks like it is available in Linux and BSD but not MacOS. We can skip locking if `flock` is not available. The race condition only manifests when starting duplicate managers on the same system in parallel. > Lock PID file when starting daemons > --- > > Key: FLINK-4212 > URL: https://issues.apache.org/jira/browse/FLINK-4212 > Project: Flink > Issue Type: Improvement > Components: Startup Shell Scripts >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > > As noted on the mailing list (0), when multiple TaskManagers are started in > parallel (using pdsh) there is a race condition on updating the pid: 1) the > pid file is first read to parse the process' index, 2) the process is > started, and 3) on success the daemon pid is appended to the pid file. > We could use a tool such as {{flock}} to lock on the pid file while starting > the Flink daemon. > 0: > http://mail-archives.apache.org/mod_mbox/flink-user/201607.mbox/%3CCA%2BssbKXw954Bz_sBRwP6db0FntWyGWzTyP7wJZ5nhOeQnof3kg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4212) Lock PID file when starting daemons
[ https://issues.apache.org/jira/browse/FLINK-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379288#comment-15379288 ] ASF GitHub Bot commented on FLINK-4212: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2251 Is the `flock` command available by default on the common UNIX-style OSs? Ubuntu has it by default, what about MacOS? > Lock PID file when starting daemons > --- > > Key: FLINK-4212 > URL: https://issues.apache.org/jira/browse/FLINK-4212 > Project: Flink > Issue Type: Improvement > Components: Startup Shell Scripts >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > > As noted on the mailing list (0), when multiple TaskManagers are started in > parallel (using pdsh) there is a race condition on updating the pid: 1) the > pid file is first read to parse the process' index, 2) the process is > started, and 3) on success the daemon pid is appended to the pid file. > We could use a tool such as {{flock}} to lock on the pid file while starting > the Flink daemon. > 0: > http://mail-archives.apache.org/mod_mbox/flink-user/201607.mbox/%3CCA%2BssbKXw954Bz_sBRwP6db0FntWyGWzTyP7wJZ5nhOeQnof3kg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2251: [FLINK-4212] [scripts] Lock PID file when starting daemon...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2251 Is the `flock` command available by default on the common UNIX-style OSs? Ubuntu has it by default, what about MacOS? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4210) Move close()/isClosed() out of MetricGroup interface
[ https://issues.apache.org/jira/browse/FLINK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379276#comment-15379276 ] Chesnay Schepler commented on FLINK-4210: - initially there was a separate group for them, however since some features are implemented with user-defined functions as well this proved to be problematic. Although since then some time has passed and i don't know how valid this problem is anymore. > Move close()/isClosed() out of MetricGroup interface > > > Key: FLINK-4210 > URL: https://issues.apache.org/jira/browse/FLINK-4210 > Project: Flink > Issue Type: Improvement > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Chesnay Schepler >Assignee: Chesnay Schepler >Priority: Minor > Fix For: 1.1.0 > > > The (user-facing) MetricGroup interface currently exposes a close() and > isClosed() method which generally users shouldn't need to call. They are an > internal thing, and thus should be moved into the AbstractMetricGroup class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379270#comment-15379270 ] ASF GitHub Bot commented on FLINK-3466: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2252 Thanks, I'll address your comments and merge this... > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.0.0, 0.10.2 >Reporter: Robert Metzger >Assignee: Stephan Ewen > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) >
[GitHub] flink issue #2252: [FLINK-3466] [runtime] Cancel state handled on state rest...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2252 Thanks, I'll address your comments and merge this... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3397) Failed streaming jobs should fall back to the most recent checkpoint/savepoint
[ https://issues.apache.org/jira/browse/FLINK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379266#comment-15379266 ] ramkrishna.s.vasudevan commented on FLINK-3397: --- [~uce] Just a gentle reminder, in case you had some time to give some feedback. If you are in mid of release, sorry to bother you. Will wait for some more time. Thanks. > Failed streaming jobs should fall back to the most recent checkpoint/savepoint > -- > > Key: FLINK-3397 > URL: https://issues.apache.org/jira/browse/FLINK-3397 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing, Streaming >Affects Versions: 1.0.0 >Reporter: Gyula Fora >Priority: Minor > Attachments: FLINK-3397.pdf > > > The current fallback behaviour in case of a streaming job failure is slightly > counterintuitive: > If a job fails it will fall back to the most recent checkpoint (if any) even > if there were more recent savepoint taken. This means that savepoints are not > regarded as checkpoints by the system only points from where a job can be > manually restarted. > I suggest to change this so that savepoints are also regarded as checkpoints > in case of a failure and they will also be used to automatically restore the > streaming job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4210) Move close()/isClosed() out of MetricGroup interface
[ https://issues.apache.org/jira/browse/FLINK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379258#comment-15379258 ] Stephan Ewen commented on FLINK-4210: - So, the Flink runtime metrics for an operator are in the same group as the user's metrics? I though the user metrics were a subgroup of the operator group - wouldn't that make sense? Aside from that, we expose close() on so many things (like the collectors) and it has never been an issue. > Move close()/isClosed() out of MetricGroup interface > > > Key: FLINK-4210 > URL: https://issues.apache.org/jira/browse/FLINK-4210 > Project: Flink > Issue Type: Improvement > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Chesnay Schepler >Assignee: Chesnay Schepler >Priority: Minor > Fix For: 1.1.0 > > > The (user-facing) MetricGroup interface currently exposes a close() and > isClosed() method which generally users shouldn't need to call. They are an > internal thing, and thus should be moved into the AbstractMetricGroup class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4218) Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting
[ https://issues.apache.org/jira/browse/FLINK-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379254#comment-15379254 ] Stephan Ewen commented on FLINK-4218: - The read-after-write consistency seems to hold only for new keys. The check for the parent directly may actually be considered a modification/update operation and hence fall under the eventual consistency. Concerning the S3AFileSystem: Do you know if a "close()" call ensures that data is properly written? In HDFS, "close()" returns only successfully if data is persistent. > Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." > causes task restarting > -- > > Key: FLINK-4218 > URL: https://issues.apache.org/jira/browse/FLINK-4218 > Project: Flink > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sergii Koshel > > Sporadically see exception as below. And restart of task because of it. > {code:title=Exception|borderStyle=solid} > java.lang.RuntimeException: Error triggering a checkpoint as the result of > receiving checkpoint barrier > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:785) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:775) > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203) > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:183) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:66) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:265) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3:///flink/checkpoints/ece317c26960464ba5de75f3bbc38cb2/chk-8810/96eebbeb-de14-45c7-8ebb-e7cde978d6d3 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:351) > at > org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileSize(AbstractFileStateHandle.java:93) > at > org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSize(FileStreamStateHandle.java:58) > at > org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.getStateSize(AbstractStateBackend.java:482) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(StreamTaskStateList.java:77) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:604) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:779) > ... 8 more > {code} > File actually exists on S3. > I suppose it is related to some race conditions with S3 but would be good to > retry a few times before stop task execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379249#comment-15379249 ] ASF GitHub Bot commented on FLINK-3466: --- Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956861 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/AbstractFileStateHandle.java --- @@ -20,18 +20,21 @@ import org.apache.flink.core.fs.FileSystem; import org.apache.flink.core.fs.Path; +import org.apache.flink.runtime.state.AbstractCloseableHandle; +import org.apache.flink.runtime.state.StateObject; import java.io.IOException; +import java.io.Serializable; import static java.util.Objects.requireNonNull; /** * Base class for state that is stored in a file. */ -public abstract class AbstractFileStateHandle implements java.io.Serializable { - +public abstract class AbstractFileStateHandle extends AbstractCloseableHandle implements StateObject, Serializable { --- End diff -- True, will remove this. > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.0.0, 0.10.2 >Reporter: Robert Metzger >Assignee: Stephan Ewen > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) >
[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...
Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956813 --- Diff: flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/InterruptSensitiveRestoreTest.java --- @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.streaming.runtime.tasks; + +import org.apache.flink.api.common.ExecutionConfig; +import org.apache.flink.api.common.JobID; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.testutils.OneShotLatch; +import org.apache.flink.runtime.blob.BlobKey; +import org.apache.flink.runtime.broadcast.BroadcastVariableManager; +import org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor; +import org.apache.flink.runtime.deployment.ResultPartitionDeploymentDescriptor; +import org.apache.flink.runtime.deployment.TaskDeploymentDescriptor; +import org.apache.flink.runtime.execution.ExecutionState; +import org.apache.flink.runtime.execution.librarycache.FallbackLibraryCacheManager; +import org.apache.flink.runtime.executiongraph.ExecutionAttemptID; +import org.apache.flink.runtime.filecache.FileCache; +import org.apache.flink.runtime.instance.ActorGateway; +import org.apache.flink.runtime.io.disk.iomanager.IOManager; +import org.apache.flink.runtime.io.network.NetworkEnvironment; +import org.apache.flink.runtime.jobgraph.JobVertexID; +import org.apache.flink.runtime.memory.MemoryManager; +import org.apache.flink.runtime.operators.testutils.UnregisteredTaskMetricsGroup; +import org.apache.flink.runtime.state.StateHandle; +import org.apache.flink.runtime.taskmanager.Task; +import org.apache.flink.runtime.taskmanager.TaskManagerRuntimeInfo; +import org.apache.flink.runtime.util.EnvironmentInformation; +import org.apache.flink.runtime.util.SerializableObject; +import org.apache.flink.streaming.api.TimeCharacteristic; +import org.apache.flink.streaming.api.checkpoint.Checkpointed; +import org.apache.flink.streaming.api.functions.source.SourceFunction; +import org.apache.flink.streaming.api.graph.StreamConfig; +import org.apache.flink.streaming.api.operators.StreamSource; +import org.apache.flink.util.SerializedValue; + +import org.junit.Test; + +import scala.concurrent.duration.FiniteDuration; + +import java.io.IOException; +import java.io.Serializable; +import java.net.URL; +import java.util.Collections; +import java.util.concurrent.TimeUnit; + +import static org.junit.Assert.*; +import static org.mockito.Mockito.*; + +/** + * This test checks that task restores that get stuck in the presence of interrupts + * are handled properly. + * + * In practice, reading from HDFS is interrupt sensitive: The HDFS code frequently deadlocks + * or livelocks if it is interrupted. + */ +public class InterruptSensitiveRestoreTest { + + private static final OneShotLatch IN_RESTORE_LATCH = new OneShotLatch(); + + @Test + public void testRestoreWithInterrupt() throws Exception { + + Configuration taskConfig = new Configuration(); + StreamConfig cfg = new StreamConfig(taskConfig); + cfg.setTimeCharacteristic(TimeCharacteristic.ProcessingTime); + cfg.setStreamOperator(new StreamSource<>(new TestSource())); + + StateHandle lockingHandle = new InterruptLockingStateHandle(); + StreamTaskState opState = new StreamTaskState(); + opState.setFunctionState(lockingHandle); + StreamTaskStateList taskState = new StreamTaskStateList(new StreamTaskState[] { opState }); + + TaskDeploymentDescriptor tdd = createTaskDeploymentDescriptor(taskConfig, taskState); + Task task = createTask(tdd); + + // start the task and wait until it is in "restore" + task.startTaskThread(); +
[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...
Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956861 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/AbstractFileStateHandle.java --- @@ -20,18 +20,21 @@ import org.apache.flink.core.fs.FileSystem; import org.apache.flink.core.fs.Path; +import org.apache.flink.runtime.state.AbstractCloseableHandle; +import org.apache.flink.runtime.state.StateObject; import java.io.IOException; +import java.io.Serializable; import static java.util.Objects.requireNonNull; /** * Base class for state that is stored in a file. */ -public abstract class AbstractFileStateHandle implements java.io.Serializable { - +public abstract class AbstractFileStateHandle extends AbstractCloseableHandle implements StateObject, Serializable { --- End diff -- True, will remove this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379248#comment-15379248 ] ASF GitHub Bot commented on FLINK-3466: --- Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956828 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/state/AbstractCloseableHandle.java --- @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.state; + +import java.io.Closeable; +import java.io.IOException; +import java.io.Serializable; +import java.util.concurrent.atomic.AtomicIntegerFieldUpdater; + +/** + * A simple base for closable handles. + * + * Offers to register a stream (or other closable object) that close calls are delegated to if + * the handel is closed or was already closed. --- End diff -- Thanks, will fix it. > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.0.0, 0.10.2 >Reporter: Robert Metzger >Assignee: Stephan Ewen > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) >
[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379247#comment-15379247 ] ASF GitHub Bot commented on FLINK-3466: --- Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956813 --- Diff: flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/InterruptSensitiveRestoreTest.java --- @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.streaming.runtime.tasks; + +import org.apache.flink.api.common.ExecutionConfig; +import org.apache.flink.api.common.JobID; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.testutils.OneShotLatch; +import org.apache.flink.runtime.blob.BlobKey; +import org.apache.flink.runtime.broadcast.BroadcastVariableManager; +import org.apache.flink.runtime.deployment.InputGateDeploymentDescriptor; +import org.apache.flink.runtime.deployment.ResultPartitionDeploymentDescriptor; +import org.apache.flink.runtime.deployment.TaskDeploymentDescriptor; +import org.apache.flink.runtime.execution.ExecutionState; +import org.apache.flink.runtime.execution.librarycache.FallbackLibraryCacheManager; +import org.apache.flink.runtime.executiongraph.ExecutionAttemptID; +import org.apache.flink.runtime.filecache.FileCache; +import org.apache.flink.runtime.instance.ActorGateway; +import org.apache.flink.runtime.io.disk.iomanager.IOManager; +import org.apache.flink.runtime.io.network.NetworkEnvironment; +import org.apache.flink.runtime.jobgraph.JobVertexID; +import org.apache.flink.runtime.memory.MemoryManager; +import org.apache.flink.runtime.operators.testutils.UnregisteredTaskMetricsGroup; +import org.apache.flink.runtime.state.StateHandle; +import org.apache.flink.runtime.taskmanager.Task; +import org.apache.flink.runtime.taskmanager.TaskManagerRuntimeInfo; +import org.apache.flink.runtime.util.EnvironmentInformation; +import org.apache.flink.runtime.util.SerializableObject; +import org.apache.flink.streaming.api.TimeCharacteristic; +import org.apache.flink.streaming.api.checkpoint.Checkpointed; +import org.apache.flink.streaming.api.functions.source.SourceFunction; +import org.apache.flink.streaming.api.graph.StreamConfig; +import org.apache.flink.streaming.api.operators.StreamSource; +import org.apache.flink.util.SerializedValue; + +import org.junit.Test; + +import scala.concurrent.duration.FiniteDuration; + +import java.io.IOException; +import java.io.Serializable; +import java.net.URL; +import java.util.Collections; +import java.util.concurrent.TimeUnit; + +import static org.junit.Assert.*; +import static org.mockito.Mockito.*; + +/** + * This test checks that task restores that get stuck in the presence of interrupts + * are handled properly. + * + * In practice, reading from HDFS is interrupt sensitive: The HDFS code frequently deadlocks + * or livelocks if it is interrupted. + */ +public class InterruptSensitiveRestoreTest { + + private static final OneShotLatch IN_RESTORE_LATCH = new OneShotLatch(); + + @Test + public void testRestoreWithInterrupt() throws Exception { + + Configuration taskConfig = new Configuration(); + StreamConfig cfg = new StreamConfig(taskConfig); + cfg.setTimeCharacteristic(TimeCharacteristic.ProcessingTime); + cfg.setStreamOperator(new StreamSource<>(new TestSource())); + + StateHandle lockingHandle = new InterruptLockingStateHandle(); + StreamTaskState opState = new StreamTaskState(); + opState.setFunctionState(lockingHandle); + StreamTaskStateList taskState = new StreamTaskStateList(new StreamTaskState[] { opState }); + + TaskDeploymentDescriptor tdd =
[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...
Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956828 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/state/AbstractCloseableHandle.java --- @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.state; + +import java.io.Closeable; +import java.io.IOException; +import java.io.Serializable; +import java.util.concurrent.atomic.AtomicIntegerFieldUpdater; + +/** + * A simple base for closable handles. + * + * Offers to register a stream (or other closable object) that close calls are delegated to if + * the handel is closed or was already closed. --- End diff -- Thanks, will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3466) Job might get stuck in restoreState() from HDFS due to interrupt
[ https://issues.apache.org/jira/browse/FLINK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379245#comment-15379245 ] ASF GitHub Bot commented on FLINK-3466: --- Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956526 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/state/memory/AbstractMemStateSnapshot.java --- @@ -54,6 +56,8 @@ /** The serialized data of the state key/value pairs */ private final byte[] data; + + private transient boolean closed; --- End diff -- I think it is not crucial to have a strict barrier here. If the reading thread eventually notices the flag, it is enough. And since volatile accesses are much more expensive, I wanted to avoid that. > Job might get stuck in restoreState() from HDFS due to interrupt > > > Key: FLINK-3466 > URL: https://issues.apache.org/jira/browse/FLINK-3466 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.0.0, 0.10.2 >Reporter: Robert Metzger >Assignee: Stephan Ewen > > A user reported the following issue with a failing job: > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1979) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1016) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:477) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332) > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848) > java.io.DataInputStream.read(DataInputStream.java:149) > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:69) > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) > java.io.ObjectInputStream.(ObjectInputStream.java:299) > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.(InstantiationUtil.java:55) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:52) > org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35) > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) > org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:440) > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) > org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > java.lang.Thread.run(Thread.java:745) > {code} > and > {code} > 10:46:09,223 WARN org.apache.flink.runtime.taskmanager.Task >- Task 'XXX -> YYY (3/5)' did not react to cancelling signal, but is stuck > in method: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2038) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) >
[GitHub] flink pull request #2252: [FLINK-3466] [runtime] Cancel state handled on sta...
Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2252#discussion_r70956526 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/state/memory/AbstractMemStateSnapshot.java --- @@ -54,6 +56,8 @@ /** The serialized data of the state key/value pairs */ private final byte[] data; + + private transient boolean closed; --- End diff -- I think it is not crucial to have a strict barrier here. If the reading thread eventually notices the flag, it is enough. And since volatile accesses are much more expensive, I wanted to avoid that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2226: [FLINK-4192] - Move Metrics API to separate module
Github user zentol commented on the issue: https://github.com/apache/flink/pull/2226 @StephanEwen I've changed the `MetricConfig`to extend `Properties` and remove the `setString()` method. I've kept the other methods for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4192) Move Metrics API to separate module
[ https://issues.apache.org/jira/browse/FLINK-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379224#comment-15379224 ] ASF GitHub Bot commented on FLINK-4192: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/2226 @StephanEwen I've changed the `MetricConfig`to extend `Properties` and remove the `setString()` method. I've kept the other methods for now. > Move Metrics API to separate module > --- > > Key: FLINK-4192 > URL: https://issues.apache.org/jira/browse/FLINK-4192 > Project: Flink > Issue Type: Improvement > Components: Metrics >Affects Versions: 1.1.0 >Reporter: Chesnay Schepler >Assignee: Chesnay Schepler > Fix For: 1.1.0 > > > All metrics code currently resides in flink-core. If a user implements a > reporter and wants a fat jar it will now have to include the entire > flink-core module. > Instead, we could move several interfaces into a separate module. > These interfaces to move include: > * Counter, Gauge, Histogram(Statistics) > * MetricGroup > * MetricReporter, Scheduled, AbstractReporter > In addition a new MetricRegistry interface will be required as well as a > replacement for the Configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4053) Return value from Connection should be checked against null
[ https://issues.apache.org/jira/browse/FLINK-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379216#comment-15379216 ] ASF GitHub Bot commented on FLINK-4053: --- Github user mushketyk commented on the issue: https://github.com/apache/flink/pull/2128 Thank you @zentol ! > Return value from Connection should be checked against null > --- > > Key: FLINK-4053 > URL: https://issues.apache.org/jira/browse/FLINK-4053 > Project: Flink > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ivan Mushketyk >Priority: Minor > Fix For: 1.1.0 > > > In RMQSource.java and RMQSink.java, there is code in the following pattern: > {code} > connection = factory.newConnection(); > channel = connection.createChannel(); > {code} > According to > https://www.rabbitmq.com/releases/rabbitmq-java-client/current-javadoc/com/rabbitmq/client/Connection.html#createChannel() > : > {code} > Returns: > a new channel descriptor, or null if none is available > {code} > The return value should be checked against null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)