[jira] [Resolved] (FLINK-1546) Failed job causes JobManager to shutdown due to uncatched WebFrontend exception
[ https://issues.apache.org/jira/browse/FLINK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1546. - Resolution: Fixed Fix Version/s: 0.9 Job Archiving was fixed in 8ae0dc2d768aecfa3129df553f43d827792b65d7 Failed job causes JobManager to shutdown due to uncatched WebFrontend exception --- Key: FLINK-1546 URL: https://issues.apache.org/jira/browse/FLINK-1546 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 0.9 Reporter: Robert Metzger Fix For: 0.9 {code} 16:59:26,588 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - Status of job ef19b2b201d4b81f031334cb76eadc78 (Basic Page Rank Example) changed to FAILEDCleanup job ef19b2b201d4b81f031334cb76eadc78.. 16:59:26,591 ERROR akka.actor.OneForOneStrategy - Can only archive the job from a terminal state java.lang.IllegalStateException: Can only archive the job from a terminal state at org.apache.flink.runtime.executiongraph.ExecutionGraph.prepareForArchiving(ExecutionGraph.java:648) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$removeJob(JobManager.scala:508) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:271) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.yarn.YarnJobManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnJobManager.scala:70) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:86) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 16:59:26,595 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - Stopping webserver. 16:59:26,654 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - Stopped webserver. 16:59:26,656 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - Stopping job manager akka://flink/user/jobmanager. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1635) Remove Apache Thrift dependency from Flink
[ https://issues.apache.org/jira/browse/FLINK-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366023#comment-14366023 ] Stephan Ewen commented on FLINK-1635: - [~rmetzger] This is removed an fixed, if I understand correctly? Remove Apache Thrift dependency from Flink -- Key: FLINK-1635 URL: https://issues.apache.org/jira/browse/FLINK-1635 Project: Flink Issue Type: Bug Components: Java API Affects Versions: 0.9 Reporter: Robert Metzger I've added Thrift and Protobuf to Flink to support it out of the box with Kryo. However, after trying to access a HCatalog/Hive table yesterday using Flink I found that there is a dependency conflict between Flink and Hive (on thrift). Maybe it makes more sense to properly document our serialization framework and provide a copypaste solution on how to get thrift/protobuf et al to work with Flink. Please chime in if you are against removing the out of the box support for protobuf and kryo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1584) Spurious failure of TaskManagerFailsITCase
[ https://issues.apache.org/jira/browse/FLINK-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1584. - Resolution: Fixed Fixed with the switch to the newer akka version (enabled by shading away conflicting dependencies) 84e76f4d3274e07176f7377b7b739b6f180c6296 Spurious failure of TaskManagerFailsITCase -- Key: FLINK-1584 URL: https://issues.apache.org/jira/browse/FLINK-1584 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 0.9 The {{TaskManagerFailsITCase}} fails spuriously on Travis. The reason might be that different test cases try to access the same {{JobManager}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1500) exampleScalaPrograms.EnumTriangleOptITCase does not finish on Travis
[ https://issues.apache.org/jira/browse/FLINK-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366001#comment-14366001 ] Stephan Ewen commented on FLINK-1500: - Have we seen this again, or was this an artifact of one of the bugs we fixed in the last weeks (like intermediate result partition lookup) ? exampleScalaPrograms.EnumTriangleOptITCase does not finish on Travis Key: FLINK-1500 URL: https://issues.apache.org/jira/browse/FLINK-1500 Project: Flink Issue Type: Bug Reporter: Till Rohrmann The test case org.apache.flink.test.exampleScalaPrograms.EnumTriangleOptITCase does not finish on Travis. This problem is non-deterministic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1535) Use usercode class loader to serialize/deserialize accumulators
[ https://issues.apache.org/jira/browse/FLINK-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-1535: Priority: Blocker (was: Major) Use usercode class loader to serialize/deserialize accumulators --- Key: FLINK-1535 URL: https://issues.apache.org/jira/browse/FLINK-1535 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Priority: Blocker Fix For: 0.9 Currently, accumulators are transferred via simple Akka Messages. Since the accumulators may be user defined types, we should use the user code class loader for code loading when deserializing them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1459) Collect DataSet to client
[ https://issues.apache.org/jira/browse/FLINK-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1459. - Resolution: Fixed Fix Version/s: 0.9 Assignee: Stephan Ewen This has been implemented a while back, actually. Sorry for the late update. Implemented in 3dc2fe1dc300146e5209023274c0b0d04277f9ee Collect DataSet to client - Key: FLINK-1459 URL: https://issues.apache.org/jira/browse/FLINK-1459 Project: Flink Issue Type: Improvement Reporter: John Sandiford Assignee: Stephan Ewen Fix For: 0.9 Hi, I may well have missed something obvious here but I cannot find an easy way to extract the values in a DataSet to the client. Spark has collect, collectAsMap etc... (I need to pass the values from a small aggregated DataSet back to a machine learning library which is controlling the iterations.) The only way I could find to do this was to implement my own in memory OutputFormat. This is not ideal, but does work. Many thanks, John val env = ExecutionEnvironment.getExecutionEnvironment val data: DataSet[Double] = env.fromElements(1.0, 2.0, 3.0, 4.0) val result = data.reduce((a, b) = a) val valuesOnClient = result.??? env.execute(Simple example) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1467) Job deployment fails with NPE on JobManager, if TMs did not start properly
[ https://issues.apache.org/jira/browse/FLINK-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366026#comment-14366026 ] Stephan Ewen commented on FLINK-1467: - The null-pointer exception has fixed in one of the TaskManager / Akka exception reworks. The fix for the root cause (TaskManagers fail fast when memory initialization fails) is part of [FLINK-1580]. I am closing this as a duplicate. Job deployment fails with NPE on JobManager, if TMs did not start properly -- Key: FLINK-1467 URL: https://issues.apache.org/jira/browse/FLINK-1467 Project: Flink Issue Type: Bug Components: JobManager Reporter: Robert Metzger I have a Flink cluster started where all TaskManagers died (misconfiguration). The JobManager needs more than 200 seconds to realize that (on the TaskManagers overview, you see timeouts 200). When submitting a job, you'll get the following exception: {code} org.apache.flink.client.program.ProgramInvocationException: The program execution failed: java.lang.Exception: Failed to deploy the task CHAIN DataSource (Generator: class io.airlift.tpch.NationGenerator) - Map (Map at writeAsFormattedText(DataSet.java:1132)) (1/1) - execution #0 to slot SubSlot 0 (f8d11026ec5a11f0b273184c74ec4f29 (0) - ALLOCATED/ALIVE): java.lang.NullPointerException at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$submitTask(TaskManager.scala:346) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:248) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.yarn.YarnTaskManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnTaskManager.scala:32) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:41) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:27) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:27) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:78) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) at org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:311) at akka.dispatch.OnComplete.internal(Future.scala:247) at akka.dispatch.OnComplete.internal(Future.scala:244) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) at org.apache.flink.client.program.Client.run(Client.java:345) at org.apache.flink.client.program.Client.run(Client.java:304) at org.apache.flink.client.program.Client.run(Client.java:298) at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) at flink.generators.programs.TPCHGenerator.main(TPCHGenerator.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Resolved] (FLINK-947) Add support for Named Datasets
[ https://issues.apache.org/jira/browse/FLINK-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-947. Resolution: Fixed Fix Version/s: 0.9 Merged under {{flink-staging/flink-expressions}} Add support for Named Datasets Key: FLINK-947 URL: https://issues.apache.org/jira/browse/FLINK-947 Project: Flink Issue Type: New Feature Components: Java API Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek Priority: Minor Fix For: 0.9 This would create an API that is a mix between SQL like declarativity and the power of user defined functions. Example user code could look like this: {code:Java} NamedDataSet one = ... NamedDataSet two = ... NamedDataSet result = one.join(two).where(key).equalTo(otherKey) .project(a, b, c) .map( (UserTypeIn in) - return new UserTypeOut(...) ) .print(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1018) Logistic Regression deadlocks
[ https://issues.apache.org/jira/browse/FLINK-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1018. - Resolution: Fixed Fix Version/s: 0.9 Fixed via 9c77f0785e43326521da5e535f9ab1f05a9c6280 Logistic Regression deadlocks - Key: FLINK-1018 URL: https://issues.apache.org/jira/browse/FLINK-1018 Project: Flink Issue Type: Bug Reporter: Markus Holzemer Fix For: 0.9 Attachments: LogisticRegression.java We are currently running our implementation of logistic regression with batch gradient descent on the cluster. Unfortunatelly for datasets 1GB it seems to deadlock inside of the iteration. This means the first iteration is never finished. The iteration does a map over all points, the map gets the iteration input as broadcast variable. The result of the map is reduced and the result of the reducer (1 tuple) is crossed with the iteration input. There should be no reason for the deadlock, since the data is still quite small compared to the cluster size (4 nodes a 32GB). Also the datasize stays constant throughout the algorithm. Here is the generated plan. I will also attach the full algorithm. {code} { nodes: [ { id: 2, type: source, pact: Data Source, contents: [([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0., parallelism: 1, subtasks_per_instance: 1, global_properties: [ { name: Partitioning, value: RANDOM }, { name: Partitioning Order, value: (none) }, { name: Uniqueness, value: not unique } ], local_properties: [ { name: Order, value: (none) }, { name: Grouping, value: not grouped }, { name: Uniqueness, value: not unique } ], estimates: [ { name: Est. Output Size, value: (unknown) }, { name: Est. Cardinality, value: (unknown) } ], costs: [ { name: Network, value: 0.0 B }, { name: Disk I/O, value: 0.0 B }, { name: CPU, value: 0.0 }, { name: Cumulative Network, value: 0.0 B }, { name: Cumulative Disk I/O, value: 0.0 B }, { name: Cumulative CPU, value: 0.0 } ], compiler_hints: [ { name: Output Size (bytes), value: (none) }, { name: Output Cardinality, value: (none) }, { name: Avg. Output Record Size (bytes), value: (none) }, { name: Filter Factor, value: (none) } ] }, { step_function: [ { id: 8, type: source, pact: Data Source, contents: TextInputFormat (hdfs://cloud-7:45010/tmp/input/higgs.M.txt) - UTF-8, parallelism: 64, subtasks_per_instance: 16, global_properties: [ { name: Partitioning, value: RANDOM }, { name: Partitioning Order, value: (none) }, { name: Uniqueness, value: not unique } ], local_properties: [ { name: Order, value: (none) }, { name: Grouping, value: not grouped }, { name: Uniqueness, value: not unique } ], estimates: [ { name: Est. Output Size, value: 8.0.31 GB }, { name: Est. Cardinality, value: 109.90 M } ], costs: [ { name: Network, value: 0.0 B }, { name: Disk I/O, value: 8.0.31 GB }, { name: CPU, value: 0.0 }, { name: Cumulative Network, value: 0.0 B }, { name: Cumulative Disk I/O, value: 8.0.31 GB }, { name: Cumulative CPU, value: 0.0 } ], compiler_hints: [ { name: Output Size (bytes), value: (none) }, { name: Output Cardinality, value: (none) }, { name: Avg. Output Record Size (bytes), value: (none) }, { name: Filter Factor, value: (none) } ] }, { id: 7, type: pact, pact: Map, contents: de.tu_berlin.impro3.stratosphere.classification.logreg.LogisticRegression$6,
[jira] [Resolved] (FLINK-952) TypeExtractor requires the argument types of the UDF to be identical to the parameter types
[ https://issues.apache.org/jira/browse/FLINK-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-952. Resolution: Fixed Fix Version/s: 0.9 Fixed with the introduction of the subclass aware pojo and generic type serializers. TypeExtractor requires the argument types of the UDF to be identical to the parameter types --- Key: FLINK-952 URL: https://issues.apache.org/jira/browse/FLINK-952 Project: Flink Issue Type: Bug Components: Local Runtime Reporter: Till Rohrmann Fix For: 0.9 The TypeExtractor checks for each operation whether the DataSet element types are valid arguments for the UDF. However, it checks for strict equality instead of a subtype relationship. Thus the following computation would not work even though it should semantically be correct. DataSet[B].map(new MapFunction[A,A](){ A map(A x)}) with B being a sub class of A. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1090) Join deadlocks when used inside Delta iteration
[ https://issues.apache.org/jira/browse/FLINK-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1090. - Resolution: Fixed Fix Version/s: 0.9 Solved via 9c77f0785e43326521da5e535f9ab1f05a9c6280 Join deadlocks when used inside Delta iteration --- Key: FLINK-1090 URL: https://issues.apache.org/jira/browse/FLINK-1090 Project: Flink Issue Type: Bug Components: Distributed Runtime, Iterations, Scala API Affects Versions: 0.6-incubating Environment: Ubuntu 14.04, Flink 0.6-incubating LocalExecutor JVM 1.7 with 7 GB RAM assigned Reporter: Stefan Bunk Fix For: 0.9 I have a join inside a delta iteration, which hangs, i.e.I think it's deadlocked. If I do the join without a delta iteration, it works. _Why I think it's a deadlock_: - no output in the logs - CPU idles - no IO (measured using iotop) - stacktrace (when starting in debug mode and stopping at arbitrary points) locks deadlockish !http://i.imgur.com/4TgSK3x.png! _Join properties_: - size of the operands: 6.1 GB, 257 MB - estimated result size: 50 MB - the deadlock only occurs for big inputs, if I decrease the size of the first operand to something smaller, e.g. 1MB, it works. I am using the Scala API. Let me know, which further information you need. The code is basically the one I posted on the [mailing list|http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Delta-Iteration-Runtime-Error-quot-Could-not-set-up-runtime-strategy-for-input-channel-to-node-quot-td6.html], but I could provide a compilable version if thats necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1088) Iteration head deadlock
[ https://issues.apache.org/jira/browse/FLINK-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1088. - Resolution: Fixed Fix Version/s: 0.9 Fixed via 9c77f0785e43326521da5e535f9ab1f05a9c6280 Iteration head deadlock --- Key: FLINK-1088 URL: https://issues.apache.org/jira/browse/FLINK-1088 Project: Flink Issue Type: Bug Components: Iterations Affects Versions: 0.7.0-incubating Reporter: Márton Balassi Fix For: 0.9 Flink hangs up for an iterative algorithm for which Stratosphere 0.5 was working. For the code please check out the following repo: https://github.com/mbalassi/als-comparison The stacktrace includes the following on Brokers: Join(Sends the rows of p with multiple keys)) (1/1) daemon prio=10 tid=0x7f8928014800 nid=0x998 waiting on condition [0x7f8912eed000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007d2668ea0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:374) at org.apache.flink.runtime.iterative.concurrent.Broker.get(Broker.java:63) at org.apache.flink.runtime.iterative.task.IterationIntermediatePactTask.run(IterationIntermediatePactTask.java:84) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:375) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:265) at java.lang.Thread.run(Thread.java:744) This part waits for the iteration head which has not been started yet and thus induces a deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1756) Rename Stream Monitoring to Stream Checkpointing
Stephan Ewen created FLINK-1756: --- Summary: Rename Stream Monitoring to Stream Checkpointing Key: FLINK-1756 URL: https://issues.apache.org/jira/browse/FLINK-1756 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 Currently, to enable the streaming checkpointing, you have to set monitoring on. I vote to call it checkpointing, because that describes it better and is more intuitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1710) Expression API Tests take very long
[ https://issues.apache.org/jira/browse/FLINK-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371133#comment-14371133 ] Stephan Ewen commented on FLINK-1710: - Maybe there are some known tweaks / best practices to - speed up compilation - serialize generated code and integrate it into class loaders Expression API Tests take very long --- Key: FLINK-1710 URL: https://issues.apache.org/jira/browse/FLINK-1710 Project: Flink Issue Type: Bug Components: Expression API Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Aljoscha Krettek Fix For: 0.9 The tests of the Expression API take an immense amount of time, compared to the other API tests. Is that because they execute on large (generated) data sets, because the program compilation overhead is high, or because there is an inefficiency in the execution still? Running org.apache.flink.api.scala.expressions.AggregationsITCase Running org.apache.flink.api.scala.expressions.SelectITCase Running org.apache.flink.api.scala.expressions.AsITCase Running org.apache.flink.api.scala.expressions.StringExpressionsITCase Running org.apache.flink.api.scala.expressions.CastingITCase Running org.apache.flink.api.scala.expressions.JoinITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.652 sec - in org.apache.flink.api.scala.expressions.AsITCase Running org.apache.flink.api.scala.expressions.GroupedAggreagationsITCase Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 55.797 sec - in org.apache.flink.api.scala.expressions.StringExpressionsITCase Running org.apache.flink.api.scala.expressions.PageRankExpressionITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 63.072 sec - in org.apache.flink.api.scala.expressions.SelectITCase Running org.apache.flink.api.scala.expressions.ExpressionsITCase Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 65.628 sec - in org.apache.flink.api.scala.expressions.CastingITCase Running org.apache.flink.api.scala.expressions.FilterITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 74.174 sec - in org.apache.flink.api.scala.expressions.AggregationsITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 93.878 sec - in org.apache.flink.api.scala.expressions.JoinITCase Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 63.4 sec - in org.apache.flink.api.scala.expressions.GroupedAggreagationsITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 4, Time elapsed: 44.179 sec - in org.apache.flink.api.scala.expressions.FilterITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 53.801 sec - in org.apache.flink.api.scala.expressions.ExpressionsITCase Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 124.365 sec - in org.apache.flink.api.scala.expressions.PageRankExpressionITCase -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1430) Add test for streaming scala api completeness
[ https://issues.apache.org/jira/browse/FLINK-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371256#comment-14371256 ] Stephan Ewen commented on FLINK-1430: - The completeness test is to ensure that the Java and Scala API are in sync, not the Batch and the Streaming API. So, I think we should include these utility methods that you mentioned. Add test for streaming scala api completeness - Key: FLINK-1430 URL: https://issues.apache.org/jira/browse/FLINK-1430 Project: Flink Issue Type: Test Components: Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Mingliang Qi Currently the completeness of the streaming scala api is not tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1759) Execution statistics for vertex-centric iterations
[ https://issues.apache.org/jira/browse/FLINK-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371300#comment-14371300 ] Stephan Ewen commented on FLINK-1759: - Robert has started an effort to integrate more profiling into the system, and I am working (side project, will bring it onto the mailing list soon) on getting a new extended version of the web frontend in, that visualizes all that information. As part of that, we should be able to display that information. Execution statistics for vertex-centric iterations -- Key: FLINK-1759 URL: https://issues.apache.org/jira/browse/FLINK-1759 Project: Flink Issue Type: Improvement Components: Gelly Affects Versions: 0.9 Reporter: Vasia Kalavri Assignee: Andra Lungu Priority: Minor It would be nice to add an option for gathering execution statistics from VertexCentricIteration. In particular, the following metrics could be useful: - total number of supersteps - number of messages sent (total / per superstep) - bytes of messages exchanged (total / per superstep) - execution time (total / per superstep) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1760) Add support for building Flink with Scala 2.11
[ https://issues.apache.org/jira/browse/FLINK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1760. - Resolution: Fixed Fixed via 2cd5e93daa9dc7b1e024ec7c1f1fc665f953510a Add support for building Flink with Scala 2.11 -- Key: FLINK-1760 URL: https://issues.apache.org/jira/browse/FLINK-1760 Project: Flink Issue Type: Improvement Components: Build System Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Alexander Alexandrov Fix For: 0.9 Pull request https://github.com/apache/flink/pull/477 is implementing this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1762) Make Statefulness of a Streaming Function explicit
Stephan Ewen created FLINK-1762: --- Summary: Make Statefulness of a Streaming Function explicit Key: FLINK-1762 URL: https://issues.apache.org/jira/browse/FLINK-1762 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Currently, the state of streaming functions stored in the {{StreamingRuntimeContext}}. That is rather inexplicit, a function may or may not make use of the state. This also hides from the system whether a function is stateful or not. How about we make this explicit by letting stateful functions extend a special interface (see below). That would allow the stream graph to already know which functions are stateful. Certain vertices would not participate in the checkpointing, if they only contain stateless vertices. We can set up the ExecutionGraph to expect confirmations only from the participating vertices, saving messages. {code} public interface Statehandle { get, put, ... } public interface Stateful { void setStateHandle(Statehandle handle); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-974) TaskManager startup script does not respect JAVA_HOME
[ https://issues.apache.org/jira/browse/FLINK-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-974. Resolution: Invalid This is invalid. The taskmanager takes the JAVA home from its environment. In a local setup (ssh to localhost), this may be different from the user environment, but this is expected and intended. TaskManager startup script does not respect JAVA_HOME - Key: FLINK-974 URL: https://issues.apache.org/jira/browse/FLINK-974 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.6-incubating Reporter: Stephan Ewen Assignee: Ufuk Celebi Labels: Starter The TaskManager startup script does not respect JAVA_HOME, while the JobManager startup script does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1720) Integrate ScalaDoc in Scala sources into overall JavaDoc
[ https://issues.apache.org/jira/browse/FLINK-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1720. - Resolution: Fixed Fixed via ce39c190a2befa837fd03436a8420d40edf8d713 Integrate ScalaDoc in Scala sources into overall JavaDoc Key: FLINK-1720 URL: https://issues.apache.org/jira/browse/FLINK-1720 Project: Flink Issue Type: Improvement Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1760) Add support for building Flink with Scala 2.11
Stephan Ewen created FLINK-1760: --- Summary: Add support for building Flink with Scala 2.11 Key: FLINK-1760 URL: https://issues.apache.org/jira/browse/FLINK-1760 Project: Flink Issue Type: Improvement Components: Build System Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Alexander Alexandrov Fix For: 0.9 Pull request https://github.com/apache/flink/pull/477 is implementing this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1764) Rework record copying logic in streaming API
[ https://issues.apache.org/jira/browse/FLINK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371673#comment-14371673 ] Stephan Ewen commented on FLINK-1764: - The chained filter invokable does that in my case. Once in the {{collect()}} method, once in the {{filter()}} method. Rework record copying logic in streaming API Key: FLINK-1764 URL: https://issues.apache.org/jira/browse/FLINK-1764 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen The logic for chained tasks in the streaming API does a lot of copying of records. In some cases, a record is copied multiple times before being passed to a function. This seems unnecessary, in the general case. In any case, multiple copies seem incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1764) Rework record copying logic in streaming API
[ https://issues.apache.org/jira/browse/FLINK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371706#comment-14371706 ] Stephan Ewen commented on FLINK-1764: - I think that very much depends on the contract that you define: - After a source function gave a record to the collector, should it be guaranteed to still be the same? If you do not promise that, you need not copy. - Do you want to guarantee that the value emitted by a map function is never changed? That is only ever a problem anyways if the MapFunction retains a reference to that value (by storing it in a list or so). I am unsure whether always copying is a good way to go. The initial use cases here use all very small records (often with immutable types anyways) where copying comes cheap. As soon as someone uses heavier objects, this can be pretty heavy on the performance. I am curious whether we can avoid that by making the copies optional. It can be either on or off by default. Rework record copying logic in streaming API Key: FLINK-1764 URL: https://issues.apache.org/jira/browse/FLINK-1764 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen The logic for chained tasks in the streaming API does a lot of copying of records. In some cases, a record is copied multiple times before being passed to a function. This seems unnecessary, in the general case. In any case, multiple copies seem incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1765) Reducer grouping is skippted when parallelism is one
Stephan Ewen created FLINK-1765: --- Summary: Reducer grouping is skippted when parallelism is one Key: FLINK-1765 URL: https://issues.apache.org/jira/browse/FLINK-1765 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 This program (not the parallelism) incorrectly runs a non grouped reduce and fails with a NullPointerException. {code} StreamExecutionEnvironment env = ... env.setDegreeOfParallelism(1); DataStreamString stream = env.addSource(...); stream .filter(...) .map(...) .groupBy(someField) .reduce(new ReduceFunction() {...} ) .addSink(...); env.execute(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1710) Expression API Tests take very long
[ https://issues.apache.org/jira/browse/FLINK-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371543#comment-14371543 ] Stephan Ewen commented on FLINK-1710: - That would be a major rewrite, I guess? Is this part not dependent on certain Scala compiler features? Expression API Tests take very long --- Key: FLINK-1710 URL: https://issues.apache.org/jira/browse/FLINK-1710 Project: Flink Issue Type: Bug Components: Expression API Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Aljoscha Krettek Fix For: 0.9 The tests of the Expression API take an immense amount of time, compared to the other API tests. Is that because they execute on large (generated) data sets, because the program compilation overhead is high, or because there is an inefficiency in the execution still? Running org.apache.flink.api.scala.expressions.AggregationsITCase Running org.apache.flink.api.scala.expressions.SelectITCase Running org.apache.flink.api.scala.expressions.AsITCase Running org.apache.flink.api.scala.expressions.StringExpressionsITCase Running org.apache.flink.api.scala.expressions.CastingITCase Running org.apache.flink.api.scala.expressions.JoinITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.652 sec - in org.apache.flink.api.scala.expressions.AsITCase Running org.apache.flink.api.scala.expressions.GroupedAggreagationsITCase Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 55.797 sec - in org.apache.flink.api.scala.expressions.StringExpressionsITCase Running org.apache.flink.api.scala.expressions.PageRankExpressionITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 63.072 sec - in org.apache.flink.api.scala.expressions.SelectITCase Running org.apache.flink.api.scala.expressions.ExpressionsITCase Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 65.628 sec - in org.apache.flink.api.scala.expressions.CastingITCase Running org.apache.flink.api.scala.expressions.FilterITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 74.174 sec - in org.apache.flink.api.scala.expressions.AggregationsITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 93.878 sec - in org.apache.flink.api.scala.expressions.JoinITCase Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 63.4 sec - in org.apache.flink.api.scala.expressions.GroupedAggreagationsITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 4, Time elapsed: 44.179 sec - in org.apache.flink.api.scala.expressions.FilterITCase Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 53.801 sec - in org.apache.flink.api.scala.expressions.ExpressionsITCase Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 124.365 sec - in org.apache.flink.api.scala.expressions.PageRankExpressionITCase -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1763) Remove cancel from streaming SinkFunction
Stephan Ewen created FLINK-1763: --- Summary: Remove cancel from streaming SinkFunction Key: FLINK-1763 URL: https://issues.apache.org/jira/browse/FLINK-1763 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 Since the streaming sink function is called individually for each record, it does not require a {{cancel()}} function. The system can cancel between calls to that function (which it cannot do for the source function). Removing this method removes the need to always implement the unnecessary, and usually empty, method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1764) Rework record copying logic in streaming API
Stephan Ewen created FLINK-1764: --- Summary: Rework record copying logic in streaming API Key: FLINK-1764 URL: https://issues.apache.org/jira/browse/FLINK-1764 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen The logic for chained tasks in the streaming API does a lot of copying of records. In some cases, a record is copied multiple times before being passed to a function. This seems unnecessary, in the general case. In any case, multiple copies seem incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1668) Add a config option to specify delays between restarts
Stephan Ewen created FLINK-1668: --- Summary: Add a config option to specify delays between restarts Key: FLINK-1668 URL: https://issues.apache.org/jira/browse/FLINK-1668 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The system currently introduces a short delay between a failed task execution and the restarted execution. The reason is that this delay seemed to help in letting problems surface that let to the failed task. As an example, if a TaskManager fails, tasks fail due to data transfer errors. The TaskManager is not immediately recognized as failed, though (takes a bit until heartbeats time out). Immediately re-deploying tasks has a very high chance of assigning work to the TaskManager that is actually not responding, causing the execution retry to fail again. The delay gives the system time to figure out that the TaskManager was lost and does not take it into account upon the retry. Currently, the system uses the heartbeat timeout as the default delay value. This may make sense as a default value for critical task failures, but is actually quite high for other types of failures. In any case, I would like to add an option for users to specify the delay (even set it to 0, if desired). The delay is not the best solution, in my opinion, we should eventually move to something better. Ideas are to put TaskManagers responsible for failed tasks in a probationary mode until they have reported back that everything is good (still alive, disk space available, etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1667) Add tests for recovery with distributed process failure
Stephan Ewen created FLINK-1667: --- Summary: Add tests for recovery with distributed process failure Key: FLINK-1667 URL: https://issues.apache.org/jira/browse/FLINK-1667 Project: Flink Issue Type: Improvement Components: test Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The system should have a test that actually spawns multiple TaskManager processes (JVMs) and tests how recovery works when one of them is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1690) ProcessFailureBatchRecoveryITCase.testTaskManagerProcessFailure spuriously fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357146#comment-14357146 ] Stephan Ewen commented on FLINK-1690: - I will create a patch with increased timeout... ProcessFailureBatchRecoveryITCase.testTaskManagerProcessFailure spuriously fails on Travis -- Key: FLINK-1690 URL: https://issues.apache.org/jira/browse/FLINK-1690 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Priority: Minor I got the following error on Travis. {code} ProcessFailureBatchRecoveryITCase.testTaskManagerProcessFailure:244 The program did not finish in time {code} I think we have to increase the timeouts for this test case to make it reliably run on Travis. The log of the failed Travis build can be found [here|https://api.travis-ci.org/jobs/53952486/log.txt?deansi=true] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1691) Inprove CountCollectITCase
Stephan Ewen created FLINK-1691: --- Summary: Inprove CountCollectITCase Key: FLINK-1691 URL: https://issues.apache.org/jira/browse/FLINK-1691 Project: Flink Issue Type: Bug Components: test Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Maximilian Michels Fix For: 0.9 The CountCollectITCase logs heavily and does not reuse the same cluster across multiple tests. Both can be addressed by letting it extend the MultipleProgramsTestBase -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1671) Add execution modes for programs
Stephan Ewen created FLINK-1671: --- Summary: Add execution modes for programs Key: FLINK-1671 URL: https://issues.apache.org/jira/browse/FLINK-1671 Project: Flink Issue Type: Bug Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 Currently, there is a single way that programs get executed: Pipelined. With the new code for batch shuffles (https://github.com/apache/flink/pull/471), we have much more flexibility and I would like to expose that. I suggest to add more execution modes that can be chosen on the `ExecutionEnvironment`: - {{BATCH}} A mode where every shuffle is executed in a batch way, meaning preceding operators must be done before successors start. Only for the batch programs (d'oh). - {{PIPELINED}} This is the mode corresponding to the current execution mode. It pipelines where possible and batches, where deadlocks would otherwise happen. Initially, I would make this the default (be close to the current behavior). Only available for batch programs. - {{PIPELINED_WITH_BATCH_FALLBACK}} This would start out with pipelining shuffles and fall back to batch shuffles upon failure and recovery, or once it sees that not enough slots are available to bring up all operators at once (requirement for pipelining). - {{STREAMING}} This is the default and only way for streaming programs. All communication is pipelined, and the special streaming checkpointing code is activated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1690) ProcessFailureBatchRecoveryITCase.testTaskManagerProcessFailure spuriously fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357144#comment-14357144 ] Stephan Ewen commented on FLINK-1690: - I retract my statement. After careful log defusing with [~uce], I think that things go as planned. The test has actually a too limited time budget. It takes almost 8 seconds on my machine, so I grant it that Travis may not be able to complete it in 30 seconds. ProcessFailureBatchRecoveryITCase.testTaskManagerProcessFailure spuriously fails on Travis -- Key: FLINK-1690 URL: https://issues.apache.org/jira/browse/FLINK-1690 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Priority: Minor I got the following error on Travis. {code} ProcessFailureBatchRecoveryITCase.testTaskManagerProcessFailure:244 The program did not finish in time {code} I think we have to increase the timeouts for this test case to make it reliably run on Travis. The log of the failed Travis build can be found [here|https://api.travis-ci.org/jobs/53952486/log.txt?deansi=true] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1648) Add a mode where the system automatically sets the parallelism to the available task slots
[ https://issues.apache.org/jira/browse/FLINK-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1648. - Resolution: Implemented Implemented in d8d642fd6d7d9b8526325d4efff1015f636c5ddb Add a mode where the system automatically sets the parallelism to the available task slots -- Key: FLINK-1648 URL: https://issues.apache.org/jira/browse/FLINK-1648 Project: Flink Issue Type: New Feature Components: JobManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 This is basically a port of this code form the 0.8 release: https://github.com/apache/flink/pull/410 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1705) InstanceConnectionInfo returns wrong hostname when no DNS entry exists
Stephan Ewen created FLINK-1705: --- Summary: InstanceConnectionInfo returns wrong hostname when no DNS entry exists Key: FLINK-1705 URL: https://issues.apache.org/jira/browse/FLINK-1705 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 If there is no DNS entry for an address (like 10.4.122.43), then the {{InstanceConnectionInfo}} returns the first octet ({{10}}) as the hostame. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1106) Deprecate old Record API
[ https://issues.apache.org/jira/browse/FLINK-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355063#comment-14355063 ] Stephan Ewen commented on FLINK-1106: - A bit of test coverage depends on the deprecated API. We would need to port at least some of the tests to the new API. We can probably drop some subsumed / obsolete tests. Deprecate old Record API Key: FLINK-1106 URL: https://issues.apache.org/jira/browse/FLINK-1106 Project: Flink Issue Type: Task Components: Java API Affects Versions: 0.7.0-incubating Reporter: Robert Metzger Assignee: Robert Metzger Priority: Critical Fix For: 0.7.0-incubating For the upcoming 0.7 release, we should mark all user-facing methods from the old Record Java API as deprecated, with a warning that we are going to remove it at some point. I would suggest to wait one or two releases from the 0.7 release (given our current release cycle). I'll start a mailing-list discussion at some point regarding this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1675) Rework Accumulators
Stephan Ewen created FLINK-1675: --- Summary: Rework Accumulators Key: FLINK-1675 URL: https://issues.apache.org/jira/browse/FLINK-1675 Project: Flink Issue Type: Bug Components: JobManager, TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The accumulators need an overhaul to address various issues: 1. User defined Accumulator classes crash the client, because it is not using the user code classloader to decode the received message. 2. They should be attached to the ExecutionGraph, not the dedicated AccumulatorManager. That makes them accessible also for archived execution graphs. 3. Accumulators should be sent periodically, as part of the heart beat that sends metrics. This allows them to be updated in real time 4. Accumulators should be stored fine grained (per executionvertex, or per execution) and the final value should be on computed by merging all involved ones. This allows users to access the per-subtask accumulators, which is often interesting. 5. Accumulators should subsume the aggregators by allowing to be versioned with a superstep. The versioned ones should be redistributed to the cluster after each superstep. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1659) Rename classes and packages that contains Pact
[ https://issues.apache.org/jira/browse/FLINK-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360185#comment-14360185 ] Stephan Ewen commented on FLINK-1659: - I am working on getting the configurable data exchanges (shuffles / broadcasts) in BATCH and PIPELINING into th eoptimizer right now. After that, I would offer to rename the classes. My feeling is that we should call it *Optimizer*, not *Compiler*, as that confused people - it sounds too much like a language compiler (Java / Scala) Rename classes and packages that contains Pact -- Key: FLINK-1659 URL: https://issues.apache.org/jira/browse/FLINK-1659 Project: Flink Issue Type: Task Reporter: Henry Saputra Priority: Minor We have several class names that contain or start with Pact. Pact is the previous term for Flink data model and user defined functions/ operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-441) Renaming in pact-compiler
[ https://issues.apache.org/jira/browse/FLINK-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-441: --- Issue Type: Sub-task (was: Improvement) Parent: FLINK-1659 Renaming in pact-compiler - Key: FLINK-441 URL: https://issues.apache.org/jira/browse/FLINK-441 Project: Flink Issue Type: Sub-task Reporter: GitHub Import Priority: Minor Labels: github-import Fix For: pre-apache I would like to do a cleanup and renaming in the pact-compiler. Most of the work is in line with the recent global renaming, but I also want to clear and organize the various representation structures for the optimized plan. I open this issue to keep track and discuss the suggested renaming. We'll have to coordinate the merging of this issue because some renamings (e.g. PactCompiler - Compiler) seem to affect a lot of other packages. ### Global Scope (Wide Dependencies) The following names are part of the public API of stratosphere-compiler. Their renaming will probably affect a lot of other modules. In ```eu.stratosphere.compiler```: * ```PactCompiler``` ⇒ ```Compiler``` ### Module Scope (Narrow Dependencies) The following names are part of the internal API of stratosphere-compiler. Their renaming will probably affect only stratosphere-compiler and stratosphere-tests. In ```eu.stratosphere.compiler```: * ```DataStatistics``` ⇒ ```StatsStore``` This should be developed as an API for data stats over *expressions* instead of just over *data sources*. * ```NonCachingDataStatistics``` ⇒ *delete*. This class does not seem to be used. Imported from GitHub Url: https://github.com/stratosphere/stratosphere/issues/441 Created by: [aalexandrov|https://github.com/aalexandrov] Labels: Created at: Mon Jan 27 12:33:50 CET 2014 State: open -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1650) Suppress Akka's Netty Shutdown Errors through the log config
[ https://issues.apache.org/jira/browse/FLINK-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360230#comment-14360230 ] Stephan Ewen commented on FLINK-1650: - I this YARN specific? Does the shading affect the netty classes and hence change the classnames and the logger config? Suppress Akka's Netty Shutdown Errors through the log config Key: FLINK-1650 URL: https://issues.apache.org/jira/browse/FLINK-1650 Project: Flink Issue Type: Bug Components: other Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 I suggest to set the logging for `org.jboss.netty.channel.DefaultChannelPipeline` to error, in order to get rid of the misleading stack trace caused by an akka/netty hickup on shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1668) Add a config option to specify delays between restarts
[ https://issues.apache.org/jira/browse/FLINK-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1668. - Resolution: Implemented Implemented in abbb0a93ca67da17197dc5372e6d95edd8149d44 Add a config option to specify delays between restarts -- Key: FLINK-1668 URL: https://issues.apache.org/jira/browse/FLINK-1668 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The system currently introduces a short delay between a failed task execution and the restarted execution. The reason is that this delay seemed to help in letting problems surface that let to the failed task. As an example, if a TaskManager fails, tasks fail due to data transfer errors. The TaskManager is not immediately recognized as failed, though (takes a bit until heartbeats time out). Immediately re-deploying tasks has a very high chance of assigning work to the TaskManager that is actually not responding, causing the execution retry to fail again. The delay gives the system time to figure out that the TaskManager was lost and does not take it into account upon the retry. Currently, the system uses the heartbeat timeout as the default delay value. This may make sense as a default value for critical task failures, but is actually quite high for other types of failures. In any case, I would like to add an option for users to specify the delay (even set it to 0, if desired). The delay is not the best solution, in my opinion, we should eventually move to something better. Ideas are to put TaskManagers responsible for failed tasks in a probationary mode until they have reported back that everything is good (still alive, disk space available, etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1776) APIs provide invalid Semantic Properties for Operators with SelectorFunction keys
[ https://issues.apache.org/jira/browse/FLINK-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377789#comment-14377789 ] Stephan Ewen commented on FLINK-1776: - That is critical, I agree. Can we make the assumption that all fields are shifted by one (or as many as the key selector returns values) since the data gets wrapped in a tuple2(key, value)? APIs provide invalid Semantic Properties for Operators with SelectorFunction keys - Key: FLINK-1776 URL: https://issues.apache.org/jira/browse/FLINK-1776 Project: Flink Issue Type: Bug Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: Fabian Hueske Priority: Critical Fix For: 0.9 Semantic properties are defined by users and evaluated by the optimizer. When semantic properties such as forwarded or read fields are bound to the input type of a function. In case of operators with selector function keys, a user function is wrapped by a wrapping function that has a different input types than the original user function. However, the user-defined semantic properties are verbatim forwarded to the optimizer. Since the properties refer to a specific type which is changed by the wrapping function and the semantic properties are not adapted, the optimizer uses wrong properties and might produce invalid plans -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1761) IndexOutOfBoundsException when receiving empty buffer at remote channel
[ https://issues.apache.org/jira/browse/FLINK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1761. - Resolution: Fixed Fix Version/s: 0.9 Fixed in 380ef878c850f83b5e12176e465d59c737066e20 and 925481fb1c88f3c45b289cdf5ef203190492031a IndexOutOfBoundsException when receiving empty buffer at remote channel --- Key: FLINK-1761 URL: https://issues.apache.org/jira/browse/FLINK-1761 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: master Reporter: Ufuk Celebi Assignee: Ufuk Celebi Fix For: 0.9 Receiving buffers from remote input channels with size 0 results in an {{IndexOutOfBoundsException}}. {code} Caused by: java.lang.IndexOutOfBoundsException: index: 30 (expected: range(0, 30)) at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1123) at io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:156) at io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:151) at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:179) at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:717) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.decodeBufferOrEvent(PartitionRequestClientHandler.java:205) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.decodeMsg(PartitionRequestClientHandler.java:164) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelRead(PartitionRequestClientHandler.java:118) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1769) Maven deploy is broken (build artifacts are cleaned in docs-and-sources profile)
[ https://issues.apache.org/jira/browse/FLINK-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377791#comment-14377791 ] Stephan Ewen commented on FLINK-1769: - Anyone looking into this already? This effectively means that we do not have snapshots right now, correct? Maven deploy is broken (build artifacts are cleaned in docs-and-sources profile) Key: FLINK-1769 URL: https://issues.apache.org/jira/browse/FLINK-1769 Project: Flink Issue Type: Bug Components: Build System Reporter: Robert Metzger Priority: Critical The issue has been introduced by FLINK-1720. This change broke the deployment to maven snapshots / central. {code} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-install-plugin:2.5.1:install (default-install) on project flink-shaded-include-yarn: Failed to install artifact org.apache.flink:flink-shaded-include-yarn:pom:0.9-SNAPSHOT: /home/robert/incubator-flink/flink-shaded-hadoop/flink-shaded-include-yarn/target/dependency-reduced-pom.xml (No such file or directory) - [Help 1] {code} The issue is that maven is now executing {{clean}} after {{shade}} and then {{install}} can not store the result of {{shade}} anymore (because it has been deleted) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1778) Improve normalized keys in composite key case
Stephan Ewen created FLINK-1778: --- Summary: Improve normalized keys in composite key case Key: FLINK-1778 URL: https://issues.apache.org/jira/browse/FLINK-1778 Project: Flink Issue Type: Improvement Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 Currently, if we have a key (String, long), the String will take up the entire normalized key space, without being fully discerning anyways. Limiting the key prefix in size and giving space to the second key field should in most cases improve the comparison efficiency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1782) Change Quickstart Java version to 1.7
Stephan Ewen created FLINK-1782: --- Summary: Change Quickstart Java version to 1.7 Key: FLINK-1782 URL: https://issues.apache.org/jira/browse/FLINK-1782 Project: Flink Issue Type: Improvement Components: quickstars Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The quickstarts refer to the outdated Java 1.6 source and bin version. We should upgrade this to 1.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1781) Quickstarts broken due to Scala Version Variables
Stephan Ewen created FLINK-1781: --- Summary: Quickstarts broken due to Scala Version Variables Key: FLINK-1781 URL: https://issues.apache.org/jira/browse/FLINK-1781 Project: Flink Issue Type: Bug Components: quickstars Affects Versions: 0.9 Reporter: Stephan Ewen Priority: Blocker Fix For: 0.9 The quickstart archetype resources refer to the scala version variables. When creating a maven project standalone, these variables are not defined, and the pom is invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1783) Quickstart shading should not created shaded jar and dependency reduced pom
Stephan Ewen created FLINK-1783: --- Summary: Quickstart shading should not created shaded jar and dependency reduced pom Key: FLINK-1783 URL: https://issues.apache.org/jira/browse/FLINK-1783 Project: Flink Issue Type: Improvement Components: quickstars Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1774) Remove the redundant code in try{} block
[ https://issues.apache.org/jira/browse/FLINK-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1774. - Resolution: Fixed Fixed in c89c657ae16bbe89da54669a234713a3811813ee Thank you for the patch! Remove the redundant code in try{} block Key: FLINK-1774 URL: https://issues.apache.org/jira/browse/FLINK-1774 Project: Flink Issue Type: Improvement Affects Versions: master Reporter: Sibao Hong Assignee: Sibao Hong Priority: Minor Fix For: master Remove the redundant code of fos.close(); fos = null; in try block because the fos,close() code will always executes in finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1779) Rename the function name from getCurrentyActiveConnections to getCurrentActiveConnections in org.apache.flink.runtime.blob
[ https://issues.apache.org/jira/browse/FLINK-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1779. - Resolution: Fixed Fixed in fb3f3ee845a3aae295c9aae00f3d406d9f1d5813 Thank you for the patch! Rename the function name from getCurrentyActiveConnections to getCurrentActiveConnections in org.apache.flink.runtime.blob --- Key: FLINK-1779 URL: https://issues.apache.org/jira/browse/FLINK-1779 Project: Flink Issue Type: Improvement Reporter: Sibao Hong Assignee: Sibao Hong Priority: Minor Fix For: master I think the function name getCurrentyActiveConnections in ' org.apache.flink.runtime.blob' is a wrong spelling, it should be getCurrentActiveConnections is more better, and also I add some comments about the function and the Tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1801) NetworkEnvironment should start without JobManager association
Stephan Ewen created FLINK-1801: --- Summary: NetworkEnvironment should start without JobManager association Key: FLINK-1801 URL: https://issues.apache.org/jira/browse/FLINK-1801 Project: Flink Issue Type: Sub-task Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The NetworkEnvironment should be able to start without a dedicated JobManager association and get one / loose one as the TaskManager connects to different JobManagers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1348) Move Stream Connector Jars from lib to Client JARs
[ https://issues.apache.org/jira/browse/FLINK-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1348. - Resolution: Fixed Fix Version/s: 0.9 Fixed with the updated dependency management in the rewrite to use shading. Move Stream Connector Jars from lib to Client JARs Key: FLINK-1348 URL: https://issues.apache.org/jira/browse/FLINK-1348 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Márton Balassi Fix For: 0.9 Right now, the connectors and all dependencies are put into the lib folder and are part of the system at startup time. This is a large bunch of dependencies, and they may actually conflict with the dependencies of custom connectors (or example with a different version of RabbitMQ or so). We could fix that, if we remove the dependencies from the lib folder and set up archetypes that build fat jars with the dependencies. That way, each job (with its custom class loader) will gets the dependencies it needs and will not see all the other (potentially conflicting ones) in the namespace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1754) Deadlock in job execution
[ https://issues.apache.org/jira/browse/FLINK-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1754. - Resolution: Not a Problem Is actually a known bug in 0.8 and fixed in 0.9 Deadlock in job execution - Key: FLINK-1754 URL: https://issues.apache.org/jira/browse/FLINK-1754 Project: Flink Issue Type: Bug Affects Versions: 0.8.1 Reporter: Sebastian Kruse I have encountered a reproducible deadlock in the execution of one of my jobs. The part of the plan, where this happens, is the following: {code:java} /** Performs the reduction via creating transitive INDs and removing them from the original IND set. */ private DataSetTuple2Integer, int[] calculateTransitiveReduction1(DataSetTuple2Integer, int[] inclusionDependencies) { // Concatenate INDs (only one hop). DataSetTuple2Integer, int[] transitiveInds = inclusionDependencies .flatMap(new SplitInds()) .joinWithTiny(inclusionDependencies) .where(1).equalTo(0) .with(new ConcatenateInds()); // Remove the concatenated INDs to come up with a transitive reduction of the INDs. return inclusionDependencies .coGroup(transitiveInds) .where(0).equalTo(0) .with(new RemoveTransitiveInds()); } {code} Seemingly, the flatmap operator waits infinitely for a free buffer to write on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1435) TaskManager does not log missing memory error on start up
[ https://issues.apache.org/jira/browse/FLINK-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1435. - Resolution: Not a Problem TaskManager does not log missing memory error on start up - Key: FLINK-1435 URL: https://issues.apache.org/jira/browse/FLINK-1435 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.7.0-incubating Reporter: Malte Schwarzer Priority: Minor Labels: memorymanager, starter When using bin/start-cluster.sh to start TaskManagers and a worker node is failing to start because of missing memory, you do not receive any error messages in log files. Worker node has only 15000M memory available, but it is configured with Maximum heap size: 4 MiBytes. Task manager does not join the cluster. Process hangs. Last lines of log looks like this: ... ... - - Starting with 12 incoming and 12 outgoing connection threads. ... - Setting low water mark to 16384 and high water mark to 32768 bytes. ... - Instantiated PooledByteBufAllocator with direct arenas: 24, heap arenas: 0, page size (bytes): 65536, chunk size (bytes): 16777216. ... - Using 0.7 of the free heap space for managed memory. ... - Initializing memory manager with 24447 megabytes of memory. Page size is 32768 bytes. (END) Error message about not enough memory is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1801) NetworkEnvironment should start without JobManager association
[ https://issues.apache.org/jira/browse/FLINK-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1801. - Resolution: Implemented Implemented in ee273dbe01e95d2b260fa690e21e2c244a2a5711 NetworkEnvironment should start without JobManager association -- Key: FLINK-1801 URL: https://issues.apache.org/jira/browse/FLINK-1801 Project: Flink Issue Type: Sub-task Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The NetworkEnvironment should be able to start without a dedicated JobManager association and get one / loose one as the TaskManager connects to different JobManagers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1465) GlobalBufferPool reports negative memory allocation
[ https://issues.apache.org/jira/browse/FLINK-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1465. - Resolution: Fixed Fix Version/s: 0.9 Assignee: Stephan Ewen (was: Ufuk Celebi) Fixed via ee273dbe01e95d2b260fa690e21e2c244a2a5711 GlobalBufferPool reports negative memory allocation --- Key: FLINK-1465 URL: https://issues.apache.org/jira/browse/FLINK-1465 Project: Flink Issue Type: Bug Components: Local Runtime, TaskManager Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Stephan Ewen Fix For: 0.9 I've got this error message when starting Flink. It does not really help me. I suspect that my configuration files (which worked with 0.8 aren't working with 0.9 anymore). Still, the exception is reporting weird stuff {code} 11:41:02,516 INFO org.apache.flink.yarn.YarnUtils$$anonfun$startActorSystemAndTaskManager$1$$anon$1 - TaskManager successfully registered at JobManager akka.tcp://fl...@cloud-18.dima.tu-berlin.de:39674/user/jo bmanager. 11:41:25,230 ERROR org.apache.flink.yarn.YarnUtils$$anonfun$startActorSystemAndTaskManager$1$$anon$1 - Failed to instantiate network environment. java.io.IOException: Failed to instantiate network buffer pool: Could not allocate enough memory segments for GlobalBufferPool (required (Mb): 0, allocated (Mb): -965, missing (Mb): 965). at org.apache.flink.runtime.io.network.NetworkEnvironment.init(NetworkEnvironment.java:81) at org.apache.flink.runtime.taskmanager.TaskManager.setupNetworkEnvironment(TaskManager.scala:508) at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$finishRegistration(TaskManager.scala:479) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:226) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.yarn.YarnTaskManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnTaskManager.scala:32) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:41) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:27) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:27) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:78) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.OutOfMemoryError: Could not allocate enough memory segments for GlobalBufferPool (required (Mb): 0, allocated (Mb): -965, missing (Mb): 965). at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.init(NetworkBufferPool.java:76) at org.apache.flink.runtime.io.network.NetworkEnvironment.init(NetworkEnvironment.java:78) ... 23 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1435) TaskManager does not log missing memory error on start up
[ https://issues.apache.org/jira/browse/FLINK-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386949#comment-14386949 ] Stephan Ewen commented on FLINK-1435: - I think we misunderstood this issue initially. This seems like the TaskManager is started with a heap size that exceeds the physical memory of the machine. It is possible to do that, if your OS has enough swap space. The process hangs, because it is incredibly slow doe to non-stop swapping. Inside the JVM, you do not see that memory is missing, because it is not, it only comes from the swap space. This is not a Flink bug, such mis-configuration is well possible. TaskManager does not log missing memory error on start up - Key: FLINK-1435 URL: https://issues.apache.org/jira/browse/FLINK-1435 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.7.0-incubating Reporter: Malte Schwarzer Priority: Minor Labels: memorymanager, starter When using bin/start-cluster.sh to start TaskManagers and a worker node is failing to start because of missing memory, you do not receive any error messages in log files. Worker node has only 15000M memory available, but it is configured with Maximum heap size: 4 MiBytes. Task manager does not join the cluster. Process hangs. Last lines of log looks like this: ... ... - - Starting with 12 incoming and 12 outgoing connection threads. ... - Setting low water mark to 16384 and high water mark to 32768 bytes. ... - Instantiated PooledByteBufAllocator with direct arenas: 24, heap arenas: 0, page size (bytes): 65536, chunk size (bytes): 16777216. ... - Using 0.7 of the free heap space for managed memory. ... - Initializing memory manager with 24447 megabytes of memory. Page size is 32768 bytes. (END) Error message about not enough memory is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1465) GlobalBufferPool reports negative memory allocation
[ https://issues.apache.org/jira/browse/FLINK-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386966#comment-14386966 ] Stephan Ewen commented on FLINK-1465: - This is actually an integer overflow issue. I have a fix coming up... GlobalBufferPool reports negative memory allocation --- Key: FLINK-1465 URL: https://issues.apache.org/jira/browse/FLINK-1465 Project: Flink Issue Type: Bug Components: Local Runtime, TaskManager Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Ufuk Celebi I've got this error message when starting Flink. It does not really help me. I suspect that my configuration files (which worked with 0.8 aren't working with 0.9 anymore). Still, the exception is reporting weird stuff {code} 11:41:02,516 INFO org.apache.flink.yarn.YarnUtils$$anonfun$startActorSystemAndTaskManager$1$$anon$1 - TaskManager successfully registered at JobManager akka.tcp://fl...@cloud-18.dima.tu-berlin.de:39674/user/jo bmanager. 11:41:25,230 ERROR org.apache.flink.yarn.YarnUtils$$anonfun$startActorSystemAndTaskManager$1$$anon$1 - Failed to instantiate network environment. java.io.IOException: Failed to instantiate network buffer pool: Could not allocate enough memory segments for GlobalBufferPool (required (Mb): 0, allocated (Mb): -965, missing (Mb): 965). at org.apache.flink.runtime.io.network.NetworkEnvironment.init(NetworkEnvironment.java:81) at org.apache.flink.runtime.taskmanager.TaskManager.setupNetworkEnvironment(TaskManager.scala:508) at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$finishRegistration(TaskManager.scala:479) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:226) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.yarn.YarnTaskManager$$anonfun$receiveYarnMessages$1.applyOrElse(YarnTaskManager.scala:32) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:41) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:27) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:27) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:78) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.OutOfMemoryError: Could not allocate enough memory segments for GlobalBufferPool (required (Mb): 0, allocated (Mb): -965, missing (Mb): 965). at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.init(NetworkBufferPool.java:76) at org.apache.flink.runtime.io.network.NetworkEnvironment.init(NetworkEnvironment.java:78) ... 23 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382269#comment-14382269 ] Stephan Ewen commented on FLINK-1319: - I think there is no reason they are not available in the Scala API. They absolutely should be ;-) I vote to move them to the core project. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1783) Quickstart shading should not created shaded jar and dependency reduced pom
[ https://issues.apache.org/jira/browse/FLINK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382294#comment-14382294 ] Stephan Ewen commented on FLINK-1783: - It does not affect one, it only causes maven to create additional artifacts in the {{target}} directory: a reduced pom, an original jar, a reduced jar. I saw people being confused by which jar to use... Quickstart shading should not created shaded jar and dependency reduced pom --- Key: FLINK-1783 URL: https://issues.apache.org/jira/browse/FLINK-1783 Project: Flink Issue Type: Improvement Components: Quickstarts Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1783) Quickstart shading should not created shaded jar and dependency reduced pom
[ https://issues.apache.org/jira/browse/FLINK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382295#comment-14382295 ] Stephan Ewen commented on FLINK-1783: - Even though they were identical in the end... Quickstart shading should not created shaded jar and dependency reduced pom --- Key: FLINK-1783 URL: https://issues.apache.org/jira/browse/FLINK-1783 Project: Flink Issue Type: Improvement Components: Quickstarts Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1789) Allow adding of URLs to the usercode class loader
[ https://issues.apache.org/jira/browse/FLINK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382291#comment-14382291 ] Stephan Ewen commented on FLINK-1789: - I think that is a great idea. We can even use that for lazy class loading for interactive jobs. Brilliant idea, actually! Allow adding of URLs to the usercode class loader - Key: FLINK-1789 URL: https://issues.apache.org/jira/browse/FLINK-1789 Project: Flink Issue Type: Improvement Components: Distributed Runtime Reporter: Timo Walther Assignee: Timo Walther Priority: Minor Currently, there is no option to add customs classpath URLs to the FlinkUserCodeClassLoader. JARs always need to be shipped to the cluster even if they are already present on all nodes. It would be great if RemoteEnvironment also accepts valid classpaths URLs and forwards them to BlobLibraryCacheManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1782) Change Quickstart Java version to 1.7
[ https://issues.apache.org/jira/browse/FLINK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382276#comment-14382276 ] Stephan Ewen commented on FLINK-1782: - I'll take a quick stab at fixing the quickstart related issues... Change Quickstart Java version to 1.7 - Key: FLINK-1782 URL: https://issues.apache.org/jira/browse/FLINK-1782 Project: Flink Issue Type: Improvement Components: Quickstarts Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The quickstarts refer to the outdated Java 1.6 source and bin version. We should upgrade this to 1.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1796) Local mode TaskManager should have a process reaper
[ https://issues.apache.org/jira/browse/FLINK-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1796. - Resolution: Fixed Fixed via 8c32142528590a030693529c7c8d93f194968c0a Local mode TaskManager should have a process reaper --- Key: FLINK-1796 URL: https://issues.apache.org/jira/browse/FLINK-1796 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 We use process reaper actors (a typical Akka design pattern) to shut down the JVM processes when the core actors die, as this is currently unrecoverable. The local mode uses the process reaper only for the JobManager actor, not for the TaskManager actor. This may lead to dead stale JVMs on critical TaskManager errors and makes debugging harder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1795) Solution set allows duplicates upon construction.
[ https://issues.apache.org/jira/browse/FLINK-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1795. - Resolution: Fixed Fixed via 923a2ae259bd72a2d48639ae0e64db0a04a4aa91 Solution set allows duplicates upon construction. - Key: FLINK-1795 URL: https://issues.apache.org/jira/browse/FLINK-1795 Project: Flink Issue Type: Bug Components: Iterations Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The solution set identifies entries uniquely by key. During construction, it does not eliminate duplicates. The duplicates do not get updated during the iterations (since only the first match is considered), but are contained in the final result. This contradicts the definition of the solution set. It should not contain duplicates to begin with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1669) Streaming tests for recovery with distributed process failure
[ https://issues.apache.org/jira/browse/FLINK-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1669. - Resolution: Fixed Fixed in c284745ee4612054339842789b0d87eb4f9a821d Streaming tests for recovery with distributed process failure - Key: FLINK-1669 URL: https://issues.apache.org/jira/browse/FLINK-1669 Project: Flink Issue Type: Sub-task Components: Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Márton Balassi Fix For: 0.9 Multiple JVM test for streaming recovery from failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1608) TaskManagers may pick wrong network interface when starting before JobManager
[ https://issues.apache.org/jira/browse/FLINK-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1608. - Resolution: Fixed Assignee: Stephan Ewen Resolved in 861ebe753ff982b4cbf7c3c5b8c43fa306ac89f0 TaskManagers may pick wrong network interface when starting before JobManager - Key: FLINK-1608 URL: https://issues.apache.org/jira/browse/FLINK-1608 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The TaskManager uses a NetUtils routine to pick a network interface that lets it talk to the Jobmanager. However, if the JobManager is not online yet, the TaskManager falls back to an arbitrary non-localhost device. In cases where the TaskManagers start faster than the JobManager, they may pick the wrong interface (and associated address and hostname) The later logic (that tries to connect to the JobManager actor) does several retries. I think we need similar logic when looking for a suitable network interface to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1628) Strange behavior of where function during a join
[ https://issues.apache.org/jira/browse/FLINK-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343584#comment-14343584 ] Stephan Ewen commented on FLINK-1628: - Is this specific to the combination of tuple positions and key selector in the where clause, or does it also occur when using tuple positions for both inputs (where and equalTo clause) Strange behavior of where function during a join -- Key: FLINK-1628 URL: https://issues.apache.org/jira/browse/FLINK-1628 Project: Flink Issue Type: Bug Affects Versions: 0.9 Reporter: Daniel Bali Labels: batch Hello! If I use the `where` function with a field list during a join, it exhibits strange behavior. Here is the sample code that triggers the error: https://gist.github.com/balidani/d9789b713e559d867d5c This example joins a DataSet with itself, then counts the number of rows. If I use `.where(0, 1)` the result is (22), which is not correct. If I use `EdgeKeySelector`, I get the correct result (101). When I pass a field list to the `equalTo` function (but not `where`), everything works again. If I don't include the `groupBy` and `reduceGroup` parts, everything works. Also, when working with large DataSets, passing a field list to `where` makes it incredibly slow, even though I don't see any exceptions in the log (in DEBUG mode). Does anybody know what might cause this problem? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1631) Port collisions in ProcessReaping tests
Stephan Ewen created FLINK-1631: --- Summary: Port collisions in ProcessReaping tests Key: FLINK-1631 URL: https://issues.apache.org/jira/browse/FLINK-1631 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The process reaping tests for the JobManager spawn a process that starts a webserver on the default port. It may happen that this port is not available, due to another concurrently running task. I suggest to add an option to not start the webserver to prevent this, by setting the webserver port to {{-1}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1616) Action list -r gives IOException when there are running jobs
[ https://issues.apache.org/jira/browse/FLINK-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345457#comment-14345457 ] Stephan Ewen commented on FLINK-1616: - There are a few issues with the CLI frontend. I am doing a major cleanup today. Action list -r gives IOException when there are running jobs -- Key: FLINK-1616 URL: https://issues.apache.org/jira/browse/FLINK-1616 Project: Flink Issue Type: Bug Affects Versions: 0.9 Reporter: Vasia Kalavri Priority: Minor Here's the full exception: java.io.IOException: Could not retrieve running jobs from job manager. at org.apache.flink.client.CliFrontend.list(CliFrontend.java:528) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1089) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.20.0.25:13245/user/jobmanager#-1517424081]] after [10 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at akka.actor.LightArrayRevolverScheduler$TaskHolder.run(Scheduler.scala:476) at akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:282) at akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:281) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at akka.actor.LightArrayRevolverScheduler.close(Scheduler.scala:280) at akka.actor.ActorSystemImpl.stopScheduler(ActorSystem.scala:688) at akka.actor.ActorSystemImpl$$anonfun$liftedTree2$1$1.apply$mcV$sp(ActorSystem.scala:617) at akka.actor.ActorSystemImpl$$anonfun$liftedTree2$1$1.apply(ActorSystem.scala:617) at akka.actor.ActorSystemImpl$$anonfun$liftedTree2$1$1.apply(ActorSystem.scala:617) at akka.actor.ActorSystemImpl$$anon$3.run(ActorSystem.scala:641) at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.runNext$1(ActorSystem.scala:808) at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply$mcV$sp(ActorSystem.scala:811) at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:804) at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:804) at akka.util.ReentrantGuard.withGuard(LockUtil.scala:15) at akka.actor.ActorSystemImpl$TerminationCallbacks.run(ActorSystem.scala:804) at akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:638) at akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:638) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) If there are no running jobs, no exception is thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1631) Port collisions in ProcessReaping tests
[ https://issues.apache.org/jira/browse/FLINK-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1631. - Resolution: Fixed Fixed via 94a66d570e4bb40824813911a4f1bb47a8bf8b90 Port collisions in ProcessReaping tests --- Key: FLINK-1631 URL: https://issues.apache.org/jira/browse/FLINK-1631 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The process reaping tests for the JobManager spawn a process that starts a webserver on the default port. It may happen that this port is not available, due to another concurrently running task. I suggest to add an option to not start the webserver to prevent this, by setting the webserver port to {{-1}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1648) Add a mode where the system automatically sets the parallelism to the available task slots
Stephan Ewen created FLINK-1648: --- Summary: Add a mode where the system automatically sets the parallelism to the available task slots Key: FLINK-1648 URL: https://issues.apache.org/jira/browse/FLINK-1648 Project: Flink Issue Type: New Feature Components: JobManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 This is basically a port of this code form the 0.8 release: https://github.com/apache/flink/pull/410 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1650) Suppress Akka's Netty Shutdown Errors through the log config
[ https://issues.apache.org/jira/browse/FLINK-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347450#comment-14347450 ] Stephan Ewen commented on FLINK-1650: - The logged error is below. Setting the log level to ERROR should allow us to see critical messages and suppress the confusing warnings that seem to be an akka/netty bug. {code} Feb 18, 2015 5:25:18 PM org.jboss.netty.channel.DefaultChannelPipeline WARNING: An exception was thrown by an exception handler. java.util.concurrent.RejectedExecutionException: Worker has already been shutdown at org.jboss.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:120) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:72) at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:56) at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36) at org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34) at org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:496) at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:46) at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54) at org.jboss.netty.channel.Channels.disconnect(Channels.java:781) at org.jboss.netty.channel.AbstractChannel.disconnect(AbstractChannel.java:211) at akka.remote.transport.netty.NettyTransport$$anonfun$gracefulClose$1.apply(NettyTransport.scala:223) at akka.remote.transport.netty.NettyTransport$$anonfun$gracefulClose$1.apply(NettyTransport.scala:222) at scala.util.Success.foreach(Try.scala:205) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:204) at scala.concurrent.Future$$anonfun$foreach$1.apply(Future.scala:204) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} Suppress Akka's Netty Shutdown Errors through the log config Key: FLINK-1650 URL: https://issues.apache.org/jira/browse/FLINK-1650 Project: Flink Issue Type: Bug Components: other Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 I suggest to set the logging for `org.jboss.netty.channel.DefaultChannelPipeline` to error, in order to get rid of the misleading stack trace caused by an akka/netty hickup on shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1649) Give a good error message when a user program emits a null record
Stephan Ewen created FLINK-1649: --- Summary: Give a good error message when a user program emits a null record Key: FLINK-1649 URL: https://issues.apache.org/jira/browse/FLINK-1649 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1572) Output directories are created before input paths are checked
[ https://issues.apache.org/jira/browse/FLINK-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348675#comment-14348675 ] Stephan Ewen commented on FLINK-1572: - Can we always remove the output directory? What if it was already there and only the files (1, 2, 3, ...) were created by Flink? Output directories are created before input paths are checked - Key: FLINK-1572 URL: https://issues.apache.org/jira/browse/FLINK-1572 Project: Flink Issue Type: Improvement Components: JobManager Affects Versions: 0.9 Reporter: Robert Metzger Priority: Minor Flink is first creating the output directories for a job before creating the input splits. If a job's input directories are wrong, the system will have created output directories for a failed job. It would be much better if the system is creating the output directories on demand before data is actually written. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1601) Sometimes the YARNSessionFIFOITCase fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348685#comment-14348685 ] Stephan Ewen commented on FLINK-1601: - [~rmetzger] Any insight where this may come from? Sometimes the YARNSessionFIFOITCase fails on Travis --- Key: FLINK-1601 URL: https://issues.apache.org/jira/browse/FLINK-1601 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Assignee: Robert Metzger Sometimes the {{YARNSessionFIFOITCase}} fails on Travis with the following exception. {code} Tests run: 8, Failures: 8, Errors: 0, Skipped: 0, Time elapsed: 71.375 sec FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOITCase perJobYarnCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) Time elapsed: 60.707 sec FAILURE! java.lang.AssertionError: During the timeout period of 60 seconds the expected string did not show up at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.yarn.YarnTestBase.runWithArgs(YarnTestBase.java:315) at org.apache.flink.yarn.YARNSessionFIFOITCase.perJobYarnCluster(YARNSessionFIFOITCase.java:185) testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) Time elapsed: 0.507 sec FAILURE! java.lang.AssertionError: There is at least one application on the cluster is not finished at org.junit.Assert.fail(Assert.java:88) at org.apache.flink.yarn.YarnTestBase.checkClusterEmpty(YarnTestBase.java:146) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) {code} The result is {code} Failed tests: YARNSessionFIFOITCase.perJobYarnCluster:185-YarnTestBase.runWithArgs:315 During the timeout period of 60 seconds the expected string did not show up YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on
[jira] [Resolved] (FLINK-1649) Give a good error message when a user program emits a null record
[ https://issues.apache.org/jira/browse/FLINK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1649. - Resolution: Fixed Fixed via 482766e949d69e282ed862bd97f2a8378b2f699e Give a good error message when a user program emits a null record - Key: FLINK-1649 URL: https://issues.apache.org/jira/browse/FLINK-1649 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1635) Remove Apache Thrift dependency from Flink
[ https://issues.apache.org/jira/browse/FLINK-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344841#comment-14344841 ] Stephan Ewen commented on FLINK-1635: - I agree, let's remove this. The way that it currently does out-of-the-box ties it to a specific version, which is bad for both protobuf and thrift. Remove Apache Thrift dependency from Flink -- Key: FLINK-1635 URL: https://issues.apache.org/jira/browse/FLINK-1635 Project: Flink Issue Type: Bug Components: Java API Affects Versions: 0.9 Reporter: Robert Metzger I've added Thrift and Protobuf to Flink to support it out of the box with Kryo. However, after trying to access a HCatalog/Hive table yesterday using Flink I found that there is a dependency conflict between Flink and Hive (on thrift). Maybe it makes more sense to properly document our serialization framework and provide a copypaste solution on how to get thrift/protobuf et al to work with Flink. Please chime in if you are against removing the out of the box support for protobuf and kryo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1608) TaskManagers may pick wrong network interface when starting before JobManager
[ https://issues.apache.org/jira/browse/FLINK-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-1608: Description: The taskmanagers use a NetUtils routine to find an interface that lets them talk to the Jobmanager. However, if the JobManager is not online yet, they fall back to some non-localhost device. In cases where the TaskManagers start faster than the JobManager, they pick the wrong hostname and interface. The later logic (that tries to connect to the JobManager actor) has a logic with retries. I think we need a similar logic here... was: The taskmanagers use a NetUtils routine to find an interface that lets them talk to the Jobmanager. However, if the JobManager is not online yet, they fall back to localhost. In cases where the TaskManagers start faster than the JobManager, they pick the wrong hostname and interface. TaskManagers may pick wrong network interface when starting before JobManager - Key: FLINK-1608 URL: https://issues.apache.org/jira/browse/FLINK-1608 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The taskmanagers use a NetUtils routine to find an interface that lets them talk to the Jobmanager. However, if the JobManager is not online yet, they fall back to some non-localhost device. In cases where the TaskManagers start faster than the JobManager, they pick the wrong hostname and interface. The later logic (that tries to connect to the JobManager actor) has a logic with retries. I think we need a similar logic here... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1580) Cleanup TaskManager initialization logic
[ https://issues.apache.org/jira/browse/FLINK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335468#comment-14335468 ] Stephan Ewen commented on FLINK-1580: - Partially solved in ed8b26bf2e8dd7c187c24ad0d8ff3e67f6a7478c Cleanup TaskManager initialization logic Key: FLINK-1580 URL: https://issues.apache.org/jira/browse/FLINK-1580 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Currently, the TaskManager initializes many heavy load objects upon registration at the JobManager. If an exception occurs during the initialization it takes quite long until the {{JobManager}} detects the {{TaskManager}} failure. Therefore, it would be better if we could rearrange the initialization logic so that the {{TaskManager}} only registers at the {{JobManager}} if the all objects could be initialized successfully. Moreover, it would be worthwhile to move some of the initialization work out of the {{TaskManager}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1608) TaskManagers may pick wrong network interface when starting before JobManager
[ https://issues.apache.org/jira/browse/FLINK-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-1608: Description: The TaskManager uses a NetUtils routine to pick a network interface that lets it talk to the Jobmanager. However, if the JobManager is not online yet, the TaskManager falls back to an arbitrary non-localhost device. In cases where the TaskManagers start faster than the JobManager, they may pick the wrong interface (and associated address and hostname) The later logic (that tries to connect to the JobManager actor) does several retries. I think we need similar logic when looking for a suitable network interface to use. was: The taskmanagers use a NetUtils routine to find an interface that lets them talk to the Jobmanager. However, if the JobManager is not online yet, they fall back to some non-localhost device. In cases where the TaskManagers start faster than the JobManager, they pick the wrong hostname and interface. The later logic (that tries to connect to the JobManager actor) has a logic with retries. I think we need a similar logic here... TaskManagers may pick wrong network interface when starting before JobManager - Key: FLINK-1608 URL: https://issues.apache.org/jira/browse/FLINK-1608 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The TaskManager uses a NetUtils routine to pick a network interface that lets it talk to the Jobmanager. However, if the JobManager is not online yet, the TaskManager falls back to an arbitrary non-localhost device. In cases where the TaskManagers start faster than the JobManager, they may pick the wrong interface (and associated address and hostname) The later logic (that tries to connect to the JobManager actor) does several retries. I think we need similar logic when looking for a suitable network interface to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1590) Log environment information also in YARN mode
[ https://issues.apache.org/jira/browse/FLINK-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1590. - Resolution: Fixed Fix Version/s: 0.9 Assignee: Stephan Ewen (was: Robert Metzger) Fixed via ed8b26bf2e8dd7c187c24ad0d8ff3e67f6a7478c Log environment information also in YARN mode - Key: FLINK-1590 URL: https://issues.apache.org/jira/browse/FLINK-1590 Project: Flink Issue Type: Improvement Components: YARN Client Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Stephan Ewen Priority: Minor Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1608) TaskManagers may pick wrong network interface when starting before JobManager
[ https://issues.apache.org/jira/browse/FLINK-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335442#comment-14335442 ] Stephan Ewen commented on FLINK-1608: - As a safety fallback, I suggest that we allow the TaskManager hostname to be specified in the configuration. To make proper use of this, each TaskManager would need a distinct configuration. Not standard scenario, but a fallback solution if the automatic methods fail. TaskManagers may pick wrong network interface when starting before JobManager - Key: FLINK-1608 URL: https://issues.apache.org/jira/browse/FLINK-1608 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The taskmanagers use a NetUtils routine to find an interface that lets them talk to the Jobmanager. However, if the JobManager is not online yet, they fall back to some non-localhost device. In cases where the TaskManagers start faster than the JobManager, they pick the wrong hostname and interface. The later logic (that tries to connect to the JobManager actor) has a logic with retries. I think we need a similar logic here... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1610) Java docs do not build
[ https://issues.apache.org/jira/browse/FLINK-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336993#comment-14336993 ] Stephan Ewen commented on FLINK-1610: - May be related to referring to Scala classes (and objects) from Java. Some tools have hickups there, due to the fact that the Scala class file names are not like the class name in the code (objects for example append a `$`) Java docs do not build -- Key: FLINK-1610 URL: https://issues.apache.org/jira/browse/FLINK-1610 Project: Flink Issue Type: Bug Components: Build System, Documentation Affects Versions: 0.9 Reporter: Max Michels Fix For: 0.9 Among a bunch of warnings, I get the following error which prevents the java doc generation from finishing: {code} javadoc: error - com.sun.tools.doclets.internal.toolkit.util.DocletAbortException: com.sun.tools.javac.code.Symbol$CompletionFailure: class file for akka.testkit.TestKit not found Command line was: /System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/bin/javadoc -Xdoclint:none @options @packages at org.apache.maven.plugin.javadoc.AbstractJavadocMojo.executeJavadocCommandLine(AbstractJavadocMojo.java:5074) at org.apache.maven.plugin.javadoc.AbstractJavadocMojo.executeReport(AbstractJavadocMojo.java:1999) at org.apache.maven.plugin.javadoc.JavadocReport.generate(JavadocReport.java:130) at org.apache.maven.plugin.javadoc.JavadocReport.execute(JavadocReport.java:315) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216) at org.apache.maven.cli.MavenCli.main(MavenCli.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1162) Cannot serialize Scala classes with Avro serializer
[ https://issues.apache.org/jira/browse/FLINK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333167#comment-14333167 ] Stephan Ewen commented on FLINK-1162: - Yep, this one should be resolved. Cannot serialize Scala classes with Avro serializer --- Key: FLINK-1162 URL: https://issues.apache.org/jira/browse/FLINK-1162 Project: Flink Issue Type: Bug Components: Local Runtime, Scala API Reporter: Till Rohrmann The problem occurs for class names containing a '$' dollar sign in its name how it is sometimes the case for Scala classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1596) FileIOChannel introduces space in temp file name
[ https://issues.apache.org/jira/browse/FLINK-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1596. - Resolution: Fixed Fixed in 0.9 in 98bc7b951b30961871958a4483e0b186bfb785b8 FileIOChannel introduces space in temp file name Key: FLINK-1596 URL: https://issues.apache.org/jira/browse/FLINK-1596 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.8.1 Reporter: Johannes Assignee: Johannes Priority: Minor Labels: easyfix FLINK-1483 introduced separate directories for all threads. Unfortunately this seems to not work on windows, due to spaces in the filename Stacktrace {code} Caused by: java.io.IOException: Channel to path '\AppData\Local\Temp\flink-io-366dee63-092c-415c-b119-a138506dec86\ fa44b17b98c3b1b1e30185fd92be5d01.02.channel' could not be opened. at org.apache.flink.runtime.io.disk.iomanager.AbstractFileIOChannel.init(AbstractFileIOChannel.java:61) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannel.init(AsynchronousFileIOChannel.java:77) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousBulkBlockReader.init(AsynchronousBulkBlockReader.java:46) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousBulkBlockReader.init(AsynchronousBulkBlockReader.java:39) at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.createBulkBlockChannelReader(IOManagerAsync.java:236) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:747) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.ReOpenableMutableHashTable.prepareNextPartition(ReOpenableMutableHashTable.java:167) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:541) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.AbstractCachedBuildSideMatchDriver.run(AbstractCachedBuildSideMatchDriver.java:155) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139) at org.apache.flink.runtime.iterative.task.IterationIntermediatePactTask.run(IterationIntermediatePactTask.java:92) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:204) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: AppData\Local\Temp\flink-io-366dee63-092c-415c-b119-a138506dec86\ fa44b17b98c3b1b1e30185fd92be5d01.02.channel (Das System kann die angegebene Datei nicht finden) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:241) at java.io.RandomAccessFile.init(RandomAccessFile.java:122) at org.apache.flink.runtime.io.disk.iomanager.AbstractFileIOChannel.init(AbstractFileIOChannel.java:57) ... 16 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1598) Give better error messages when serializers run out of space.
[ https://issues.apache.org/jira/browse/FLINK-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1598. - Resolution: Fixed Fixed via 9528a521f56e0c6b0c70d43e62ad84b19c048c36 Give better error messages when serializers run out of space. - Key: FLINK-1598 URL: https://issues.apache.org/jira/browse/FLINK-1598 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1613) Cannost submit to remote ExecutionEnvironment from IDE
[ https://issues.apache.org/jira/browse/FLINK-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338714#comment-14338714 ] Stephan Ewen commented on FLINK-1613: - This is somewhat expected, currently. The remote executor needs a JAR file with all the user defined classes. It will shiip that JAR into the cluster. There is work to collect these classes automatically. If you want, pick up the pull request and use the jarfile creator to automatically gather the classes and generate the JAR to ship. https://github.com/apache/flink/pull/35 Cannost submit to remote ExecutionEnvironment from IDE -- Key: FLINK-1613 URL: https://issues.apache.org/jira/browse/FLINK-1613 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.8.1 Environment: * Ubuntu Linux 14.04 * Flink 0.9-SNAPSHOT or 0.8.1 running in standalone mode on localhost Reporter: Alexander Alexandrov Fix For: 0.9, 0.8.2 I am reporting this as [~rmetzler] mentioned offline that it was working in the past. At the moment it is not possible to submit jobs directly from the IDE. Both the Java and the Scala quickstart guides fail on both 0.8.1 and 0.9-SNAPSHOT with ClassNotFoundException exceptions. To reproduce the error, run the quickstart scripts and change the ExecutionEnvironment initialization: {code:java} env = ExecutionEnvironment.createRemoteEnvironment(localhost, 6123) {code} This is the cause for Java: {noformat} Caused by: java.lang.ClassNotFoundException: org.myorg.quickstart.WordCount$LineSplitter at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:54) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:274) at org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:236) at org.apache.flink.runtime.operators.util.TaskConfig.getStubWrapper(TaskConfig.java:281) {noformat} This is for Scala: {noformat} java.lang.ClassNotFoundException: org.myorg.quickstart.WordCount$$anon$2$$anon$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:54) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:274) at org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:236) at org.apache.flink.api.java.typeutils.runtime.RuntimeSerializerFactory.readParametersFromConfig(RuntimeSerializerFactory.java:76) at
[jira] [Resolved] (FLINK-1781) Quickstarts broken due to Scala Version Variables
[ https://issues.apache.org/jira/browse/FLINK-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1781. - Resolution: Fixed Fixed in 1aba942c1fd7c4dbf1c4d4f30602d69c2cb3540e Quickstarts broken due to Scala Version Variables - Key: FLINK-1781 URL: https://issues.apache.org/jira/browse/FLINK-1781 Project: Flink Issue Type: Bug Components: Quickstarts Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Priority: Blocker Fix For: 0.9 The quickstart archetype resources refer to the scala version variables. When creating a maven project standalone, these variables are not defined, and the pom is invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1754) Deadlock in job execution
[ https://issues.apache.org/jira/browse/FLINK-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384450#comment-14384450 ] Stephan Ewen commented on FLINK-1754: - Is this issue resolved? The issue with deadlocks in 0.8.1 is (I think) that the runtime does not obey the assumptions from the optimizer. The hash table building requires (for some reason) data availability on the probe side as well. Deadlock in job execution - Key: FLINK-1754 URL: https://issues.apache.org/jira/browse/FLINK-1754 Project: Flink Issue Type: Bug Affects Versions: 0.8.1 Reporter: Sebastian Kruse I have encountered a reproducible deadlock in the execution of one of my jobs. The part of the plan, where this happens, is the following: {code:java} /** Performs the reduction via creating transitive INDs and removing them from the original IND set. */ private DataSetTuple2Integer, int[] calculateTransitiveReduction1(DataSetTuple2Integer, int[] inclusionDependencies) { // Concatenate INDs (only one hop). DataSetTuple2Integer, int[] transitiveInds = inclusionDependencies .flatMap(new SplitInds()) .joinWithTiny(inclusionDependencies) .where(1).equalTo(0) .with(new ConcatenateInds()); // Remove the concatenated INDs to come up with a transitive reduction of the INDs. return inclusionDependencies .coGroup(transitiveInds) .where(0).equalTo(0) .with(new RemoveTransitiveInds()); } {code} Seemingly, the flatmap operator waits infinitely for a free buffer to write on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1796) Local mode TaskManager should have a process reaper
[ https://issues.apache.org/jira/browse/FLINK-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384458#comment-14384458 ] Stephan Ewen commented on FLINK-1796: - I have a fix for this coming up... Local mode TaskManager should have a process reaper --- Key: FLINK-1796 URL: https://issues.apache.org/jira/browse/FLINK-1796 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 We use process reaper actors (a typical Akka design pattern) to shut down the JVM processes when the core actors die, as this is currently unrecoverable. The local mode uses the process reaper only for the JobManager actor, not for the TaskManager actor. This may lead to dead stale JVMs on critical TaskManager errors and makes debugging harder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1783) Quickstart shading should not created shaded jar and dependency reduced pom
[ https://issues.apache.org/jira/browse/FLINK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1783. - Resolution: Fixed Assignee: Stephan Ewen Fixed in d11e0910880a48bbd5c452e4c76ffdca000f5614 Quickstart shading should not created shaded jar and dependency reduced pom --- Key: FLINK-1783 URL: https://issues.apache.org/jira/browse/FLINK-1783 Project: Flink Issue Type: Improvement Components: Quickstarts Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1795) Solution set allows duplicates upon construction.
[ https://issues.apache.org/jira/browse/FLINK-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384456#comment-14384456 ] Stephan Ewen commented on FLINK-1795: - I have a fix and test coming up... Solution set allows duplicates upon construction. - Key: FLINK-1795 URL: https://issues.apache.org/jira/browse/FLINK-1795 Project: Flink Issue Type: Bug Components: Iterations Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The solution set identifies entries uniquely by key. During construction, it does not eliminate duplicates. The duplicates do not get updated during the iterations (since only the first match is considered), but are contained in the final result. This contradicts the definition of the solution set. It should not contain duplicates to begin with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1796) Local mode TaskManager should have a process reaper
Stephan Ewen created FLINK-1796: --- Summary: Local mode TaskManager should have a process reaper Key: FLINK-1796 URL: https://issues.apache.org/jira/browse/FLINK-1796 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 We use process reaper actors (a typical Akka design pattern) to shut down the JVM processes when the core actors die, as this is currently unrecoverable. The local mode uses the process reaper only for the JobManager actor, not for the TaskManager actor. This may lead to dead stale JVMs on critical TaskManager errors and makes debugging harder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1795) Solution set allows duplicates upon construction.
Stephan Ewen created FLINK-1795: --- Summary: Solution set allows duplicates upon construction. Key: FLINK-1795 URL: https://issues.apache.org/jira/browse/FLINK-1795 Project: Flink Issue Type: Bug Components: Iterations Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The solution set identifies entries uniquely by key. During construction, it does not eliminate duplicates. The duplicates do not get updated during the iterations (since only the first match is considered), but are contained in the final result. This contradicts the definition of the solution set. It should not contain duplicates to begin with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1782) Change Quickstart Java version to 1.7
[ https://issues.apache.org/jira/browse/FLINK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384254#comment-14384254 ] Stephan Ewen commented on FLINK-1782: - This is currently not possible, since the builds with Java 6 fail to run the quickstart tests when the quickstart pom specifies compiler version 1.7 (Java 7) Change Quickstart Java version to 1.7 - Key: FLINK-1782 URL: https://issues.apache.org/jira/browse/FLINK-1782 Project: Flink Issue Type: Improvement Components: Quickstarts Affects Versions: 0.9 Reporter: Stephan Ewen The quickstarts refer to the outdated Java 1.6 source and bin version. We should upgrade this to 1.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1782) Change Quickstart Java version to 1.7
[ https://issues.apache.org/jira/browse/FLINK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen resolved FLINK-1782. - Resolution: Won't Fix Fix Version/s: (was: 0.9) Change Quickstart Java version to 1.7 - Key: FLINK-1782 URL: https://issues.apache.org/jira/browse/FLINK-1782 Project: Flink Issue Type: Improvement Components: Quickstarts Affects Versions: 0.9 Reporter: Stephan Ewen The quickstarts refer to the outdated Java 1.6 source and bin version. We should upgrade this to 1.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376899#comment-14376899 ] Stephan Ewen commented on FLINK-1319: - Very nice result, a very much anticipated feature. Can you tell us how many functions are currently analyzed by this? Does the basic mechanism work with record-at-a-time functions only, or also with group-at-a-time functions? To proceed: - Do we nee an extra project for this? I would actually not mind having this in core / java. It is sort of lightweight and we have the ASM dependency anyways (closure cleaning). - To activate or deactivate it, I would use the ExecutionConfig in the ExecutionEnvironment. From my experience with users, no one bothers to call any of the parametrization methods ever (withForwardFields, withName, analyzeUdf, ...). If we make it dependent on that, it will effectively not be used. - I would have it deactivated by default initially. Users can activate it globally with the ExecutionConfig. We should have it activated it in all test to give the code coverage with our test UDFs. This can be done centralized, where the test context environments are created. - We can activate it by default in the next release, once we have given this some testing and exposure. Other comments: - I would vote to throw an exception (or at least print a warning) if you detect that any path in the program returns a null value. - ASM dependency versions needs to be set by a variable (defined in root pom, interaction with shading) - Can you format the POM xml like the other POMs (tabs) ? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1916) EOFException when running delta-iteration job
[ https://issues.apache.org/jira/browse/FLINK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505338#comment-14505338 ] Stephan Ewen commented on FLINK-1916: - Confirmed, this is a bug in the {{CompactingHashTable}} class. [~knub] - do you have a minimal example that is able to reproduce this bug? Then I'll try and fix it. EOFException when running delta-iteration job - Key: FLINK-1916 URL: https://issues.apache.org/jira/browse/FLINK-1916 Project: Flink Issue Type: Bug Environment: 0.9-milestone-1 Exception on the cluster, local execution works Reporter: Stefan Bunk The delta-iteration program in [1] ends with an java.io.EOFException at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307) at org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62) at org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29) at org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219) at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536) at org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217) at java.lang.Thread.run(Thread.java:745) For logs and the accompanying mailing list discussion see below. When running with slightly different memory configuration, as hinted on the mailing list, I sometimes also get this exception: 19.Apr. 13:39:29 INFO Task - IterationHead(WorksetIteration (Resolved-Redirects)) (10/10) switched to FAILED : java.lang.IndexOutOfBoundsException: Index: 161, Size: 161 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.resetTo(InMemoryPartition.java:352) at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.access$100(InMemoryPartition.java:301) at org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:226) at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536) at org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217) at java.lang.Thread.run(Thread.java:745) [1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57 [2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4 [3] One task manager's logs: https://gist.github.com/knub/8f2f953da95c8d7adefc [4] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/EOFException-when-running-Flink-job-td1092.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script
[ https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505313#comment-14505313 ] Stephan Ewen commented on FLINK-1908: - I don't think that this issue will be fixed in 0.8.x. @DarkKnightCZ Can you verify whether 0.9 works for you? JobManager startup delay isn't considered when using start-cluster.sh script Key: FLINK-1908 URL: https://issues.apache.org/jira/browse/FLINK-1908 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9, 0.8.1 Environment: Linux Reporter: Lukas Raska Priority: Minor Original Estimate: 5m Remaining Estimate: 5m When starting Flink cluster via start-cluster.sh script, JobManager startup can be delayed (as it's started asynchronously), which can result in failed startup of several task managers. Solution is to wait certain amount of time and periodically check if RPC port is accessible, then proceed with starting task managers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)